Low-Rank Adaptation (LoRA)

Table of Contents

Low-Rank Adaptation (LoRA) is the most popular technique for Parameter-Efficient Fine-Tuning (PEFT). It allows fine-tuning of massive models by freezing the original pre-trained weights and injecting pairs of trainable rank-decomposition matrices into each layer of the Transformer architecture.

How it Works

Instead of updating the full weight matrix WW (which might have millions of parameters), LoRA represents the weight update ΔW\Delta W as the product of two much smaller matrices, AA and BB:

W=W+ΔW=W+BAW' = W + \Delta W = W + BA

The Intuition

The “intrinsic dimension” of the change needed to adapt a model to a new task is low. You don’t need to change every single parameter to teach a general model to speak like a pirate; you just need to shift its representations in a specific, low-rank direction.

Benefits

  1. Efficiency: Reduces the number of trainable parameters by 10,000x and GPU memory requirements by 3x.
  2. No Latency: The learned matrices BABA can be merged back into the original weights WW after training, meaning there is no inference slowdown compared to the base model.
  3. Portability: LoRA adapters are tiny (often <100MB) compared to the full model (tens of GBs), making them easy to share.

    Mike 3.0

    Send a message to start the chat!

    You can ask the bot anything about me and it will help to find the relevant information!

    Try asking: