Low-Rank Adaptation (LoRA)

Low-Rank Adaptation (LoRA) is the most popular technique for Parameter-Efficient Fine-Tuning (PEFT). It allows fine-tuning of massive models by freezing the original pre-trained weights and injecting pairs of trainable rank-decomposition matrices into each layer of the Transformer architecture.

How it Works

Instead of updating the full weight matrix $W$ (which might have millions of parameters), LoRA represents the weight update $\Delta W$ as the product of two much smaller matrices, $A$ and $B$ :

$W' = W + \Delta W = W + BA$

$W$ : Frozen pre-trained weights (dimensions $d \times d$ ).
$B$ : Trainable matrix (dimensions $d \times r$ ).
$A$ : Trainable matrix (dimensions $r \times d$ ).
$r$ : The Rank. A hyperparameter (e.g., 8, 16, 64) that is much smaller than the model dimension $d$ .

The Intuition

The “intrinsic dimension” of the change needed to adapt a model to a new task is low. You don’t need to change every single parameter to teach a general model to speak like a pirate; you just need to shift its representations in a specific, low-rank direction.

Analogy: Instead of rewriting an entire textbook (Full Fine-Tuning) to add a new chapter, you add a small “sticky note” (LoRA matrices) that modifies the relevant sections.

Benefits

Efficiency: Reduces the number of trainable parameters by 10,000x and GPU memory requirements by 3x.
No Latency: The learned matrices $BA$ can be merged back into the original weights $W$ after training, meaning there is no inference slowdown compared to the base model.
Portability: LoRA adapters are tiny (often <100MB) compared to the full model (tens of GBs), making them easy to share.

Low-Rank Adaptation (LoRA)

How it Works

The Intuition

Benefits

Chat with Mike 3.0