Low-Rank Adaptation (LoRA) is the most popular technique for Parameter-Efficient Fine-Tuning (PEFT). It allows fine-tuning of massive models by freezing the original pre-trained weights and injecting pairs of trainable rank-decomposition matrices into each layer of the Transformer architecture.
How it Works
Instead of updating the full weight matrix (which might have millions of parameters), LoRA represents the weight update as the product of two much smaller matrices, and :
- : Frozen pre-trained weights (dimensions ).
- : Trainable matrix (dimensions ).
- : Trainable matrix (dimensions ).
- : The Rank. A hyperparameter (e.g., 8, 16, 64) that is much smaller than the model dimension .
The Intuition
The “intrinsic dimension” of the change needed to adapt a model to a new task is low. You don’t need to change every single parameter to teach a general model to speak like a pirate; you just need to shift its representations in a specific, low-rank direction.
- Analogy: Instead of rewriting an entire textbook (Full Fine-Tuning) to add a new chapter, you add a small “sticky note” (LoRA matrices) that modifies the relevant sections.
Benefits
- Efficiency: Reduces the number of trainable parameters by 10,000x and GPU memory requirements by 3x.
- No Latency: The learned matrices can be merged back into the original weights after training, meaning there is no inference slowdown compared to the base model.
- Portability: LoRA adapters are tiny (often <100MB) compared to the full model (tens of GBs), making them easy to share.
