Quantized Low-Rank Adaptation (QLoRA)

Table of Contents

Quantized Low-Rank Adaptation (QLoRA) is an extended version of Low-Rank Adaptation (LoRA) that enables fine-tuning of extremely large models (like 65B or 70B parameter models) on a single consumer GPU (e.g., 48GB VRAM) without degrading performance.

It typically involves quantizing the base model to 4-bit precision while keeping the LoRA adapters in higher precision (16-bit) during training.

Key Innovations

QLoRA introduced three main memory-saving techniques:

1. 4-bit NormalFloat (NF4)

A new data type strictly optimized for normally distributed weights. It packs information more efficiently than standard 4-bit Integers or Floats, retaining higher fidelity of the original model weights.

2. Double Quantization

Quantizing the quantization constants themselves.

3. Paged Optimizers

Uses NVIDIA Unified Memory to automatically offload optimizer states to the CPU RAM when the GPU runs out of memory. This prevents the dreaded “Out of Memory” (OOM) errors during training spikes.

Impact

QLoRA “democratized” Fine-Tuning. Before QLoRA, fine-tuning a 65B model required ~780GB of VRAM (multiple A100s). With QLoRA, it can be done on a single 48GB GPU, making state-of-the-art model customization accessible to researchers and hobbyists.

    Mike 3.0

    Send a message to start the chat!

    You can ask the bot anything about me and it will help to find the relevant information!

    Try asking: