For Developers/Glossary/LoRA (Low-Rank Adaptation)
Training

LoRA (Low-Rank Adaptation)

A parameter-efficient fine-tuning method that adds small trainable matrices to frozen model weights, enabling fast and cheap fine-tuning.

LoRA (Low-Rank Adaptation) is a fine-tuning technique that avoids updating all model weights. Instead, it freezes the original weights and adds two small matrices (A and B) to each attention layer. Only A and B are trained. Because they're low-rank (their inner dimension is small), the total number of trainable parameters is a tiny fraction of the full model's weight count.

Why LoRA dominates open-source fine-tuning

Full fine-tuning a 70B parameter model requires multiple high-end GPUs, weeks of training, and significant storage. LoRA reduces trainable parameters by 10,000x while achieving comparable results on most tasks. A 70B model can be LoRA fine-tuned on a single A100 in hours. The resulting LoRA adapter (the A and B matrices) is typically 10-100MB rather than hundreds of GB.

QLoRA: LoRA at even lower cost

QLoRA combines LoRA with 4-bit quantization of the base model. The frozen base weights are stored in 4-bit, the LoRA adapters in 16-bit. This lets you fine-tune a 70B model on a single 48GB GPU - accessible on consumer high-end hardware or a single cloud GPU instance.

LoRA vs full fine-tuning

LoRA is best when the task is well-defined and the base model already has relevant capabilities. If you're trying to inject entirely new knowledge or radically change model behavior, full fine-tuning or continued pre-training may be necessary. For most practical adaptation tasks (format, style, domain-specific terminology), LoRA is the right default.

Related terms

Models relevant to LoRA (Low-Rank Adaptation)