411: Low-Rank Adaptation (LoRA)¶
Chapter Overview
LoRA (Low-Rank Adaptation) is the most popular and effective PEFT method used today. It achieves performance comparable to full fine-tuning while only training a tiny fraction (often <0.1%) of the model's parameters.
The core insight of LoRA is based on the hypothesis that the "change" in a model's weights during adaptation has a low intrinsic rank.
The Core Idea: Decomposing the Update¶
Instead of directly updating a large, pre-trained weight matrix W
(which can have billions of parameters), LoRA freezes W
and learns its change, ΔW
, indirectly.
LoRA represents this change ΔW
as the product of two much smaller, "low-rank" matrices: A and B.
ΔW = B @ A
This is a low-rank decomposition.
graph TD
subgraph "Original Weight Matrix (Frozen)"
W("W<br/>d × k<br/>(e.g., 4096 × 4096)<br/>16.7M Parameters")
end
subgraph "LoRA Adapter Matrices (Trainable)"
A("A<br/>d × r<br/>(e.g., 4096 × 8)<br/>32k Params") --> B("B<br/>r × k<br/>(e.g., 8 × 4096)<br/>32k Params")
note1["Down-projection"]
note2["Up-projection"]
end
C["Total Trainable Params:<br/>~65,000 (~0.4% of original)"]
A -.-> note1
B -.-> note2
B --> C
style W fill:#e3f2fd,stroke:#1976d2
style A fill:#e8f5e8,stroke:#388e3c
style B fill:#e8f5e8,stroke:#388e3c
style C fill:#c8e6c9,stroke:#1B5E20
Mathematical Foundation¶
The Low-Rank Hypothesis¶
LoRA is based on the hypothesis that the change in weights during fine-tuning has a low intrinsic rank. This means:
Where: - \(W_0\) is the pre-trained weight matrix (frozen) - \(\Delta W\) is the change we want to learn - \(B \in \mathbb{R}^{d \times r}\) and \(A \in \mathbb{R}^{r \times k}\) are low-rank matrices - \(r \ll \min(d, k)\) is the rank
Forward Pass¶
During inference, the output is computed as:
Where \(x\) is the input and \(h\) is the output.
Key Parameters¶
Rank ®¶
The most important hyperparameter in LoRA. It determines the dimensionality of the low-rank adaptation.
- Lower rank (r=1-4): Fewer parameters, more efficient, but potentially less expressive
- Higher rank (r=16-64): More parameters, more expressive, but less efficient
Alpha (α)¶
A scaling factor that controls the magnitude of the LoRA adaptation:
Target Modules¶
Which layers to apply LoRA to: - Query/Key/Value projections in attention layers (most common) - Feed-forward layers - Output projections
Practical Implementation¶
LoRA Configuration Example
Memory and Computational Benefits¶
Memory Comparison¶
Model Size | Full Fine-Tuning | LoRA (r=16) | Memory Reduction |
---|---|---|---|
7B params | ~28GB VRAM | ~8GB VRAM | 71% reduction |
13B params | ~52GB VRAM | ~12GB VRAM | 77% reduction |
70B params | ~280GB VRAM | ~35GB VRAM | 87% reduction |
Parameter Efficiency¶
graph LR
subgraph "7B Parameter Model"
A[Full Model<br/>7,000,000,000 params] -->|"LoRA r=16"| B[Trainable Params<br/>~4,000,000 params]
C[Efficiency<br/>99.94% reduction]
end
style A fill:#ffcdd2,stroke:#B71C1C
style B fill:#c8e6c9,stroke:#1B5E20
style C fill:#e8f5e8,stroke:#388e3c
Advantages of LoRA¶
1. Extreme Parameter Efficiency¶
- Typically <1% of original parameters need training
- Enables fine-tuning on consumer hardware
2. No Inference Latency¶
- LoRA weights can be merged with original weights
- No additional computational overhead during inference
3. Modular and Swappable¶
- Easy to switch between different LoRA adapters
- One base model can serve multiple tasks
4. Storage Efficient¶
- Only need to store small adapter weights
- Easy to distribute and version control
Best Practices¶
Choosing the Right Rank¶
Rank Selection Guidelines
- Start with r=16 for most tasks
- Use r=4-8 for simple tasks or when memory is very limited
- Use r=32-64 for complex tasks requiring high expressiveness
- Monitor performance vs. efficiency trade-offs
Target Module Selection¶
Module Selection Strategy
- Attention layers first:
q_proj
,k_proj
,v_proj
- Add output projection:
o_proj
for more capacity - Include feed-forward:
gate_proj
,up_proj
,down_proj
for complex tasks
Interactive Exercise¶
Calculate LoRA Efficiency
Given a transformer layer with: - Hidden dimension: 4096 - Each attention projection: 4096 × 4096 parameters - LoRA rank: 16
Calculate: 1. Original parameters in one projection layer 2. LoRA parameters for the same layer 3. Parameter reduction percentage
Solution: 1. Original: 4096 × 4096 = 16,777,216 parameters 2. LoRA: (4096 × 16) + (16 × 4096) = 131,072 parameters 3. Reduction: 99.22%
Common Pitfalls¶
1. Rank Too Low¶
- Insufficient capacity for complex adaptations
- Poor performance on downstream tasks
2. Rank Too High¶
- Diminishing returns on performance
- Increased memory usage and training time
3. Wrong Target Modules¶
- Missing critical layers for the task
- Applying to too many layers unnecessarily
Next Steps¶
- [[412-QLoRA]]: Learn how to combine LoRA with quantization for even greater efficiency
- [[413-Adapter-Tuning]]: Understand the foundational PEFT approach that inspired LoRA
- Practice: Try implementing LoRA on a small model to see the concepts in action