411: Low-Rank Adaptation (LoRA)¶

Chapter Overview

LoRA (Low-Rank Adaptation) is the most popular and effective PEFT method used today. It achieves performance comparable to full fine-tuning while only training a tiny fraction (often <0.1%) of the model's parameters.

The core insight of LoRA is based on the hypothesis that the "change" in a model's weights during adaptation has a low intrinsic rank.

The Core Idea: Decomposing the Update¶

Instead of directly updating a large, pre-trained weight matrix W (which can have billions of parameters), LoRA freezes W and learns its change, ΔW, indirectly.

LoRA represents this change ΔW as the product of two much smaller, "low-rank" matrices: A and B.

ΔW = B @ A

This is a low-rank decomposition.

graph TD
    subgraph "Original Weight Matrix (Frozen)"
        W("W<br/>d × k<br/>(e.g., 4096 × 4096)<br/>16.7M Parameters")
    end

    subgraph "LoRA Adapter Matrices (Trainable)"
        A("A<br/>d × r<br/>(e.g., 4096 × 8)<br/>32k Params") --> B("B<br/>r × k<br/>(e.g., 8 × 4096)<br/>32k Params")
        note1["Down-projection"]
        note2["Up-projection"]
    end

    C["Total Trainable Params:<br/>~65,000 (~0.4% of original)"]

    A -.-> note1
    B -.-> note2
    B --> C

    style W fill:#e3f2fd,stroke:#1976d2
    style A fill:#e8f5e8,stroke:#388e3c
    style B fill:#e8f5e8,stroke:#388e3c
    style C fill:#c8e6c9,stroke:#1B5E20

Mathematical Foundation¶

The Low-Rank Hypothesis¶

LoRA is based on the hypothesis that the change in weights during fine-tuning has a low intrinsic rank. This means:

\[\Delta W = BA\]

Where: - \(W_0\) is the pre-trained weight matrix (frozen) - \(\Delta W\) is the change we want to learn - \(B \in \mathbb{R}^{d \times r}\) and \(A \in \mathbb{R}^{r \times k}\) are low-rank matrices - \(r \ll \min(d, k)\) is the rank

Forward Pass¶

During inference, the output is computed as:

\[h = W_0 x + \Delta W x = W_0 x + BA x\]

Where \(x\) is the input and \(h\) is the output.

Key Parameters¶

Rank ®¶

The most important hyperparameter in LoRA. It determines the dimensionality of the low-rank adaptation.

Lower rank (r=1-4): Fewer parameters, more efficient, but potentially less expressive
Higher rank (r=16-64): More parameters, more expressive, but less efficient

Alpha (α)¶

A scaling factor that controls the magnitude of the LoRA adaptation:

\[\Delta W = \frac{\alpha}{r} \cdot BA\]

Target Modules¶

Which layers to apply LoRA to: - Query/Key/Value projections in attention layers (most common) - Feed-forward layers - Output projections

Practical Implementation¶

LoRA Configuration Example

# Typical LoRA configuration for a 7B model
lora_config = {
    "r": 16,                    # Rank
    "alpha": 32,                # Alpha scaling
    "target_modules": ["q_proj", "k_proj", "v_proj"],
    "dropout": 0.1,
    "bias": "none"
}

Memory and Computational Benefits¶

Memory Comparison¶

Model Size	Full Fine-Tuning	LoRA (r=16)	Memory Reduction
7B params	~28GB VRAM	~8GB VRAM	71% reduction
13B params	~52GB VRAM	~12GB VRAM	77% reduction
70B params	~280GB VRAM	~35GB VRAM	87% reduction

Parameter Efficiency¶

graph LR
    subgraph "7B Parameter Model"
        A[Full Model<br/>7,000,000,000 params] -->|"LoRA r=16"| B[Trainable Params<br/>~4,000,000 params]
        C[Efficiency<br/>99.94% reduction]
    end

    style A fill:#ffcdd2,stroke:#B71C1C
    style B fill:#c8e6c9,stroke:#1B5E20
    style C fill:#e8f5e8,stroke:#388e3c

Advantages of LoRA¶

1. Extreme Parameter Efficiency¶

Typically <1% of original parameters need training
Enables fine-tuning on consumer hardware

2. No Inference Latency¶

LoRA weights can be merged with original weights
No additional computational overhead during inference

3. Modular and Swappable¶

Easy to switch between different LoRA adapters
One base model can serve multiple tasks

4. Storage Efficient¶

Only need to store small adapter weights
Easy to distribute and version control

Best Practices¶

Choosing the Right Rank¶

Rank Selection Guidelines

Start with r=16 for most tasks
Use r=4-8 for simple tasks or when memory is very limited
Use r=32-64 for complex tasks requiring high expressiveness
Monitor performance vs. efficiency trade-offs

Target Module Selection¶

Module Selection Strategy

Attention layers first: q_proj, k_proj, v_proj
Add output projection: o_proj for more capacity
Include feed-forward: gate_proj, up_proj, down_proj for complex tasks

Interactive Exercise¶

Calculate LoRA Efficiency

Given a transformer layer with: - Hidden dimension: 4096 - Each attention projection: 4096 × 4096 parameters - LoRA rank: 16

Calculate: 1. Original parameters in one projection layer 2. LoRA parameters for the same layer 3. Parameter reduction percentage

Solution: 1. Original: 4096 × 4096 = 16,777,216 parameters 2. LoRA: (4096 × 16) + (16 × 4096) = 131,072 parameters 3. Reduction: 99.22%

Common Pitfalls¶

1. Rank Too Low¶

Insufficient capacity for complex adaptations
Poor performance on downstream tasks

2. Rank Too High¶

Diminishing returns on performance
Increased memory usage and training time

3. Wrong Target Modules¶

Missing critical layers for the task
Applying to too many layers unnecessarily

Next Steps¶

[[412-QLoRA]]: Learn how to combine LoRA with quantization for even greater efficiency
[[413-Adapter-Tuning]]: Understand the foundational PEFT approach that inspired LoRA
Practice: Try implementing LoRA on a small model to see the concepts in action