Skip to content

411: Low-Rank Adaptation (LoRA)

Chapter Overview

LoRA (Low-Rank Adaptation) is the most popular and effective PEFT method used today. It achieves performance comparable to full fine-tuning while only training a tiny fraction (often <0.1%) of the model's parameters.

The core insight of LoRA is based on the hypothesis that the "change" in a model's weights during adaptation has a low intrinsic rank.


The Core Idea: Decomposing the Update

Instead of directly updating a large, pre-trained weight matrix W (which can have billions of parameters), LoRA freezes W and learns its change, ΔW, indirectly.

LoRA represents this change ΔW as the product of two much smaller, "low-rank" matrices: A and B.

ΔW = B @ A

This is a low-rank decomposition.

graph TD
    subgraph "Original Weight Matrix (Frozen)"
        W("W<br/>d × k<br/>(e.g., 4096 × 4096)<br/>16.7M Parameters")
    end

    subgraph "LoRA Adapter Matrices (Trainable)"
        A("A<br/>d × r<br/>(e.g., 4096 × 8)<br/>32k Params") --> B("B<br/>r × k<br/>(e.g., 8 × 4096)<br/>32k Params")
        note1["Down-projection"]
        note2["Up-projection"]
    end

    C["Total Trainable Params:<br/>~65,000 (~0.4% of original)"]

    A -.-> note1
    B -.-> note2
    B --> C

    style W fill:#e3f2fd,stroke:#1976d2
    style A fill:#e8f5e8,stroke:#388e3c
    style B fill:#e8f5e8,stroke:#388e3c
    style C fill:#c8e6c9,stroke:#1B5E20

Mathematical Foundation

The Low-Rank Hypothesis

LoRA is based on the hypothesis that the change in weights during fine-tuning has a low intrinsic rank. This means:

\[\Delta W = BA\]

Where: - \(W_0\) is the pre-trained weight matrix (frozen) - \(\Delta W\) is the change we want to learn - \(B \in \mathbb{R}^{d \times r}\) and \(A \in \mathbb{R}^{r \times k}\) are low-rank matrices - \(r \ll \min(d, k)\) is the rank

Forward Pass

During inference, the output is computed as:

\[h = W_0 x + \Delta W x = W_0 x + BA x\]

Where \(x\) is the input and \(h\) is the output.

Key Parameters

Rank ®

The most important hyperparameter in LoRA. It determines the dimensionality of the low-rank adaptation.

  • Lower rank (r=1-4): Fewer parameters, more efficient, but potentially less expressive
  • Higher rank (r=16-64): More parameters, more expressive, but less efficient

Alpha (α)

A scaling factor that controls the magnitude of the LoRA adaptation:

\[\Delta W = \frac{\alpha}{r} \cdot BA\]

Target Modules

Which layers to apply LoRA to: - Query/Key/Value projections in attention layers (most common) - Feed-forward layers - Output projections

Practical Implementation

LoRA Configuration Example

# Typical LoRA configuration for a 7B model
lora_config = {
    "r": 16,                    # Rank
    "alpha": 32,                # Alpha scaling
    "target_modules": ["q_proj", "k_proj", "v_proj"],
    "dropout": 0.1,
    "bias": "none"
}

Memory and Computational Benefits

Memory Comparison

Model Size Full Fine-Tuning LoRA (r=16) Memory Reduction
7B params ~28GB VRAM ~8GB VRAM 71% reduction
13B params ~52GB VRAM ~12GB VRAM 77% reduction
70B params ~280GB VRAM ~35GB VRAM 87% reduction

Parameter Efficiency

graph LR
    subgraph "7B Parameter Model"
        A[Full Model<br/>7,000,000,000 params] -->|"LoRA r=16"| B[Trainable Params<br/>~4,000,000 params]
        C[Efficiency<br/>99.94% reduction]
    end

    style A fill:#ffcdd2,stroke:#B71C1C
    style B fill:#c8e6c9,stroke:#1B5E20
    style C fill:#e8f5e8,stroke:#388e3c

Advantages of LoRA

1. Extreme Parameter Efficiency

  • Typically <1% of original parameters need training
  • Enables fine-tuning on consumer hardware

2. No Inference Latency

  • LoRA weights can be merged with original weights
  • No additional computational overhead during inference

3. Modular and Swappable

  • Easy to switch between different LoRA adapters
  • One base model can serve multiple tasks

4. Storage Efficient

  • Only need to store small adapter weights
  • Easy to distribute and version control

Best Practices

Choosing the Right Rank

Rank Selection Guidelines

  • Start with r=16 for most tasks
  • Use r=4-8 for simple tasks or when memory is very limited
  • Use r=32-64 for complex tasks requiring high expressiveness
  • Monitor performance vs. efficiency trade-offs

Target Module Selection

Module Selection Strategy

  • Attention layers first: q_proj, k_proj, v_proj
  • Add output projection: o_proj for more capacity
  • Include feed-forward: gate_proj, up_proj, down_proj for complex tasks

Interactive Exercise

Calculate LoRA Efficiency

Given a transformer layer with: - Hidden dimension: 4096 - Each attention projection: 4096 × 4096 parameters - LoRA rank: 16

Calculate: 1. Original parameters in one projection layer 2. LoRA parameters for the same layer 3. Parameter reduction percentage

Solution: 1. Original: 4096 × 4096 = 16,777,216 parameters 2. LoRA: (4096 × 16) + (16 × 4096) = 131,072 parameters 3. Reduction: 99.22%

Common Pitfalls

1. Rank Too Low

  • Insufficient capacity for complex adaptations
  • Poor performance on downstream tasks

2. Rank Too High

  • Diminishing returns on performance
  • Increased memory usage and training time

3. Wrong Target Modules

  • Missing critical layers for the task
  • Applying to too many layers unnecessarily

Next Steps

  • [[412-QLoRA]]: Learn how to combine LoRA with quantization for even greater efficiency
  • [[413-Adapter-Tuning]]: Understand the foundational PEFT approach that inspired LoRA
  • Practice: Try implementing LoRA on a small model to see the concepts in action