410: Parameter-Efficient Fine-Tuning (PEFT)¶
Chapter Overview
Parameter-Efficient Fine-Tuning (PEFT) is a collection of techniques designed to adapt a large Foundation Model to a new task by updating only a small fraction of its parameters.
PEFT is a breakthrough because it dramatically reduces the computational cost and memory requirements of fine-tuning, making it possible to customize massive models on consumer-grade hardware.
The Problem: The Cost of Full Fine-Tuning¶
A full fine-tuning process, where every weight in a multi-billion parameter model is updated, is incredibly resource-intensive.
Resource Challenges¶
- Memory Footprint: The model's weights, its gradients, and the optimizer states must all be held in GPU memory. For a large model, this can require hundreds of gigabytes of VRAM, far beyond what a single GPU can handle.
- Storage Cost: If you have one base model and need to customize it for 100 different tasks, full fine-tuning would require you to store 100 separate, massive copies of the entire model.
- Computational Overhead: Training all parameters requires significant computational resources and time.
PEFT methods solve both of these problems elegantly.
The PEFT Solution: Freeze and Inject¶
The core idea behind most PEFT methods is to freeze the vast majority of the pre-trained model's parameters and then inject a small number of new, trainable parameters at strategic locations.
graph TD
subgraph "Full Fine-Tuning"
A[Base Model<br/>7 Billion Parameters] --"Update All Weights"--> B[Fine-Tuned Model<br/>7B Trainable Parameters]
C((Result:<br/>- Very High VRAM needed<br/>- Store a full 7B model copy))
B --> C
end
subgraph "PEFT (e.g., LoRA)"
D[Base Model<br/>7 Billion Parameters - FROZEN] --"Inject & Update Only Adapters"--> E[Small, Trainable Adapters<br/>4 Million Parameters]
F((Result:<br/>- Low VRAM needed<br/>- Store only the tiny adapter))
E --> F
end
style A fill:#ffcdd2,stroke:#B71C1C
style C fill:#ffcdd2,stroke:#B71C1C
style D fill:#c8e6c9,stroke:#1B5E20
style F fill:#c8e6c9,stroke:#1B5E20
Key Benefits of PEFT¶
1. Memory Efficiency¶
PEFT dramatically reduces the memory requirements for fine-tuning by keeping the base model frozen and only updating a small set of parameters.
2. Storage Efficiency¶
Instead of storing multiple full model copies, you only need to store small adapter modules for each task.
3. Computational Efficiency¶
Training fewer parameters means faster training times and lower computational costs.
4. Modular Design¶
Adapters can be easily swapped in and out, allowing for flexible multi-task deployment.
Types of PEFT Methods¶
Low-Rank Adaptation (LoRA)¶
The most popular PEFT method, using low-rank matrix decomposition to efficiently represent weight updates.
Adapter Tuning¶
Inserts small neural network modules between transformer layers.
Prefix Tuning¶
Prepends trainable prefix tokens to the input sequence.
P-Tuning¶
Optimizes continuous prompts in the embedding space.
Real-World Impact¶
Practical Example
Consider fine-tuning a 7B parameter model like LLaMA:
- Full Fine-Tuning: Requires ~28GB VRAM (FP16), stores 7B parameters per task
- LoRA: Requires ~8GB VRAM, stores only ~4M parameters per task
This means you can fine-tune on a single consumer GPU instead of requiring expensive enterprise hardware!
Interactive Exercise¶
Think About It
If you had a base model and wanted to adapt it for 10 different tasks:
- How much storage would you need for full fine-tuning vs. PEFT?
- What are the trade-offs between parameter efficiency and model performance?
- Which scenarios would benefit most from PEFT approaches?
Next Steps¶
In the following chapters, we'll dive deep into specific PEFT techniques:
- [[411-Low-Rank-Adaptation-LoRA]]: The most widely used PEFT method
- [[412-QLoRA]]: Combining quantization with LoRA for maximum efficiency
- [[413-Adapter-Tuning]]: The foundational PEFT approach
Each technique offers unique advantages and is suited for different use cases and resource constraints.