512: Model Compression & Quantization¶
Chapter Overview
Model Compression refers to techniques used to reduce model size, making it faster and more memory-efficient for inference. These optimizations are applied directly to the model itself.
Of these techniques, Quantization is by far the most popular and impactful for modern LLMs.
The Main Compression Techniques¶
1. Quantization¶
Definition: The process of reducing the numerical precision of a model's weights.
How it Works: Converts weights from high-precision format (32-bit floating point, FP32) to lower-precision format (8-bit integer INT8, or 4-bit NormalFloat NF4).
Impact: - Memory Reduction: A 13B parameter model in FP32 is ~52GB. In 8-bit, it's ~13GB. In 4-bit, it's ~6.5GB. - Speed Increase: Lower-precision arithmetic is faster on modern hardware.
Relevance: Core component of QLoRA fine-tuning and essential for running large models on consumer hardware.
2. Pruning¶
Definition: Removing unnecessary or less important weights from the model.
Types: - Structured Pruning: Removes entire neurons, layers, or channels - Unstructured Pruning: Removes individual weights based on magnitude
Trade-off: Significant size reduction but may require retraining to maintain performance.
3. Knowledge Distillation¶
Definition: Training a smaller "student" model to mimic a larger "teacher" model.
Process: The student learns from the teacher's outputs rather than just the original training data.
Benefit: Creates compact models that retain much of the original performance.
Quantization Deep Dive¶
Common Quantization Formats¶
Format | Bits | Memory Usage | Quality | Use Case |
---|---|---|---|---|
FP32 | 32 | 100% | Highest | Training, Research |
FP16 | 16 | 50% | High | Inference |
INT8 | 8 | 25% | Good | Mobile, Edge |
NF4 | 4 | 12.5% | Acceptable | Consumer Hardware |
Quantization Methods¶
Post-Training Quantization (PTQ): - Applied after training is complete - Faster to implement - May have slight quality loss
Quantization-Aware Training (QAT): - Quantization is simulated during training - Better quality preservation - Requires more compute resources
Practical Applications¶
Hardware Considerations¶
- GPU Memory: Quantization allows larger models to fit in limited VRAM
- CPU Inference: Essential for running models on standard hardware
- Mobile Deployment: 4-bit quantization enables on-device AI
Real-World Example¶
Original Model: Llama 2 70B (140GB in FP32)
After 4-bit Quantization: ~35GB
Result: Runs on consumer GPUs with 48GB VRAM
Interactive Exercise¶
Try This: Calculate the memory savings for different quantization levels.
Given a model with 7 billion parameters: 1. FP32: 7B × 4 bytes = 28GB 2. INT8: 7B × 1 byte = 7GB 3. NF4: 7B × 0.5 bytes = 3.5GB
Memory Reduction: 8x smaller with 4-bit quantization!
Key Takeaways¶
- Quantization is the most impactful compression technique for LLMs
- 4-bit quantization offers the best balance of size reduction and quality
- Essential for democratizing access to large language models
- Enables deployment on consumer hardware and mobile devices