512: Model Compression & Quantization¶

Chapter Overview

Model Compression refers to techniques used to reduce model size, making it faster and more memory-efficient for inference. These optimizations are applied directly to the model itself.

Of these techniques, Quantization is by far the most popular and impactful for modern LLMs.

The Main Compression Techniques¶

1. Quantization¶

Definition: The process of reducing the numerical precision of a model's weights.

How it Works: Converts weights from high-precision format (32-bit floating point, FP32) to lower-precision format (8-bit integer INT8, or 4-bit NormalFloat NF4).

Impact: - Memory Reduction: A 13B parameter model in FP32 is ~52GB. In 8-bit, it's ~13GB. In 4-bit, it's ~6.5GB. - Speed Increase: Lower-precision arithmetic is faster on modern hardware.

Relevance: Core component of QLoRA fine-tuning and essential for running large models on consumer hardware.

2. Pruning¶

Definition: Removing unnecessary or less important weights from the model.

Types: - Structured Pruning: Removes entire neurons, layers, or channels - Unstructured Pruning: Removes individual weights based on magnitude

Trade-off: Significant size reduction but may require retraining to maintain performance.

3. Knowledge Distillation¶

Definition: Training a smaller "student" model to mimic a larger "teacher" model.

Process: The student learns from the teacher's outputs rather than just the original training data.

Benefit: Creates compact models that retain much of the original performance.

Quantization Deep Dive¶

Common Quantization Formats¶

Format	Bits	Memory Usage	Quality	Use Case
FP32	32	100%	Highest	Training, Research
FP16	16	50%	High	Inference
INT8	8	25%	Good	Mobile, Edge
NF4	4	12.5%	Acceptable	Consumer Hardware

Quantization Methods¶

Post-Training Quantization (PTQ): - Applied after training is complete - Faster to implement - May have slight quality loss

Quantization-Aware Training (QAT): - Quantization is simulated during training - Better quality preservation - Requires more compute resources

Practical Applications¶

Hardware Considerations¶

GPU Memory: Quantization allows larger models to fit in limited VRAM
CPU Inference: Essential for running models on standard hardware
Mobile Deployment: 4-bit quantization enables on-device AI

Real-World Example¶

Original Model: Llama 2 70B (140GB in FP32)
After 4-bit Quantization: ~35GB
Result: Runs on consumer GPUs with 48GB VRAM

Interactive Exercise¶

Try This: Calculate the memory savings for different quantization levels.

Given a model with 7 billion parameters: 1. FP32: 7B × 4 bytes = 28GB 2. INT8: 7B × 1 byte = 7GB 3. NF4: 7B × 0.5 bytes = 3.5GB

Memory Reduction: 8x smaller with 4-bit quantization!

Key Takeaways¶

Quantization is the most impactful compression technique for LLMs
4-bit quantization offers the best balance of size reduction and quality
Essential for democratizing access to large language models
Enables deployment on consumer hardware and mobile devices