420: Aligning with Human Preferences (RLHF & DPO)¶
Topic Overview
Standard fine-tuning is excellent for teaching a model a new skill, but it struggles with teaching subjective qualities like "helpfulness," "harmlessness," or "honesty." How do you create a dataset for "being helpful"?
Alignment is the process of fine-tuning a model to better align with complex, often subjective, human values. This section covers the two primary techniques for achieving alignment: Reinforcement Learning from Human Feedback (RLHF) and its modern successor, Direct Preference Optimization (DPO).
The Core Problem: Defining "Good"¶
It's easy to define a "good" response for a factual question, but much harder for an open-ended one. For example, which summary is "better"? Which creative story is more "engaging"?
These alignment techniques solve this problem by learning from human preferences rather than from static, pre-written "correct" answers. They learn what humans prefer to see.
The Two Main Approaches¶
1. Reinforcement Learning from Human Feedback (RLHF)¶
This is the classic, three-step process that powered models like InstructGPT and the original ChatGPT. It's complex but powerful.
flowchart TD
A[Step 1: Supervised Fine-Tuning] --> B[Step 2: Reward Model Training]
B --> C[Step 3: RL Fine-Tuning with PPO]
subgraph "RLHF Pipeline"
direction TB
A1[Teaches basic style<br/>and instruction following]
B1[Learns human preferences<br/>through comparison data]
C1[Optimizes policy to<br/>maximize reward scores]
end
A --> A1
B --> B1
C --> C1
style A fill:#e3f2fd,stroke:#1976d2
style B fill:#fff3e0,stroke:#f57c00
style C fill:#e8f5e8,stroke:#388e3c
style A1 fill:#e3f2fd,stroke:#1976d2,stroke-dasharray: 5 5
style B1 fill:#fff3e0,stroke:#f57c00,stroke-dasharray: 5 5
style C1 fill:#e8f5e8,stroke:#388e3c,stroke-dasharray: 5 5
Key Characteristics: - Proven track record: Powers ChatGPT, Claude, and other major models - Complex but stable: Three distinct phases with clear objectives - Resource intensive: Requires training multiple models and human annotation
2. Direct Preference Optimization (DPO)¶
A newer, simpler approach that achieves similar results with less complexity.
flowchart TD
A[Reference Model] --> B[DPO Training]
C[Preference Dataset] --> B
B --> D[Aligned Model]
subgraph "DPO Advantages"
E[✅ Single training phase]
F[✅ No reward model needed]
G[✅ Mathematically elegant]
H[✅ Computationally efficient]
end
style A fill:#e3f2fd,stroke:#1976d2
style B fill:#fff3e0,stroke:#f57c00
style C fill:#f3e5f5,stroke:#7b1fa2
style D fill:#c8e6c9,stroke:#1B5E20
Key Characteristics: - Simpler pipeline: Direct optimization without intermediate models - Mathematically elegant: Bypasses reward modeling through clever loss function design - Emerging standard: Increasingly adopted by research and industry
When to Use Each Approach¶
Scenario | Recommended Approach | Reasoning |
---|---|---|
Production systems | RLHF | Battle-tested, well-understood failure modes |
Research experiments | DPO | Faster iteration, easier to debug |
Limited compute | DPO | Requires fewer model training runs |
Complex preferences | RLHF | Reward model provides interpretability |
Interactive Exercise¶
Alignment Challenge
Scenario: You're building a customer service chatbot. A user asks: "How do I cancel my subscription?"
Response A: "To cancel your subscription, go to Settings > Account > Cancel Subscription. Click 'Confirm' when prompted."
Response B: "I understand you want to cancel your subscription. While I can help with that, I'd love to know if there's anything we could do to improve your experience first. To cancel, visit Settings > Account > Cancel Subscription."
Question: Which response would you prefer and why? How might this preference data be used in RLHF vs DPO?
Key Takeaways¶
- Alignment addresses subjective qualities that traditional fine-tuning cannot handle effectively
- RLHF is the established method with proven results but higher complexity
- DPO is the emerging alternative offering similar results with simpler implementation
- Both methods rely on human preference data rather than absolute "correct" answers
- The choice depends on your specific use case and resource constraints
Navigation¶
- Next: 421: RLHF - Reward Modeling
- Also See: 423: Direct Preference Optimization (DPO)
- Previous: Fine-Tuning Overview