422: RLHF - Step 2: RL Fine-Tuning with PPO¶
Chapter Overview
This is the final and most complex stage of the RLHF pipeline. In this phase, we use a Reinforcement Learning (RL) algorithm to fine-tune our SFT model (now called the policy) to maximize the score from our trained Reward Model.
The most common algorithm used for this is Proximal Policy Optimization (PPO).
The Reinforcement Learning Setup¶
In the context of RLHF, we map language generation to the RL framework:
RL Component | RLHF Equivalent | Description |
---|---|---|
Agent | Policy Model (SFT Model) | The language model being optimized |
Environment | Conversational Context | The prompt and conversation history |
Action | Generated Token/Response | The text output from the model |
Reward | Reward Model Score | Scalar value indicating quality |
State | Current Text Context | The prompt + partial response |
The RLHF Training Loop¶
The process works as an iterative loop, continually refining the policy model based on feedback from the reward model.
flowchart TD
subgraph "PPO Training Loop"
A[1. Sample Prompt<br/>from training set] --> B[2. Generate Response<br/>using Policy Model]
B --> C[3. Get Reward Score<br/>from Reward Model]
C --> D[4. Calculate PPO Loss<br/>Reward + KL Penalty]
D --> E[5. Update Policy Model<br/>using PPO algorithm]
E --> F{6. Convergence<br/>Check}
F -->|No| A
F -->|Yes| G[✅ Aligned Model]
end
subgraph "Key Components"
H[Policy Model<br/>πθ current]
I[Reference Model<br/>πθ frozen]
J[Reward Model<br/>RM frozen]
K[KL Divergence<br/>Penalty]
end
B --> H
C --> J
D --> I
D --> K
style A fill:#e3f2fd,stroke:#1976d2
style C fill:#fde0dc,stroke:#c43829
style E fill:#c8e6c9,stroke:#1B5E20
style G fill:#c8e6c9,stroke:#1B5E20,stroke-width:3px
The PPO Algorithm¶
PPO is designed to make stable, conservative updates to the policy while maximizing rewards.
Core PPO Objective¶
Where:
- r_t(θ)
= probability ratio between new and old policy
- A_t
= advantage estimate (how much better this action is)
- ε
= clipping parameter (typically 0.2)
RLHF-Specific Modifications¶
Combined Loss Function:
Components: 1. PPO Loss: Maximize reward while staying close to previous policy 2. KL Penalty: Prevent the model from deviating too far from the reference model 3. Entropy Bonus: Encourage exploration and prevent mode collapse
Detailed Training Process¶
Phase 1: Rollout Generation¶
# Pseudocode for rollout generation
for batch in training_data:
prompts = sample_prompts(batch)
responses = policy_model.generate(prompts)
rewards = reward_model.score(prompts, responses)
# Calculate advantages
advantages = calculate_advantages(rewards, values)
# Store experience
rollout_buffer.add(prompts, responses, rewards, advantages)
Phase 2: Policy Updates¶
# Pseudocode for PPO updates
for epoch in range(ppo_epochs):
for mini_batch in rollout_buffer:
# Calculate policy ratio
old_log_probs = mini_batch.log_probs
new_log_probs = policy_model.log_probs(mini_batch.responses)
ratio = exp(new_log_probs - old_log_probs)
# Calculate clipped loss
clipped_ratio = clip(ratio, 1-epsilon, 1+epsilon)
ppo_loss = -min(ratio * advantages, clipped_ratio * advantages)
# Add KL penalty
kl_penalty = beta * kl_divergence(policy_model, reference_model)
# Combined loss
total_loss = ppo_loss + kl_penalty
# Update policy
optimizer.step(total_loss)
Hyperparameter Tuning¶
Critical Parameters¶
Parameter | Typical Range | Impact |
---|---|---|
Learning Rate | 1e-6 to 1e-4 | Too high: instability; Too low: slow convergence |
KL Coefficient (β) | 0.01 to 0.1 | Balance between reward and reference model similarity |
Clip Epsilon (ε) | 0.1 to 0.3 | Controls how much the policy can change per update |
Batch Size | 64 to 512 | Larger: more stable but slower; Smaller: faster but noisier |
PPO Epochs | 4 to 10 | More epochs: better optimization but risk of overfitting |
Dynamic Adjustments¶
- Adaptive KL coefficient: Increase β if KL divergence grows too large
- Learning rate scheduling: Decay learning rate over time
- Early stopping: Stop if reward plateaus or KL divergence explodes
Monitoring and Debugging¶
Key Metrics to Track¶
- Reward Metrics:
- Average reward per episode
- Reward distribution
-
Reward trend over time
-
Policy Metrics:
- KL divergence from reference model
- Policy entropy (diversity of outputs)
-
Gradient norms
-
Training Stability:
- Loss convergence
- Policy ratio distribution
- Explained variance
Common Issues and Solutions¶
1. Reward Hacking¶
Symptoms: High reward scores but poor quality outputs Solutions: - Strengthen KL penalty - Improve reward model training - Add additional constraint terms
2. Mode Collapse¶
Symptoms: Model produces repetitive, safe responses Solutions: - Increase entropy bonus - Reduce KL coefficient - Use diverse training prompts
3. Training Instability¶
Symptoms: Erratic loss curves, NaN values Solutions: - Reduce learning rate - Increase gradient clipping - Check for numerical overflow
Practical Example¶
Let's trace through a concrete training step:
PPO Training Step
Input Prompt: "Explain the concept of gravity to a 5-year-old"
Policy Response: "Gravity is like a invisible force that pulls things down. When you drop a ball, gravity pulls it to the ground!"
Reward Score: 8.2/10 (good explanation, age-appropriate)
KL Divergence: 0.05 (reasonably close to reference model)
PPO Calculation: - Policy ratio: 1.15 (slightly more likely than before) - Clipped ratio: 1.15 (within epsilon bounds) - Advantage: +0.3 (above average reward) - PPO loss: -0.3 * 1.15 = -0.345 - KL penalty: 0.01 * 0.05 = 0.0005 - Total loss: -0.345 + 0.0005 = -0.3445
Result: Policy parameters updated to make similar responses more likely
Interactive Exercise¶
PPO Optimization Challenge
Scenario: You're training a coding assistant using PPO. After 1000 steps, you observe:
- Average reward: 7.5/10 (good)
- KL divergence: 0.8 (high)
- Policy entropy: 0.1 (low)
- Generated code quality: Repetitive but functional
Analysis Questions: 1. What problems do you identify from these metrics? 2. How would you adjust the hyperparameters? 3. What risks might arise from your proposed changes? 4. What additional metrics would you want to monitor?
Advanced Techniques¶
1. Value Function Learning¶
Train a separate value function to estimate expected future rewards:
2. Generalized Advantage Estimation (GAE)¶
Improve advantage calculations using exponential smoothing:
3. Multiple Reward Models¶
Use ensemble of reward models to reduce single-model bias:
Key Takeaways¶
- PPO provides stable optimization for aligning language models with human preferences
- KL divergence penalty is crucial for preventing the model from forgetting its original capabilities
- Hyperparameter tuning significantly impacts training success and stability
- Continuous monitoring is essential to detect and address training issues
- Reward hacking remains a fundamental challenge requiring careful design
Navigation¶
- Next: 423: Direct Preference Optimization (DPO)
- Previous: 421: RLHF - Reward Modeling
- Overview: 420: Aligning with Human Preferences