422: RLHF - Step 2: RL Fine-Tuning with PPO¶

Chapter Overview

This is the final and most complex stage of the RLHF pipeline. In this phase, we use a Reinforcement Learning (RL) algorithm to fine-tune our SFT model (now called the policy) to maximize the score from our trained Reward Model.

The most common algorithm used for this is Proximal Policy Optimization (PPO).

The Reinforcement Learning Setup¶

In the context of RLHF, we map language generation to the RL framework:

RL Component	RLHF Equivalent	Description
Agent	Policy Model (SFT Model)	The language model being optimized
Environment	Conversational Context	The prompt and conversation history
Action	Generated Token/Response	The text output from the model
Reward	Reward Model Score	Scalar value indicating quality
State	Current Text Context	The prompt + partial response

The RLHF Training Loop¶

The process works as an iterative loop, continually refining the policy model based on feedback from the reward model.

flowchart TD
    subgraph "PPO Training Loop"
        A[1. Sample Prompt<br/>from training set] --> B[2. Generate Response<br/>using Policy Model]
        B --> C[3. Get Reward Score<br/>from Reward Model]
        C --> D[4. Calculate PPO Loss<br/>Reward + KL Penalty]
        D --> E[5. Update Policy Model<br/>using PPO algorithm]
        E --> F{6. Convergence<br/>Check}
        F -->|No| A
        F -->|Yes| G[✅ Aligned Model]
    end

    subgraph "Key Components"
        H[Policy Model<br/>πθ current]
        I[Reference Model<br/>πθ frozen]
        J[Reward Model<br/>RM frozen]
        K[KL Divergence<br/>Penalty]
    end

    B --> H
    C --> J
    D --> I
    D --> K

    style A fill:#e3f2fd,stroke:#1976d2
    style C fill:#fde0dc,stroke:#c43829
    style E fill:#c8e6c9,stroke:#1B5E20
    style G fill:#c8e6c9,stroke:#1B5E20,stroke-width:3px

The PPO Algorithm¶

PPO is designed to make stable, conservative updates to the policy while maximizing rewards.

Core PPO Objective¶

L_PPO = E[min(r_t(θ)A_t, clip(r_t(θ), 1-ε, 1+ε)A_t)]

Where: - r_t(θ) = probability ratio between new and old policy - A_t = advantage estimate (how much better this action is) - ε = clipping parameter (typically 0.2)

RLHF-Specific Modifications¶

Combined Loss Function:

Loss = L_PPO + β * L_KL + γ * L_entropy

Components: 1. PPO Loss: Maximize reward while staying close to previous policy 2. KL Penalty: Prevent the model from deviating too far from the reference model 3. Entropy Bonus: Encourage exploration and prevent mode collapse

Detailed Training Process¶

Phase 1: Rollout Generation¶

# Pseudocode for rollout generation
for batch in training_data:
    prompts = sample_prompts(batch)
    responses = policy_model.generate(prompts)
    rewards = reward_model.score(prompts, responses)

    # Calculate advantages
    advantages = calculate_advantages(rewards, values)

    # Store experience
    rollout_buffer.add(prompts, responses, rewards, advantages)

Phase 2: Policy Updates¶

# Pseudocode for PPO updates
for epoch in range(ppo_epochs):
    for mini_batch in rollout_buffer:
        # Calculate policy ratio
        old_log_probs = mini_batch.log_probs
        new_log_probs = policy_model.log_probs(mini_batch.responses)
        ratio = exp(new_log_probs - old_log_probs)

        # Calculate clipped loss
        clipped_ratio = clip(ratio, 1-epsilon, 1+epsilon)
        ppo_loss = -min(ratio * advantages, clipped_ratio * advantages)

        # Add KL penalty
        kl_penalty = beta * kl_divergence(policy_model, reference_model)

        # Combined loss
        total_loss = ppo_loss + kl_penalty

        # Update policy
        optimizer.step(total_loss)

Hyperparameter Tuning¶

Critical Parameters¶

Parameter	Typical Range	Impact
Learning Rate	1e-6 to 1e-4	Too high: instability; Too low: slow convergence
KL Coefficient (β)	0.01 to 0.1	Balance between reward and reference model similarity
Clip Epsilon (ε)	0.1 to 0.3	Controls how much the policy can change per update
Batch Size	64 to 512	Larger: more stable but slower; Smaller: faster but noisier
PPO Epochs	4 to 10	More epochs: better optimization but risk of overfitting

Dynamic Adjustments¶

Adaptive KL coefficient: Increase β if KL divergence grows too large
Learning rate scheduling: Decay learning rate over time
Early stopping: Stop if reward plateaus or KL divergence explodes

Monitoring and Debugging¶

Key Metrics to Track¶

Reward Metrics:
Average reward per episode
Reward distribution
Reward trend over time
Policy Metrics:
KL divergence from reference model
Policy entropy (diversity of outputs)
Gradient norms
Training Stability:
Loss convergence
Policy ratio distribution
Explained variance

Common Issues and Solutions¶

1. Reward Hacking¶

Symptoms: High reward scores but poor quality outputs Solutions: - Strengthen KL penalty - Improve reward model training - Add additional constraint terms

2. Mode Collapse¶

Symptoms: Model produces repetitive, safe responses Solutions: - Increase entropy bonus - Reduce KL coefficient - Use diverse training prompts

3. Training Instability¶

Symptoms: Erratic loss curves, NaN values Solutions: - Reduce learning rate - Increase gradient clipping - Check for numerical overflow

Practical Example¶

Let's trace through a concrete training step:

PPO Training Step

Input Prompt: "Explain the concept of gravity to a 5-year-old"

Policy Response: "Gravity is like a invisible force that pulls things down. When you drop a ball, gravity pulls it to the ground!"

Reward Score: 8.2/10 (good explanation, age-appropriate)

KL Divergence: 0.05 (reasonably close to reference model)

PPO Calculation: - Policy ratio: 1.15 (slightly more likely than before) - Clipped ratio: 1.15 (within epsilon bounds) - Advantage: +0.3 (above average reward) - PPO loss: -0.3 * 1.15 = -0.345 - KL penalty: 0.01 * 0.05 = 0.0005 - Total loss: -0.345 + 0.0005 = -0.3445

Result: Policy parameters updated to make similar responses more likely

Interactive Exercise¶

PPO Optimization Challenge

Scenario: You're training a coding assistant using PPO. After 1000 steps, you observe:

Average reward: 7.5/10 (good)
KL divergence: 0.8 (high)
Policy entropy: 0.1 (low)
Generated code quality: Repetitive but functional

Analysis Questions: 1. What problems do you identify from these metrics? 2. How would you adjust the hyperparameters? 3. What risks might arise from your proposed changes? 4. What additional metrics would you want to monitor?

Advanced Techniques¶

1. Value Function Learning¶

Train a separate value function to estimate expected future rewards:

V(s) = E[R_t | s_t = s]

2. Generalized Advantage Estimation (GAE)¶

Improve advantage calculations using exponential smoothing:

A_t = Σ(γλ)^l * δ_{t+l}

3. Multiple Reward Models¶

Use ensemble of reward models to reduce single-model bias:

R_combined = Σ w_i * R_i(prompt, response)

Key Takeaways¶

PPO provides stable optimization for aligning language models with human preferences
KL divergence penalty is crucial for preventing the model from forgetting its original capabilities
Hyperparameter tuning significantly impacts training success and stability
Continuous monitoring is essential to detect and address training issues
Reward hacking remains a fundamental challenge requiring careful design

Next: 423: Direct Preference Optimization (DPO)
Previous: 421: RLHF - Reward Modeling
Overview: 420: Aligning with Human Preferences