Skip to content

422: RLHF - Step 2: RL Fine-Tuning with PPO

Chapter Overview

This is the final and most complex stage of the RLHF pipeline. In this phase, we use a Reinforcement Learning (RL) algorithm to fine-tune our SFT model (now called the policy) to maximize the score from our trained Reward Model.

The most common algorithm used for this is Proximal Policy Optimization (PPO).


The Reinforcement Learning Setup

In the context of RLHF, we map language generation to the RL framework:

RL Component RLHF Equivalent Description
Agent Policy Model (SFT Model) The language model being optimized
Environment Conversational Context The prompt and conversation history
Action Generated Token/Response The text output from the model
Reward Reward Model Score Scalar value indicating quality
State Current Text Context The prompt + partial response

The RLHF Training Loop

The process works as an iterative loop, continually refining the policy model based on feedback from the reward model.

flowchart TD
    subgraph "PPO Training Loop"
        A[1. Sample Prompt<br/>from training set] --> B[2. Generate Response<br/>using Policy Model]
        B --> C[3. Get Reward Score<br/>from Reward Model]
        C --> D[4. Calculate PPO Loss<br/>Reward + KL Penalty]
        D --> E[5. Update Policy Model<br/>using PPO algorithm]
        E --> F{6. Convergence<br/>Check}
        F -->|No| A
        F -->|Yes| G[✅ Aligned Model]
    end

    subgraph "Key Components"
        H[Policy Model<br/>πθ current]
        I[Reference Model<br/>πθ frozen]
        J[Reward Model<br/>RM frozen]
        K[KL Divergence<br/>Penalty]
    end

    B --> H
    C --> J
    D --> I
    D --> K

    style A fill:#e3f2fd,stroke:#1976d2
    style C fill:#fde0dc,stroke:#c43829
    style E fill:#c8e6c9,stroke:#1B5E20
    style G fill:#c8e6c9,stroke:#1B5E20,stroke-width:3px

The PPO Algorithm

PPO is designed to make stable, conservative updates to the policy while maximizing rewards.

Core PPO Objective

L_PPO = E[min(r_t(θ)A_t, clip(r_t(θ), 1-ε, 1+ε)A_t)]

Where: - r_t(θ) = probability ratio between new and old policy - A_t = advantage estimate (how much better this action is) - ε = clipping parameter (typically 0.2)

RLHF-Specific Modifications

Combined Loss Function:

Loss = L_PPO + β * L_KL + γ * L_entropy

Components: 1. PPO Loss: Maximize reward while staying close to previous policy 2. KL Penalty: Prevent the model from deviating too far from the reference model 3. Entropy Bonus: Encourage exploration and prevent mode collapse


Detailed Training Process

Phase 1: Rollout Generation

# Pseudocode for rollout generation
for batch in training_data:
    prompts = sample_prompts(batch)
    responses = policy_model.generate(prompts)
    rewards = reward_model.score(prompts, responses)

    # Calculate advantages
    advantages = calculate_advantages(rewards, values)

    # Store experience
    rollout_buffer.add(prompts, responses, rewards, advantages)

Phase 2: Policy Updates

# Pseudocode for PPO updates
for epoch in range(ppo_epochs):
    for mini_batch in rollout_buffer:
        # Calculate policy ratio
        old_log_probs = mini_batch.log_probs
        new_log_probs = policy_model.log_probs(mini_batch.responses)
        ratio = exp(new_log_probs - old_log_probs)

        # Calculate clipped loss
        clipped_ratio = clip(ratio, 1-epsilon, 1+epsilon)
        ppo_loss = -min(ratio * advantages, clipped_ratio * advantages)

        # Add KL penalty
        kl_penalty = beta * kl_divergence(policy_model, reference_model)

        # Combined loss
        total_loss = ppo_loss + kl_penalty

        # Update policy
        optimizer.step(total_loss)

Hyperparameter Tuning

Critical Parameters

Parameter Typical Range Impact
Learning Rate 1e-6 to 1e-4 Too high: instability; Too low: slow convergence
KL Coefficient (β) 0.01 to 0.1 Balance between reward and reference model similarity
Clip Epsilon (ε) 0.1 to 0.3 Controls how much the policy can change per update
Batch Size 64 to 512 Larger: more stable but slower; Smaller: faster but noisier
PPO Epochs 4 to 10 More epochs: better optimization but risk of overfitting

Dynamic Adjustments

  • Adaptive KL coefficient: Increase β if KL divergence grows too large
  • Learning rate scheduling: Decay learning rate over time
  • Early stopping: Stop if reward plateaus or KL divergence explodes

Monitoring and Debugging

Key Metrics to Track

  1. Reward Metrics:
  2. Average reward per episode
  3. Reward distribution
  4. Reward trend over time

  5. Policy Metrics:

  6. KL divergence from reference model
  7. Policy entropy (diversity of outputs)
  8. Gradient norms

  9. Training Stability:

  10. Loss convergence
  11. Policy ratio distribution
  12. Explained variance

Common Issues and Solutions

1. Reward Hacking

Symptoms: High reward scores but poor quality outputs Solutions: - Strengthen KL penalty - Improve reward model training - Add additional constraint terms

2. Mode Collapse

Symptoms: Model produces repetitive, safe responses Solutions: - Increase entropy bonus - Reduce KL coefficient - Use diverse training prompts

3. Training Instability

Symptoms: Erratic loss curves, NaN values Solutions: - Reduce learning rate - Increase gradient clipping - Check for numerical overflow


Practical Example

Let's trace through a concrete training step:

PPO Training Step

Input Prompt: "Explain the concept of gravity to a 5-year-old"

Policy Response: "Gravity is like a invisible force that pulls things down. When you drop a ball, gravity pulls it to the ground!"

Reward Score: 8.2/10 (good explanation, age-appropriate)

KL Divergence: 0.05 (reasonably close to reference model)

PPO Calculation: - Policy ratio: 1.15 (slightly more likely than before) - Clipped ratio: 1.15 (within epsilon bounds) - Advantage: +0.3 (above average reward) - PPO loss: -0.3 * 1.15 = -0.345 - KL penalty: 0.01 * 0.05 = 0.0005 - Total loss: -0.345 + 0.0005 = -0.3445

Result: Policy parameters updated to make similar responses more likely


Interactive Exercise

PPO Optimization Challenge

Scenario: You're training a coding assistant using PPO. After 1000 steps, you observe:

  • Average reward: 7.5/10 (good)
  • KL divergence: 0.8 (high)
  • Policy entropy: 0.1 (low)
  • Generated code quality: Repetitive but functional

Analysis Questions: 1. What problems do you identify from these metrics? 2. How would you adjust the hyperparameters? 3. What risks might arise from your proposed changes? 4. What additional metrics would you want to monitor?


Advanced Techniques

1. Value Function Learning

Train a separate value function to estimate expected future rewards:

V(s) = E[R_t | s_t = s]

2. Generalized Advantage Estimation (GAE)

Improve advantage calculations using exponential smoothing:

A_t = Σ(γλ)^l * δ_{t+l}

3. Multiple Reward Models

Use ensemble of reward models to reduce single-model bias:

R_combined = Σ w_i * R_i(prompt, response)


Key Takeaways

  1. PPO provides stable optimization for aligning language models with human preferences
  2. KL divergence penalty is crucial for preventing the model from forgetting its original capabilities
  3. Hyperparameter tuning significantly impacts training success and stability
  4. Continuous monitoring is essential to detect and address training issues
  5. Reward hacking remains a fundamental challenge requiring careful design