421: RLHF - Step 1: Reward Modeling¶

Chapter Overview

The first major phase of the RLHF pipeline is to train a Reward Model (RM). The RM is a separate language model whose sole purpose is to act as a proxy for human preference.

It takes a prompt and a response as input and outputs a single scalar value—a "reward" score—that predicts how much a human would like that response.

The Reward Model Training Process¶

Training a Reward Model is a supervised learning task, but the "labels" are human preferences, not static answers.

flowchart TD
    subgraph "Step 1: Data Generation"
        A[Prompt:<br/>'Explain quantum computing'] --> B[SFT Model]
        B --> C[Response A:<br/>'Quantum computing uses<br/>quantum bits...']
        B --> D[Response B:<br/>'It's like regular computing<br/>but with magic...']
        B --> E[Response C:<br/>'Quantum mechanics allows<br/>superposition...']
    end

    subgraph "Step 2: Human Preference Labeling"
        F{Human Labeler Reviews:<br/>Prompt + Response A + Response B}
        F -->|'A is better'| G[Preference: A > B]
        F -->|'B is better'| H[Preference: B > A]
        G --> I[Training Data Point:<br/>prompt, chosen_response, rejected_response]
        H --> I
    end

    subgraph "Step 3: Train the Reward Model"
        J[Preference Dataset<br/>thousands of comparisons] --> K[Reward Model Training]
        K -->|Objective: score_chosen > score_rejected| L[✅ Trained RM]
        L --> M[Input: prompt + response<br/>Output: scalar reward score]
    end

    style I fill:#e3f2fd,stroke:#1976d2
    style L fill:#c8e6c9,stroke:#1B5E20
    style M fill:#fff3e0,stroke:#f57c00

How the Reward Model Works¶

The reward model is trained to predict human preferences using a ranking loss function:

Training Objective¶

Loss = -log(σ(r_chosen - r_rejected))

Where: - r_chosen = reward score for the preferred response - r_rejected = reward score for the rejected response
- σ = sigmoid function

Goal: Maximize the probability that the chosen response gets a higher score than the rejected response.

Practical Example¶

Let's walk through a concrete example:

Reward Model Training Example

Prompt: "Write a professional email declining a job offer"

Response A: "Thanks but no thanks. I found something better."

Response B: "Thank you for the generous offer. After careful consideration, I've decided to pursue another opportunity that aligns more closely with my career goals. I appreciate your time and wish you success in finding the right candidate."

Human Preference: B > A (Response B is more professional)

Training Data Point:

{
  "prompt": "Write a professional email declining a job offer",
  "chosen": "Thank you for the generous offer...",
  "rejected": "Thanks but no thanks...",
  "preference": "chosen"
}

Architecture Details¶

Base Model Selection¶

Typically: Same architecture as the SFT model
Size: Often smaller than the policy model (e.g., 7B vs 13B parameters)
Modification: Replace the language modeling head with a scalar output head

Training Hyperparameters¶

Parameter	Typical Value	Purpose
Learning Rate	5e-6	Slower than pretraining to preserve knowledge
Batch Size	64 pairs	Balance between stability and efficiency
Epochs	1-3	Prevent overfitting to preferences
Warmup Steps	500	Gradual learning rate increase

Data Collection Challenges¶

1. Annotator Agreement¶

Not all preferences are clear-cut. Measuring inter-annotator agreement is crucial:

Cohen's Kappa = (P_observed - P_expected) / (1 - P_expected)

Acceptable Range: κ > 0.6 indicates good agreement

2. Preference Inconsistency¶

Humans aren't perfectly consistent. The same person might prefer A over B on Monday and B over A on Tuesday.

3. Demographic Bias¶

Different groups may have different preferences. Consider: - Cultural backgrounds - Age groups
- Professional contexts - Personal values

Quality Metrics¶

Reward Model Evaluation¶

Accuracy: How often does the RM agree with human preferences?
Calibration: Are high-confidence predictions actually correct?
Consistency: Does the RM give similar scores to similar responses?

Validation Approaches¶

Hold-out test set: Reserve 10-20% of preference data
Cross-validation: Rotate training/validation splits
Live evaluation: Test on new, unseen preference pairs

Interactive Exercise¶

Reward Model Design Challenge

Scenario: You're training a reward model for a creative writing assistant.

Given this prompt: "Write the opening paragraph of a mystery novel"

Response A: "It was a dark and stormy night. The detective walked into the room and saw the dead body."

Response B: "The grandfather clock in the corner had stopped at 3:17 AM, the exact moment Margaret Chen realized someone was in her house."

Questions: 1. Which response would you prefer and why? 2. What aspects make one response "better" than the other? 3. How might different people disagree on this preference? 4. What challenges would this create for training a reward model?

Common Pitfalls and Solutions¶

1. Reward Hacking¶

Problem: The model learns to exploit biases in the reward model rather than being genuinely helpful.

Solution: - Diverse training data - Regular reward model updates - KL divergence penalties

2. Preference Overfitting¶

Problem: The reward model memorizes specific examples rather than learning general principles.

Solution: - Larger, more diverse datasets - Regularization techniques - Cross-validation

3. Distribution Shift¶

Problem: The reward model works well on training data but fails on new types of prompts.

Solution: - Continuous data collection - Domain adaptation techniques - Robust evaluation protocols

Key Takeaways¶

Reward models are preference predictors, not truth evaluators
Quality of preference data directly impacts alignment success
Human consistency is a fundamental challenge in the process
Evaluation metrics must go beyond simple accuracy
Bias and fairness considerations are critical throughout