Skip to content

421: RLHF - Step 1: Reward Modeling

Chapter Overview

The first major phase of the RLHF pipeline is to train a Reward Model (RM). The RM is a separate language model whose sole purpose is to act as a proxy for human preference.

It takes a prompt and a response as input and outputs a single scalar value—a "reward" score—that predicts how much a human would like that response.


The Reward Model Training Process

Training a Reward Model is a supervised learning task, but the "labels" are human preferences, not static answers.

flowchart TD
    subgraph "Step 1: Data Generation"
        A[Prompt:<br/>'Explain quantum computing'] --> B[SFT Model]
        B --> C[Response A:<br/>'Quantum computing uses<br/>quantum bits...']
        B --> D[Response B:<br/>'It's like regular computing<br/>but with magic...']
        B --> E[Response C:<br/>'Quantum mechanics allows<br/>superposition...']
    end

    subgraph "Step 2: Human Preference Labeling"
        F{Human Labeler Reviews:<br/>Prompt + Response A + Response B}
        F -->|'A is better'| G[Preference: A > B]
        F -->|'B is better'| H[Preference: B > A]
        G --> I[Training Data Point:<br/>prompt, chosen_response, rejected_response]
        H --> I
    end

    subgraph "Step 3: Train the Reward Model"
        J[Preference Dataset<br/>thousands of comparisons] --> K[Reward Model Training]
        K -->|Objective: score_chosen > score_rejected| L[✅ Trained RM]
        L --> M[Input: prompt + response<br/>Output: scalar reward score]
    end

    style I fill:#e3f2fd,stroke:#1976d2
    style L fill:#c8e6c9,stroke:#1B5E20
    style M fill:#fff3e0,stroke:#f57c00

How the Reward Model Works

The reward model is trained to predict human preferences using a ranking loss function:

Training Objective

Loss = -log(σ(r_chosen - r_rejected))

Where: - r_chosen = reward score for the preferred response - r_rejected = reward score for the rejected response
- σ = sigmoid function

Goal: Maximize the probability that the chosen response gets a higher score than the rejected response.


Practical Example

Let's walk through a concrete example:

Reward Model Training Example

Prompt: "Write a professional email declining a job offer"

Response A: "Thanks but no thanks. I found something better."

Response B: "Thank you for the generous offer. After careful consideration, I've decided to pursue another opportunity that aligns more closely with my career goals. I appreciate your time and wish you success in finding the right candidate."

Human Preference: B > A (Response B is more professional)

Training Data Point:

{
  "prompt": "Write a professional email declining a job offer",
  "chosen": "Thank you for the generous offer...",
  "rejected": "Thanks but no thanks...",
  "preference": "chosen"
}


Architecture Details

Base Model Selection

  • Typically: Same architecture as the SFT model
  • Size: Often smaller than the policy model (e.g., 7B vs 13B parameters)
  • Modification: Replace the language modeling head with a scalar output head

Training Hyperparameters

Parameter Typical Value Purpose
Learning Rate 5e-6 Slower than pretraining to preserve knowledge
Batch Size 64 pairs Balance between stability and efficiency
Epochs 1-3 Prevent overfitting to preferences
Warmup Steps 500 Gradual learning rate increase

Data Collection Challenges

1. Annotator Agreement

Not all preferences are clear-cut. Measuring inter-annotator agreement is crucial:

Cohen's Kappa = (P_observed - P_expected) / (1 - P_expected)

Acceptable Range: κ > 0.6 indicates good agreement

2. Preference Inconsistency

Humans aren't perfectly consistent. The same person might prefer A over B on Monday and B over A on Tuesday.

3. Demographic Bias

Different groups may have different preferences. Consider: - Cultural backgrounds - Age groups
- Professional contexts - Personal values


Quality Metrics

Reward Model Evaluation

  1. Accuracy: How often does the RM agree with human preferences?
  2. Calibration: Are high-confidence predictions actually correct?
  3. Consistency: Does the RM give similar scores to similar responses?

Validation Approaches

  • Hold-out test set: Reserve 10-20% of preference data
  • Cross-validation: Rotate training/validation splits
  • Live evaluation: Test on new, unseen preference pairs

Interactive Exercise

Reward Model Design Challenge

Scenario: You're training a reward model for a creative writing assistant.

Given this prompt: "Write the opening paragraph of a mystery novel"

Response A: "It was a dark and stormy night. The detective walked into the room and saw the dead body."

Response B: "The grandfather clock in the corner had stopped at 3:17 AM, the exact moment Margaret Chen realized someone was in her house."

Questions: 1. Which response would you prefer and why? 2. What aspects make one response "better" than the other? 3. How might different people disagree on this preference? 4. What challenges would this create for training a reward model?


Common Pitfalls and Solutions

1. Reward Hacking

Problem: The model learns to exploit biases in the reward model rather than being genuinely helpful.

Solution: - Diverse training data - Regular reward model updates - KL divergence penalties

2. Preference Overfitting

Problem: The reward model memorizes specific examples rather than learning general principles.

Solution: - Larger, more diverse datasets - Regularization techniques - Cross-validation

3. Distribution Shift

Problem: The reward model works well on training data but fails on new types of prompts.

Solution: - Continuous data collection - Domain adaptation techniques - Robust evaluation protocols


Key Takeaways

  1. Reward models are preference predictors, not truth evaluators
  2. Quality of preference data directly impacts alignment success
  3. Human consistency is a fundamental challenge in the process
  4. Evaluation metrics must go beyond simple accuracy
  5. Bias and fairness considerations are critical throughout