421: RLHF - Step 1: Reward Modeling¶
Chapter Overview
The first major phase of the RLHF pipeline is to train a Reward Model (RM). The RM is a separate language model whose sole purpose is to act as a proxy for human preference.
It takes a prompt and a response as input and outputs a single scalar value—a "reward" score—that predicts how much a human would like that response.
The Reward Model Training Process¶
Training a Reward Model is a supervised learning task, but the "labels" are human preferences, not static answers.
flowchart TD
subgraph "Step 1: Data Generation"
A[Prompt:<br/>'Explain quantum computing'] --> B[SFT Model]
B --> C[Response A:<br/>'Quantum computing uses<br/>quantum bits...']
B --> D[Response B:<br/>'It's like regular computing<br/>but with magic...']
B --> E[Response C:<br/>'Quantum mechanics allows<br/>superposition...']
end
subgraph "Step 2: Human Preference Labeling"
F{Human Labeler Reviews:<br/>Prompt + Response A + Response B}
F -->|'A is better'| G[Preference: A > B]
F -->|'B is better'| H[Preference: B > A]
G --> I[Training Data Point:<br/>prompt, chosen_response, rejected_response]
H --> I
end
subgraph "Step 3: Train the Reward Model"
J[Preference Dataset<br/>thousands of comparisons] --> K[Reward Model Training]
K -->|Objective: score_chosen > score_rejected| L[✅ Trained RM]
L --> M[Input: prompt + response<br/>Output: scalar reward score]
end
style I fill:#e3f2fd,stroke:#1976d2
style L fill:#c8e6c9,stroke:#1B5E20
style M fill:#fff3e0,stroke:#f57c00
How the Reward Model Works¶
The reward model is trained to predict human preferences using a ranking loss function:
Training Objective¶
Where:
- r_chosen
= reward score for the preferred response
- r_rejected
= reward score for the rejected response
- σ
= sigmoid function
Goal: Maximize the probability that the chosen response gets a higher score than the rejected response.
Practical Example¶
Let's walk through a concrete example:
Reward Model Training Example
Prompt: "Write a professional email declining a job offer"
Response A: "Thanks but no thanks. I found something better."
Response B: "Thank you for the generous offer. After careful consideration, I've decided to pursue another opportunity that aligns more closely with my career goals. I appreciate your time and wish you success in finding the right candidate."
Human Preference: B > A (Response B is more professional)
Training Data Point:
Architecture Details¶
Base Model Selection¶
- Typically: Same architecture as the SFT model
- Size: Often smaller than the policy model (e.g., 7B vs 13B parameters)
- Modification: Replace the language modeling head with a scalar output head
Training Hyperparameters¶
Parameter | Typical Value | Purpose |
---|---|---|
Learning Rate | 5e-6 | Slower than pretraining to preserve knowledge |
Batch Size | 64 pairs | Balance between stability and efficiency |
Epochs | 1-3 | Prevent overfitting to preferences |
Warmup Steps | 500 | Gradual learning rate increase |
Data Collection Challenges¶
1. Annotator Agreement¶
Not all preferences are clear-cut. Measuring inter-annotator agreement is crucial:
Acceptable Range: κ > 0.6 indicates good agreement
2. Preference Inconsistency¶
Humans aren't perfectly consistent. The same person might prefer A over B on Monday and B over A on Tuesday.
3. Demographic Bias¶
Different groups may have different preferences. Consider:
- Cultural backgrounds
- Age groups
- Professional contexts
- Personal values
Quality Metrics¶
Reward Model Evaluation¶
- Accuracy: How often does the RM agree with human preferences?
- Calibration: Are high-confidence predictions actually correct?
- Consistency: Does the RM give similar scores to similar responses?
Validation Approaches¶
- Hold-out test set: Reserve 10-20% of preference data
- Cross-validation: Rotate training/validation splits
- Live evaluation: Test on new, unseen preference pairs
Interactive Exercise¶
Reward Model Design Challenge
Scenario: You're training a reward model for a creative writing assistant.
Given this prompt: "Write the opening paragraph of a mystery novel"
Response A: "It was a dark and stormy night. The detective walked into the room and saw the dead body."
Response B: "The grandfather clock in the corner had stopped at 3:17 AM, the exact moment Margaret Chen realized someone was in her house."
Questions: 1. Which response would you prefer and why? 2. What aspects make one response "better" than the other? 3. How might different people disagree on this preference? 4. What challenges would this create for training a reward model?
Common Pitfalls and Solutions¶
1. Reward Hacking¶
Problem: The model learns to exploit biases in the reward model rather than being genuinely helpful.
Solution: - Diverse training data - Regular reward model updates - KL divergence penalties
2. Preference Overfitting¶
Problem: The reward model memorizes specific examples rather than learning general principles.
Solution: - Larger, more diverse datasets - Regularization techniques - Cross-validation
3. Distribution Shift¶
Problem: The reward model works well on training data but fails on new types of prompts.
Solution: - Continuous data collection - Domain adaptation techniques - Robust evaluation protocols
Key Takeaways¶
- Reward models are preference predictors, not truth evaluators
- Quality of preference data directly impacts alignment success
- Human consistency is a fundamental challenge in the process
- Evaluation metrics must go beyond simple accuracy
- Bias and fairness considerations are critical throughout