Skip to content

420: Aligning with Human Preferences (RLHF & DPO)

Topic Overview

Standard fine-tuning is excellent for teaching a model a new skill, but it struggles with teaching subjective qualities like "helpfulness," "harmlessness," or "honesty." How do you create a dataset for "being helpful"?

Alignment is the process of fine-tuning a model to better align with complex, often subjective, human values. This section covers the two primary techniques for achieving alignment: Reinforcement Learning from Human Feedback (RLHF) and its modern successor, Direct Preference Optimization (DPO).


The Core Problem: Defining "Good"

It's easy to define a "good" response for a factual question, but much harder for an open-ended one. For example, which summary is "better"? Which creative story is more "engaging"?

These alignment techniques solve this problem by learning from human preferences rather than from static, pre-written "correct" answers. They learn what humans prefer to see.


The Two Main Approaches

1. Reinforcement Learning from Human Feedback (RLHF)

This is the classic, three-step process that powered models like InstructGPT and the original ChatGPT. It's complex but powerful.

flowchart TD
    A[Step 1: Supervised Fine-Tuning] --> B[Step 2: Reward Model Training]
    B --> C[Step 3: RL Fine-Tuning with PPO]

    subgraph "RLHF Pipeline"
        direction TB
        A1[Teaches basic style<br/>and instruction following] 
        B1[Learns human preferences<br/>through comparison data]
        C1[Optimizes policy to<br/>maximize reward scores]
    end

    A --> A1
    B --> B1
    C --> C1

    style A fill:#e3f2fd,stroke:#1976d2
    style B fill:#fff3e0,stroke:#f57c00
    style C fill:#e8f5e8,stroke:#388e3c
    style A1 fill:#e3f2fd,stroke:#1976d2,stroke-dasharray: 5 5
    style B1 fill:#fff3e0,stroke:#f57c00,stroke-dasharray: 5 5
    style C1 fill:#e8f5e8,stroke:#388e3c,stroke-dasharray: 5 5

Key Characteristics: - Proven track record: Powers ChatGPT, Claude, and other major models - Complex but stable: Three distinct phases with clear objectives - Resource intensive: Requires training multiple models and human annotation

2. Direct Preference Optimization (DPO)

A newer, simpler approach that achieves similar results with less complexity.

flowchart TD
    A[Reference Model] --> B[DPO Training]
    C[Preference Dataset] --> B
    B --> D[Aligned Model]

    subgraph "DPO Advantages"
        E[✅ Single training phase]
        F[✅ No reward model needed]
        G[✅ Mathematically elegant]
        H[✅ Computationally efficient]
    end

    style A fill:#e3f2fd,stroke:#1976d2
    style B fill:#fff3e0,stroke:#f57c00
    style C fill:#f3e5f5,stroke:#7b1fa2
    style D fill:#c8e6c9,stroke:#1B5E20

Key Characteristics: - Simpler pipeline: Direct optimization without intermediate models - Mathematically elegant: Bypasses reward modeling through clever loss function design - Emerging standard: Increasingly adopted by research and industry


When to Use Each Approach

Scenario Recommended Approach Reasoning
Production systems RLHF Battle-tested, well-understood failure modes
Research experiments DPO Faster iteration, easier to debug
Limited compute DPO Requires fewer model training runs
Complex preferences RLHF Reward model provides interpretability

Interactive Exercise

Alignment Challenge

Scenario: You're building a customer service chatbot. A user asks: "How do I cancel my subscription?"

Response A: "To cancel your subscription, go to Settings > Account > Cancel Subscription. Click 'Confirm' when prompted."

Response B: "I understand you want to cancel your subscription. While I can help with that, I'd love to know if there's anything we could do to improve your experience first. To cancel, visit Settings > Account > Cancel Subscription."

Question: Which response would you prefer and why? How might this preference data be used in RLHF vs DPO?


Key Takeaways

  1. Alignment addresses subjective qualities that traditional fine-tuning cannot handle effectively
  2. RLHF is the established method with proven results but higher complexity
  3. DPO is the emerging alternative offering similar results with simpler implementation
  4. Both methods rely on human preference data rather than absolute "correct" answers
  5. The choice depends on your specific use case and resource constraints