201: Perplexity¶

Chapter Overview

Perplexity (PPL) is a metric used to measure how well a probability model (like a language model) predicts a sample of text. It is one of the most common intrinsic metrics for evaluating a model's general language modeling capability.

Intuitively, perplexity can be thought of as the model's uncertainty or "surprise" when encountering a piece of text.

Mathematical Definition¶

Perplexity is mathematically defined as the exponential of the cross-entropy loss:

Perplexity = exp(Cross-Entropy Loss)

Interpretation¶

Low perplexity indicates the model is less "surprised" by the text. The probability distribution it predicted for the next tokens closely matches the actual tokens in the text. This suggests a good language model.
High perplexity indicates the model is more "surprised." The tokens that actually appeared were considered less likely by the model. This suggests a poorer language model.

Visual Comparison¶

flowchart TD
    subgraph one ["Low Perplexity Model"]
        A["Input:<br/>The cat sat on the ___"] --> B["Model Prediction<br/>'mat': 90%<br/>'dog': 5%<br/>'sky': 1%"]
        B -->|"Actual next word is 'mat'"| C["✅ Low Surprise, Low Loss<br/>= Low Perplexity"]
    end

    subgraph two ["High Perplexity Model"]
        D["Input:<br/>The cat sat on the ___"] --> E["Model Prediction<br/>'mat': 10%<br/>'dog': 15%<br/>'sky': 12%"]
        E -->|"Actual next word is 'mat'"| F["❌ High Surprise, High Loss<br/>= High Perplexity"]
    end

    style C fill:#c8e6c9,stroke:#1B5E20
    style F fill:#ffcdd2,stroke:#B71C1C

Practical Considerations¶

Advantages¶

Intrinsic measure: Evaluates the model's core language modeling capability
Standardized: Widely used across the field, enabling model comparisons
Automated: Can be computed without human annotation

Limitations¶

Task-agnostic: May not correlate with performance on specific downstream tasks
Context-dependent: Perplexity varies significantly based on the evaluation dataset
Not user-centric: Doesn't directly measure user satisfaction or task completion

When to Use Perplexity¶

Perplexity is most valuable when: - Comparing different language models on the same dataset - Monitoring model performance during training - Establishing baseline performance for language modeling tasks - Evaluating model improvements at the pre-training stage

Previous: 200: Evaluating LLM Systems
Next: 202: Functional Correctness

This course material is part of the AI Engineering interactive course for beginners.