201: Perplexity¶
Chapter Overview
Perplexity (PPL) is a metric used to measure how well a probability model (like a language model) predicts a sample of text. It is one of the most common intrinsic metrics for evaluating a model's general language modeling capability.
Intuitively, perplexity can be thought of as the model's uncertainty or "surprise" when encountering a piece of text.
Mathematical Definition¶
Perplexity is mathematically defined as the exponential of the cross-entropy loss:
Interpretation¶
-
Low perplexity indicates the model is less "surprised" by the text. The probability distribution it predicted for the next tokens closely matches the actual tokens in the text. This suggests a good language model.
-
High perplexity indicates the model is more "surprised." The tokens that actually appeared were considered less likely by the model. This suggests a poorer language model.
Visual Comparison¶
flowchart TD
subgraph one ["Low Perplexity Model"]
A["Input:<br/>The cat sat on the ___"] --> B["Model Prediction<br/>'mat': 90%<br/>'dog': 5%<br/>'sky': 1%"]
B -->|"Actual next word is 'mat'"| C["✅ Low Surprise, Low Loss<br/>= Low Perplexity"]
end
subgraph two ["High Perplexity Model"]
D["Input:<br/>The cat sat on the ___"] --> E["Model Prediction<br/>'mat': 10%<br/>'dog': 15%<br/>'sky': 12%"]
E -->|"Actual next word is 'mat'"| F["❌ High Surprise, High Loss<br/>= High Perplexity"]
end
style C fill:#c8e6c9,stroke:#1B5E20
style F fill:#ffcdd2,stroke:#B71C1C
Practical Considerations¶
Advantages¶
- Intrinsic measure: Evaluates the model's core language modeling capability
- Standardized: Widely used across the field, enabling model comparisons
- Automated: Can be computed without human annotation
Limitations¶
- Task-agnostic: May not correlate with performance on specific downstream tasks
- Context-dependent: Perplexity varies significantly based on the evaluation dataset
- Not user-centric: Doesn't directly measure user satisfaction or task completion
When to Use Perplexity¶
Perplexity is most valuable when: - Comparing different language models on the same dataset - Monitoring model performance during training - Establishing baseline performance for language modeling tasks - Evaluating model improvements at the pre-training stage
Course Navigation¶
- Previous: 200: Evaluating LLM Systems
- Next: 202: Functional Correctness
This course material is part of the AI Engineering interactive course for beginners.