203: Similarity Metrics (BLEU, ROUGE, etc.)¶
Chapter Overview
When you have reference data (a "ground-truth" or "golden" answer), you can evaluate a model's output by mathematically comparing its similarity to the reference. This is a common strategy in tasks like translation and summarization.
These metrics fall into two main categories: Lexical (word-based) and Semantic (meaning-based).
1. Lexical Similarity¶
Lexical metrics measure the overlap of words and phrases (n-grams) between the model's generated output and the reference text.
flowchart TD
subgraph lexcomp ["Lexical Comparison"]
A["Model Output:<br/>'The quick brown cat jumped over the lazy dog.'"]
B["Reference:<br/>'The speedy brown cat leaped over the lazy dog.'"]
A -->|"Word Overlap Analysis"| C["Shared Words:<br/>the, brown, cat, over, the, lazy, dog"]
B -->|"Word Overlap Analysis"| C
end
C --> D["Result: High n-gram overlap<br/>= High Lexical Score"]
style D fill:#c8e6c9,stroke:#1B5E20
BLEU (Bilingual Evaluation Understudy)¶
Primary Use: Machine translation evaluation
How it works: BLEU measures the precision of n-grams (sequences of words) between the generated text and reference translations. It considers 1-grams through 4-grams and applies a brevity penalty to prevent overly short translations from scoring artificially high.
Formula: BLEU = BP × exp(∑(wn × log pn))
Where: - BP = Brevity penalty - wn = Weight for n-gram precision - pn = Precision of n-grams
Example:
Reference: "The cat is on the mat"
Candidate: "The cat sits on the mat"
1-gram precision: 5/6 (5 matching words out of 6)
2-gram precision: 3/5 (3 matching bigrams out of 5)
ROUGE (Recall-Oriented Understudy for Gisting Evaluation)¶
Primary Use: Text summarization evaluation
How it works: ROUGE focuses on recall rather than precision, measuring what percentage of the reference text's n-grams appear in the generated text.
Variants: - ROUGE-N: N-gram co-occurrence statistics - ROUGE-L: Longest common subsequence - ROUGE-W: Weighted longest common subsequence
flowchart LR
subgraph bleu ["BLEU (Precision-focused)"]
A["Generated Text"] --> B["How many words in<br/>generated text match<br/>reference?"]
B --> C["Good for: Translation<br/>Penalizes extra words"]
end
subgraph rouge ["ROUGE (Recall-focused)"]
D["Reference Text"] --> E["How many words in<br/>reference are captured<br/>by generated text?"]
E --> F["Good for: Summarization<br/>Rewards completeness"]
end
style C fill:#e3f2fd,stroke:#1976d2
style F fill:#fce4ec,stroke:#c2185b
2. Semantic Similarity¶
Semantic metrics evaluate the meaning and context of text, going beyond simple word matching to understand conceptual similarity.
Embedding-based Metrics¶
BERTScore: Uses contextual embeddings from BERT to compute similarity scores based on the semantic content rather than exact word matches.
Sentence-BERT: Generates sentence-level embeddings and computes cosine similarity between generated and reference texts.
flowchart TD
subgraph semantic ["Semantic Similarity Process"]
A["Model Output:<br/>'The feline jumped'"] --> B["Convert to<br/>Embeddings"]
C["Reference:<br/>'The cat leaped'"] --> D["Convert to<br/>Embeddings"]
B --> E["Embedding Vector A"]
D --> F["Embedding Vector B"]
E --> G["Cosine Similarity<br/>Calculation"]
F --> G
G --> H["High Similarity Score<br/>(Despite different words)"]
end
style H fill:#c8e6c9,stroke:#1B5E20
Advantages and Limitations¶
Lexical Metrics (BLEU, ROUGE)¶
Advantages: - Fast and efficient computation - Widely standardized across research community - Language-agnostic (works for any language) - Reproducible results
Limitations: - Ignores semantic meaning - Sensitive to exact word choice - May penalize valid paraphrases - Limited correlation with human judgment
Semantic Metrics (BERTScore, etc.)¶
Advantages: - Captures meaning beyond exact words - Better correlation with human evaluation - Handles paraphrases effectively - Context-aware evaluation
Limitations: - Computationally expensive - Requires pre-trained models - May be biased by training data - Less interpretable than lexical metrics
Practical Implementation¶
Choosing the Right Metric¶
For Translation Tasks: Use BLEU as the primary metric, supplemented by semantic metrics for nuanced evaluation.
For Summarization: ROUGE metrics provide good baseline evaluation, particularly ROUGE-L for capturing structural similarity.
For Open-ended Generation: Combine multiple metrics including semantic similarity measures for comprehensive evaluation.
Best Practices¶
-
Multiple References: Use several reference texts when possible to reduce bias toward specific phrasings.
-
Metric Combination: No single metric captures all aspects of text quality. Use multiple complementary metrics.
-
Human Validation: Regularly validate automated metrics against human judgments to ensure relevance.
-
Domain Adaptation: Consider domain-specific modifications or additional metrics for specialized applications.
Example Comparison¶
Task: Summarize a news article about climate change
Reference: "Global temperatures continue rising due to increased greenhouse gas emissions."
Model A: "Earth's temperature increases because of more greenhouse gases."
Model B: "The planet gets warmer from pollution and carbon dioxide."
Lexical Similarity:
- Model A: Higher BLEU/ROUGE (more word overlap)
- Model B: Lower BLEU/ROUGE (different vocabulary)
Semantic Similarity:
- Model A: High BERTScore (similar meaning, structure)
- Model B: High BERTScore (captures same concept despite different words)
When to Use Similarity Metrics¶
Ideal Scenarios: - Tasks with clear reference standards - Automated evaluation pipelines - Comparing multiple model variants - Initial screening of model outputs
Limitations to Consider: - Creative or open-ended tasks - Tasks requiring factual accuracy verification - Multi-turn conversations - Tasks where multiple valid answers exist
This course material is part of the AI Engineering interactive course for beginners.