203: Similarity Metrics (BLEU, ROUGE, etc.)¶

Chapter Overview

When you have reference data (a "ground-truth" or "golden" answer), you can evaluate a model's output by mathematically comparing its similarity to the reference. This is a common strategy in tasks like translation and summarization.

These metrics fall into two main categories: Lexical (word-based) and Semantic (meaning-based).

1. Lexical Similarity¶

Lexical metrics measure the overlap of words and phrases (n-grams) between the model's generated output and the reference text.

flowchart TD
    subgraph lexcomp ["Lexical Comparison"]
        A["Model Output:<br/>'The quick brown cat jumped over the lazy dog.'"] 
        B["Reference:<br/>'The speedy brown cat leaped over the lazy dog.'"]
        A -->|"Word Overlap Analysis"| C["Shared Words:<br/>the, brown, cat, over, the, lazy, dog"]
        B -->|"Word Overlap Analysis"| C
    end

    C --> D["Result: High n-gram overlap<br/>= High Lexical Score"]

    style D fill:#c8e6c9,stroke:#1B5E20

BLEU (Bilingual Evaluation Understudy)¶

Primary Use: Machine translation evaluation

How it works: BLEU measures the precision of n-grams (sequences of words) between the generated text and reference translations. It considers 1-grams through 4-grams and applies a brevity penalty to prevent overly short translations from scoring artificially high.

Formula: BLEU = BP × exp(∑(wn × log pn))

Where: - BP = Brevity penalty - wn = Weight for n-gram precision - pn = Precision of n-grams

Example:

Reference: "The cat is on the mat"
Candidate: "The cat sits on the mat"

1-gram precision: 5/6 (5 matching words out of 6)
2-gram precision: 3/5 (3 matching bigrams out of 5)

ROUGE (Recall-Oriented Understudy for Gisting Evaluation)¶

Primary Use: Text summarization evaluation

How it works: ROUGE focuses on recall rather than precision, measuring what percentage of the reference text's n-grams appear in the generated text.

Variants: - ROUGE-N: N-gram co-occurrence statistics - ROUGE-L: Longest common subsequence - ROUGE-W: Weighted longest common subsequence

flowchart LR
    subgraph bleu ["BLEU (Precision-focused)"]
        A["Generated Text"] --> B["How many words in<br/>generated text match<br/>reference?"]
        B --> C["Good for: Translation<br/>Penalizes extra words"]
    end

    subgraph rouge ["ROUGE (Recall-focused)"]
        D["Reference Text"] --> E["How many words in<br/>reference are captured<br/>by generated text?"]
        E --> F["Good for: Summarization<br/>Rewards completeness"]
    end

    style C fill:#e3f2fd,stroke:#1976d2
    style F fill:#fce4ec,stroke:#c2185b

2. Semantic Similarity¶

Semantic metrics evaluate the meaning and context of text, going beyond simple word matching to understand conceptual similarity.

Embedding-based Metrics¶

BERTScore: Uses contextual embeddings from BERT to compute similarity scores based on the semantic content rather than exact word matches.

Sentence-BERT: Generates sentence-level embeddings and computes cosine similarity between generated and reference texts.

flowchart TD
    subgraph semantic ["Semantic Similarity Process"]
        A["Model Output:<br/>'The feline jumped'"] --> B["Convert to<br/>Embeddings"]
        C["Reference:<br/>'The cat leaped'"] --> D["Convert to<br/>Embeddings"]

        B --> E["Embedding Vector A"]
        D --> F["Embedding Vector B"]

        E --> G["Cosine Similarity<br/>Calculation"]
        F --> G

        G --> H["High Similarity Score<br/>(Despite different words)"]
    end

    style H fill:#c8e6c9,stroke:#1B5E20

Advantages and Limitations¶

Lexical Metrics (BLEU, ROUGE)¶

Advantages: - Fast and efficient computation - Widely standardized across research community - Language-agnostic (works for any language) - Reproducible results

Limitations: - Ignores semantic meaning - Sensitive to exact word choice - May penalize valid paraphrases - Limited correlation with human judgment

Semantic Metrics (BERTScore, etc.)¶

Advantages: - Captures meaning beyond exact words - Better correlation with human evaluation - Handles paraphrases effectively - Context-aware evaluation

Limitations: - Computationally expensive - Requires pre-trained models - May be biased by training data - Less interpretable than lexical metrics

Practical Implementation¶

Choosing the Right Metric¶

For Translation Tasks: Use BLEU as the primary metric, supplemented by semantic metrics for nuanced evaluation.

For Summarization: ROUGE metrics provide good baseline evaluation, particularly ROUGE-L for capturing structural similarity.

For Open-ended Generation: Combine multiple metrics including semantic similarity measures for comprehensive evaluation.

Best Practices¶

Multiple References: Use several reference texts when possible to reduce bias toward specific phrasings.
Metric Combination: No single metric captures all aspects of text quality. Use multiple complementary metrics.
Human Validation: Regularly validate automated metrics against human judgments to ensure relevance.
Domain Adaptation: Consider domain-specific modifications or additional metrics for specialized applications.

Example Comparison¶

Task: Summarize a news article about climate change

Reference: "Global temperatures continue rising due to increased greenhouse gas emissions."

Model A: "Earth's temperature increases because of more greenhouse gases."
Model B: "The planet gets warmer from pollution and carbon dioxide."

Lexical Similarity:
- Model A: Higher BLEU/ROUGE (more word overlap)
- Model B: Lower BLEU/ROUGE (different vocabulary)

Semantic Similarity:
- Model A: High BERTScore (similar meaning, structure)
- Model B: High BERTScore (captures same concept despite different words)

When to Use Similarity Metrics¶

Ideal Scenarios: - Tasks with clear reference standards - Automated evaluation pipelines - Comparing multiple model variants - Initial screening of model outputs

Limitations to Consider: - Creative or open-ended tasks - Tasks requiring factual accuracy verification - Multi-turn conversations - Tasks where multiple valid answers exist

This course material is part of the AI Engineering interactive course for beginners.