113: Multi-Head Attention¶

Chapter Overview

The Self-Attention Mechanism is powerful, but a single attention calculation can only focus on one type of relationship within the text. Multi-Head Attention is the architectural innovation that allows the model to overcome this limitation.

Instead of performing a single attention calculation, it runs multiple attention "heads" in parallel and then combines their outputs. This allows the model to simultaneously learn and focus on different aspects of the language.

The Core Idea: Different Perspectives¶

Imagine you're analyzing a sentence. You might want to pay attention to different things at once: - Syntactic relationships: Which word is the subject of this verb? - Semantic relationships: Which words are synonyms or related in meaning? - Positional relationships: Which words are nearby?

A single attention mechanism might learn to focus on just one of these. Multi-Head Attention allows the model to have dedicated "specialists" for each type of relationship.

flowchart TD
    subgraph one ["Step 1: Start with Input"]
        A[Input Embedding<br/>with Positional Encoding]
    end

    subgraph two ["Step 2: Project into Multiple 'Heads'"]
        A --> P1["Projection for Head 1<br/>(Q₁, K₁, V₁)"]
        A --> P2["Projection for Head 2<br/>(Q₂, K₂, V₂)"]
        A --> P3["..."]
        A --> PN["Projection for Head N<br/>(Qₙ, Kₙ, Vₙ)"]
    end

    subgraph three ["Step 3: Run Attention in Parallel"]
        P1 --> H1["Head 1 Output<br/>(focuses on syntax)"]
        P2 --> H2["Head 2 Output<br/>(focuses on semantics)"]
        P3 --> H3[...]
        PN --> HN["Head N Output<br/>(focuses on other patterns)"]
    end

    subgraph four ["Step 4: Combine and Finalize"]
        H1 --> C[Concatenate All<br/>Head Outputs]
        H2 --> C
        H3 --> C
        HN --> C
        C --> D[Final Linear Projection] --> E[Final Output Vector]
    end

    style A fill:#e3f2fd,stroke:#1976d2
    style C fill:#fff3e0,stroke:#f57c00
    style D fill:#fce4ec,stroke:#c2185b
    style E fill:#c8e6c9,stroke:#1B5E20,stroke-width:2px