Skip to content

111: The Self-Attention Mechanism

Chapter Overview

The Self-Attention Mechanism is the engine of the [[110-MOC-Transformer-Architecture|Transformer]]. It's the innovation that allows a model to weigh the importance of different words in a sequence when processing any single word, enabling it to build a deep, contextual understanding of the text.

It answers the question: "When I am looking at this one word, which other words in the sentence should I pay the most attention to?"


The QKV Model: Query, Key, Value

Self-attention works by projecting the embedding of each input token into three distinct vectors for every "attention head". These vectors have a specific purpose, best understood by an analogy to a library search: the Query is your research question, the Keys are the book titles, and the Values are the books' contents.

  • Query (Q): Imagine you are researching a topic. Your query is the question you have in mind. In the model, the Query vector represents the current word that is "looking for" context.
  • Key (K): Think of the keys as the keywords or titles on the spines of all the books in the library. The Key vector of each word acts as a label that can be "matched" against a query.
  • Value (V): This is the actual content of the books. The Value vector of each word contains its meaningful content.

The Attention Process Illustrated

The process unfolds in a series of matrix operations, but the intuition is straightforward:

sequenceDiagram
    participant C as Input Context
    participant Q as Query Vector<br/>(for "making")
    participant K as Key Vectors<br/>(for all words)
    participant V as Value Vectors<br/>(for all words)
    participant S as Attention Scores
    participant O as Output Vector<br/>(for "making")

    C->>Q: Generate Query for "making"
    C->>K: Generate Keys for all words
    C->>V: Generate Values for all words

    loop For each word in context
        Q->>K: Compare Q("making") with K(word)
        K->>S: Calculate Similarity Score
    end

    Note over S: Scores are scaled and<br/>passed through Softmax<br/>to become weights

    S->>V: Apply weights to all Value vectors
    V->>O: Weighted sum of all values

    Note over O: The new vector for "making"<br/>is now context-rich, heavily<br/>influenced by relevant words

Step-by-Step Breakdown

Let's walk through what happens when we process the word "making" in the sentence: 'The Transformer is making it possible to understand language." in the sentence: "The Transformer is making it possible to understand language."

Step 1: Create Query, Key, Value Vectors

flowchart TD
    subgraph INPUT ["Input: 'The Transformer is making it possible to understand language'"]
        direction TB
        W1[The] --- W2[Transformer] --- W3[is] --- W4[making] --- W5[it] --- W6[possible] --- W7[to] --- W8[understand] --- W9[language]
    end

    subgraph VECTORS ["Vector Creation"]
        direction TB
        W4 --> Q4[Query Vector<br/>for 'making']
        W1 --> K1[Key Vector<br/>for 'The']
        W2 --> K2[Key Vector<br/>for 'Transformer']
        W3 --> K3[Key Vector<br/>for 'is']
        W4 --> K4[Key Vector<br/>for 'making']
        W5 --> K5[Key Vector<br/>for 'it']
        W6 --> K6[Key Vector<br/>for 'possible']
        W7 --> K7[Key Vector<br/>for 'to']
        W8 --> K8[Key Vector<br/>for 'understand']
        W9 --> K9[Key Vector<br/>for 'language']
    end

    subgraph VALUES ["Value Vectors (Content)"]
        direction TB
        V1[Value for 'The'] --- V2[Value for 'Transformer'] --- V3[Value for 'is'] --- V4[Value for 'making'] --- V5[Value for 'it']
        V6[Value for 'possible'] --- V7[Value for 'to'] --- V8[Value for 'understand'] --- V9[Value for 'language']
    end

    INPUT --> VECTORS
    VECTORS --> VALUES

    style Q4 fill:#e1f5fe,stroke:#0277bd,stroke-width:3px
    style INPUT fill:#f3e5f5,stroke:#7b1fa2
    style VECTORS fill:#fff3e0,stroke:#f57c00
    style VALUES fill:#e8f5e8,stroke:#388e3c

Step 2: Calculate Attention Scores

flowchart LR
    subgraph ATTENTION ["Attention Score Calculation"]
        direction TB
        Q[Query: 'making'] --> DOT1[Q·K₁ = 0.1]
        Q --> DOT2[Q·K₂ = 0.8]
        Q --> DOT3[Q·K₃ = 0.2]
        Q --> DOT4[Q·K₄ = 0.9]
        Q --> DOT5[Q·K₅ = 0.7]
        Q --> DOT6[Q·K₆ = 0.3]
        Q --> DOT7[Q·K₇ = 0.1]
        Q --> DOT8[Q·K₈ = 0.6]
        Q --> DOT9[Q·K₉ = 0.4]
    end

    subgraph SOFTMAX ["After Softmax (Normalized)"]
        direction TB
        DOT1 --> S1[w₁ = 0.02]
        DOT2 --> S2[w₂ = 0.25]
        DOT3 --> S3[w₃ = 0.03]
        DOT4 --> S4[w₄ = 0.35]
        DOT5 --> S5[w₅ = 0.20]
        DOT6 --> S6[w₆ = 0.04]
        DOT7 --> S7[w₇ = 0.02]
        DOT8 --> S8[w₈ = 0.15]
        DOT9 --> S9[w₉ = 0.06]
    end

    style DOT2 fill:#ffcdd2,stroke:#c62828
    style DOT4 fill:#ffcdd2,stroke:#c62828
    style DOT5 fill:#ffcdd2,stroke:#c62828
    style S2 fill:#c8e6c9,stroke:#2e7d32
    style S4 fill:#c8e6c9,stroke:#2e7d32
    style S5 fill:#c8e6c9,stroke:#2e7d32

Step 3: Weighted Sum of Values

The final output for "making" is created by taking a weighted sum of all value vectors:

flowchart TD
    subgraph WEIGHTED ["Weighted Value Combination"]
        direction TB
        V1[Value: 'The'<br/>×0.02] --> SUM[Final Output<br/>for 'making']
        V2[Value: 'Transformer'<br/>×0.25] --> SUM
        V3[Value: 'is'<br/>×0.03] --> SUM
        V4[Value: 'making'<br/>×0.35] --> SUM
        V5[Value: 'it'<br/>×0.20] --> SUM
        V6[Value: 'possible'<br/>×0.04] --> SUM
        V7[Value: 'to'<br/>×0.02] --> SUM
        V8[Value: 'understand'<br/>×0.15] --> SUM
        V9[Value: 'language'<br/>×0.06] --> SUM
    end

    style V2 fill:#ffcdd2,stroke:#c62828,stroke-width:3px
    style V4 fill:#ffcdd2,stroke:#c62828,stroke-width:3px
    style V5 fill:#ffcdd2,stroke:#c62828,stroke-width:3px
    style SUM fill:#c8e6c9,stroke:#2e7d32,stroke-width:3px

Key Insights

Why This Works

The beauty of self-attention is that it allows each word to dynamically determine which other words in the sequence are most relevant to its meaning. In our example, "making" pays most attention to:

  • Itself (0.35) - maintaining its core meaning
  • "Transformer" (0.25) - what is doing the making
  • "it" (0.20) - what is being made
  • "understand" (0.15) - the purpose of the making

Mathematical Foundation

The attention mechanism can be summarized as:

Attention(Q, K, V) = softmax(QK^T / √d_k)V

Where: - Q, K, V are the query, key, and value matrices - d_k is the dimension of the key vectors (for scaling) - The softmax ensures attention weights sum to 1


Multi-Head Attention Preview

In practice, Transformers use multiple attention heads simultaneously, each learning to focus on different types of relationships:

flowchart TD
    subgraph MULTI ["Multi-Head Attention"]
        direction TB
        INPUT[Input: 'making'] --> H1[Head 1:<br/>Syntactic Relations]
        INPUT --> H2[Head 2:<br/>Semantic Relations]
        INPUT --> H3[Head 3:<br/>Positional Relations]
        INPUT --> H4[Head 4:<br/>Task-Specific Relations]

        H1 --> CONCAT[Concatenate<br/>All Heads]
        H2 --> CONCAT
        H3 --> CONCAT
        H4 --> CONCAT

        CONCAT --> FINAL[Final Output<br/>for 'making']
    end

    style INPUT fill:#e1f5fe,stroke:#0277bd
    style H1 fill:#fff3e0,stroke:#f57c00
    style H2 fill:#e8f5e8,stroke:#388e3c
    style H3 fill:#fce4ec,stroke:#c2185b
    style H4 fill:#f3e5f5,stroke:#7b1fa2
    style FINAL fill:#ffecb3,stroke:#ff8f00,stroke-width:3px

Next Steps

Now that you understand self-attention, you're ready to explore how multiple attention heads work together:

🔍 Multi-Head Attention →

Or dive deeper into the technical implementation:

← Transformer Architecture Positional Encoding →