111: The Self-Attention Mechanism¶
Chapter Overview
The Self-Attention Mechanism is the engine of the [[110-MOC-Transformer-Architecture|Transformer]]. It's the innovation that allows a model to weigh the importance of different words in a sequence when processing any single word, enabling it to build a deep, contextual understanding of the text.
It answers the question: "When I am looking at this one word, which other words in the sentence should I pay the most attention to?"
The QKV Model: Query, Key, Value¶
Self-attention works by projecting the embedding of each input token into three distinct vectors for every "attention head". These vectors have a specific purpose, best understood by an analogy to a library search: the Query is your research question, the Keys are the book titles, and the Values are the books' contents.
- Query (Q): Imagine you are researching a topic. Your query is the question you have in mind. In the model, the
Query
vector represents the current word that is "looking for" context. - Key (K): Think of the keys as the keywords or titles on the spines of all the books in the library. The
Key
vector of each word acts as a label that can be "matched" against a query. - Value (V): This is the actual content of the books. The
Value
vector of each word contains its meaningful content.
The Attention Process Illustrated¶
The process unfolds in a series of matrix operations, but the intuition is straightforward:
sequenceDiagram
participant C as Input Context
participant Q as Query Vector<br/>(for "making")
participant K as Key Vectors<br/>(for all words)
participant V as Value Vectors<br/>(for all words)
participant S as Attention Scores
participant O as Output Vector<br/>(for "making")
C->>Q: Generate Query for "making"
C->>K: Generate Keys for all words
C->>V: Generate Values for all words
loop For each word in context
Q->>K: Compare Q("making") with K(word)
K->>S: Calculate Similarity Score
end
Note over S: Scores are scaled and<br/>passed through Softmax<br/>to become weights
S->>V: Apply weights to all Value vectors
V->>O: Weighted sum of all values
Note over O: The new vector for "making"<br/>is now context-rich, heavily<br/>influenced by relevant words
Step-by-Step Breakdown¶
Let's walk through what happens when we process the word "making" in the sentence: 'The Transformer is making it possible to understand language." in the sentence: "The Transformer is making it possible to understand language."
Step 1: Create Query, Key, Value Vectors¶
flowchart TD
subgraph INPUT ["Input: 'The Transformer is making it possible to understand language'"]
direction TB
W1[The] --- W2[Transformer] --- W3[is] --- W4[making] --- W5[it] --- W6[possible] --- W7[to] --- W8[understand] --- W9[language]
end
subgraph VECTORS ["Vector Creation"]
direction TB
W4 --> Q4[Query Vector<br/>for 'making']
W1 --> K1[Key Vector<br/>for 'The']
W2 --> K2[Key Vector<br/>for 'Transformer']
W3 --> K3[Key Vector<br/>for 'is']
W4 --> K4[Key Vector<br/>for 'making']
W5 --> K5[Key Vector<br/>for 'it']
W6 --> K6[Key Vector<br/>for 'possible']
W7 --> K7[Key Vector<br/>for 'to']
W8 --> K8[Key Vector<br/>for 'understand']
W9 --> K9[Key Vector<br/>for 'language']
end
subgraph VALUES ["Value Vectors (Content)"]
direction TB
V1[Value for 'The'] --- V2[Value for 'Transformer'] --- V3[Value for 'is'] --- V4[Value for 'making'] --- V5[Value for 'it']
V6[Value for 'possible'] --- V7[Value for 'to'] --- V8[Value for 'understand'] --- V9[Value for 'language']
end
INPUT --> VECTORS
VECTORS --> VALUES
style Q4 fill:#e1f5fe,stroke:#0277bd,stroke-width:3px
style INPUT fill:#f3e5f5,stroke:#7b1fa2
style VECTORS fill:#fff3e0,stroke:#f57c00
style VALUES fill:#e8f5e8,stroke:#388e3c
Step 2: Calculate Attention Scores¶
flowchart LR
subgraph ATTENTION ["Attention Score Calculation"]
direction TB
Q[Query: 'making'] --> DOT1[Q·K₁ = 0.1]
Q --> DOT2[Q·K₂ = 0.8]
Q --> DOT3[Q·K₃ = 0.2]
Q --> DOT4[Q·K₄ = 0.9]
Q --> DOT5[Q·K₅ = 0.7]
Q --> DOT6[Q·K₆ = 0.3]
Q --> DOT7[Q·K₇ = 0.1]
Q --> DOT8[Q·K₈ = 0.6]
Q --> DOT9[Q·K₉ = 0.4]
end
subgraph SOFTMAX ["After Softmax (Normalized)"]
direction TB
DOT1 --> S1[w₁ = 0.02]
DOT2 --> S2[w₂ = 0.25]
DOT3 --> S3[w₃ = 0.03]
DOT4 --> S4[w₄ = 0.35]
DOT5 --> S5[w₅ = 0.20]
DOT6 --> S6[w₆ = 0.04]
DOT7 --> S7[w₇ = 0.02]
DOT8 --> S8[w₈ = 0.15]
DOT9 --> S9[w₉ = 0.06]
end
style DOT2 fill:#ffcdd2,stroke:#c62828
style DOT4 fill:#ffcdd2,stroke:#c62828
style DOT5 fill:#ffcdd2,stroke:#c62828
style S2 fill:#c8e6c9,stroke:#2e7d32
style S4 fill:#c8e6c9,stroke:#2e7d32
style S5 fill:#c8e6c9,stroke:#2e7d32
Step 3: Weighted Sum of Values¶
The final output for "making" is created by taking a weighted sum of all value vectors:
flowchart TD
subgraph WEIGHTED ["Weighted Value Combination"]
direction TB
V1[Value: 'The'<br/>×0.02] --> SUM[Final Output<br/>for 'making']
V2[Value: 'Transformer'<br/>×0.25] --> SUM
V3[Value: 'is'<br/>×0.03] --> SUM
V4[Value: 'making'<br/>×0.35] --> SUM
V5[Value: 'it'<br/>×0.20] --> SUM
V6[Value: 'possible'<br/>×0.04] --> SUM
V7[Value: 'to'<br/>×0.02] --> SUM
V8[Value: 'understand'<br/>×0.15] --> SUM
V9[Value: 'language'<br/>×0.06] --> SUM
end
style V2 fill:#ffcdd2,stroke:#c62828,stroke-width:3px
style V4 fill:#ffcdd2,stroke:#c62828,stroke-width:3px
style V5 fill:#ffcdd2,stroke:#c62828,stroke-width:3px
style SUM fill:#c8e6c9,stroke:#2e7d32,stroke-width:3px
Key Insights¶
Why This Works
The beauty of self-attention is that it allows each word to dynamically determine which other words in the sequence are most relevant to its meaning. In our example, "making" pays most attention to:
- Itself (0.35) - maintaining its core meaning
- "Transformer" (0.25) - what is doing the making
- "it" (0.20) - what is being made
- "understand" (0.15) - the purpose of the making
Mathematical Foundation
The attention mechanism can be summarized as:
Attention(Q, K, V) = softmax(QK^T / √d_k)V
Where: - Q, K, V are the query, key, and value matrices - d_k is the dimension of the key vectors (for scaling) - The softmax ensures attention weights sum to 1
Multi-Head Attention Preview¶
In practice, Transformers use multiple attention heads simultaneously, each learning to focus on different types of relationships:
flowchart TD
subgraph MULTI ["Multi-Head Attention"]
direction TB
INPUT[Input: 'making'] --> H1[Head 1:<br/>Syntactic Relations]
INPUT --> H2[Head 2:<br/>Semantic Relations]
INPUT --> H3[Head 3:<br/>Positional Relations]
INPUT --> H4[Head 4:<br/>Task-Specific Relations]
H1 --> CONCAT[Concatenate<br/>All Heads]
H2 --> CONCAT
H3 --> CONCAT
H4 --> CONCAT
CONCAT --> FINAL[Final Output<br/>for 'making']
end
style INPUT fill:#e1f5fe,stroke:#0277bd
style H1 fill:#fff3e0,stroke:#f57c00
style H2 fill:#e8f5e8,stroke:#388e3c
style H3 fill:#fce4ec,stroke:#c2185b
style H4 fill:#f3e5f5,stroke:#7b1fa2
style FINAL fill:#ffecb3,stroke:#ff8f00,stroke-width:3px
Next Steps¶
Now that you understand self-attention, you're ready to explore how multiple attention heads work together:
Or dive deeper into the technical implementation: