112: Positional Encoding¶
Chapter Overview
The [[111-Self-Attention-Mechanism|Self-Attention]] mechanism is powerful, but it has a fundamental weakness: it is permutation-invariant. This means it treats the input "the cat sat on the mat" and "the mat sat on the cat" as identical because it has no inherent sense of word order.
Positional Encoding is the clever solution to this problem. It injects information about the position of each token directly into its embedding.
The Core Concept¶
Before the input embeddings are fed into the first Transformer layer, a positional encoding vector is added to each token's embedding.
This encoding vector is not learned; it is a fixed vector generated by a mathematical formula. This ensures that the model receives a unique and consistent signal for each position in the sequence.
flowchart TD
subgraph one ["Step 1: Start with the Word"]
A["Token: 'cat'"] --> B["Word Embedding<br/>[0.2, -0.1, 0.5, ...]"]
end
subgraph two ["Step 2: Generate Positional Information"]
C["Position in Sequence: 5"] --> D["Positional Encoding Vector<br/>(Generated via sine/cosine formula)<br/>[-0.96, 0.28, 0.76, ...]"]
end
subgraph three ["Step 3: Combine them"]
B -->|"Content"| E["➕<br/>Element-wise<br/>Addition"]
D -->|"Position"| E
end
subgraph four ["Step 4: Final Input for Transformer"]
E --> F["Final Input Vector<br/>(Now contains both content and position info)<br/>[-0.76, 0.18, 1.26, ...]"]
end
style A fill:#e3f2fd,stroke:#1976d2
style C fill:#e3f2fd,stroke:#1976d2
style B fill:#fff3e0,stroke:#f57c00
style D fill:#fff3e0,stroke:#f57c00
style E fill:#fce4ec,stroke:#c2185b,stroke-width:2px
style F fill:#c8e6c9,stroke:#1B5E20,stroke-width:2px
Next Steps¶
With content and position now combined into a single vector, the input is ready for the main processing layers. Let's explore how the Transformer handles multiple attention calculations in parallel.