115: The Decoder Architecture Deep Dive¶
Chapter Overview
While [[114-Encoder-vs-Decoder-Models|note 114]] compares Encoders and Decoders, this note provides a deeper look into the specific mechanics of a Decoder block. This is the fundamental component for all generative models like GPT.
The Decoder's unique challenge is to generate a coherent sequence of text, one token at a time, using two distinct attention steps.
The Two-Attention-Step Process¶
Each Decoder block contains not one, but two separate attention mechanisms that work in sequence.
flowchart TD
subgraph Input ["🎯 Input to Decoder Block"]
A["Previous Layer's Output<br/>(or initial token embeddings)"]
end
subgraph MaskedAttn ["🔒 Step 1: Masked Self-Attention"]
A --> B["Masked Multi-Head<br/>Self-Attention"]
B --> C1["Add & Norm"]
B -.-> Note1["Only sees previous tokens<br/>in the target sequence"]
end
subgraph CrossAttn ["🔄 Step 2: Cross-Attention"]
C1 --> D["Multi-Head<br/>Cross-Attention"]
D --> E1["Add & Norm"]
F["Encoder Output<br/>(Context Memory)"] --> D
D -.-> Note2["Attends to full input<br/>sequence for context"]
end
subgraph FFN ["⚡ Step 3: Final Processing"]
E1 --> H["Feed-Forward Network"]
H --> I["Add & Norm"]
I --> J["Output to<br/>Next Decoder Block"]
end
%% Styling
classDef inputStyle fill:#e3f2fd,stroke:#1976d2,stroke-width:2px
classDef maskedStyle fill:#fff3e0,stroke:#f57c00,stroke-width:2px
classDef crossStyle fill:#fce4ec,stroke:#c2185b,stroke-width:2px
classDef outputStyle fill:#c8e6c9,stroke:#1B5E20,stroke-width:3px
classDef noteStyle fill:#f5f5f5,stroke:#757575,stroke-dasharray: 5 5
class Input inputStyle
class MaskedAttn maskedStyle
class CrossAttn crossStyle
class FFN outputStyle
class Note1,Note2 noteStyle