115: The Decoder Architecture Deep Dive¶

Chapter Overview

While [[114-Encoder-vs-Decoder-Models|note 114]] compares Encoders and Decoders, this note provides a deeper look into the specific mechanics of a Decoder block. This is the fundamental component for all generative models like GPT.

The Decoder's unique challenge is to generate a coherent sequence of text, one token at a time, using two distinct attention steps.

The Two-Attention-Step Process¶

Each Decoder block contains not one, but two separate attention mechanisms that work in sequence.

flowchart TD
    subgraph Input ["🎯 Input to Decoder Block"]
        A["Previous Layer's Output<br/>(or initial token embeddings)"]
    end

    subgraph MaskedAttn ["🔒 Step 1: Masked Self-Attention"]
        A --> B["Masked Multi-Head<br/>Self-Attention"]
        B --> C1["Add & Norm"]
        B -.-> Note1["Only sees previous tokens<br/>in the target sequence"]
    end

    subgraph CrossAttn ["🔄 Step 2: Cross-Attention"]
        C1 --> D["Multi-Head<br/>Cross-Attention"]
        D --> E1["Add & Norm"]
        F["Encoder Output<br/>(Context Memory)"] --> D
        D -.-> Note2["Attends to full input<br/>sequence for context"]
    end

    subgraph FFN ["⚡ Step 3: Final Processing"]
        E1 --> H["Feed-Forward Network"]
        H --> I["Add & Norm"]
        I --> J["Output to<br/>Next Decoder Block"]
    end

    %% Styling
    classDef inputStyle fill:#e3f2fd,stroke:#1976d2,stroke-width:2px
    classDef maskedStyle fill:#fff3e0,stroke:#f57c00,stroke-width:2px
    classDef crossStyle fill:#fce4ec,stroke:#c2185b,stroke-width:2px
    classDef outputStyle fill:#c8e6c9,stroke:#1B5E20,stroke-width:3px
    classDef noteStyle fill:#f5f5f5,stroke:#757575,stroke-dasharray: 5 5

    class Input inputStyle
    class MaskedAttn maskedStyle
    class CrossAttn crossStyle
    class FFN outputStyle
    class Note1,Note2 noteStyle