114: Encoder vs. Decoder Models¶

Chapter Overview

A single Transformer "block" is a powerful unit, but it's by stacking these blocks together that we create deep, capable models. However, not all Transformer blocks are the same. They come in two primary flavors: Encoder blocks and Decoder blocks.

The choice of which type of block to use, or whether to use both, defines the model's fundamental architecture and its primary purpose.

The Two Architectures¶

The key difference between an Encoder and a Decoder lies in the type of self-attention they use.

flowchart TD
    subgraph one ["Encoder Block: Bi-directional Attention"]
        A[Input Text] --> B["Multi-Head Self-Attention<br/>(Full Context)"]
        B --> C["Feed-Forward Network"]
        C --> D["Output:<br/>Contextual Embeddings"]
        B -.-> N1["Each token can 'see' all other tokens"]
    end

    subgraph two ["Decoder Block: Masked Causal Attention"]
        E[Input Text] --> F["Masked Multi-Head Self-Attention<br/>(Partial Context)"]
        F --> G["Feed-Forward Network"]
        G --> H["Output:<br/>Next Token Prediction"]
        F -.-> N2["Each token can only 'see' past tokens"]
    end

    style one fill:#e3f2fd,stroke:#1976d2
    style two fill:#e8f5e8,stroke:#388e3c