116: Feed-Forward Networks in Transformers¶

Chapter Overview

The Feed-Forward Network (FFN) is the second major component inside every Transformer block, positioned right after the [[113-Multi-Head-Attention|Multi-Head Attention]] layer. While attention handles the mixing of information between tokens, the FFN is responsible for processing and transforming the information within each token's representation.

Purpose and Structure¶

After the attention mechanism has created a context-rich vector for each token, the FFN acts as a non-linear processing station to further enrich that representation.

It is a simple, fully connected neural network that is applied independently to each token position.

The Two-Layer Architecture¶

The FFN typically consists of two linear layers with a non-linear activation function in between.

flowchart TD
    A["🎯 Input from Attention Layer<br/>(for a single token)"] --> B["📈 Linear Layer 1<br/>(Up-projection)"]
    B --> C["⚡ Activation Function<br/>(e.g., GELU, SwiGLU)"]
    C --> D["📉 Linear Layer 2<br/>(Down-projection)"]
    D --> E["✨ Final Output<br/>(for that token)"]

    subgraph FFN ["🔄 Feed-Forward Network (FFN)"]
        B
        C
        D
    end

    %% Styling
    classDef inputStyle fill:#e3f2fd,stroke:#1976d2,stroke-width:2px
    classDef ffnStyle fill:#fff3e0,stroke:#f57c00,stroke-width:2px
    classDef activationStyle fill:#ffecb3,stroke:#ff8f00,stroke-width:2px
    classDef outputStyle fill:#c8e6c9,stroke:#1B5E20,stroke-width:3px
    classDef subgraphStyle fill:#f9f9f9,stroke:#666,stroke-width:2px

    class A inputStyle
    class B,D ffnStyle
    class C activationStyle
    class E outputStyle
    class FFN subgraphStyle