116: Feed-Forward Networks in Transformers¶
Chapter Overview
The Feed-Forward Network (FFN) is the second major component inside every Transformer block, positioned right after the [[113-Multi-Head-Attention|Multi-Head Attention]] layer. While attention handles the mixing of information between tokens, the FFN is responsible for processing and transforming the information within each token's representation.
Purpose and Structure¶
After the attention mechanism has created a context-rich vector for each token, the FFN acts as a non-linear processing station to further enrich that representation.
It is a simple, fully connected neural network that is applied independently to each token position.
The Two-Layer Architecture¶
The FFN typically consists of two linear layers with a non-linear activation function in between.
flowchart TD
A["🎯 Input from Attention Layer<br/>(for a single token)"] --> B["📈 Linear Layer 1<br/>(Up-projection)"]
B --> C["⚡ Activation Function<br/>(e.g., GELU, SwiGLU)"]
C --> D["📉 Linear Layer 2<br/>(Down-projection)"]
D --> E["✨ Final Output<br/>(for that token)"]
subgraph FFN ["🔄 Feed-Forward Network (FFN)"]
B
C
D
end
%% Styling
classDef inputStyle fill:#e3f2fd,stroke:#1976d2,stroke-width:2px
classDef ffnStyle fill:#fff3e0,stroke:#f57c00,stroke-width:2px
classDef activationStyle fill:#ffecb3,stroke:#ff8f00,stroke-width:2px
classDef outputStyle fill:#c8e6c9,stroke:#1B5E20,stroke-width:3px
classDef subgraphStyle fill:#f9f9f9,stroke:#666,stroke-width:2px
class A inputStyle
class B,D ffnStyle
class C activationStyle
class E outputStyle
class FFN subgraphStyle