413: Adapter Tuning¶

Chapter Overview

Adapter Tuning is one of the original and most intuitive PEFT methods. It involves injecting small, trainable neural network modules, known as "adapters," into a frozen pre-trained model.

While LoRA is often preferred today due to its efficiency, understanding adapters provides crucial context for the evolution of PEFT and remains highly effective for many applications.

The Core Idea: Injecting New Layers¶

The main principle of Adapter Tuning is to leave the original Foundation Model completely untouched (frozen) and insert new, lightweight layers at strategic points. Only these new adapter layers are trained.

In a Transformer, adapters are typically inserted after the Multi-Head Attention and Feed-Forward Network sub-layers in each block.

graph TD
    subgraph "Standard Transformer Block"
        A[Input] --> B[Multi-Head Attention]
        B --> C[Layer Norm & Residual]
        C --> D[Feed-Forward Network]
        D --> E[Layer Norm & Residual]
        E --> F[Output]
    end

    subgraph "Adapter-Enhanced Transformer Block"
        A2[Input] --> B2[Multi-Head Attention<br/>FROZEN]
        B2 --> Adapter1[Adapter Module<br/>TRAINABLE]
        Adapter1 --> C2[Layer Norm & Residual]
        C2 --> D2[Feed-Forward Network<br/>FROZEN]
        D2 --> Adapter2[Adapter Module<br/>TRAINABLE]
        Adapter2 --> E2[Layer Norm & Residual]
        E2 --> F2[Output]
    end

    style B2 fill:#e3f2fd,stroke:#1976d2
    style D2 fill:#e3f2fd,stroke:#1976d2
    style Adapter1 fill:#c8e6c9,stroke:#1B5E20
    style Adapter2 fill:#c8e6c9,stroke:#1B5E20

Adapter Architecture¶

Classic Adapter Design¶

An adapter module typically consists of:

Down-projection layer: Reduces dimensionality
Non-linear activation: Introduces capacity
Up-projection layer: Restores original dimensionality
Residual connection: Ensures training stability

graph LR
    A[Input<br/>d dimensions] --> B[Down-projection<br/>d → m]
    B --> C[ReLU/GeLU<br/>Activation]
    C --> D[Up-projection<br/>m → d]
    D --> E[+]
    A --> E
    E --> F[Output<br/>d dimensions]

    style B fill:#e8f5e8,stroke:#388e3c
    style C fill:#fff3e0,stroke:#f57c00
    style D fill:#e8f5e8,stroke:#388e3c
    style E fill:#e3f2fd,stroke:#1976d2

Mathematical Formulation¶

For input \(x\) with dimension \(d\):

\[\text{Adapter}(x) = x + W_{\text{up}} \cdot \sigma(W_{\text{down}} \cdot x)\]

Where: - \(W_{\text{down}} \in \mathbb{R}^{m \times d}\) (down-projection) - \(W_{\text{up}} \in \mathbb{R}^{d \times m}\) (up-projection) - \(\sigma\) is the activation function (ReLU, GeLU, etc.) - \(m \ll d\) is the bottleneck dimension

Parameter Efficiency Analysis¶

Bottleneck Dimension Impact¶

The bottleneck dimension \(m\) controls the trade-off between efficiency and expressiveness:

Bottleneck Size	Parameters per Adapter	Efficiency	Expressiveness
m = d/16	\(2 \cdot d \cdot (d/16) = d^2/8\)	Very High	Low
m = d/8	\(2 \cdot d \cdot (d/8) = d^2/4\)	High	Medium
m = d/4	\(2 \cdot d \cdot (d/4) = d^2/2\)	Medium	High

Comparison with LoRA¶

For a transformer layer with hidden dimension \(d = 4096\):

Method	Parameters	Efficiency
Full Fine-tuning	~67M	0%
Adapter (m=512)	~4.2M	94%
LoRA (r=16)	~131k	99.8%

Adapter Variants¶

1. Houlsby Adapters (2019)¶

Placement: After attention and feed-forward layers
Architecture: Down-projection → ReLU → Up-projection
Residual: Around the entire adapter

2. Pfeiffer Adapters (2020)¶

Placement: Only after feed-forward layers
Architecture: Similar to Houlsby but fewer insertion points
Advantage: Fewer parameters, similar performance

3. Parallel Adapters¶

Placement: In parallel with existing layers
Architecture: Run alongside attention/feed-forward
Advantage: No additional sequential depth

graph TD
    subgraph "Houlsby Adapters"
        A1[Attention] --> B1[Adapter]
        B1 --> C1[FFN]
        C1 --> D1[Adapter]
    end

    subgraph "Pfeiffer Adapters"
        A2[Attention] --> C2[FFN]
        C2 --> D2[Adapter]
    end

    subgraph "Parallel Adapters"
        Input --> A3[Attention]
        Input --> E3[Adapter]
        A3 --> Sum[+]
        E3 --> Sum
        Sum --> Output
    end

    style B1 fill:#c8e6c9,stroke:#1B5E20
    style D1 fill:#c8e6c9,stroke:#1B5E20
    style D2 fill:#c8e6c9,stroke:#1B5E20
    style E3 fill:#c8e6c9,stroke:#1B5E20

Training Dynamics¶

Initialization Strategy¶

Proper Initialization

Down-projection: Random initialization (Xavier/Kaiming)
Up-projection: Zero initialization (critical!)
Bias: Zero initialization

Zero initialization of up-projection ensures adapters start as identity functions.

Learning Rate Considerations¶

Adapter layers: Higher learning rates (1e-3 to 1e-4)
Frozen layers: Learning rate = 0
Layer norm parameters: Sometimes unfrozen with small learning rates

Advantages of Adapter Tuning¶

1. Intuitive Design¶

Easy to understand and implement
Clear separation between frozen and trainable components

2. Flexible Placement¶

Can be inserted at various points in the architecture
Allows for task-specific architectural choices

3. Stable Training¶

Residual connections provide training stability
Less sensitive to hyperparameter choices

4. Modular Composition¶

Easy to combine multiple adapters
Supports hierarchical and compositional learning

Limitations Compared to LoRA¶

1. Higher Parameter Count¶

Typically requires more parameters than LoRA for similar performance
Less memory efficient during training

2. Inference Overhead¶

Adds computational layers during inference
Cannot be merged with base model weights

3. Sequential Bottleneck¶

Introduces additional sequential computation
May impact inference speed

Modern Applications¶

1. Multi-Task Learning¶

Different adapters for different tasks
Share base model, swap adapters as needed

2. Continual Learning¶

Add new adapters for new tasks
Avoid catastrophic forgetting

3. Cross-Lingual Transfer¶

Language-specific adapters
Efficient multilingual model deployment

Best Practices¶

1. Bottleneck Dimension Selection¶

Choosing the Right Bottleneck Size

Start with m = d/8 for most tasks
Use m = d/16 for simple tasks or memory constraints
Use m = d/4 for complex tasks requiring high capacity
Monitor performance vs. efficiency trade-offs

2. Placement Strategy¶

Where to Insert Adapters

Pfeiffer placement: After feed-forward layers (recommended for efficiency)
Houlsby placement: After both attention and feed-forward (for maximum capacity)
Parallel placement: When inference speed is critical

3. Training Configuration¶

Optimal Training Setup

Learning rate: 1e-4 to 1e-3 for adapter parameters
Batch size: Can be larger than full fine-tuning due to memory efficiency
Warmup: Use learning rate warmup for stability
Regularization: Dropout in adapter layers (0.1-0.2)

Adapter Fusion and Composition¶

Multi-Adapter Architectures¶

Advanced adapter techniques allow combining multiple adapters:

graph TD
    subgraph "Adapter Fusion"
        A[Input] --> B[Task A Adapter]
        A --> C[Task B Adapter]
        A --> D[Task C Adapter]
        B --> E[Fusion Layer]
        C --> E
        D --> E
        E --> F[Output]
    end

    style B fill:#ffcdd2,stroke:#B71C1C
    style C fill:#e8f5e8,stroke:#388e3c
    style D fill:#e3f2fd,stroke:#1976d2
    style E fill:#fff3e0,stroke:#f57c00

AdapterFusion Benefits¶

Knowledge composition: Combine knowledge from multiple tasks
Few-shot learning: Leverage existing adapters for new tasks
Efficient transfer: Share computation across related tasks

Comparison with Other PEFT Methods¶

Performance vs. Efficiency Trade-offs¶

graph LR
    subgraph "PEFT Methods Comparison"
        A[Full Fine-tuning<br/>100% Performance<br/>0% Efficiency] 
        B[Adapter Tuning<br/>95-98% Performance<br/>90-95% Efficiency]
        C[LoRA<br/>97-99% Performance<br/>99%+ Efficiency]
        D[Prefix Tuning<br/>90-95% Performance<br/>99%+ Efficiency]
    end

    style A fill:#ffcdd2,stroke:#B71C1C
    style B fill:#fff3e0,stroke:#f57c00
    style C fill:#c8e6c9,stroke:#1B5E20
    style D fill:#e3f2fd,stroke:#1976d2

When to Choose Adapters¶

Adapter vs. LoRA Decision Matrix

Choose Adapters when: - You need maximum interpretability - Working with multi-task scenarios - Inference speed is not critical - You want to compose multiple capabilities

Choose LoRA when: - Memory efficiency is paramount - You want fastest training - Inference speed matters - Single-task fine-tuning

Interactive Exercise¶

Adapter Design Challenge

Design an adapter configuration for a 12-layer transformer with hidden dimension 768:

Given: - Model: 12 layers × 768 hidden dimensions - Task: Sentiment analysis (relatively simple) - Constraint: <1% of original parameters

Your Task: 1. Choose bottleneck dimension 2. Decide on placement strategy 3. Calculate total parameters 4. Justify your choices

Sample Solution: - Bottleneck: m = 96 (768/8) - Placement: Pfeiffer (after FFN only) - Parameters per adapter: 2 × 768 × 96 = 147,456 - Total: 12 × 147,456 = 1,769,472 parameters - Efficiency: ~99.7% parameter reduction

Real-World Case Studies¶

1. Google's Universal Sentence Encoder¶

Uses adapters for multi-language support
Single base model with language-specific adapters
Achieves 95%+ of monolingual performance

2. Facebook's Multilingual BERT¶

Adapters for cross-lingual transfer
Efficient deployment across 100+ languages
Maintains base model while adding language capabilities

3. OpenAI's GPT-3 Fine-tuning¶

Early experiments with adapter-like mechanisms
Balances customization with base model preservation
Inspired modern PEFT approaches

Common Pitfalls and Solutions¶

1. Poor Initialization¶

Initialization Mistakes

Problem: Random initialization of up-projection layer Solution: Always initialize up-projection weights to zero Reason: Ensures adapters start as identity functions

2. Wrong Bottleneck Size¶

Capacity Issues

Problem: Bottleneck too small for complex tasks Solution: Start with m = d/8 and adjust based on performance Monitoring: Track validation loss plateau

3. Overfitting with Small Datasets¶

Overfitting Risk

Problem: Adapters overfit on small datasets Solution: Increase dropout, reduce bottleneck size, add regularization

Implementation Example¶

# Simple adapter implementation
class Adapter(nn.Module):
    def __init__(self, input_dim, bottleneck_dim, dropout=0.1):
        super().__init__()
        self.down_proj = nn.Linear(input_dim, bottleneck_dim)
        self.up_proj = nn.Linear(bottleneck_dim, input_dim)
        self.dropout = nn.Dropout(dropout)
        self.activation = nn.ReLU()

        # Critical: Initialize up-projection to zero
        nn.init.zeros_(self.up_proj.weight)
        nn.init.zeros_(self.up_proj.bias)

    def forward(self, x):
        # Residual connection around adapter
        adapter_output = self.up_proj(
            self.activation(
                self.dropout(
                    self.down_proj(x)
                )
            )
        )
        return x + adapter_output

Future Directions¶

1. Conditional Adapters¶

Adapters that activate based on input characteristics
Dynamic routing between different adapter modules

2. Hierarchical Adapters¶

Multi-level adapter architectures
Coarse-to-fine task adaptation

3. Neural Architecture Search for Adapters¶

Automated adapter design optimization
Task-specific architecture discovery

Summary¶

Adapter Tuning represents a foundational approach to parameter-efficient fine-tuning that:

Provides intuitive parameter efficiency through bottleneck architectures
Offers flexible deployment options with various placement strategies
Enables modular composition for multi-task scenarios
Maintains strong performance while dramatically reducing trainable parameters

While LoRA has become more popular due to its superior efficiency, adapters remain valuable for scenarios requiring interpretability, modularity, and compositional learning.

Next Steps¶

Compare and contrast: Implement both LoRA and Adapter tuning on the same task
Experiment: Try different bottleneck dimensions and placement strategies
Explore: Look into AdapterFusion for multi-task learning scenarios
Practice: Apply adapter tuning to your own fine-tuning projects