Skip to content

503: Model Routing & Gateways

Chapter Overview

As your AI application matures, you'll discover that a single model isn't optimal for all tasks. Some queries need fast, inexpensive responses, while others require complex reasoning from powerful, costly models.

Model Routing dynamically directs user queries to the most appropriate model or pipeline. A Model Gateway is the architectural component that manages this intelligent routing logic.


The Model Router

A Model Router is typically a lightweight classification system that sits at the front of your architecture. It analyzes user intent and determines the optimal processing path.

flowchart TD
    A[User Query] --> B{Model Router<br/>Intent Classifier}

    subgraph "Processing Pipelines"
        C[Simple Q&A Pipeline<br/>Fast & Cheap Model]
        D[Complex Reasoning Pipeline<br/>Powerful & Slow Model]
        E[RAG Pipeline<br/>Knowledge-Based Questions]
    end

    B -->|Simple Question| C
    B -->|Multi-step Reasoning| D
    B -->|Knowledge Query| E

    C --> F((Final Response))
    D --> F
    E --> F

    style B fill:#fff3e0,stroke:#f57c00
    style C fill:#e3f2fd,stroke:#1976d2
    style D fill:#fce4ec,stroke:#c2185b
    style E fill:#e8f5e8,stroke:#388e3c

Routing Strategies

Intent-Based Routing

Route queries based on user intent or task type.

Common Intent Categories: - Factual Questions: "What is the capital of France?" - Creative Tasks: "Write a poem about summer" - Analysis Tasks: "Summarize this document" - Conversational: "How are you today?"

Complexity-Based Routing

Analyze query complexity and route accordingly.

Complexity Indicators: - Query length and structure - Number of entities mentioned - Presence of logical operators - Multi-step reasoning requirements

Cost-Optimization Routing

Balance performance requirements with cost constraints.

Cost Factors: - Model pricing per token - Expected response length - Required response time - Accuracy requirements


Implementation Approaches

Rule-Based Routing

Simple, fast, and predictable routing using predefined rules.

flowchart TD
    A[User Query] --> B{Contains Keywords?}
    B -->|calculate, solve| C[Math Model]
    B -->|translate| D[Translation Model]
    B -->|summarize| E[Summarization Model]
    B -->|Default| F[General Model]

    style C fill:#e3f2fd,stroke:#1976d2
    style D fill:#e8f5e8,stroke:#388e3c
    style E fill:#fff3e0,stroke:#f57c00
    style F fill:#fce4ec,stroke:#c2185b

Advantages: - Fast execution - Predictable behavior - Easy to debug and modify - Low computational overhead

Limitations: - Limited flexibility - Difficult to handle edge cases - Requires manual rule maintenance

Model-Based Routing

Use a lightweight classifier to determine the optimal routing path.

Classification Features: - Query text embeddings - Query length and structure - Named entity recognition - Syntactic patterns

Model Options: - Fine-tuned BERT classifier - Lightweight neural networks - Ensemble methods - Traditional ML classifiers

Hybrid Routing

Combine rule-based and model-based approaches for optimal performance.

flowchart TD
    A[User Query] --> B{Rule-Based<br/>Pre-Filter}
    B -->|Clear Match| C[Direct Route]
    B -->|Ambiguous| D[Model Classifier]
    D --> E[Model Route]

    C --> F[Execute Pipeline]
    E --> F

    style B fill:#fff3e0,stroke:#f57c00
    style D fill:#e8f5e8,stroke:#388e3c

Gateway Architecture

A Model Gateway is the central component managing routing, load balancing, and failover.

Core Gateway Components

graph TD
    A[Client Request] --> B[Gateway Load Balancer]
    B --> C[Request Router]
    C --> D[Model Pool Manager]

    subgraph "Model Endpoints"
        E[Model A Instance 1]
        F[Model A Instance 2]
        G[Model B Instance 1]
        H[Model C Instance 1]
    end

    D --> E
    D --> F
    D --> G
    D --> H

    E --> I[Response Aggregator]
    F --> I
    G --> I
    H --> I

    I --> J[Client Response]

    style C fill:#fff3e0,stroke:#f57c00
    style D fill:#e8f5e8,stroke:#388e3c
    style I fill:#e3f2fd,stroke:#1976d2

Gateway Responsibilities

Request Management: - Route incoming requests - Load balance across instances - Handle request queuing - Implement rate limiting

Model Management: - Monitor model health - Handle model scaling - Manage model versions - Implement failover logic

Response Management: - Aggregate responses - Apply post-processing - Handle error responses - Implement caching


Advanced Routing Patterns

Cascade Routing

Try simpler models first, escalate to more powerful models if needed.

flowchart TD
    A[User Query] --> B[Fast Model]
    B --> C{Confidence > Threshold?}
    C -->|Yes| D[Return Response]
    C -->|No| E[Powerful Model]
    E --> F[Return Response]

    style B fill:#e3f2fd,stroke:#1976d2
    style E fill:#fce4ec,stroke:#c2185b

Ensemble Routing

Route to multiple models and combine their outputs.

flowchart TD
    A[User Query] --> B[Model 1]
    A --> C[Model 2]
    A --> D[Model 3]

    B --> E[Response Combiner]
    C --> E
    D --> E

    E --> F[Final Response]

    style E fill:#fff3e0,stroke:#f57c00

Contextual Routing

Consider user context and history when routing.

Context Factors: - User preferences - Previous interactions - Session state - Performance feedback


Best Practices

Design Principles

  1. Start Simple: Begin with basic routing rules
  2. Measure Performance: Track routing accuracy and latency
  3. Optimize Costs: Balance quality with expense
  4. Plan for Failure: Implement robust fallback mechanisms

Implementation Guidelines

  • Monitor routing decisions and outcomes
  • A/B test different routing strategies
  • Collect user feedback on response quality
  • Implement gradual rollout for new routing logic

Performance Optimization

  • Cache routing decisions for similar queries
  • Pre-compute routing for common patterns
  • Optimize classifier model size
  • Use asynchronous processing where possible

Common Routing Scenarios

Customer Support

  • Simple FAQ: Fast, cheap model
  • Complex Issues: Powerful reasoning model
  • Escalation: Human handoff

Content Generation

  • Short Content: Fast generation model
  • Long Content: High-quality model
  • Creative Content: Specialized creative model

Data Analysis

  • Simple Queries: Basic analytics model
  • Complex Analysis: Advanced reasoning model
  • Visualization: Specialized chart generation

Tools and Frameworks

Open Source Solutions

  • LangChain: Routing and orchestration
  • Semantic Kernel: Model orchestration
  • Haystack: Pipeline management
  • Custom FastAPI: Build your own gateway

Commercial Solutions

  • OpenAI API: Built-in model routing
  • Azure OpenAI: Service-level routing
  • AWS Bedrock: Multi-model routing
  • Google Vertex AI: Model selection

Monitoring and Observability

  • LangSmith: LLM application monitoring
  • Weights & Biases: Experiment tracking
  • MLflow: Model lifecycle management
  • Custom dashboards: Track routing metrics

Measuring Routing Effectiveness

Key Metrics

  • Routing Accuracy: Percentage of correctly routed queries
  • Response Quality: User satisfaction with responses
  • Cost Efficiency: Cost per successful interaction
  • Latency: Time from query to response

Continuous Improvement

  • Regular evaluation of routing decisions
  • User feedback integration
  • Performance benchmarking
  • Cost optimization analysis

Interactive Exercise

Try designing a routing strategy for these scenarios:

  1. E-commerce Chatbot: Handle product questions, order status, and technical support
  2. Content Platform: Route between creative writing, fact-checking, and summarization
  3. Educational Assistant: Balance simple explanations with complex problem-solving

Consider factors like cost, speed, and accuracy for each use case.


Next Steps

  • Implement basic rule-based routing for your application
  • Experiment with different routing strategies
  • Set up monitoring for routing decisions
  • Study advanced optimization and caching techniques
  • Practice with real-world routing scenarios

Key Takeaways

Model routing optimizes cost and performance
Start simple with rule-based routing
Measure and iterate on routing decisions
Plan for failure with robust fallbacks
Consider user context in routing decisions