Skip to content

511: Inference Performance Metrics

Chapter Overview

To effectively optimize an inference service, you must first measure its performance. Several key metrics help us evaluate and understand the trade-offs between speed, cost, and efficiency.


Latency: The User's Perception of Speed

Latency is the time from when a user sends a request until they receive a complete response. For autoregressive LLMs, this is broken down into two crucial components:

gantt
    title LLM Response Latency Breakdown
    dateFormat X
    axisFormat %s

    section User Experience
    User sends query     : 0, 1
    Waiting for response : 1, 12
    Response complete    : 12, 13

    section Processing Phases
    Prompt Processing    : 1, 3
    First Token Gen     : 3, 4
    Token 2 Generation  : 4, 5
    Token 3 Generation  : 5, 6
    Token 4 Generation  : 6, 7
    Token 5 Generation  : 7, 8
    Token 6 Generation  : 8, 9
    Token 7 Generation  : 9, 10
    Token 8 Generation  : 10, 11
    Final Token Gen     : 11, 12

    section Key Metrics
    TTFT Measurement    : 1, 4
    TPOT Measurement    : 4, 5

Time to First Token (TTFT)

How quickly the first token is generated after the user sends their query.

What it measures: - Initial "thinking" time of the model - Prompt processing (pre-fill) efficiency - System responsiveness

Why it matters: - Critical for making applications feel responsive - User perception of AI "intelligence" - First impression of system performance

Time Per Output Token (TPOT)

How long it takes to generate each subsequent token after the first one.

What it measures: - Streaming speed of the response - Consistent generation performance - Model efficiency in autoregressive mode

Why it matters: - Determines reading pace for users - Affects overall response completion time - Critical for long-form content generation

Total Latency Formula

graph LR
    A[Total Latency] --> B[TTFT]
    A --> C[TPOT × Output Tokens]

    B --> D[Prompt Processing<br/>+ First Token]
    C --> E[Remaining Tokens<br/>Generation Time]

    D --> F[Example: 800ms]
    E --> G[Example: 50ms × 100 tokens<br/>= 5000ms]

    F --> H[Total: 5800ms<br/>5.8 seconds]
    G --> H

    style A fill:#fff3e0,stroke:#f57c00
    style B fill:#e3f2fd,stroke:#1976d2
    style C fill:#e8f5e8,stroke:#388e3c

Total Latency = TTFT + (TPOT × Number of Output Tokens)


Throughput: System-Wide Efficiency

Throughput measures how many requests or tokens the system can process per unit of time.

graph TD
    A[Throughput Metrics] --> B[Requests per Second]
    A --> C[Tokens per Second]

    B --> D[Batch Processing<br/>Efficiency]
    C --> E[Model Generation<br/>Speed]

    D --> F[Example: 50 req/sec<br/>for simple queries]
    E --> G[Example: 1000 tokens/sec<br/>across all requests]

    F --> H[System Capacity<br/>Planning]
    G --> I[Resource Utilization<br/>Optimization]

    style B fill:#e3f2fd,stroke:#1976d2
    style C fill:#e8f5e8,stroke:#388e3c

Requests per Second (RPS)

  • Total number of completed requests per second
  • Important for capacity planning
  • Varies significantly with request complexity

Tokens per Second (TPS)

  • Total tokens generated across all requests per second
  • Better measure of actual computational work
  • More consistent metric for performance comparison

Cost Metrics: Economic Efficiency

Understanding the economic impact of inference decisions is crucial for production deployments.

graph TD
    A[Cost Metrics] --> B[Cost per Request]
    A --> C[Cost per Token]
    A --> D[Cost per User Session]

    B --> E[Infrastructure Cost<br/>÷ Total Requests]
    C --> F[Infrastructure Cost<br/>÷ Total Tokens]
    D --> G[Infrastructure Cost<br/>÷ Active Users]

    E --> H[Budget Planning<br/>ROI Analysis]
    F --> I[Model Efficiency<br/>Comparison]
    G --> J[User Economics<br/>Pricing Strategy]

    style B fill:#ffcdd2,stroke:#d32f2f
    style C fill:#c8e6c9,stroke:#388e3c
    style D fill:#fff3e0,stroke:#f57c00

Cost per Request

  • Total infrastructure cost divided by number of requests
  • Useful for simple query patterns
  • Easy to understand and communicate

Cost per Token

  • Total infrastructure cost divided by tokens generated
  • Better for comparing different models
  • More accurate for variable-length responses

Cost per User Session

  • Total cost divided by active user sessions
  • Important for user-facing applications
  • Helps determine pricing strategies

Quality Metrics: Accuracy and Reliability

Performance optimization should never sacrifice the quality of model outputs.

graph TD
    A[Quality Metrics] --> B[Accuracy Metrics]
    A --> C[Consistency Metrics]
    A --> D[Reliability Metrics]

    B --> E[Task-Specific Scores<br/>BLEU, ROUGE, etc.]
    C --> F[Output Variance<br/>Across Runs]
    D --> G[Error Rate<br/>Failure Handling]

    E --> H[Model Performance<br/>Validation]
    F --> I[Optimization Impact<br/>Assessment]
    G --> J[System Robustness<br/>Evaluation]

    style B fill:#e3f2fd,stroke:#1976d2
    style C fill:#e8f5e8,stroke:#388e3c
    style D fill:#fff3e0,stroke:#f57c00

Accuracy Metrics

  • Task-specific evaluation scores
  • Comparison with baseline performance
  • Human evaluation when possible

Consistency Metrics

  • Output variance across multiple runs
  • Stability under optimization
  • Reproducibility of results

Reliability Metrics

  • Error rates and failure modes
  • System uptime and availability
  • Graceful degradation capabilities

Measurement Tools and Techniques

Benchmarking Tools

graph LR
    A[Measurement Tools] --> B[Load Testing]
    A --> C[Profiling Tools]
    A --> D[Monitoring Systems]

    B --> E[Apache Bench<br/>wrk, Artillery]
    C --> F[NVIDIA Nsight<br/>TensorRT Profiler]
    D --> G[Prometheus<br/>Grafana, DataDog]

    style B fill:#e3f2fd,stroke:#1976d2
    style C fill:#e8f5e8,stroke:#388e3c
    style D fill:#fff3e0,stroke:#f57c00

Load Testing

  • Simulate realistic user traffic
  • Measure performance under stress
  • Identify bottlenecks and limits

Profiling Tools

  • Analyze detailed performance characteristics
  • Identify optimization opportunities
  • Understand resource utilization

Monitoring Systems

  • Track metrics in production
  • Set up alerts for performance degradation
  • Historical performance analysis

Metric Relationships and Trade-offs

Understanding how metrics interact helps make informed optimization decisions.

graph TD
    A[Metric Trade-offs] --> B[Latency vs Throughput]
    A --> C[Cost vs Quality]
    A --> D[Speed vs Accuracy]

    B --> E[Lower latency often<br/>reduces throughput]
    C --> F[Cheaper inference may<br/>compromise quality]
    D --> G[Faster models may<br/>sacrifice accuracy]

    E --> H[Optimization<br/>Strategy Choice]
    F --> H
    G --> H

    H --> I[Balanced Approach<br/>Based on Use Case]

    style B fill:#ffcdd2,stroke:#d32f2f
    style C fill:#c8e6c9,stroke:#388e3c
    style D fill:#fff3e0,stroke:#f57c00

Common Trade-offs

  • Latency vs. Throughput: Optimizing for one often hurts the other
  • Cost vs. Quality: Cheaper solutions may reduce output quality
  • Speed vs. Accuracy: Faster models may sacrifice precision

Optimization Strategies

  • Identify Priority Metrics: What matters most for your use case?
  • Set Acceptable Ranges: Define minimum thresholds for each metric
  • Monitor Continuously: Track changes over time

Best Practices for Measurement

Establish Baselines

  • Record current performance before optimization
  • Use consistent testing conditions
  • Document measurement methodology

Realistic Testing

  • Use production-like data and queries
  • Test under various load conditions
  • Include edge cases and failure scenarios

Continuous Monitoring

  • Track metrics in production
  • Set up automated alerts
  • Regular performance reviews

Interactive Exercise

Scenario: You're optimizing a customer service chatbot.

Current Performance: - TTFT: 1.2 seconds - TPOT: 80ms - Average response: 50 tokens - Cost per request: $0.05 - User satisfaction: 85%

Questions: 1. What's the total latency for an average response? 2. If you reduce TTFT to 0.8s but TPOT increases to 100ms, is this better? 3. How would you balance cost reduction with user satisfaction?


Key Takeaways

TTFT affects perceived responsiveness
TPOT determines streaming speed
Throughput measures system capacity
Cost metrics enable economic decisions
Quality metrics prevent degradation
Trade-offs exist between all metrics
Continuous measurement is essential