511: Inference Performance Metrics¶

Chapter Overview

To effectively optimize an inference service, you must first measure its performance. Several key metrics help us evaluate and understand the trade-offs between speed, cost, and efficiency.

Latency: The User's Perception of Speed¶

Latency is the time from when a user sends a request until they receive a complete response. For autoregressive LLMs, this is broken down into two crucial components:

gantt
    title LLM Response Latency Breakdown
    dateFormat X
    axisFormat %s

    section User Experience
    User sends query     : 0, 1
    Waiting for response : 1, 12
    Response complete    : 12, 13

    section Processing Phases
    Prompt Processing    : 1, 3
    First Token Gen     : 3, 4
    Token 2 Generation  : 4, 5
    Token 3 Generation  : 5, 6
    Token 4 Generation  : 6, 7
    Token 5 Generation  : 7, 8
    Token 6 Generation  : 8, 9
    Token 7 Generation  : 9, 10
    Token 8 Generation  : 10, 11
    Final Token Gen     : 11, 12

    section Key Metrics
    TTFT Measurement    : 1, 4
    TPOT Measurement    : 4, 5

Time to First Token (TTFT)¶

How quickly the first token is generated after the user sends their query.

What it measures: - Initial "thinking" time of the model - Prompt processing (pre-fill) efficiency - System responsiveness

Why it matters: - Critical for making applications feel responsive - User perception of AI "intelligence" - First impression of system performance

Time Per Output Token (TPOT)¶

How long it takes to generate each subsequent token after the first one.

What it measures: - Streaming speed of the response - Consistent generation performance - Model efficiency in autoregressive mode

Why it matters: - Determines reading pace for users - Affects overall response completion time - Critical for long-form content generation

Total Latency Formula¶

graph LR
    A[Total Latency] --> B[TTFT]
    A --> C[TPOT × Output Tokens]

    B --> D[Prompt Processing<br/>+ First Token]
    C --> E[Remaining Tokens<br/>Generation Time]

    D --> F[Example: 800ms]
    E --> G[Example: 50ms × 100 tokens<br/>= 5000ms]

    F --> H[Total: 5800ms<br/>5.8 seconds]
    G --> H

    style A fill:#fff3e0,stroke:#f57c00
    style B fill:#e3f2fd,stroke:#1976d2
    style C fill:#e8f5e8,stroke:#388e3c

Total Latency = TTFT + (TPOT × Number of Output Tokens)

Throughput: System-Wide Efficiency¶

Throughput measures how many requests or tokens the system can process per unit of time.

graph TD
    A[Throughput Metrics] --> B[Requests per Second]
    A --> C[Tokens per Second]

    B --> D[Batch Processing<br/>Efficiency]
    C --> E[Model Generation<br/>Speed]

    D --> F[Example: 50 req/sec<br/>for simple queries]
    E --> G[Example: 1000 tokens/sec<br/>across all requests]

    F --> H[System Capacity<br/>Planning]
    G --> I[Resource Utilization<br/>Optimization]

    style B fill:#e3f2fd,stroke:#1976d2
    style C fill:#e8f5e8,stroke:#388e3c

Requests per Second (RPS)¶

Total number of completed requests per second
Important for capacity planning
Varies significantly with request complexity

Tokens per Second (TPS)¶

Total tokens generated across all requests per second
Better measure of actual computational work
More consistent metric for performance comparison

Cost Metrics: Economic Efficiency¶

Understanding the economic impact of inference decisions is crucial for production deployments.

graph TD
    A[Cost Metrics] --> B[Cost per Request]
    A --> C[Cost per Token]
    A --> D[Cost per User Session]

    B --> E[Infrastructure Cost<br/>÷ Total Requests]
    C --> F[Infrastructure Cost<br/>÷ Total Tokens]
    D --> G[Infrastructure Cost<br/>÷ Active Users]

    E --> H[Budget Planning<br/>ROI Analysis]
    F --> I[Model Efficiency<br/>Comparison]
    G --> J[User Economics<br/>Pricing Strategy]

    style B fill:#ffcdd2,stroke:#d32f2f
    style C fill:#c8e6c9,stroke:#388e3c
    style D fill:#fff3e0,stroke:#f57c00

Cost per Request¶

Total infrastructure cost divided by number of requests
Useful for simple query patterns
Easy to understand and communicate

Cost per Token¶

Total infrastructure cost divided by tokens generated
Better for comparing different models
More accurate for variable-length responses

Cost per User Session¶

Total cost divided by active user sessions
Important for user-facing applications
Helps determine pricing strategies

Quality Metrics: Accuracy and Reliability¶

Performance optimization should never sacrifice the quality of model outputs.

graph TD
    A[Quality Metrics] --> B[Accuracy Metrics]
    A --> C[Consistency Metrics]
    A --> D[Reliability Metrics]

    B --> E[Task-Specific Scores<br/>BLEU, ROUGE, etc.]
    C --> F[Output Variance<br/>Across Runs]
    D --> G[Error Rate<br/>Failure Handling]

    E --> H[Model Performance<br/>Validation]
    F --> I[Optimization Impact<br/>Assessment]
    G --> J[System Robustness<br/>Evaluation]

    style B fill:#e3f2fd,stroke:#1976d2
    style C fill:#e8f5e8,stroke:#388e3c
    style D fill:#fff3e0,stroke:#f57c00

Accuracy Metrics¶

Task-specific evaluation scores
Comparison with baseline performance
Human evaluation when possible

Consistency Metrics¶

Output variance across multiple runs
Stability under optimization
Reproducibility of results

Reliability Metrics¶

Error rates and failure modes
System uptime and availability
Graceful degradation capabilities

Measurement Tools and Techniques¶

Benchmarking Tools¶

graph LR
    A[Measurement Tools] --> B[Load Testing]
    A --> C[Profiling Tools]
    A --> D[Monitoring Systems]

    B --> E[Apache Bench<br/>wrk, Artillery]
    C --> F[NVIDIA Nsight<br/>TensorRT Profiler]
    D --> G[Prometheus<br/>Grafana, DataDog]

    style B fill:#e3f2fd,stroke:#1976d2
    style C fill:#e8f5e8,stroke:#388e3c
    style D fill:#fff3e0,stroke:#f57c00

Load Testing¶

Simulate realistic user traffic
Measure performance under stress
Identify bottlenecks and limits

Profiling Tools¶

Analyze detailed performance characteristics
Identify optimization opportunities
Understand resource utilization

Monitoring Systems¶

Track metrics in production
Set up alerts for performance degradation
Historical performance analysis

Metric Relationships and Trade-offs¶

Understanding how metrics interact helps make informed optimization decisions.

graph TD
    A[Metric Trade-offs] --> B[Latency vs Throughput]
    A --> C[Cost vs Quality]
    A --> D[Speed vs Accuracy]

    B --> E[Lower latency often<br/>reduces throughput]
    C --> F[Cheaper inference may<br/>compromise quality]
    D --> G[Faster models may<br/>sacrifice accuracy]

    E --> H[Optimization<br/>Strategy Choice]
    F --> H
    G --> H

    H --> I[Balanced Approach<br/>Based on Use Case]

    style B fill:#ffcdd2,stroke:#d32f2f
    style C fill:#c8e6c9,stroke:#388e3c
    style D fill:#fff3e0,stroke:#f57c00

Common Trade-offs¶

Latency vs. Throughput: Optimizing for one often hurts the other
Cost vs. Quality: Cheaper solutions may reduce output quality
Speed vs. Accuracy: Faster models may sacrifice precision

Optimization Strategies¶

Identify Priority Metrics: What matters most for your use case?
Set Acceptable Ranges: Define minimum thresholds for each metric
Monitor Continuously: Track changes over time

Best Practices for Measurement¶

Establish Baselines¶

Record current performance before optimization
Use consistent testing conditions
Document measurement methodology

Realistic Testing¶

Use production-like data and queries
Test under various load conditions
Include edge cases and failure scenarios

Continuous Monitoring¶

Track metrics in production
Set up automated alerts
Regular performance reviews

Interactive Exercise¶

Scenario: You're optimizing a customer service chatbot.

Current Performance: - TTFT: 1.2 seconds - TPOT: 80ms - Average response: 50 tokens - Cost per request: $0.05 - User satisfaction: 85%

Questions: 1. What's the total latency for an average response? 2. If you reduce TTFT to 0.8s but TPOT increases to 100ms, is this better? 3. How would you balance cost reduction with user satisfaction?

Key Takeaways¶

✅ TTFT affects perceived responsiveness
✅ TPOT determines streaming speed
✅ Throughput measures system capacity
✅ Cost metrics enable economic decisions
✅ Quality metrics prevent degradation
✅ Trade-offs exist between all metrics
✅ Continuous measurement is essential