511: Inference Performance Metrics¶
Chapter Overview
To effectively optimize an inference service, you must first measure its performance. Several key metrics help us evaluate and understand the trade-offs between speed, cost, and efficiency.
Latency: The User's Perception of Speed¶
Latency is the time from when a user sends a request until they receive a complete response. For autoregressive LLMs, this is broken down into two crucial components:
gantt
title LLM Response Latency Breakdown
dateFormat X
axisFormat %s
section User Experience
User sends query : 0, 1
Waiting for response : 1, 12
Response complete : 12, 13
section Processing Phases
Prompt Processing : 1, 3
First Token Gen : 3, 4
Token 2 Generation : 4, 5
Token 3 Generation : 5, 6
Token 4 Generation : 6, 7
Token 5 Generation : 7, 8
Token 6 Generation : 8, 9
Token 7 Generation : 9, 10
Token 8 Generation : 10, 11
Final Token Gen : 11, 12
section Key Metrics
TTFT Measurement : 1, 4
TPOT Measurement : 4, 5
Time to First Token (TTFT)¶
How quickly the first token is generated after the user sends their query.
What it measures: - Initial "thinking" time of the model - Prompt processing (pre-fill) efficiency - System responsiveness
Why it matters: - Critical for making applications feel responsive - User perception of AI "intelligence" - First impression of system performance
Time Per Output Token (TPOT)¶
How long it takes to generate each subsequent token after the first one.
What it measures: - Streaming speed of the response - Consistent generation performance - Model efficiency in autoregressive mode
Why it matters: - Determines reading pace for users - Affects overall response completion time - Critical for long-form content generation
Total Latency Formula¶
graph LR
A[Total Latency] --> B[TTFT]
A --> C[TPOT × Output Tokens]
B --> D[Prompt Processing<br/>+ First Token]
C --> E[Remaining Tokens<br/>Generation Time]
D --> F[Example: 800ms]
E --> G[Example: 50ms × 100 tokens<br/>= 5000ms]
F --> H[Total: 5800ms<br/>5.8 seconds]
G --> H
style A fill:#fff3e0,stroke:#f57c00
style B fill:#e3f2fd,stroke:#1976d2
style C fill:#e8f5e8,stroke:#388e3c
Total Latency = TTFT + (TPOT × Number of Output Tokens)
Throughput: System-Wide Efficiency¶
Throughput measures how many requests or tokens the system can process per unit of time.
graph TD
A[Throughput Metrics] --> B[Requests per Second]
A --> C[Tokens per Second]
B --> D[Batch Processing<br/>Efficiency]
C --> E[Model Generation<br/>Speed]
D --> F[Example: 50 req/sec<br/>for simple queries]
E --> G[Example: 1000 tokens/sec<br/>across all requests]
F --> H[System Capacity<br/>Planning]
G --> I[Resource Utilization<br/>Optimization]
style B fill:#e3f2fd,stroke:#1976d2
style C fill:#e8f5e8,stroke:#388e3c
Requests per Second (RPS)¶
- Total number of completed requests per second
- Important for capacity planning
- Varies significantly with request complexity
Tokens per Second (TPS)¶
- Total tokens generated across all requests per second
- Better measure of actual computational work
- More consistent metric for performance comparison
Cost Metrics: Economic Efficiency¶
Understanding the economic impact of inference decisions is crucial for production deployments.
graph TD
A[Cost Metrics] --> B[Cost per Request]
A --> C[Cost per Token]
A --> D[Cost per User Session]
B --> E[Infrastructure Cost<br/>÷ Total Requests]
C --> F[Infrastructure Cost<br/>÷ Total Tokens]
D --> G[Infrastructure Cost<br/>÷ Active Users]
E --> H[Budget Planning<br/>ROI Analysis]
F --> I[Model Efficiency<br/>Comparison]
G --> J[User Economics<br/>Pricing Strategy]
style B fill:#ffcdd2,stroke:#d32f2f
style C fill:#c8e6c9,stroke:#388e3c
style D fill:#fff3e0,stroke:#f57c00
Cost per Request¶
- Total infrastructure cost divided by number of requests
- Useful for simple query patterns
- Easy to understand and communicate
Cost per Token¶
- Total infrastructure cost divided by tokens generated
- Better for comparing different models
- More accurate for variable-length responses
Cost per User Session¶
- Total cost divided by active user sessions
- Important for user-facing applications
- Helps determine pricing strategies
Quality Metrics: Accuracy and Reliability¶
Performance optimization should never sacrifice the quality of model outputs.
graph TD
A[Quality Metrics] --> B[Accuracy Metrics]
A --> C[Consistency Metrics]
A --> D[Reliability Metrics]
B --> E[Task-Specific Scores<br/>BLEU, ROUGE, etc.]
C --> F[Output Variance<br/>Across Runs]
D --> G[Error Rate<br/>Failure Handling]
E --> H[Model Performance<br/>Validation]
F --> I[Optimization Impact<br/>Assessment]
G --> J[System Robustness<br/>Evaluation]
style B fill:#e3f2fd,stroke:#1976d2
style C fill:#e8f5e8,stroke:#388e3c
style D fill:#fff3e0,stroke:#f57c00
Accuracy Metrics¶
- Task-specific evaluation scores
- Comparison with baseline performance
- Human evaluation when possible
Consistency Metrics¶
- Output variance across multiple runs
- Stability under optimization
- Reproducibility of results
Reliability Metrics¶
- Error rates and failure modes
- System uptime and availability
- Graceful degradation capabilities
Measurement Tools and Techniques¶
Benchmarking Tools¶
graph LR
A[Measurement Tools] --> B[Load Testing]
A --> C[Profiling Tools]
A --> D[Monitoring Systems]
B --> E[Apache Bench<br/>wrk, Artillery]
C --> F[NVIDIA Nsight<br/>TensorRT Profiler]
D --> G[Prometheus<br/>Grafana, DataDog]
style B fill:#e3f2fd,stroke:#1976d2
style C fill:#e8f5e8,stroke:#388e3c
style D fill:#fff3e0,stroke:#f57c00
Load Testing¶
- Simulate realistic user traffic
- Measure performance under stress
- Identify bottlenecks and limits
Profiling Tools¶
- Analyze detailed performance characteristics
- Identify optimization opportunities
- Understand resource utilization
Monitoring Systems¶
- Track metrics in production
- Set up alerts for performance degradation
- Historical performance analysis
Metric Relationships and Trade-offs¶
Understanding how metrics interact helps make informed optimization decisions.
graph TD
A[Metric Trade-offs] --> B[Latency vs Throughput]
A --> C[Cost vs Quality]
A --> D[Speed vs Accuracy]
B --> E[Lower latency often<br/>reduces throughput]
C --> F[Cheaper inference may<br/>compromise quality]
D --> G[Faster models may<br/>sacrifice accuracy]
E --> H[Optimization<br/>Strategy Choice]
F --> H
G --> H
H --> I[Balanced Approach<br/>Based on Use Case]
style B fill:#ffcdd2,stroke:#d32f2f
style C fill:#c8e6c9,stroke:#388e3c
style D fill:#fff3e0,stroke:#f57c00
Common Trade-offs¶
- Latency vs. Throughput: Optimizing for one often hurts the other
- Cost vs. Quality: Cheaper solutions may reduce output quality
- Speed vs. Accuracy: Faster models may sacrifice precision
Optimization Strategies¶
- Identify Priority Metrics: What matters most for your use case?
- Set Acceptable Ranges: Define minimum thresholds for each metric
- Monitor Continuously: Track changes over time
Best Practices for Measurement¶
Establish Baselines¶
- Record current performance before optimization
- Use consistent testing conditions
- Document measurement methodology
Realistic Testing¶
- Use production-like data and queries
- Test under various load conditions
- Include edge cases and failure scenarios
Continuous Monitoring¶
- Track metrics in production
- Set up automated alerts
- Regular performance reviews
Interactive Exercise¶
Scenario: You're optimizing a customer service chatbot.
Current Performance: - TTFT: 1.2 seconds - TPOT: 80ms - Average response: 50 tokens - Cost per request: $0.05 - User satisfaction: 85%
Questions: 1. What's the total latency for an average response? 2. If you reduce TTFT to 0.8s but TPOT increases to 100ms, is this better? 3. How would you balance cost reduction with user satisfaction?
Key Takeaways¶
✅ TTFT affects perceived responsiveness
✅ TPOT determines streaming speed
✅ Throughput measures system capacity
✅ Cost metrics enable economic decisions
✅ Quality metrics prevent degradation
✅ Trade-offs exist between all metrics
✅ Continuous measurement is essential