For Developers/Glossary/Latency vs Throughput
Inference

Latency vs Throughput

Latency measures how long a single request takes to complete; throughput measures how many requests can be processed per second.

These two metrics trade off against each other in LLM inference and drive different model and infrastructure choices depending on your use case.

Definitions

  • Time to first token (TTFT): How long until the first output token arrives. Critical for streaming interfaces where users see output appear progressively.
  • Time per output token (TPOT): How fast subsequent tokens are generated. Determines perceived response speed.
  • Total latency: TTFT + (TPOT x output length). What most users experience as "how slow is this?"
  • Throughput: Total tokens generated per second across all concurrent requests. The key metric for cost and batch processing.

The latency-throughput tradeoff

LLM inference servers can optimize for one or the other. Serving each request immediately minimizes latency but underutilizes GPUs. Batching multiple requests together maximizes GPU utilization (throughput) but adds queue wait time (latency). Production serving typically targets a latency SLA and maximizes throughput within that constraint.

Practical implications

  • Real-time chat: TTFT matters most - users notice if the first token takes over 1-2 seconds
  • Batch processing: Throughput matters; run overnight jobs, minimize cost per token
  • Agentic workflows: Latency compounds over many steps - a 2-second-per-call agent with 20 steps takes 40 seconds minimum

Related terms

Models relevant to Latency vs Throughput