Latency vs Throughput

These two metrics involve tradeoffs in LLM inference and drive different model and infrastructure choices depending on your use case.

Definitions

Time to first token (TTFT): How long until the first output token arrives. Critical for streaming interfaces where users see output appear progressively.
Time per output token (TPOT): How fast subsequent tokens are generated. Determines perceived response speed.
Total latency: TTFT + (TPOT x output length). What users experience as end-to-end response time.
Throughput: Total tokens generated per second across all concurrent requests. The key metric for cost and batch processing.

The latency-throughput tradeoff

LLM inference servers can optimize for one or the other. Serving each request immediately minimizes latency but leaves GPUs underutilized. Batching multiple requests together maximizes GPU utilization (throughput) but adds queue wait time (latency). Production serving typically targets a latency SLA and maximizes throughput within that constraint.

Practical implications

Real-time chat: TTFT matters most. Users notice when the first token takes over 1-2 seconds to arrive.
Batch processing: Throughput matters. Run overnight jobs and minimize cost per token.
Agentic workflows: Latency compounds over many steps. A 2-second-per-call agent with 20 steps takes 40 seconds minimum, regardless of throughput.

Definitions

The latency-throughput tradeoff

Practical implications

Related terms

Models relevant to Latency vs Throughput