Inference

Inference is what happens when you call the API: your input tokens are processed by the model's weights, and output tokens are generated one by one. Everything a developer interacts with in production - latency, cost, throughput, context windows - is a property of inference, not training.

Autoregressive inference

Modern LLMs generate tokens one at a time in a left-to-right pass. Each generated token is appended to the context and used to predict the next. This sequential dependency is the fundamental reason inference is slower than training, which can parallelize across sequence positions.

Prefill vs decode phases

Inference has two phases:

Prefill: Process all input tokens simultaneously to populate the KV cache. This is parallelizable and fast for long inputs.
Decode: Generate one output token at a time, using the KV cache. This is sequential and the bottleneck for generation speed.

Inference optimization techniques

Quantization: Reduce weight precision from 32-bit to 8-bit or 4-bit, reducing memory and improving throughput with minor quality loss.
FlashAttention: Fused attention kernel that reduces memory bandwidth requirements.
Continuous batching: Dynamically batch requests as they arrive rather than waiting for a full batch.
Speculative decoding: Use a small draft model to propose tokens; the large model verifies them in parallel.
KV cache optimization: Compress or prune cached key-value pairs to reduce memory footprint during generation.

Self-hosted vs managed inference

Managed APIs (Anthropic, OpenAI, Groq, Together AI) handle infrastructure and scaling. Self-hosted solutions (vLLM, TGI, Ollama) give control over cost and data privacy but require GPU infrastructure expertise.

Autoregressive inference

Prefill vs decode phases

Inference optimization techniques

Self-hosted vs managed inference

Related terms

Models relevant to Inference