Core Concepts

Inference

The process of running a trained model to generate outputs from inputs - the "serving" phase, as opposed to training.

Inference is what happens when you call the API: your input tokens are processed by the model's weights, and output tokens are generated one by one. Everything a developer interacts with in production - latency, cost, throughput, context windows - is a property of inference, not training.

Autoregressive inference

Modern LLMs generate tokens one at a time in a left-to-right pass. Each generated token is appended to the context and used to predict the next. This sequential dependency is the fundamental reason inference is slower than training (which can parallelize across sequence positions).

Prefill vs decode phases

Inference has two phases:

  • Prefill: Process all input tokens simultaneously to populate the KV cache. This is parallelizable and fast for long inputs.
  • Decode: Generate one output token at a time, using the KV cache. This is sequential and the bottleneck for generation speed.

Inference optimization techniques

  • Quantization: Reduce weight precision from 32-bit to 8-bit or 4-bit, reducing memory and improving throughput with minor quality loss.
  • FlashAttention: Fused attention kernel that reduces memory bandwidth requirements.
  • Continuous batching: Dynamically batch requests as they arrive rather than waiting for a full batch.
  • Speculative decoding: Use a small draft model to propose tokens; the large model verifies them in parallel.

Self-hosted vs managed inference

Managed APIs (Anthropic, OpenAI, Groq, Together AI) handle infrastructure. Self-hosted (vLLM, TGI, Ollama) gives control over cost and data privacy but requires GPU infrastructure expertise.

Related terms

Models relevant to Inference