KV Cache

During generation, an LLM computes attention over all tokens in the context at every step. Without caching, generating a 1,000-token response on a 10,000-token context would require 10 million attention operations. The KV (key-value) cache stores the attention key and value tensors for all processed tokens, so each new token only needs attention computed over the cached values rather than recomputing from scratch.

Why it matters

KV cache is essential for practical autoregressive generation with large context windows. Without it, a 1M-token context would require compute proportional to 1M at every generation step, making inference prohibitively expensive. With the cache, only newly generated tokens require fresh computation, reducing per-token latency significantly. For a typical deployment serving multiple concurrent requests, KV cache management is often the primary bottleneck for throughput and cost.

Prefix caching

API providers including Anthropic, OpenAI, and Google offer prefix caching: when the same system prompt or document prefix appears across multiple requests, the KV cache for those tokens is computed once and reused across requests. This reduces costs by 50-90% and latency by 50-80% for prompts with long shared prefixes, particularly valuable for applications with lengthy system prompts or fixed document contexts. Prefix caching has become standard across major inference platforms since 2024.

Memory constraints and optimization

KV caches grow linearly with context length and batch size. For large context windows (1M+ tokens), the KV cache can exceed the model weights in size. This limits the number of concurrent requests a GPU cluster can handle at maximum context length. Techniques like multi-query attention (MQA) and grouped-query attention (GQA) reduce KV cache memory overhead by sharing key and value heads across query heads. FlashAttention and related implementations improve computational efficiency without reducing cache size. Attention quantization methods that reduce key and value precision from float16 to int8 or lower have emerged as another approach to manage memory constraints without sacrificing quality.

Why it matters

Prefix caching

Memory constraints and optimization

Related terms

Models relevant to KV Cache