KV Cache
A performance optimization that stores attention computation results for previously seen tokens, avoiding expensive recomputation on each inference step.
During generation, an LLM computes attention over all tokens in the context at every step. Without caching, generating a 1,000-token response on a 10,000-token context would require 10 million attention operations. The KV (key-value) cache stores the attention key and value matrices for all processed tokens, so each new token only needs attention over the cached values rather than recomputing from scratch.
Why it matters
KV cache is what makes interactive use of LLMs with large context windows practical. Without it, a 1M-token context window would require compute proportional to 1M at every generation step - prohibitively expensive. With the cache, only newly generated tokens require fresh computation.
Prefix caching
API providers (Anthropic, OpenAI, Google) offer prefix caching: if the same system prompt appears across many requests, the KV cache for those tokens is computed once and reused. This reduces cost by 90% and latency by 50-80% for prompts with long shared prefixes - useful for applications with lengthy system prompts or fixed document contexts.
Memory constraints
KV caches grow linearly with context length and batch size. For large context windows (1M+ tokens), the KV cache can be larger than the model weights themselves. This limits how many concurrent requests a given GPU cluster can handle at maximum context length. Memory-efficient attention variants (FlashAttention, multi-query attention) reduce this overhead.