Context Window

The context window is the LLM's working memory. Everything the model processes during a single inference call - your system prompt, conversation history, retrieved documents, and tool outputs - must fit within this limit. The model generates output tokens until it completes the task or reaches the output length limit.

Input vs output tokens

Context window size typically refers to the total token budget. Both input and output tokens consume from this budget. A 200K context model could handle 190K input tokens (a large codebase) and up to 10K output tokens, or 100K input and 100K output, depending on the task.

How models handle long contexts

At inference time, models use the KV cache to avoid recomputing attention for every token in the context. Despite this optimization, longer contexts cost more (most APIs price by token) and process slower. Some models also degrade in quality toward the middle of very long contexts, a phenomenon called the "lost in the middle" effect.

Current context window sizes

Claude 3.5 Sonnet: 200K tokens
Claude 3.7 (beta): 300K tokens
GPT-4o: 128K tokens
Gemini 2.0 Flash: 1M tokens
Llama 3.1 405B: 128K tokens
Most open-source 7-13B models: 8K to 32K tokens

Practical implications

Larger context windows don't eliminate the need for RAG (Retrieval-Augmented Generation). Stuffing a massive knowledge base into every prompt is expensive and often degrades quality. The better approach is using RAG to retrieve only relevant pieces, then assembling them into a focused, high-quality prompt. Very long contexts (above 100K tokens) are most useful for tasks like codebase Q&A, long document analysis, and conversation history preservation.

Input vs output tokens

How models handle long contexts

Current context window sizes

Practical implications

Related terms

Models relevant to Context Window