Context Window
The maximum amount of text an LLM can process at once - both your input and the model's output count against the limit.
The context window is the LLM's working memory. Everything the model "sees" during a single inference call - your system prompt, conversation history, retrieved documents, and tool outputs - must fit within this limit. The model generates output tokens until it completes the task or hits the output length limit.
Input vs output tokens
Context window size typically refers to the total token budget. Both input and output tokens consume from this budget. A 200K context model could handle: 190K input tokens (a large codebase) and up to 10K output tokens, or 100K input and 100K output, depending on the task.
How models handle long contexts
At inference time, models use the KV cache to avoid recomputing attention for every token in the context. Despite this optimization, longer contexts cost more (most APIs price by token) and are slower to process. Some models also degrade in quality toward the middle of very long contexts - the so-called "lost in the middle" effect.
Current context window sizes (2025-2026)
- Gemini 2.5 Pro / Flash: 1M - 2M tokens
- Llama 4 Scout: 10M tokens (though quality at extreme lengths varies)
- Claude 4 family: 200K tokens
- GPT-4o / o1: 128K tokens
- Most open-source 7-13B models: 8K - 32K tokens
Practical implications
Larger context windows don't eliminate the need for RAG - stuffing a massive knowledge base into every prompt is expensive. The sweet spot is using RAG to retrieve only the relevant pieces, then using the context window to assemble a precise, high-quality prompt. Very long contexts (>100K tokens) are most useful for tasks like codebase Q&A, long document analysis, and conversation history preservation.
Related terms
Models relevant to Context Window
Gemini 2.5 Pro
Google's most capable model with a 1M token context and top science benchmark scores
View model →Claude Sonnet 4.6
Anthropic's best balance of speed, intelligence, and cost for production workloads
View model →Llama 4
Meta's multimodal open-weights model family with a 10M context window variant
View model →