For Developers/Glossary/Context Window
Core Concepts

Context Window

The maximum amount of text an LLM can process at once - both your input and the model's output count against the limit.

The context window is the LLM's working memory. Everything the model "sees" during a single inference call - your system prompt, conversation history, retrieved documents, and tool outputs - must fit within this limit. The model generates output tokens until it completes the task or hits the output length limit.

Input vs output tokens

Context window size typically refers to the total token budget. Both input and output tokens consume from this budget. A 200K context model could handle: 190K input tokens (a large codebase) and up to 10K output tokens, or 100K input and 100K output, depending on the task.

How models handle long contexts

At inference time, models use the KV cache to avoid recomputing attention for every token in the context. Despite this optimization, longer contexts cost more (most APIs price by token) and are slower to process. Some models also degrade in quality toward the middle of very long contexts - the so-called "lost in the middle" effect.

Current context window sizes (2025-2026)

  • Gemini 2.5 Pro / Flash: 1M - 2M tokens
  • Llama 4 Scout: 10M tokens (though quality at extreme lengths varies)
  • Claude 4 family: 200K tokens
  • GPT-4o / o1: 128K tokens
  • Most open-source 7-13B models: 8K - 32K tokens

Practical implications

Larger context windows don't eliminate the need for RAG - stuffing a massive knowledge base into every prompt is expensive. The sweet spot is using RAG to retrieve only the relevant pieces, then using the context window to assemble a precise, high-quality prompt. Very long contexts (>100K tokens) are most useful for tasks like codebase Q&A, long document analysis, and conversation history preservation.

Related terms

Models relevant to Context Window