RAG

What Is a Context Window?

The maximum number of tokens an LLM can process in a single request, covering prompt, retrieved context, tool output, and generated response.

What Is a Context Window?

A context window is the maximum number of tokens a large language model can attend to in a single request. system prompt, user input, retrieved RAG chunks, tool observations, and the generated response combined. As of May 2026, the headline numbers have shifted hard: Claude Opus 4.7 ships with a 200k default and a 1M-token enterprise tier, GPT-5.1 sits at 400k native with a 1M opt-in, Gemini 3 Pro advertises a 2M window with active research toward 10M, and Llama 4 Scout open-weights ship at 10M tokens on the long-context branch. The window is a hard budget: anything beyond it is truncated by the API or the inference engine, and cost and latency grow with how much of it you actually fill, so context window is a retrieval-budgeting, chunking, and prompt engineering concern. not just a model spec.

For a senior engineer in 2026 the right mental model is not “how big is the window?” but “how cheaply and reliably can I fill the first 30% and ignore the rest?” Long-context benchmarks like RULER, LongBench v2, and BABILong have made it clear that effective context. the region where models actually use information. almost never matches the advertised maximum.

Why context window matters in production LLM and agent systems

The context window is where retrieval, prompt design, agent state, and cost all collide. In a RAG pipeline you trade retrieved-chunk count and chunk size against the budget left for instructions and the answer. In an agent, every tool observation eats from the same budget. by step 8 of a 12-step trajectory, the prompt may already be 180k tokens on Claude Opus 4.7, leaving little room for the model to reason. Get the budget wrong and four failure modes show up in production traces.

The first is silent truncation. The API or inference server drops the oldest tokens, often the system prompt, without raising an error. the model now answers without knowing the rules it was supposed to follow. The second is the “lost in the middle” effect: long-context models still under-weight middle-of-prompt tokens, so a critical fact buried at position 250,000 in a 1M-token prompt may simply be ignored even though the model formally “sees” it. RULER and LongBench v2 both show effective context falling 30–60% short of the advertised window for most 2026 models. The third is cost: input tokens are billed per call, so a 500k-token prompt costs roughly 50× a 10k-token prompt at the same per-token rate, and many providers charge a long-context premium above 200k. The fourth is latency: prefill time scales close to linearly with prompt length, and a 1M-token prompt can add 30–60 seconds to time-to-first-token on top of the per-token decode cost.

The pain is concrete. A retrieval engineer watches ContextRecall scores collapse on a multi-hop dataset because a re-ranker is dropping the right chunk to fit a 32k window. An SRE watches p99 latency spike whenever an agent crosses the 200k-prompt boundary on Opus 4.7. A finance lead sees a 9× cost increase on the “summarize this 300-page PDF” feature after switching to a long-context model, with no quality lift to justify it. In 2026 agent stacks running across GPT-5.1, Claude Opus 4.7, Gemini 3 Pro, and Llama 4. each with different windows, different long-context pricing, and different positional bias. context-window planning becomes a routing-policy problem, not a static config. This is exactly the surface Agent Command Center is built to manage.

The agent context problem in 2026

The 2026 shift to long-running agents using MCP tool servers and A2A protocols has changed where context budget goes. Single-turn RAG used to dominate budgets; today an agent calling 6-12 tools per task spends most of its window on tool observations, planner state, and prior subtask results. A typical SWE-Bench-Verified-style coding agent on Claude Opus 4.7 routinely hits 150-180k prompt tokens by step 10 because file contents, test output, and diffs accumulate. A τ-bench retail agent on GPT-5.1 hits 80-120k tokens by turn 8 because the database schema, prior tool calls, and user dialogue stack up. The window is not “the size of the user prompt plus a few chunks”. it is the entire trajectory state, and unless you actively manage it, the model runs out of room to plan before it runs out of steps.

Production-grade agent stacks now treat context as a managed resource: summarize older steps, evict tool outputs once their results are committed, route to a longer-context model only when the trajectory crosses a threshold, and fall back to compaction when truncation is imminent. FutureAGI’s TrajectoryScore and TaskCompletion evaluators measure whether that management is preserving the information the agent actually needed.

How FutureAGI tracks context window in production

FutureAGI’s approach is to expose context-window utilization as a first-class production signal across traces, evaluators, and the gateway. Every traceAI integration stamps llm.token_count.prompt, llm.token_count.completion, llm.token_count.total, and gen_ai.request.model on each LLM span. From those four attributes a dashboard can compute utilization against the per-model max-context. A slice on prompt_tokens / max_context > 0.85 is your early-warning system for overflow; a slice on prompt_tokens > 100000 is your long-context cost alarm.

On the evaluation side, ContextRelevance scores whether retrieved chunks are actually relevant to the query (so you do not pay for and reason over noise), ContextPrecision scores whether the relevant chunks made it into the top-k, and ContextUtilization scores whether the model actually used the chunks you sent. a high prompt-token count with low utilization means you are paying for context the model is ignoring. Groundedness then verifies that the answer is supported by the context that survived. All four anchor to the same trace, so a regression on ContextUtilization can be tied to a specific retriever change and a specific chunking strategy.

At Agent Command Center, the model fallback and routing policy primitives let you route to a longer-context model only when prompt token count crosses a threshold, instead of paying long-context prices on every call. A typical 2026 policy: send 95% of traffic to Claude Sonnet 4.6 (200k, fast), and only fall back to Gemini 3 Pro (2M) for the long-prompt 5%. Concretely, a knowledge-base agent ingests user-uploaded contracts, retrieves chunks, and routes to the long-context model only when retrieved-context tokens exceed 150k. The Command Center dashboard shows a 41% reduction in total cost vs. always-long-context, with no degradation in Faithfulness or AnswerRelevancy. Unlike a static model choice, the gateway makes context-window a per-request decision.

We’ve found in our 2026 evals that almost every “long-context model is amazing” demo collapses when you measure ContextUtilization against a real production golden dataset. effective context is usually 30-50% of advertised maximum on RULER-style multi-needle tests. The flagship LLM benchmarks story for 2026 is that single-needle “needle in a haystack” is saturated, and the benchmarks that actually move are RULER, LongBench v2, BABILong, and ZeroSCROLLS. which test multi-needle retrieval, multi-hop reasoning over long context, and long-document QA. A senior engineer should treat any vendor claim of “10M-token context” the way a 2021 paper claim of “state-of-the-art on GLUE” deserves to be treated: useful headline, useless without a downstream eval.

Wiring window utilization into release gates

A useful release gate for context-window-heavy systems combines three checks. First, p99 prompt-token utilization on a regression eval cohort. it must not climb more than 10% release-over-release without an explicit owner. Second, ContextUtilization and Faithfulness deltas on the same rows. if utilization drops while prompt tokens rise, you are paying for noise. Third, a cost-per-trace ceiling per cohort, since long-context regressions almost always announce themselves as a cost spike before they announce themselves as a quality drop. FutureAGI’s evaluate surface wires those three into a single CI check that either passes the build or links the engineer back to the failing rows. Unlike LangSmith’s run-comparison view, which centers on aggregate token counts, our flow ties every regression row to its trace, its retriever, and the evaluator reason. so the engineer fixes the cause, not the symptom.

How to measure context window utilization in 2026

Track the context window as utilization, not as a fixed property. A May 2026 measurement stack should cover four signals: per-span token counts, derived utilization, retrieval quality, and end-to-end groundedness.

SignalSourceWhat it tells you
llm.token_count.promptOTel attribute on every LLM spanHow much of the window your input took
llm.token_count.completionOTel attributeWhether the model had headroom to answer
Utilization ratioprompt_tokens / model_max_contextAlert above 0.85 to catch overflow
ContextRelevancefi.evals.ContextRelevance0-1 score on whether retrieved context matched the query
ContextPrecisionfi.evals.ContextPrecisionWhether relevant chunks ranked in the top-k
ContextUtilizationfi.evals.ContextUtilizationWhether the model actually used provided context
Faithfulnessfi.evals.FaithfulnessWhether the answer stuck to the provided context
Long-context p99 latencyTracer p99 by prompt_tokens bucketWhen prefill cost starts hurting users
Token cost per traceSum of tokens × per-model priceHonest unit-economics view
Truncation flagProvider response fieldHard signal of overflow

The honest production view treats prompt tokens as a budget, retrieval and tool output as a tax on that budget, and ContextUtilization as the only signal that tells you whether the spend is producing answers or paying for noise. Most teams discover when they first wire this up that 20-40% of their long-context spend is going to retrieved chunks the model ignores. That is a routing problem, a chunking problem, or a re-ranker problem. almost never a model problem.

A minimal fi.evals pairing:

from fi.evals import ContextRelevance, ContextUtilization, Faithfulness

ctx_rel = ContextRelevance()
ctx_use = ContextUtilization()
faithful = Faithfulness()

for row in dataset:
    rel = ctx_rel.evaluate(query=row.input, context=row.retrieved_chunks)
    use = ctx_use.evaluate(response=row.answer, context=row.retrieved_chunks)
    faith = faithful.evaluate(response=row.answer, context=row.retrieved_chunks)
    row.attach_scores(relevance=rel, utilization=use, faithfulness=faith)

For an online check wired to every traceAI LLM span, attach the evaluator to the prompt-token bucket directly so a long-context route fails the gate before the response ships:

from fi.evals import ContextUtilization, Faithfulness
from traceai import on_span

util = ContextUtilization()
faith = Faithfulness()

@on_span(kind="llm")
def gate(span):
    if span.attributes["llm.token_count.prompt"] < 100_000:
        return  # short-context path skips the gate
    util_score = util.evaluate(
        response=span.attributes["llm.output"],
        context=span.attributes["retrieval.documents"],
    ).score
    faith_score = faith.evaluate(
        response=span.attributes["llm.output"],
        context=span.attributes["retrieval.documents"],
    ).score
    if util_score < 0.35 or faith_score < 0.7:
        span.set_status("ERROR", "long-context regression")
        span.route("fallback:claude-opus-4-7-1M")

Three derived metrics make the data actionable. Effective context ratio: ContextUtilization × prompt_tokens / max_context. Below 0.2 you are renting a long-context model for nothing. Long-tail cost share: percent of total spend coming from traces above 200k prompt tokens. Above 30% you almost certainly need a routing-based fallback in Agent Command Center. Truncation incidence: percent of traces with the provider’s truncated: true flag set. Anything above 0% is a real bug.

Model context windows worth knowing in May 2026

Here is the May 2026 landscape for the windows engineers actually deploy against. Numbers are per published model cards; verify before relying on them for capacity planning.

ModelMax context (tokens)Effective context (RULER, multi-needle)Notes
Claude Opus 4.7200k (1M enterprise)~160k usableBest in-window reasoning; 1M tier is rate-limited
Claude Sonnet 4.6200k~140k usableCheapest strong long-context option
GPT-5.1400k (1M opt-in)~220k usableLong-context premium pricing above 200k
GPT-5 mini256k~120k usableFast, cheap, weaker on multi-hop long context
Gemini 3 Pro2M~600k usableBest raw window; cost premium and slow prefill
Gemini 3 Flash1M~300k usableStrong cost/window tradeoff
Llama 4 Scout (open-weight)10M~500k usableOpen-weight ceiling; deployment cost dominates
Llama 4 Maverick (open-weight)1M~250k usableProduction-friendly open option
Mistral Large 3256k~140k usableStrong European-jurisdiction option
DeepSeek V3.2256k~150k usableBest cost-per-token in long-context tier

The honest takeaway: advertised windows are now 4-20× larger than effective windows on multi-hop tasks. Plan retrieval and routing against effective context, then monitor ContextUtilization to keep the assumption honest.

Patterns that beat brute-force long context

Three patterns consistently outperform “stuff the window and pray” on every 2026 evaluator we run. First, retrieval with re-ranking. a dense retriever followed by a cross-encoder re-ranker lifts top-3 ContextPrecision by 20-40 points on most domains, which means a 32k window with re-ranking beats a 1M window without it on cost-adjusted faithfulness. Second, trajectory summarization for agents. after every 4-5 steps, summarize the older steps into a compact state object and evict the raw tool output; TrajectoryScore stays flat while prompt tokens drop 60%. Third, structured tool-output compaction. instead of injecting raw SQL results, project them to a typed schema and inject only the fields the next step needs. All three are managed at the routing layer in Agent Command Center; none of them require the underlying model to change. Compared with the LangChain ConversationBufferWindowMemory pattern, these are evaluator-driven rather than turn-count-driven, so the agent retains the informative turns, not just the recent ones.

Common mistakes

  • Treating “longer context = better answers.” Long-context models still under-weight middle-of-prompt tokens. RULER, LongBench v2, and BABILong all confirm the effective context is well below the advertised maximum on every frontier 2026 model. more chunks does not equal more recall.
  • Ignoring tool observations in the budget. Agent traces routinely overflow because tool outputs (SQL results, scraped pages, MCP server responses) inflate prompt tokens silently. Budget tool output separately and truncate or summarize before re-injecting.
  • Hard-coding chunk counts. A top_k=10 retrieval that fits in 8k breaks at 128k once chunk size grows; budget by tokens, not by chunk count. Pair top-k with a token cap.
  • Not reserving response budget. If prompt_tokens + max_completion_tokens > context_window, the API errors or truncates. always reserve headroom (typically 4-8k for normal answers, more for reasoning models doing chain-of-thought).
  • Comparing context windows across tokenizers. A 200k-token Llama 4 context and 200k-token GPT-5 context hold different amounts of text. measure in characters or words for cross-model comparisons, not raw tokens.
  • Pricing long-context the same as short-context. Most 2026 providers charge a premium above 200k tokens. Always include the per-model long-context surcharge in your cost dashboard.
  • Treating the MCP server response as free. MCP tool calls return structured data that lives in the context window like any other tool observation. Long-context budgets must include MCP responses, especially in multi-step agent loops.
  • Skipping ContextUtilization. It is the only evaluator that catches “I paid for the long context, the model ignored it.” Without it, your effective-context number is wishful thinking.

Frequently Asked Questions

What is a context window?

A context window is the maximum total number of tokens. prompt, retrieved chunks, tool output, and generated response combined. that an LLM can process in one request.

How is a context window different from context length?

They are often used synonymously. Context length emphasizes the size of the input the model accepts; context window emphasizes the budget perspective. what fraction you have used and what is left for the response.

How do you measure context-window usage?

FutureAGI tags every LLM span with `llm.token_count.prompt` and `llm.token_count.total` via traceAI, so you can dashboard utilization against the model's max-context to catch silent truncations and overflow.