RAG

What Is a Context Window?

The maximum number of tokens an LLM can process in a single request, covering prompt, retrieved context, tool output, and generated response.

What Is a Context Window?

A context window is the maximum number of tokens a large language model can attend to in a single request — system prompt, user input, retrieved RAG chunks, tool observations, and the generated response combined. In 2026 the common range is 128k tokens (GPT-4o, Llama 3.1 128k) to 1M+ (Gemini 2.5 Pro, Claude Sonnet 4 long-context). The window is a hard budget: anything beyond it is truncated by the API or the inference engine, and cost and latency rise with how much of it you actually fill, so context window is a retrieval-budgeting and prompt-engineering concern, not just a model spec.

Why It Matters in Production LLM and Agent Systems

The context window is where retrieval, prompt design, agent state, and cost all collide. In a RAG pipeline, you trade retrieved-chunk count and chunk size against the budget left for instructions and the answer. In an agent, every tool observation eats from the same budget — by step 8 of a 12-step trajectory, the prompt may already be 80k tokens, leaving little room for the model to reason. Get the budget wrong and three failure modes show up.

The first is silent truncation. The API drops the oldest tokens, often the system prompt, without raising an error — the model now answers without knowing the rules it is supposed to follow. The second is stale context: long-context models often weight earlier tokens less, so a critical fact buried at position 2,000 in a 100,000-token prompt may simply be ignored. The third is cost: input tokens are billed per call, so a 50k-token prompt costs 50× a 1k-token prompt at the same per-token rate.

The pain is concrete. A retrieval engineer sees ContextRecall scores collapse on a multi-hop dataset because a re-ranker is dropping the right chunk to fit a 16k window. An SRE watches p99 latency spike whenever an agent crosses the 100k-prompt boundary. A finance lead sees a 9× cost increase on the “summarize this 200-page PDF” feature after switching to a long-context model, with no quality lift to justify it. In 2026 agent stacks running across gpt-4o, claude-sonnet-4, and gemini-2.5-pro, each with different windows, context-window planning becomes a routing-policy problem, not a static config.

How FutureAGI Handles the Context Window

FutureAGI’s approach is to expose context-window utilization as a first-class production signal across traces, evaluators, and the gateway. Every traceAI integration stamps llm.token_count.prompt, llm.token_count.completion, and llm.token_count.total on each LLM span, plus the gen_ai.request.model so you can compute utilization against the per-model max-context. A dashboard slice on prompt_tokens / max_context > 0.85 is your early-warning system for overflow.

On the evaluation side, ContextRelevance scores whether retrieved chunks are actually relevant to the query (so you don’t pay for and reason over noise), and ContextUtilization scores whether the model actually used the chunks you sent — a high prompt-token count with low utilization means you are paying for context the model is ignoring. Both anchor to the same trace.

At the Agent Command Center, the model fallback primitive lets you route to a longer-context model only when prompt token count crosses a threshold, instead of paying long-context prices on every call. A routing policy might send 95% of traffic to gpt-4o-mini (128k) and only fall back to gemini-2.5-pro (1M) for the long-prompt 5%. Concretely: a knowledge-base agent ingests user-uploaded contracts, retrieves chunks, and routes to the long-context model when retrieved-context tokens exceed 80k. The dashboard shows a 41% reduction in total cost vs. always-long-context, with no degradation in Faithfulness. Unlike a static model choice, the gateway makes context-window a per-request decision.

How to Measure or Detect It

Track the context window as utilization, not as a fixed property:

  • llm.token_count.prompt (OTel attribute): how much of the window your input took.
  • Context utilization ratio (derived metric): prompt_tokens / model_max_context; alert above 0.85.
  • ContextRelevance evaluator: returns 0–1 on whether retrieved context was relevant to the question.
  • ContextUtilization evaluator: returns whether the model actually USED the provided context — catches the “long-prompt, ignored-by-model” failure.
  • Truncation flags: some providers expose a truncated: true field on responses; surface it as an alert.
  • Token cost per trace: sum of tokens × $/token-by-model across the trace, the only honest unit-economics view for long-context calls.

Minimal Python:

from fi.evals import ContextRelevance, ContextUtilization

ctx_rel = ContextRelevance()
ctx_use = ContextUtilization()

print(ctx_rel.evaluate(query=q, context=retrieved_chunks).score)
print(ctx_use.evaluate(response=r, context=retrieved_chunks).score)

Common Mistakes

  • Treating “longer context = better answers”. Long-context models often under-weight middle-of-prompt tokens; more chunks does not equal more recall.
  • Ignoring tool observations in the budget. Agent traces routinely overflow because tool outputs (SQL results, scraped pages) inflate prompt tokens silently.
  • Hard-coding chunk counts. A top_k=10 chunk retrieval that fits in 8k breaks at 128k once chunk size grows; budget by tokens, not by count.
  • Not reserving response budget. If prompt_tokens + max_completion_tokens > context_window, the API errors or truncates — always reserve headroom.
  • Comparing context windows across tokenizers. A 128k-token Llama context and 128k-token GPT context hold different amounts of text — measure in characters or words for cross-model comparisons.

Frequently Asked Questions

What is a context window?

A context window is the maximum total number of tokens — prompt, retrieved chunks, tool output, and generated response combined — that an LLM can process in one request.

How is a context window different from a context length?

They are often used synonymously. Context length emphasizes the size of the input the model accepts; context window emphasizes the budget perspective — what fraction you have used and what is left for the response.

How do you measure context-window usage?

FutureAGI tags every LLM span with `llm.token_count.prompt` and `llm.token_count.total` via traceAI, so you can dashboard utilization against the model's max-context to catch silent truncations and overflow.