What Is Context Overflow?
An LLM failure where input plus requested output exceeds the model's context window, causing truncation, rejection, or degenerate responses.
What Is Context Overflow?
Context overflow is an LLM failure mode where the request — system prompt, retrieved context, conversation history, and the requested completion length — exceeds the model’s context window. Providers respond in three different ways: silently truncate the input (worst, because the model answers from incomplete context), return a 400 error (loud and recoverable), or generate degenerate output as the model approaches its limit. With million-token windows now standard, overflow is rarer per single call but more common in agent loops that accumulate history, tool outputs, and intermediate reasoning across dozens of steps.
Why It Matters in Production LLM and Agent Systems
On 2026-04-15 a research-agent product crashed every long-running session past hour two. Postmortem: the agent ran a ReAct loop that appended every tool output to the next prompt. Around step 80, the prompt hit the model’s 200K context limit. The provider returned a 400 with context_length_exceeded. The agent’s catch handler logged “model error” and retried — same overflow, same error. After three retries the session gave up. Customers saw “agent unavailable.” No token-budget alert was wired. The fix was a summarisation step every 20 turns plus a fallback to a 1M-context model, but the team only learned this after a week of incident triage.
That is the agent overflow shape. It hits the application engineer (mysterious crashes after hour two), the SRE (no clear log signal — looks like a provider error), the product team (sessions fail in user-perceivable ways), and the finance team (retry costs surge during overflow events).
In 2026 RAG stacks, overflow shows up differently — a too-permissive top-K retrieval pulls in 30 chunks of 4KB each, the prompt exceeds the model limit, and the provider truncates. The model answers from the first half of the retrieved context; the rest never reaches the model. This is the silent variant: no error, no alert, just lower-quality answers. The only telltale signal is a token-budget metric on the trace.
How FutureAGI Handles Context Overflow
FutureAGI’s approach is to track token counts on every span and let the Agent Command Center route around overflow before it happens. Every LLM span carries llm.token_count.prompt and llm.token_count.completion via traceAI integrations (traceAI-openai, traceAI-langchain, traceAI-langgraph). The Agent Command Center exposes a conditional routing-policy that fires when prompt tokens cross a configured threshold — for example, “if llm.token_count.prompt > 150_000 route to claude-sonnet-4 (1M context) instead of gpt-4o (128K).” The companion fallback policy catches context_length_exceeded errors at the provider boundary and reroutes the call without the application code touching it.
Concretely: a research agent is wrapped behind the Agent Command Center with three policies — a pre-guardrail that estimates token count from the prompt, a conditional routing rule (gt: { llm.token_count.prompt: 150000 } → 1M-context model), and a fallback policy that retries with summarisation if the provider still rejects. The team also runs fi.evals.ContextRelevance over the prompt at evaluation time to catch the silent variant: when context is dominated by irrelevant chunks, retrieval needs tightening before raw context bytes do. The dashboard plots prompt-token-p99 by route — drift past 80% of the model limit is the alert threshold.
Unlike LangChain’s truncate-on-overflow utilities, FutureAGI’s gateway routes before the truncation happens, preserving answer quality.
How to Measure or Detect It
Signals to wire up:
- OTel attribute
llm.token_count.prompt— primary signal; track at every span. - OTel attribute
llm.token_count.completion— combined with prompt count gives total budget consumption. - Provider error rate
context_length_exceeded— the loud-failure variant. - Dashboard signal: prompt-token p99 by route — drift past 80% of model limit means overflow is imminent.
fi.evals.ContextRelevance— surfaces the silent overflow case where retrieved context is over-broad.- Agent step count per session — leading indicator for long-loop overflow.
# Token-budget gate before model call
prompt_tokens = count_tokens(prompt)
if prompt_tokens > 0.8 * MODEL_LIMIT:
# route to larger model or trigger summarisation
summary = summariser.run(history)
prompt = build_prompt(summary, current_step)
Common Mistakes
- Conflating context overflow with runaway cost. Overflow is a structural per-call failure; runaway cost is unbounded token use across many calls. Different fixes.
- Trusting the provider to truncate gracefully. Silent truncation ships a worse answer; loud rejection at least surfaces in logs.
- Letting agent loops accumulate full history without summarisation. Every step’s tool output bloats the next prompt; insert summarisation at fixed step boundaries.
- Ignoring response-length budget.
max_tokensfor the completion eats into the same context window — add it to the calculation. - Setting a single global token threshold. Limits vary widely across models (8K to 1M); make the threshold model-aware.
Frequently Asked Questions
What is context overflow?
Context overflow is an LLM failure where the input plus the requested output exceeds the model's context window, causing truncation, request rejection, or degenerate output.
How is context overflow different from runaway cost?
Context overflow is a structural per-call failure — the request does not fit. Runaway cost is unbounded token consumption across many calls, often from agent loops or recursive tool calls, even when each individual call fits.
How do you prevent context overflow?
Track llm.token_count.prompt at every span via traceAI; set a token budget, and use the FutureAGI Agent Command Center fallback policy to route to a larger-context model or trigger summarisation before the limit hits.