Evaluating LLM Context Window Management in 2026
Long-context support is marketing. Long-context fidelity is what you eval: NIAH at every position, lost-in-middle on your docs, attention-budget cost.
Table of Contents
A support agent ships on Monday with 3k tokens of context and a 0.92 task-completion score. By Thursday, sessions stretch past 32k tokens and the same agent scores 0.71 on the same rubric. The prompt did not change. The model did not change. What changed is that the agent crossed a fidelity threshold no one was watching, on a model whose datasheet still says 200,000 tokens. Long-context support is marketing; long-context fidelity is what you eval. The methodology that earns the word “fidelity” is three tests run together: needle-in-haystack at multiple positions, lost-in-middle on your own documents, and attention-budget cost analysis. Without those three, a “200K context window” is a sticker, not a capability.
This post is the working pattern. The three tests, the production patterns that survive past 32k tokens, and the FAGI surfaces that instrument and score them. The reader leaves with rubrics that pin failures to position, depth, or strategy, and with a cost-per-effective-token curve that decides which pattern ships.
Context length is not context fidelity
The number on the model card is the tokenizer’s capacity, not the model’s reasoning depth. Models advertised at 200k tokens cliff somewhere between 32k and 64k on production workloads, and the cliff is not a slow degradation. Coherence sits at 0.9 at the limit and 0.6 a few thousand tokens past it. The drop is a step function whose location is per-model: Claude Sonnet 4.5 holds further than smaller models, GPT-5.1 has its own knee, Gemini 2.5 Pro has another, and the public needle-in-haystack score from any of them stays above 0.95 well past the point where actual task completion has collapsed.
Three properties make this a separate evaluation problem from a normal prompt change. Production agents cross 32k by week two — a 5k-per-session chatbot crosses 32k by turn 12 once you add system prompts, tool definitions, retrieved documents, and a memory snapshot. Public needle-in-haystack lies — a 100k context with one sentence to retrieve is a different problem than a 100k context full of overlapping facts where the model has to keep three straight and forget the rest. And quality drops are invisible to single-turn rubrics — every rubric in the SDK assumes one prompt, one response, and the failure modes of long context fall through that frame.
The fix is structural: stop scoring context windows like prompts and start scoring them like systems. Three tests, run together.
Test 1: needle-in-haystack at multiple positions
The first test is the one most teams already think they run. The trap is that the public NIAH benchmark uses one position (usually random) and one shape of distractor (random text). A production-grade NIAH eval sweeps positions and uses shape-similar distractors so the model cannot win on lexical luck.
Sweep positions: start, 25 percent, middle, 75 percent, end. Sweep depths: 4k, 16k, 32k, 64k, 96k, 128k. Mix the needle with distractors that look like the needle but answer a different question — same entity types, same sentence structure, different facts. Score on a custom rubric that checks whether the response acts on the needle, not just whether it surfaces it.
from fi.evals.templates import CustomLLMJudge
NeedlePositionFidelity = CustomLLMJudge(
name="NeedlePositionFidelity",
rubric=(
"Score 1-5 whether the response correctly used the target fact. "
"5: response cites the exact fact and acts on it. "
"3: response cites a related-but-wrong fact from elsewhere in context. "
"1: response invents a fact, ignores the context, or picks a distractor."
),
grading_prompt_template=(
"Long context (truncated):\n{context}\n\n"
"Fact position: {position_pct} percent through context.\n"
"Question: {question}\nExpected answer: {expected}\n"
"Model response: {response}\nScore (1-5):"
),
)
Run this across the 30-cell sweep (5 positions x 6 depths) per model in your routing pool. The middle position is where the lost-in-middle effect lives, and it is where every cliff hides first. Pair the custom rubric with Groundedness (eval_id 47) and ContextAdherence (5) from the ai-evaluation SDK so the same trace also gets the standard rubric family. The LLM evaluation playbook explains the rubric-family framing.
The artifact is a heatmap: model on one axis, depth on another, position colour-coded. The cells that drop below your tolerance (typically 0.85 of the start-position score) are where the model loses fidelity. The cliff is not at the advertised window; it is wherever the heatmap goes cold.
Test 2: lost-in-middle on your documents
Public NIAH uses synthetic essays. Your agent processes order histories, ticket threads, contract sections, and code reviews. The lost-in-middle effect on synthetic text does not predict the effect on your text, because the distractors in your text actually look like the needle.
Take 50-200 representative session transcripts from production traces. For each, identify a fact the agent must act on later — an order ID, a stated preference, an escalation threshold, a constraint set in turn 1 and asked about by turn 30. Inject that fact at three positions: 10 percent through the transcript, 50 percent, 90 percent. Pad each variant to the same total token count using shape-similar turns from other sessions. Run the agent on each variant.
The rubric scores whether the response acted on the injected fact:
LostInMiddleRatio = CustomLLMJudge(
name="LostInMiddleRatio",
rubric=(
"Score 1-5 whether the response acts on the injected fact. "
"5: agent applies the fact correctly to the user's request. "
"3: agent acknowledges the fact but does not act on it. "
"1: agent acts as if the fact were not in context."
),
grading_prompt_template=(
"Session transcript (truncated):\n{transcript}\n\n"
"Injected fact: {fact}\nInjection position: {position_pct} percent.\n"
"User request: {request}\nAgent response: {response}\n"
"Score (1-5):"
),
)
The number that matters is the ratio: middle-position score divided by the average of the edge-position scores. A ratio above 0.95 means the model handles your documents at depth. A ratio between 0.85 and 0.95 means the middle is starting to fail and you need a recovery pattern (a goal-pinning step, a hierarchical strategy, a summary cascade). Below 0.85 means the middle is unreliable on your traffic and you should stop trusting the model to recall facts past the recent buffer without retrieval.
Promote failing scenarios into the golden set weekly. The LLM evaluation playbook covers the promotion mechanics; the addition here is that each scenario carries the injection position and the depth so the cliff number stays current as session lengths grow with usage.
Test 3: attention-budget cost analysis
The third test ties cost to fidelity. Long context costs more per call and runs slower per call. The eval question is not whether either is true; it is whether the quality gain at higher tokens earns the cost. The number that answers it is cost-per-effective-token: cost per 1k input tokens divided by the rubric score at that depth.
A 32k call that costs $0.10 and scores 0.85 has a cost-per-effective-token of about $0.118 per scored point. The same model at 96k that costs $0.30 and scores 0.62 has $0.484 per scored point — four times more expensive for worse output. The plot is a curve per model per strategy, and the Pareto knee names itself.
The Agent Command Center gateway returns cost and latency on every call. Attach both to the trace span:
import requests
def chat_with_attention_budget(messages, model, strategy):
r = requests.post(
"https://gateway.futureagi.com/v1/chat/completions",
json={"model": model, "messages": messages},
headers={"Authorization": f"Bearer {API_KEY}"},
)
cost = float(r.headers.get("x-prism-cost", 0))
latency_ms = float(r.headers.get("x-prism-latency-ms", 0))
model_used = r.headers.get("x-prism-model-used", "unknown")
return r.json(), {"cost": cost, "latency_ms": latency_ms, "model": model_used}
Bin by llm.input_tokens and the curve emerges. For most workloads, naive context loses on cost-per-effective-token past 32k. Sliding window wins on cost but loses on quality on long-horizon tasks. Summarize-and-compact wins the median Pareto point at 32-64k. Hierarchical wins past 64k, at the cost of engineering complexity. The point is not to memorise that; the plot tells you the answer for your workload. The LLM cost optimisation post covers the gateway headers; the addition here is dividing cost by rubric score, not just reporting cost.
Compute attention-budget efficiency as a single rubric so the cliff lands in your dashboards:
AttentionBudgetEfficiency = CustomLLMJudge(
name="AttentionBudgetEfficiency",
rubric=(
"Given (input_tokens, cost_usd, quality_score), score 1-5 the "
"cost-per-effective-token quality. 5: cost_per_effective_token is "
"within 10 percent of the best model+depth cell on this workload. "
"1: more than 3x worse than the best cell."
),
grading_prompt_template=(
"Tokens: {tokens}\nCost: ${cost}\nQuality score: {score}\n"
"Best cell cost-per-effective-token: ${best}\nScore (1-5):"
),
)
Production patterns that survive past 32k
The three tests tell you where each pattern breaks. Four patterns are worth running through them.
Naive context. Append turns until the window overflows. Cheapest to implement, the most common silent failure mode, and the right baseline for the Pareto plot. The agent crosses the per-model cliff in production and the single-turn rubric does not notice.
Chunking plus reranking. Retrieve top-k chunks from a vector store, rerank with a cross-encoder, feed the top three into a short context. The active window stays small, so fidelity holds well past 32k. The failure mode shifts to the retriever: ChunkAttribution and ChunkUtilization become the rubrics that matter, covered in the RAG metrics deep dive. Pair this pattern with a cross-encoder reranker; the rerankers post covers selection.
Summary cascade. Keep the last K turns verbatim. Summarise older blocks into a few hundred tokens each. Index summaries for selective recall when the agent needs an older fact. Better than naive sliding because the goal stated in turn 1 survives compaction; worse than hierarchical because the summary itself becomes the source of truth for everything past the recent buffer. The rubric to watch is compaction faithfulness: if the summary drops a fact, the agent drops the fact.
Hierarchical context. Combine a small recent buffer, a global compacted summary, and per-turn retrieval over a session index. The most expensive to engineer, the best Pareto point past 64k on hard workloads, and the pattern where failures cluster around the retrieval step rather than the compaction step.
The right comparison is each pattern crossed with each model crossed with each context depth, scored on NeedlePositionFidelity, LostInMiddleRatio, AttentionBudgetEfficiency, Groundedness, and TaskCompletion. That is a lot of cells, which is why the four distributed runners in the SDK (Celery, Ray, Temporal, Kubernetes) matter — the distributed runners post covers selection.
Instrumenting context dimensions with traceAI
Score nothing until the trace tree carries the context dimensions. Three attributes per LLM span: input-token count, context position (when injecting a fact for the lost-in-middle test), and session ID for cross-turn grouping.
from fi_instrumentation import register
from fi_instrumentation.fi_types import ProjectType, SpanAttributes
from opentelemetry import trace
tracer_provider = register(
project_type=ProjectType.OBSERVE,
project_name="context-window-eval",
)
tracer = trace.get_tracer(__name__)
def llm_call_with_context_span(messages, model, strategy, session_id, position_pct=None):
input_tokens = count_tokens(messages, model=model)
with tracer.start_as_current_span("llm.call") as span:
span.set_attribute(SpanAttributes.FI_SPAN_KIND, "LLM")
span.set_attribute("llm.model", model)
span.set_attribute("llm.input_tokens", input_tokens)
span.set_attribute("llm.context_strategy", strategy)
span.set_attribute("session.id", session_id)
if position_pct is not None:
span.set_attribute("llm.context_position_pct", position_pct)
response = client.chat.completions.create(model=model, messages=messages)
span.set_attribute("llm.output_tokens", response.usage.completion_tokens)
return response
With those attributes pinned, every downstream view is a group-by away: per-strategy quality by token count, per-model cliff curve, position fidelity heatmap, cost-per-effective-token by strategy. traceAI ships 50+ AI surfaces across Python, TypeScript, Java, and C#, with pluggable semantic conventions (FI, OTEL_GENAI, OPENINFERENCE, OPENLLMETRY) so spans ingest into Phoenix or Traceloop without re-instrumenting. The instrumenting your AI agent post covers framework wiring; the addition here is the three context attributes.
Cross-turn grouping is what makes session-level views possible. A 50-turn conversation has 50 spans sharing session.id. Per-session task completion is the rollup. Per-session position fidelity is the rollup. The single-turn view never shows the long-conversation failure mode.
The 5-step setup
Five steps from “long-context shipped” to “scored per-axis on CI.”
Step 1: instrument. Add llm.input_tokens, llm.context_strategy, llm.context_position_pct, and session.id to every LLM span. Route calls through the Agent Command Center so x-prism-cost, x-prism-latency-ms, and x-prism-model-used ride on the same trace.
Step 2: build the golden set. 200-500 multi-turn scenarios that exercise context, not single-turn QA:
| Scenario type | Turns | Depths swept | What it tests |
|---|---|---|---|
| Needle injected at 5 positions | 1 | 4k-128k | Position fidelity heatmap |
| Production transcript with fact injected at 10/50/90 percent | 10-50 | 4k-96k | Lost-in-middle on your docs |
| Goal stated turn 1, asked turn 20 | 20 | 16k-64k | Compaction faithfulness |
| Same prompt run 10 times | 10 | 32k | Cache stability |
| 50-turn workflow, multiple critical facts | 50 | 64k+ | Task completion across compactions |
Promote failing production scenarios weekly; each scenario carries depth and injection position annotations so the cliff number stays current.
Step 3: run each pattern on each scenario. Score per-axis with the FAGI templates (Groundedness, ContextAdherence, Completeness, TaskCompletion) and the custom rubrics (NeedlePositionFidelity, LostInMiddleRatio, AttentionBudgetEfficiency):
from fi.evals import Evaluator
from fi.evals.templates import (
Groundedness, ContextAdherence, Completeness, TaskCompletion,
)
from fi.testcases import TestCase
evaluator = Evaluator(fi_api_key="...", fi_secret_key="...")
for pattern in ["naive", "chunking_rerank", "summary_cascade", "hierarchical"]:
scores = evaluator.evaluate(
eval_templates=[
Groundedness(), ContextAdherence(),
Completeness(), TaskCompletion(),
NeedlePositionFidelity, LostInMiddleRatio,
AttentionBudgetEfficiency,
],
inputs=[
TestCase(
query=turn.user_text,
response=run_with_pattern(turn, pattern),
context=turn.context,
)
for turn in golden_set
],
)
Parallelise per-pattern per-model per-depth on Celery, Ray, Temporal, or Kubernetes. A full sweep (4 patterns x 4 models x 6 depths x 500 scenarios = 48,000 cells) finishes overnight on a 64-worker cluster.
Step 4: plot attention-budget Pareto. Per-model, per-pattern: cost-per-effective-token versus depth. The plot is the artifact that decides which pattern ships; a single number per cell hides the tradeoff.
Step 5: cluster failures. Error Feed inside the eval stack runs HDBSCAN soft-clustering over failing traces. A Sonnet 4.5 Judge agent writes an immediate_fix per cluster. Typical context-management clusters:
- “agent forgets fact pinned at 50 percent depth, 32k” → switch from naive to hierarchical, add a goal-pinning step
- “summary cascade drops customer’s stated preference” → tune the summariser prompt with
LostInMiddleRatioas the optimisation metric - “cost-per-effective-token cliff between 64k and 96k on smaller model” → route past the cliff to a model whose curve holds further via the gateway
Cluster fixes feed the self-improving evaluators on the Future AGI Platform. The summariser prompt itself is a prompt the agent-opt optimisers can tune against LostInMiddleRatio as the target metric (PROTEGI, GEPA, MetaPrompt, BayesianSearch, RandomSearch, PromptWizard — each with EarlyStoppingConfig).
How Future AGI ships the context-window eval stack
Future AGI ships the eval stack as a package. Start with the SDK for code-defined evals. Graduate to the Platform when you want self-improving rubrics authored by an in-product agent.
- ai-evaluation SDK (Apache 2.0):
from fi.evals import Evaluator. Ready-to-use templates:Groundedness(eval_id 47),ContextAdherence(5),ContextRelevance(9),Completeness(10),TaskCompletion,AnswerRefusal(88).CustomLLMJudgefor the position-aware rubrics (NeedlePositionFidelity,LostInMiddleRatio,AttentionBudgetEfficiency). Local NLI-based equivalents that run on DeBERTa with no API call. Four distributed runners (Celery, Ray, Temporal, Kubernetes) for the 48k-cell sweep. - traceAI (Apache 2.0): 50+ AI surfaces across Python, TypeScript, Java, C#; 14 span kinds; pluggable semantic conventions (FI, OTEL_GENAI, OPENINFERENCE, OPENLLMETRY); 62 built-in evals via
EvalTagfor span-attached scoring on live traces. - Future AGI Platform: self-improving evaluators tuned by production thumbs-up/down so
LostInMiddleRatiostays aligned with what your agent authors flag as critical; in-product authoring agent generates position-aware rubrics from natural-language descriptions; classifier-backed evals at lower per-eval cost than Galileo Luna-2. - Agent Command Center gateway: cost, latency, and model-used headers per call so cost-per-effective-token computes on the same trace; 100+ providers; 18+ built-in guardrail scanners; ~29k req/s, P99 21 ms with guardrails on, t3.xlarge.
- Error Feed (inside the eval stack): HDBSCAN clustering on ClickHouse-stored span embeddings groups context-failure traces into named issues; Sonnet 4.5 Judge writes the
immediate_fix; engineer-reviewed promotions feed the dataset and the Platform’s self-improving evaluators. Linear is wired today; Slack, GitHub, Jira, PagerDuty are on the roadmap.
Ready to wire the three tests into a CI gate? Bind NeedlePositionFidelity, LostInMiddleRatio, and AttentionBudgetEfficiency into a pytest fixture this afternoon against the ai-evaluation SDK, then add the traceAI instrumentor when production traces start asking questions the CI gate missed.
Honest scope
The trace-stream-to-agent-opt connector is roadmap. Today, eval-driven optimisation on the summariser prompt itself ships via the six agent-opt optimisers, with LostInMiddleRatio as the eval target. Direct ingestion of live traces into the optimiser dataset is the next connector; the bridge today is “promote failing traces into the golden set, rerun the optimiser.” Error Feed integrates with Linear; Slack, GitHub, Jira, PagerDuty are on the roadmap.
Context window management is the agent component most likely to be evaluated on vibes and the one where vibes fail worst once conversations cross 32k tokens. Pick the three tests, build the multi-turn golden set, instrument the context attributes on every LLM span, compare the four patterns on an attention-budget Pareto, and let the cluster pass turn failures into named issues. Long-context support is the marketing number; long-context fidelity is the one your CI gate has to defend.
Related reading
Frequently asked questions
What is the right way to evaluate LLM context window management?
Why does context length not equal context fidelity?
How do I run a lost-in-middle test on my own documents?
What is attention-budget pricing and how do I measure it?
Which production patterns survive past 32k tokens?
What does Future AGI ship for context-window evaluation?
Heads-of-engineering buyer's guide for LLM eval vendors in 2026. Ten buying criteria, eight vendor categories scored honestly, a five-question rubric, and a procurement workflow.
The pillar playbook for LLM evaluation in 2026: dataset, metrics, judge, CI gate, production observation, and the closed loop from failing trace back to regression test.
Temporal turns agent workflows into replayable state machines. The eval that matches: per-activity correctness, workflow outcome, retry budget enforcement, signal-handler correctness. Replay is the superpower.