Guides

Evaluating LLM Context Window Management in 2026

Long-context support is marketing. Long-context fidelity is what you eval: NIAH at every position, lost-in-middle on your docs, attention-budget cost.

·
Updated
·
12 min read
context-window long-context agent-evaluation llm-evaluation ai-gateway llm-observability 2026
Editorial cover image for Evaluating LLM Context-Window Management in 2026
Table of Contents

A support agent ships on Monday with 3k tokens of context and a 0.92 task-completion score. By Thursday, sessions stretch past 32k tokens and the same agent scores 0.71 on the same rubric. The prompt did not change. The model did not change. What changed is that the agent crossed a fidelity threshold no one was watching, on a model whose datasheet still says 200,000 tokens. Long-context support is marketing; long-context fidelity is what you eval. The methodology that earns the word “fidelity” is three tests run together: needle-in-haystack at multiple positions, lost-in-middle on your own documents, and attention-budget cost analysis. Without those three, a “200K context window” is a sticker, not a capability.

This post is the working pattern. The three tests, the production patterns that survive past 32k tokens, and the FAGI surfaces that instrument and score them. The reader leaves with rubrics that pin failures to position, depth, or strategy, and with a cost-per-effective-token curve that decides which pattern ships.

Context length is not context fidelity

The number on the model card is the tokenizer’s capacity, not the model’s reasoning depth. Models advertised at 200k tokens cliff somewhere between 32k and 64k on production workloads, and the cliff is not a slow degradation. Coherence sits at 0.9 at the limit and 0.6 a few thousand tokens past it. The drop is a step function whose location is per-model: Claude Sonnet 4.5 holds further than smaller models, GPT-5.1 has its own knee, Gemini 2.5 Pro has another, and the public needle-in-haystack score from any of them stays above 0.95 well past the point where actual task completion has collapsed.

Three properties make this a separate evaluation problem from a normal prompt change. Production agents cross 32k by week two — a 5k-per-session chatbot crosses 32k by turn 12 once you add system prompts, tool definitions, retrieved documents, and a memory snapshot. Public needle-in-haystack lies — a 100k context with one sentence to retrieve is a different problem than a 100k context full of overlapping facts where the model has to keep three straight and forget the rest. And quality drops are invisible to single-turn rubrics — every rubric in the SDK assumes one prompt, one response, and the failure modes of long context fall through that frame.

The fix is structural: stop scoring context windows like prompts and start scoring them like systems. Three tests, run together.

Test 1: needle-in-haystack at multiple positions

The first test is the one most teams already think they run. The trap is that the public NIAH benchmark uses one position (usually random) and one shape of distractor (random text). A production-grade NIAH eval sweeps positions and uses shape-similar distractors so the model cannot win on lexical luck.

Sweep positions: start, 25 percent, middle, 75 percent, end. Sweep depths: 4k, 16k, 32k, 64k, 96k, 128k. Mix the needle with distractors that look like the needle but answer a different question — same entity types, same sentence structure, different facts. Score on a custom rubric that checks whether the response acts on the needle, not just whether it surfaces it.

from fi.evals.templates import CustomLLMJudge

NeedlePositionFidelity = CustomLLMJudge(
    name="NeedlePositionFidelity",
    rubric=(
        "Score 1-5 whether the response correctly used the target fact. "
        "5: response cites the exact fact and acts on it. "
        "3: response cites a related-but-wrong fact from elsewhere in context. "
        "1: response invents a fact, ignores the context, or picks a distractor."
    ),
    grading_prompt_template=(
        "Long context (truncated):\n{context}\n\n"
        "Fact position: {position_pct} percent through context.\n"
        "Question: {question}\nExpected answer: {expected}\n"
        "Model response: {response}\nScore (1-5):"
    ),
)

Run this across the 30-cell sweep (5 positions x 6 depths) per model in your routing pool. The middle position is where the lost-in-middle effect lives, and it is where every cliff hides first. Pair the custom rubric with Groundedness (eval_id 47) and ContextAdherence (5) from the ai-evaluation SDK so the same trace also gets the standard rubric family. The LLM evaluation playbook explains the rubric-family framing.

The artifact is a heatmap: model on one axis, depth on another, position colour-coded. The cells that drop below your tolerance (typically 0.85 of the start-position score) are where the model loses fidelity. The cliff is not at the advertised window; it is wherever the heatmap goes cold.

Test 2: lost-in-middle on your documents

Public NIAH uses synthetic essays. Your agent processes order histories, ticket threads, contract sections, and code reviews. The lost-in-middle effect on synthetic text does not predict the effect on your text, because the distractors in your text actually look like the needle.

Take 50-200 representative session transcripts from production traces. For each, identify a fact the agent must act on later — an order ID, a stated preference, an escalation threshold, a constraint set in turn 1 and asked about by turn 30. Inject that fact at three positions: 10 percent through the transcript, 50 percent, 90 percent. Pad each variant to the same total token count using shape-similar turns from other sessions. Run the agent on each variant.

The rubric scores whether the response acted on the injected fact:

LostInMiddleRatio = CustomLLMJudge(
    name="LostInMiddleRatio",
    rubric=(
        "Score 1-5 whether the response acts on the injected fact. "
        "5: agent applies the fact correctly to the user's request. "
        "3: agent acknowledges the fact but does not act on it. "
        "1: agent acts as if the fact were not in context."
    ),
    grading_prompt_template=(
        "Session transcript (truncated):\n{transcript}\n\n"
        "Injected fact: {fact}\nInjection position: {position_pct} percent.\n"
        "User request: {request}\nAgent response: {response}\n"
        "Score (1-5):"
    ),
)

The number that matters is the ratio: middle-position score divided by the average of the edge-position scores. A ratio above 0.95 means the model handles your documents at depth. A ratio between 0.85 and 0.95 means the middle is starting to fail and you need a recovery pattern (a goal-pinning step, a hierarchical strategy, a summary cascade). Below 0.85 means the middle is unreliable on your traffic and you should stop trusting the model to recall facts past the recent buffer without retrieval.

Promote failing scenarios into the golden set weekly. The LLM evaluation playbook covers the promotion mechanics; the addition here is that each scenario carries the injection position and the depth so the cliff number stays current as session lengths grow with usage.

Test 3: attention-budget cost analysis

The third test ties cost to fidelity. Long context costs more per call and runs slower per call. The eval question is not whether either is true; it is whether the quality gain at higher tokens earns the cost. The number that answers it is cost-per-effective-token: cost per 1k input tokens divided by the rubric score at that depth.

A 32k call that costs $0.10 and scores 0.85 has a cost-per-effective-token of about $0.118 per scored point. The same model at 96k that costs $0.30 and scores 0.62 has $0.484 per scored point — four times more expensive for worse output. The plot is a curve per model per strategy, and the Pareto knee names itself.

The Agent Command Center gateway returns cost and latency on every call. Attach both to the trace span:

import requests

def chat_with_attention_budget(messages, model, strategy):
    r = requests.post(
        "https://gateway.futureagi.com/v1/chat/completions",
        json={"model": model, "messages": messages},
        headers={"Authorization": f"Bearer {API_KEY}"},
    )
    cost = float(r.headers.get("x-prism-cost", 0))
    latency_ms = float(r.headers.get("x-prism-latency-ms", 0))
    model_used = r.headers.get("x-prism-model-used", "unknown")
    return r.json(), {"cost": cost, "latency_ms": latency_ms, "model": model_used}

Bin by llm.input_tokens and the curve emerges. For most workloads, naive context loses on cost-per-effective-token past 32k. Sliding window wins on cost but loses on quality on long-horizon tasks. Summarize-and-compact wins the median Pareto point at 32-64k. Hierarchical wins past 64k, at the cost of engineering complexity. The point is not to memorise that; the plot tells you the answer for your workload. The LLM cost optimisation post covers the gateway headers; the addition here is dividing cost by rubric score, not just reporting cost.

Compute attention-budget efficiency as a single rubric so the cliff lands in your dashboards:

AttentionBudgetEfficiency = CustomLLMJudge(
    name="AttentionBudgetEfficiency",
    rubric=(
        "Given (input_tokens, cost_usd, quality_score), score 1-5 the "
        "cost-per-effective-token quality. 5: cost_per_effective_token is "
        "within 10 percent of the best model+depth cell on this workload. "
        "1: more than 3x worse than the best cell."
    ),
    grading_prompt_template=(
        "Tokens: {tokens}\nCost: ${cost}\nQuality score: {score}\n"
        "Best cell cost-per-effective-token: ${best}\nScore (1-5):"
    ),
)

Production patterns that survive past 32k

The three tests tell you where each pattern breaks. Four patterns are worth running through them.

Naive context. Append turns until the window overflows. Cheapest to implement, the most common silent failure mode, and the right baseline for the Pareto plot. The agent crosses the per-model cliff in production and the single-turn rubric does not notice.

Chunking plus reranking. Retrieve top-k chunks from a vector store, rerank with a cross-encoder, feed the top three into a short context. The active window stays small, so fidelity holds well past 32k. The failure mode shifts to the retriever: ChunkAttribution and ChunkUtilization become the rubrics that matter, covered in the RAG metrics deep dive. Pair this pattern with a cross-encoder reranker; the rerankers post covers selection.

Summary cascade. Keep the last K turns verbatim. Summarise older blocks into a few hundred tokens each. Index summaries for selective recall when the agent needs an older fact. Better than naive sliding because the goal stated in turn 1 survives compaction; worse than hierarchical because the summary itself becomes the source of truth for everything past the recent buffer. The rubric to watch is compaction faithfulness: if the summary drops a fact, the agent drops the fact.

Hierarchical context. Combine a small recent buffer, a global compacted summary, and per-turn retrieval over a session index. The most expensive to engineer, the best Pareto point past 64k on hard workloads, and the pattern where failures cluster around the retrieval step rather than the compaction step.

The right comparison is each pattern crossed with each model crossed with each context depth, scored on NeedlePositionFidelity, LostInMiddleRatio, AttentionBudgetEfficiency, Groundedness, and TaskCompletion. That is a lot of cells, which is why the four distributed runners in the SDK (Celery, Ray, Temporal, Kubernetes) matter — the distributed runners post covers selection.

Instrumenting context dimensions with traceAI

Score nothing until the trace tree carries the context dimensions. Three attributes per LLM span: input-token count, context position (when injecting a fact for the lost-in-middle test), and session ID for cross-turn grouping.

from fi_instrumentation import register
from fi_instrumentation.fi_types import ProjectType, SpanAttributes
from opentelemetry import trace

tracer_provider = register(
    project_type=ProjectType.OBSERVE,
    project_name="context-window-eval",
)
tracer = trace.get_tracer(__name__)

def llm_call_with_context_span(messages, model, strategy, session_id, position_pct=None):
    input_tokens = count_tokens(messages, model=model)
    with tracer.start_as_current_span("llm.call") as span:
        span.set_attribute(SpanAttributes.FI_SPAN_KIND, "LLM")
        span.set_attribute("llm.model", model)
        span.set_attribute("llm.input_tokens", input_tokens)
        span.set_attribute("llm.context_strategy", strategy)
        span.set_attribute("session.id", session_id)
        if position_pct is not None:
            span.set_attribute("llm.context_position_pct", position_pct)
        response = client.chat.completions.create(model=model, messages=messages)
        span.set_attribute("llm.output_tokens", response.usage.completion_tokens)
        return response

With those attributes pinned, every downstream view is a group-by away: per-strategy quality by token count, per-model cliff curve, position fidelity heatmap, cost-per-effective-token by strategy. traceAI ships 50+ AI surfaces across Python, TypeScript, Java, and C#, with pluggable semantic conventions (FI, OTEL_GENAI, OPENINFERENCE, OPENLLMETRY) so spans ingest into Phoenix or Traceloop without re-instrumenting. The instrumenting your AI agent post covers framework wiring; the addition here is the three context attributes.

Cross-turn grouping is what makes session-level views possible. A 50-turn conversation has 50 spans sharing session.id. Per-session task completion is the rollup. Per-session position fidelity is the rollup. The single-turn view never shows the long-conversation failure mode.

The 5-step setup

Five steps from “long-context shipped” to “scored per-axis on CI.”

Step 1: instrument. Add llm.input_tokens, llm.context_strategy, llm.context_position_pct, and session.id to every LLM span. Route calls through the Agent Command Center so x-prism-cost, x-prism-latency-ms, and x-prism-model-used ride on the same trace.

Step 2: build the golden set. 200-500 multi-turn scenarios that exercise context, not single-turn QA:

Scenario typeTurnsDepths sweptWhat it tests
Needle injected at 5 positions14k-128kPosition fidelity heatmap
Production transcript with fact injected at 10/50/90 percent10-504k-96kLost-in-middle on your docs
Goal stated turn 1, asked turn 202016k-64kCompaction faithfulness
Same prompt run 10 times1032kCache stability
50-turn workflow, multiple critical facts5064k+Task completion across compactions

Promote failing production scenarios weekly; each scenario carries depth and injection position annotations so the cliff number stays current.

Step 3: run each pattern on each scenario. Score per-axis with the FAGI templates (Groundedness, ContextAdherence, Completeness, TaskCompletion) and the custom rubrics (NeedlePositionFidelity, LostInMiddleRatio, AttentionBudgetEfficiency):

from fi.evals import Evaluator
from fi.evals.templates import (
    Groundedness, ContextAdherence, Completeness, TaskCompletion,
)
from fi.testcases import TestCase

evaluator = Evaluator(fi_api_key="...", fi_secret_key="...")

for pattern in ["naive", "chunking_rerank", "summary_cascade", "hierarchical"]:
    scores = evaluator.evaluate(
        eval_templates=[
            Groundedness(), ContextAdherence(),
            Completeness(), TaskCompletion(),
            NeedlePositionFidelity, LostInMiddleRatio,
            AttentionBudgetEfficiency,
        ],
        inputs=[
            TestCase(
                query=turn.user_text,
                response=run_with_pattern(turn, pattern),
                context=turn.context,
            )
            for turn in golden_set
        ],
    )

Parallelise per-pattern per-model per-depth on Celery, Ray, Temporal, or Kubernetes. A full sweep (4 patterns x 4 models x 6 depths x 500 scenarios = 48,000 cells) finishes overnight on a 64-worker cluster.

Step 4: plot attention-budget Pareto. Per-model, per-pattern: cost-per-effective-token versus depth. The plot is the artifact that decides which pattern ships; a single number per cell hides the tradeoff.

Step 5: cluster failures. Error Feed inside the eval stack runs HDBSCAN soft-clustering over failing traces. A Sonnet 4.5 Judge agent writes an immediate_fix per cluster. Typical context-management clusters:

  • “agent forgets fact pinned at 50 percent depth, 32k” → switch from naive to hierarchical, add a goal-pinning step
  • “summary cascade drops customer’s stated preference” → tune the summariser prompt with LostInMiddleRatio as the optimisation metric
  • “cost-per-effective-token cliff between 64k and 96k on smaller model” → route past the cliff to a model whose curve holds further via the gateway

Cluster fixes feed the self-improving evaluators on the Future AGI Platform. The summariser prompt itself is a prompt the agent-opt optimisers can tune against LostInMiddleRatio as the target metric (PROTEGI, GEPA, MetaPrompt, BayesianSearch, RandomSearch, PromptWizard — each with EarlyStoppingConfig).

How Future AGI ships the context-window eval stack

Future AGI ships the eval stack as a package. Start with the SDK for code-defined evals. Graduate to the Platform when you want self-improving rubrics authored by an in-product agent.

  • ai-evaluation SDK (Apache 2.0): from fi.evals import Evaluator. Ready-to-use templates: Groundedness (eval_id 47), ContextAdherence (5), ContextRelevance (9), Completeness (10), TaskCompletion, AnswerRefusal (88). CustomLLMJudge for the position-aware rubrics (NeedlePositionFidelity, LostInMiddleRatio, AttentionBudgetEfficiency). Local NLI-based equivalents that run on DeBERTa with no API call. Four distributed runners (Celery, Ray, Temporal, Kubernetes) for the 48k-cell sweep.
  • traceAI (Apache 2.0): 50+ AI surfaces across Python, TypeScript, Java, C#; 14 span kinds; pluggable semantic conventions (FI, OTEL_GENAI, OPENINFERENCE, OPENLLMETRY); 62 built-in evals via EvalTag for span-attached scoring on live traces.
  • Future AGI Platform: self-improving evaluators tuned by production thumbs-up/down so LostInMiddleRatio stays aligned with what your agent authors flag as critical; in-product authoring agent generates position-aware rubrics from natural-language descriptions; classifier-backed evals at lower per-eval cost than Galileo Luna-2.
  • Agent Command Center gateway: cost, latency, and model-used headers per call so cost-per-effective-token computes on the same trace; 100+ providers; 18+ built-in guardrail scanners; ~29k req/s, P99 21 ms with guardrails on, t3.xlarge.
  • Error Feed (inside the eval stack): HDBSCAN clustering on ClickHouse-stored span embeddings groups context-failure traces into named issues; Sonnet 4.5 Judge writes the immediate_fix; engineer-reviewed promotions feed the dataset and the Platform’s self-improving evaluators. Linear is wired today; Slack, GitHub, Jira, PagerDuty are on the roadmap.

Ready to wire the three tests into a CI gate? Bind NeedlePositionFidelity, LostInMiddleRatio, and AttentionBudgetEfficiency into a pytest fixture this afternoon against the ai-evaluation SDK, then add the traceAI instrumentor when production traces start asking questions the CI gate missed.

Honest scope

The trace-stream-to-agent-opt connector is roadmap. Today, eval-driven optimisation on the summariser prompt itself ships via the six agent-opt optimisers, with LostInMiddleRatio as the eval target. Direct ingestion of live traces into the optimiser dataset is the next connector; the bridge today is “promote failing traces into the golden set, rerun the optimiser.” Error Feed integrates with Linear; Slack, GitHub, Jira, PagerDuty are on the roadmap.

Context window management is the agent component most likely to be evaluated on vibes and the one where vibes fail worst once conversations cross 32k tokens. Pick the three tests, build the multi-turn golden set, instrument the context attributes on every LLM span, compare the four patterns on an attention-budget Pareto, and let the cluster pass turn failures into named issues. Long-context support is the marketing number; long-context fidelity is the one your CI gate has to defend.

Frequently asked questions

What is the right way to evaluate LLM context window management?
Three methodologies, run together. Needle-in-haystack at multiple positions (start, 25 percent, middle, 75 percent, end) and multiple depths (4k, 16k, 32k, 64k, 96k, 128k) with shape-similar distractors mixed in. Lost-in-middle on your documents, not the public benchmark — pin facts your agent actually queries to the middle of representative session transcripts and score whether the response acts on them. Attention-budget cost analysis that ties cost-per-effective-token (cost divided by the rubric score at that depth) to the rubric so you can see the cliff in dollars, not just in scores. A '200K context window' that fails position 50 percent at 64k is a sticker, not a capability.
Why does context length not equal context fidelity?
Vendor context-length numbers describe what the tokenizer accepts, not what the model can reason over. Models advertised at 200k tokens routinely cliff between 32k and 64k on production workloads — coherence sits at 0.9 at the limit and 0.6 a few thousand tokens past it. The cliff is per-model and per-position: Claude Sonnet 4.5, GPT-5.1, Gemini 2.5 Pro each have different curves, and the middle of the context consistently underperforms the ends. Needle-in-haystack at one position lies about real performance because it does not surface the lost-in-middle effect and does not pair quality with cost. Fidelity is the multi-position, multi-depth, cost-aware number; length is the marketing one.
How do I run a lost-in-middle test on my own documents?
Take 50-200 representative session transcripts from production traces. For each, identify a fact your agent must act on later (an order ID, a stated preference, an escalation threshold). Inject that fact at three positions in the transcript: 10 percent through, 50 percent through, 90 percent through. Pad each variant to the same total token count (4k, 16k, 32k, 64k, 96k) with shape-similar but irrelevant turns. Run the agent on each variant and score with a custom rubric that checks whether the response acted on the injected fact. The middle position score divided by the edge position score is your lost-in-middle ratio. Anything below 0.85 is a position-fidelity failure on your traffic.
What is attention-budget pricing and how do I measure it?
Attention-budget pricing is cost-per-effective-token: dollars per 1k input tokens divided by the rubric score at that depth. A 32k call that costs $0.10 and scores 0.85 has a cost-per-effective-token of $0.118 per scored point. The same call at 96k that costs $0.30 and scores 0.62 has $0.484 per scored point — four times more expensive for worse output. Plot this curve per model per strategy and the Pareto knee names itself. Cost-per-effective-token is the number that justifies switching from naive to hierarchical context: not the absolute cost, not the absolute quality, the ratio that pays for engineering time.
Which production patterns survive past 32k tokens?
Three. Chunking plus reranking puts retrieved chunks (top-k after a cross-encoder reranker) into a short context instead of dumping a long one, and survives well past 32k because the active window stays small. Summary cascade keeps recent turns verbatim, summarises older blocks into a few hundred tokens each, and indexes summaries for selective recall — better than naive sliding because the goal stated in turn 1 survives compaction. Hierarchical context combines a small recent buffer, a global compacted summary, and per-turn retrieval over a session index — the most expensive to engineer and the best Pareto point past 64k. The eval question is not which pattern is best in general; it is which pattern has the lowest cost-per-effective-token on your traffic.
What does Future AGI ship for context-window evaluation?
Three things tied together. The ai-evaluation SDK ships Groundedness (eval_id 47), ContextAdherence (5), ContextRelevance (9), Completeness (10), and TaskCompletion as ready-to-use templates, plus CustomLLMJudge for the position-aware rubrics (NeedlePositionFidelity, LostInMiddleRatio, AttentionBudgetEfficiency). traceAI captures llm.input_tokens, llm.context_position, and session.id on every span so the multi-turn rollups work. The Agent Command Center gateway returns x-prism-cost and x-prism-latency-ms headers per call so cost-per-effective-token computes on the same trace. Error Feed clusters position-fidelity failures and writes an immediate fix per cluster — 'agent forgets fact pinned at 50 percent depth' becomes a named issue, not a vibe.
Related Articles
View all
The LLM Eval Vendor Buyer's Guide for 2026
Guides

Heads-of-engineering buyer's guide for LLM eval vendors in 2026. Ten buying criteria, eight vendor categories scored honestly, a five-question rubric, and a procurement workflow.

NVJK Kartik
NVJK Kartik ·
16 min
The 2026 LLM Evaluation Playbook
Guides

The pillar playbook for LLM evaluation in 2026: dataset, metrics, judge, CI gate, production observation, and the closed loop from failing trace back to regression test.

NVJK Kartik
NVJK Kartik ·
10 min