Guides

Evaluating Deep Research Agents in 2026: A Five-Stage Rubric for Long-Horizon Runs

Aggregate quality hides which research stage broke. Score plan, retrieve, source, claim, and synthesis independently or you cannot fix anything.

February 22, 2026

Updated May 20, 2026

12 min read

deep-research-agents agent-evaluation llm-evaluation citation-grounding long-horizon-agents agent-observability 2026

Table of Contents

A deep-research agent is evaluated on the wrong axis in 2026. The dashboard shows one aggregate quality score, the brief looks fluent, the latency line stays green, and the run ships. Three weeks later a user clicks citation 7 and the cited paper says the opposite. The aggregate hid which stage broke. The fix is to stop aggregating. Score query planning, source retrieval, source evaluation, claim-evidence alignment, and synthesis coherence as five independent axes. An aggregate over five orthogonal signals will not move when one of them collapses, and it will not tell you which one to fix.

This guide is the per-stage rubric set. If you have read the research-assistant monitoring playbook, this post lifts those four monitoring metrics into a five-stage eval discipline and adds query planning and synthesis coherence as first-class axes.

Why aggregate quality hides the failure

A deep-research run is a five-stage pipeline that takes minutes to hours, fires tens to hundreds of retrievals, and emits a 3,000-to-10,000-word brief with an executive summary and a citation list. Averaging across five orthogonal stages produces a number that does not move when one stage collapses.

The failure modes are orthogonal. A brief can score 0.92 on atomic-claim alignment and still fail because the plan never covered a critical sub-question. It can score 0.95 on citation validity and still fail because every cited source derives from the same upstream paper. It can be perfectly grounded and still answer a question adjacent to the one the user asked. The aggregate hides which case you are in.

Fluency is the camouflage. Research-agent training shapes prose that reads authoritative; a failed plan produces a confident brief on the wrong question, monocultured retrieval produces a confident brief on one perspective presented as consensus, and a bad synthesis produces a confident brief where the citations point sideways. Surface metrics never penetrate this layer. The per-stage rubric is the only thing that does. The unit of evaluation is the stage, not the answer.

The five stages of a deep-research run

Every deep-research system (OpenAI Deep Research, Anthropic Computer Use research, Perplexity Deep Research, in-house LangGraph or CrewAI planners) decomposes into the same five stages. The names vary; the shape does not.

Stage	What it does	Failure mode	Rubric anchor
1. Plan	Decompose the user query into sub-questions and allocate budget across them	Scope drift, missing sub-question, over-allocation to one leg	Sub-question coverage, budget allocation fit
2. Retrieve	Search and fetch, recursing into promising threads	Shallow stop, runaway recursion, retrieval-log gap	Recursion depth correctness, retrieval-log integrity
3. Evaluate source	Decide which retrieved sources are worth using	Tier-4 sources on a tier-1 question, monoculture	Source-tier stratification, primary-vs-secondary ratio
4. Align claim and evidence	Stitch each claim in the brief to a citation that supports it	Pointer error, paraphrase drift, fabricated citation	Per-claim entailment rate
5. Synthesize	Compose the final brief against the original question	Paragraph concatenation, off-by-a-question, format ignored	Plan coherence, synthesis quality, format fidelity

Score each row separately. Treat the per-stage scores as a 5-dim vector, not a scalar. The diagnostic value is in the vector, never in the mean.

Two of the five (source evaluation, claim-evidence alignment) are covered in depth in the research-assistant monitoring playbook; this post will not repeat that ground. The other three (plan, retrieve, synthesize) are the focus of the next sections, because they are where most teams’ eval setups have nothing on the dashboard.

Stage 1: Query planning

The planner reads the user question and writes a sub-question graph plus a budget per leg. It controls every downstream stage and has the least eval coverage in production today.

Three failure modes show up in almost every research-agent run.

Scope drop. The user asks for “the regulatory landscape for AI agents in healthcare in 2026.” The planner decomposes into FDA, HIPAA, and EU AI Act, and silently drops state-level laws. The retriever never sees state laws as a sub-question. The brief reads complete and is structurally short by one section.

Budget mis-allocation. The planner gives one sub-question 80% of the budget and four sub-questions 5% each. The 80% leg returns rich content; the synthesizer leans on it; the brief is 60% one leg with thin paragraphs on the rest.

Sub-question overlap. The planner writes six sub-questions and two of them are paraphrases. The retriever burns budget twice; the synthesizer stitches the same sources in twice; the brief is repetitive.

Score the plan with two judges. A structural judge counts planned sub-questions, verifies non-overlap (sentence-embedding cosine above a threshold), and checks budget allocation against a per-question-type prior. A semantic judge scores the plan against the user query for scope coverage on a calibrated 1-5 scale, with the executed sub-query graph attached as context.

from fi.evals import Evaluator
from fi.evals.metrics import CustomLLMJudge
from fi.evals.llm.providers.litellm import LiteLLMProvider

plan_coverage_judge = CustomLLMJudge(
    provider=LiteLLMProvider(),
    config={
        "name": "plan_scope_coverage",
        "model": "gpt-4o",
        "grading_criteria": (
            "Given the original user query and the planner's sub-question "
            "decomposition, score 1-5 on whether the sub-questions cover "
            "the full scope the user asked about. 5 = no missing sub-questions "
            "a domain expert would flag. 1 = a critical scope is missing."
        ),
    },
)

The first time you run this rubric over a week of production traffic, the failure rate will surprise you. Plan-stage failures look like fluent answers to questions the user never asked.

Stage 2: Source retrieval

The retriever runs search calls, fetches pages, and decides when to recurse. Three numbers carry the eval.

Recursion depth correctness. Score the agent’s stop decision against a per-question ideal depth, derived from a human-curated reference run or from a structural property of the question. A regulatory landscape question wants depth two on most legs and depth three on the one citing a primary regulation. A factual lookup wants depth one. Without a depth attribute on every RETRIEVER span, this is unmeasurable.

Retrieval-log integrity. The citation list at the end of the brief is the agent’s claim about what it used. The retrieval log is the truth. Any citation in the brief that does not appear in the retrieval log is fabricated by definition; the agent invented the reference from its prior. This is the single highest-signal cheap check in deep-research eval. Run it inline as a guardrail; reject the brief at the gateway if it fails.

Per-leg cost. Aggregate the gateway cost headers per sub-question and plot cost-per-leg against the planner’s budget allocation. Legs that overrun signal a deepening loop the planner did not anticipate.

Attach the depth attribute on every retriever span as you instrument the run.

from fi_instrumentation import register
from fi_instrumentation.fi_types import ProjectType, FiSpanKindValues

trace_provider = register(
    project_type=ProjectType.OBSERVE,
    project_name="deep-research-eval",
)

# inside the retriever loop, per call:
with tracer.start_as_current_span(
    "retriever.search",
    attributes={
        "fi.span.kind": FiSpanKindValues.RETRIEVER.value,
        "session.id": research_run_id,
        "search.query": sub_query,
        "search.query.depth": current_depth,
        "search.query.parent_id": parent_subquery_id,
        "retrieval.source_type": source_type,
    },
):
    results = search_provider.search(sub_query)

The retrieval-log check is one SQL join against the span tree, and once teams wire it they do not go back.

Stage 3: Source evaluation (covered)

Source quality stratification and source diversity are stage three. Both are covered in the research-assistant monitoring playbook with the working rubrics: unique-domain count, primary-vs-secondary ratio, citation-chain depth, and per-question source-tier floor. The principle is the same in CI as on the live monitor; in CI you can attach a stricter ground-truth set of must-cite sources per golden-set question and score recall against that set.

Three scanners run inline at this stage. MaliciousURLScanner rejects phishing and tracker domains the agent followed in a citation trail. SecretsScanner blocks leaked API keys from pastebins and gists from ending up in the final brief. InvisibleCharScanner flags zero-width and bidirectional control characters that adversaries embed in preprints to steer LLM interpretation. All three run in under ten milliseconds and slot in as INPUT rails on the retrieved-content boundary; the AI agent guardrails landscape covers the broader set.

Stage 4: Claim-evidence alignment (covered)

Atomic-claim decomposition is stage four. The brief gets split into sentence-level assertions, each tagged with the citation it carries, and a judge scores each (claim, cited passage) pair for entailment on a calibrated 1-5 scale. The aggregate is the percentage of claims above the threshold. Implementation lives in the research-assistant monitoring playbook and the hallucination detection guide for long-form outputs.

The rubric body is built-in. Use Groundedness for the per-claim entailment score, ChunkAttribution for the citation-to-passage map, and ContextAdherence for whether the brief stays inside the retrieved corpus rather than wandering into training data. Run all three at atomic-claim level. A brief can be 0.94 grounded as a whole and 0.61 aligned at the claim level; users only see the per-claim view.

from fi.evals import Evaluator
from fi.evals.templates import Groundedness, ChunkAttribution, ContextAdherence
from fi.testcases import TestCase

atomic_claims = decompose_report(brief, citations)
cases = [
    TestCase(
        input=claim.text,
        output=claim.text,
        context=claim.cited_passage,
        metadata={"claim_id": claim.id, "source_url": claim.source_url},
    )
    for claim in atomic_claims
]

evaluator = Evaluator(fi_api_key=API_KEY, fi_secret_key=SECRET_KEY)
result = evaluator.evaluate(
    eval_templates=[Groundedness(), ChunkAttribution(), ContextAdherence()],
    inputs=cases,
)

supported = sum(1 for r in result.eval_results if r.metrics[0].value >= 0.8)
alignment_rate = supported / len(result.eval_results)

The aggregate is what you trend. The rows are what you fix.

Stage 5: Synthesis coherence

The synthesizer takes the source set and writes the brief. Three rubrics cover the failure surface.

Synthesis-versus-concatenation. A grounded brief can still be useless if each section is one source’s voice paraphrased, with no cross-source reconciliation. The judge counts cross-source claims (claims supported by two or more independent sources) and contradictions surfaced (claims where the brief explicitly notes that source A and source B disagree). A brief with zero cross-source claims is paragraph concatenation, not synthesis.

Format fidelity. The user asked for a comparison table across four vendors and got prose. The user asked for an executive summary plus three sections and got one flat section. Format compliance is parseable; check it deterministically against the requested structure and regenerate the failing sections when it misses.

Plan coherence. The brief addresses the original question, not a reframed adjacent one. This is the connective tissue between stage one and stage five and gets its own treatment below.

synthesis_quality_judge = CustomLLMJudge(
    provider=LiteLLMProvider(),
    config={
        "name": "synthesis_versus_concatenation",
        "model": "claude-sonnet-4-5",
        "grading_criteria": (
            "Given the source set and the brief, score 1-5 on synthesis quality. "
            "5 = the brief identifies cross-source patterns, surfaces contradictions "
            "between sources, and takes a defensible position with evidence. "
            "1 = each section paraphrases one source with no reconciliation."
        ),
    },
)

Synthesis-quality scores below 3 almost always trace back to a planner that never requested a synthesis step, or to a synthesizer prompt that asks for a brief without asking for cross-source patterns. The fix lives in stages one and five, not in the eval.

The plan-coherence problem on long-horizon runs

The plan from minute one drifts by minute six. The retriever returns a tangent that looks relevant; the synthesizer leans on it; the final brief answers the tangent. The brief is internally coherent, fluent, well-cited, and structurally complete. It also does not answer the question the user asked.

Long-horizon runs make this harder, not easier. Replanning is a feature: the agent that revises its plan based on what retrieval surfaces is doing the right thing. The rubric has to credit useful replanning without absolving silent drift. The working compromise is to score the executed sub-question graph against the user query (not against the original plan) and score the final brief against the executed graph. The first score catches drift; the second catches synthesis errors that ignored the agent’s own work.

Three operational requirements keep plan-coherence eval honest:

A small human-labelled hold-out per agent version, with a drift alarm when judge-vs-human disagreement exceeds the inter-rater baseline. Without this, the judge silently drifts.
A session.id on every span so the multi-hour run is one navigable trace. The planner span and the synthesizer span have to be joinable.
An agent.version and prompt.version tag on every span so a regression is attributable to a rollout, not to “last week vs this week.”

Production observability and the Error Feed

Per-stage scoring runs in two places: in CI against a golden set, and on the live span via traceAI. Same five rubrics in both places, same judge, same templates. That is the diff that closes the gap between “passes in pytest” and “fails in production.”

The trace tree carries the stage labels.

agent.research.run        session.id, agent.version, prompt.version
  agent.plan              fi.span.kind=AGENT
    eval.plan_coverage
  agent.retrieve          fi.span.kind=CHAIN
    retriever.search.*    fi.span.kind=RETRIEVER, search.query.depth
      retriever.document.fetched
    eval.recursion_depth_correctness
    eval.retrieval_log_integrity
  agent.evaluate_source
    eval.source_diversity
    eval.source_tier_floor
    scanner.malicious_url
    scanner.invisible_char
  agent.synthesize        fi.span.kind=LLM
    claim.span (sentence + cited_passage)
    eval.claim_evidence_alignment
  agent.verify
    eval.synthesis_quality
    eval.plan_coherence
    eval.format_fidelity

Tail-sampling at the OTel collector keeps 100% of runs with any rubric below threshold, any guardrail trigger, any error, top-percentile latency or cost, and any experiment cohort; it samples 5-20% of the remaining clean runs uniformly. Head sampling at 1% drops the long-tail failures research agents are designed to expose. The rare 40-source run is the one most likely to fabricate.

Failing runs land in the Error Feed. HDBSCAN soft-clustering groups failures over span embeddings in ClickHouse; a Claude Sonnet 4.5 Judge agent runs a 30-turn investigation across 8 span-tools, with a Haiku Chauffeur summarizing spans longer than 3000 characters and a roughly 90% prompt-cache hit ratio keeping the bill survivable. Per cluster, the Judge emits a 5-category 30-subtype taxonomy classification, a 4-dim trace score (factual grounding, privacy and safety, instruction adherence, optimal plan execution; 1-5 each), and an immediate_fix string. The cluster names the stage that broke: “stage one scope drop on healthcare regulatory queries,” “stage three tier-4 source reliance on legal questions,” “stage five paragraph concatenation under tight token budgets.” Promote representative traces into the offline regression set with one click; the next PR on that code path cannot pass CI until the new cases clear. Linear is wired today via OAuth; Slack, GitHub, Jira, and PagerDuty are on the roadmap.

The five-stage rubric vector is the input. A clustered failure with a named stage and an immediate_fix is the output. That is the loop.

Where Future AGI fits

Future AGI ships the five-stage eval stack as a package, not a single product.

ai-evaluation (Apache 2.0) carries the per-stage rubrics. Built-in templates handle stages three and four: Groundedness, ChunkAttribution, ContextAdherence, FactualAccuracy, Completeness, TaskCompletion, and DetectHallucination cover alignment and coverage; MaliciousURLScanner, SecretsScanner, and InvisibleCharScanner cover inline source safety. For stages one, two, and five, CustomLLMJudge accepts a grading_criteria config and runs against any LiteLLM-supported model. 20+ local heuristic metrics handle the deterministic structural checks (retrieval-log integrity, format fidelity, sub-question overlap) at sub-second latency.

traceAI (Apache 2.0) carries the same rubric as a span-attached score on live traffic. 50+ AI surfaces across Python, TypeScript, Java, and C#; auto-instrumentation for OpenAI, LangChain, Groq, Portkey, Gemini; 14 span kinds including AGENT, CHAIN, RETRIEVER, LLM, TOOL, EVALUATOR, GUARDRAIL, VECTOR_DB. Long-running spans treat the multi-hour research session as one navigable trace. Server-side EvalTag wires the rubric to the span at zero added inference latency, so the same template runs in pytest and on the live span.

The Future AGI Platform is the operational layer. Self-improving evaluators retune from thumbs feedback on the per-stage scores. An in-product authoring agent writes the four custom rubrics (plan coverage, recursion depth correctness, synthesis quality, format fidelity) from natural-language descriptions; each is novel enough that an off-the-shelf template needs domain tuning. Classifier-backed evals run at lower per-eval cost than Galileo Luna-2, which makes per-claim scoring on every sampled trace economically tractable for the hundreds of claims a deep-research brief contains. Error Feed sits inside the eval stack; the promoted regression test is the durable artifact that prevents the same stage from regressing on the next agent version.

For the gateway layer, the Agent Command Center fronts 100+ providers as a single Go binary (Apache 2.0). 18+ built-in guardrail scanners plus 15 third-party adapters at the same network hop; structural citation validity and the malicious-URL scanner run here on the outbound brief. Tag-scoped budgets cap the runaway research run before the dollar ceiling. ~29k req/s with P99 21 ms on a t3.xlarge with guardrails on, OpenAI-compatible drop-in.

Ready to evaluate your own deep-research agent? Start with the ai-evaluation quickstart: wire the five-stage rubric vector against a small golden set and check the per-stage scores before you trust any aggregate. Then attach the same rubrics as EvalTag spans on live traffic via traceAI. The vector is the dashboard. The stage that drops is the bug.

Frequently asked questions

Why does one aggregate quality score hide deep-research failures?

Because the run is structurally compound. A deep-research output sits at the end of a five-stage pipeline: query planning, source retrieval, source evaluation, claim-evidence alignment, and synthesis coherence. A failure in any one of those stages can produce a fluent, plausible brief that scores 0.9 on an answer-level rubric. You only see which stage broke if you score the stages independently. A 0.9 aggregate over a plan that drifted, retrieval that monocultured, and synthesis that stitched the wrong citations together is the most common failure pattern in production, and it looks healthy on every dashboard that averages.

What are the five stages of a deep-research run you should score independently?

Plan, retrieve, evaluate-source, align-claim-evidence, synthesize. Plan is the decomposition of the user query into sub-questions and the budget allocated across them. Retrieve is the search and fetch loop, including recursion depth. Evaluate-source is the per-source quality decision (tier, primary vs secondary, recency). Align-claim-evidence is the per-claim entailment against the cited passage. Synthesize is the coherence of the final brief against the original question and the planned sub-query graph. Each stage has its own rubric, its own failure modes, and its own diagnostic value. Aggregating them throws away the diagnostic.

How is per-stage scoring different from atomic-claim decomposition?

Atomic-claim decomposition is one stage's rubric, not the whole system. It scores stage four (claim-evidence alignment) by splitting the brief into sentence-level assertions and verifying each against its cited passage. Stage scoring is the layer above. A run that scores 0.92 on atomic claims can still fail because the plan never covered a critical sub-question (stage one) or because the retriever cited five domains that all reference one upstream source (stage three). Atomic claims tell you whether the brief is internally aligned. Stage scoring tells you which stage is breaking.

Why is plan coherence the hardest of the five stages to measure?

Because the plan is a moving target. A deep-research run revises its plan dynamically as retrieval surfaces new context. Score the original plan against the executed sub-query graph and you punish useful replanning. Score only the final answer against the user query and you miss scope drift that happened mid-run. The working rubric is a two-part judge: first, did the executed sub-question graph cover the scope the user asked about; second, did the final synthesis answer the asked question rather than a reframed adjacent one. Both are 1-5 calibrated against a small human-labelled reference set per agent version.

Which Future AGI evaluators map to each of the five research stages?

Stage one (plan) uses CustomLLMJudge with a grading_criteria targeting scope coverage and sub-question completeness. Stage two (retrieve) uses RETRIEVER span attributes plus CustomLLMJudge for recursion depth correctness. Stage three (evaluate-source) uses CustomLLMJudge for source-tier stratification, with MaliciousURLScanner and InvisibleCharScanner running inline. Stage four (align-claim-evidence) uses the built-in Groundedness, ChunkAttribution, and ContextAdherence templates at atomic-claim level. Stage five (synthesize) uses Completeness, TaskCompletion, and a CustomLLMJudge for synthesis-versus-concatenation. Each maps to a code-defined template in ai-evaluation and to a span score on the live trace.

How do you stop a runaway research run from billing five figures overnight?

Two layers. At the SDK layer, attach a per-run token-and-cost budget header on every gateway call, aggregate the response headers per session ID, and fail the run when the ceiling is hit. At the gateway layer, use Agent Command Center to set tag-scoped budgets per research session; once the run crosses the dollar ceiling the gateway refuses further calls. The gateway has no opinion about your application logic, which is why it is the durable defense. The SDK budget will fail open if your application catches the exception. The gateway budget fails closed by default.

How does Error Feed handle the five-stage failure clusters?

Failing runs cluster via HDBSCAN soft-clustering over span embeddings in ClickHouse. A Claude Sonnet 4.5 Judge agent runs a 30-turn investigation across 8 span-tools, with a Haiku Chauffeur summarizing spans longer than 3000 characters at roughly 90% prompt-cache hit. Per cluster, the Judge writes a 5-category 30-subtype taxonomy classification, a 4-dimensional trace score (factual grounding, privacy and safety, instruction adherence, optimal plan execution; 1-5 each), and an immediate_fix string naming the stage that broke and the change to ship. Common research-agent clusters: planner-stage scope drift, retriever-stage monoculture, source-stage tier-4 reliance on tier-1 questions, alignment-stage citation pointer errors, synthesis-stage paragraph concatenation. Each cluster promotes into the offline regression set with one click.

View all

Guides

LangGraph Agent Evaluation: A 2026 Deep Tutorial

LangGraph eval is graph-level, not message-level. Score state transitions: node-input, node-output, edge-routing, and checkpoint replay determinism.

Nikhil Pareek · Apr 7, 2026

11 min

Guides

Evaluating Pydantic AI Agents That Use MCP Tools (2026)

Evaluate Pydantic AI agents that call MCP tools in 2026: per-typed-output rubrics, tool-call argument fidelity, MCP security checks, dependency invariants.

Vrinda Damani · May 21, 2026

11 min

Guides

Evaluating AWS Bedrock Agents in 2026

Bedrock's built-in eval is dev-loop only. Score action-group correctness, KB retrieval quality, and guardrail precision/recall on every release.

Rishav Hada · May 19, 2026

11 min

Why aggregate quality hides the failure

The five stages of a deep-research run

Stage 1: Query planning

Stage 2: Source retrieval

Stage 3: Source evaluation (covered)

Stage 4: Claim-evidence alignment (covered)

Stage 5: Synthesis coherence

The plan-coherence problem on long-horizon runs

Production observability and the Error Feed

Where Future AGI fits

Related reading

Frequently asked questions