Evaluating Deep Research Agents in 2026: A Five-Stage Rubric for Long-Horizon Runs
Aggregate quality hides which research stage broke. Score plan, retrieve, source, claim, and synthesis independently or you cannot fix anything.
Table of Contents
A deep-research agent is evaluated on the wrong axis in 2026. The dashboard shows one aggregate quality score, the brief looks fluent, the latency line stays green, and the run ships. Three weeks later a user clicks citation 7 and the cited paper says the opposite. The aggregate hid which stage broke. The fix is to stop aggregating. Score query planning, source retrieval, source evaluation, claim-evidence alignment, and synthesis coherence as five independent axes. An aggregate over five orthogonal signals will not move when one of them collapses, and it will not tell you which one to fix.
This guide is the per-stage rubric set. If you have read the research-assistant monitoring playbook, this post lifts those four monitoring metrics into a five-stage eval discipline and adds query planning and synthesis coherence as first-class axes.
Why aggregate quality hides the failure
A deep-research run is a five-stage pipeline that takes minutes to hours, fires tens to hundreds of retrievals, and emits a 3,000-to-10,000-word brief with an executive summary and a citation list. Averaging across five orthogonal stages produces a number that does not move when one stage collapses.
The failure modes are orthogonal. A brief can score 0.92 on atomic-claim alignment and still fail because the plan never covered a critical sub-question. It can score 0.95 on citation validity and still fail because every cited source derives from the same upstream paper. It can be perfectly grounded and still answer a question adjacent to the one the user asked. The aggregate hides which case you are in.
Fluency is the camouflage. Research-agent training shapes prose that reads authoritative; a failed plan produces a confident brief on the wrong question, monocultured retrieval produces a confident brief on one perspective presented as consensus, and a bad synthesis produces a confident brief where the citations point sideways. Surface metrics never penetrate this layer. The per-stage rubric is the only thing that does. The unit of evaluation is the stage, not the answer.
The five stages of a deep-research run
Every deep-research system (OpenAI Deep Research, Anthropic Computer Use research, Perplexity Deep Research, in-house LangGraph or CrewAI planners) decomposes into the same five stages. The names vary; the shape does not.
| Stage | What it does | Failure mode | Rubric anchor |
|---|---|---|---|
| 1. Plan | Decompose the user query into sub-questions and allocate budget across them | Scope drift, missing sub-question, over-allocation to one leg | Sub-question coverage, budget allocation fit |
| 2. Retrieve | Search and fetch, recursing into promising threads | Shallow stop, runaway recursion, retrieval-log gap | Recursion depth correctness, retrieval-log integrity |
| 3. Evaluate source | Decide which retrieved sources are worth using | Tier-4 sources on a tier-1 question, monoculture | Source-tier stratification, primary-vs-secondary ratio |
| 4. Align claim and evidence | Stitch each claim in the brief to a citation that supports it | Pointer error, paraphrase drift, fabricated citation | Per-claim entailment rate |
| 5. Synthesize | Compose the final brief against the original question | Paragraph concatenation, off-by-a-question, format ignored | Plan coherence, synthesis quality, format fidelity |
Score each row separately. Treat the per-stage scores as a 5-dim vector, not a scalar. The diagnostic value is in the vector, never in the mean.
Two of the five (source evaluation, claim-evidence alignment) are covered in depth in the research-assistant monitoring playbook; this post will not repeat that ground. The other three (plan, retrieve, synthesize) are the focus of the next sections, because they are where most teams’ eval setups have nothing on the dashboard.
Stage 1: Query planning
The planner reads the user question and writes a sub-question graph plus a budget per leg. It controls every downstream stage and has the least eval coverage in production today.
Three failure modes show up in almost every research-agent run.
Scope drop. The user asks for “the regulatory landscape for AI agents in healthcare in 2026.” The planner decomposes into FDA, HIPAA, and EU AI Act, and silently drops state-level laws. The retriever never sees state laws as a sub-question. The brief reads complete and is structurally short by one section.
Budget mis-allocation. The planner gives one sub-question 80% of the budget and four sub-questions 5% each. The 80% leg returns rich content; the synthesizer leans on it; the brief is 60% one leg with thin paragraphs on the rest.
Sub-question overlap. The planner writes six sub-questions and two of them are paraphrases. The retriever burns budget twice; the synthesizer stitches the same sources in twice; the brief is repetitive.
Score the plan with two judges. A structural judge counts planned sub-questions, verifies non-overlap (sentence-embedding cosine above a threshold), and checks budget allocation against a per-question-type prior. A semantic judge scores the plan against the user query for scope coverage on a calibrated 1-5 scale, with the executed sub-query graph attached as context.
from fi.evals import Evaluator
from fi.evals.metrics import CustomLLMJudge
from fi.evals.llm.providers.litellm import LiteLLMProvider
plan_coverage_judge = CustomLLMJudge(
provider=LiteLLMProvider(),
config={
"name": "plan_scope_coverage",
"model": "gpt-4o",
"grading_criteria": (
"Given the original user query and the planner's sub-question "
"decomposition, score 1-5 on whether the sub-questions cover "
"the full scope the user asked about. 5 = no missing sub-questions "
"a domain expert would flag. 1 = a critical scope is missing."
),
},
)
The first time you run this rubric over a week of production traffic, the failure rate will surprise you. Plan-stage failures look like fluent answers to questions the user never asked.
Stage 2: Source retrieval
The retriever runs search calls, fetches pages, and decides when to recurse. Three numbers carry the eval.
Recursion depth correctness. Score the agent’s stop decision against a per-question ideal depth, derived from a human-curated reference run or from a structural property of the question. A regulatory landscape question wants depth two on most legs and depth three on the one citing a primary regulation. A factual lookup wants depth one. Without a depth attribute on every RETRIEVER span, this is unmeasurable.
Retrieval-log integrity. The citation list at the end of the brief is the agent’s claim about what it used. The retrieval log is the truth. Any citation in the brief that does not appear in the retrieval log is fabricated by definition; the agent invented the reference from its prior. This is the single highest-signal cheap check in deep-research eval. Run it inline as a guardrail; reject the brief at the gateway if it fails.
Per-leg cost. Aggregate the gateway cost headers per sub-question and plot cost-per-leg against the planner’s budget allocation. Legs that overrun signal a deepening loop the planner did not anticipate.
Attach the depth attribute on every retriever span as you instrument the run.
from fi_instrumentation import register
from fi_instrumentation.fi_types import ProjectType, FiSpanKindValues
trace_provider = register(
project_type=ProjectType.OBSERVE,
project_name="deep-research-eval",
)
# inside the retriever loop, per call:
with tracer.start_as_current_span(
"retriever.search",
attributes={
"fi.span.kind": FiSpanKindValues.RETRIEVER.value,
"session.id": research_run_id,
"search.query": sub_query,
"search.query.depth": current_depth,
"search.query.parent_id": parent_subquery_id,
"retrieval.source_type": source_type,
},
):
results = search_provider.search(sub_query)
The retrieval-log check is one SQL join against the span tree, and once teams wire it they do not go back.
Stage 3: Source evaluation (covered)
Source quality stratification and source diversity are stage three. Both are covered in the research-assistant monitoring playbook with the working rubrics: unique-domain count, primary-vs-secondary ratio, citation-chain depth, and per-question source-tier floor. The principle is the same in CI as on the live monitor; in CI you can attach a stricter ground-truth set of must-cite sources per golden-set question and score recall against that set.
Three scanners run inline at this stage. MaliciousURLScanner rejects phishing and tracker domains the agent followed in a citation trail. SecretsScanner blocks leaked API keys from pastebins and gists from ending up in the final brief. InvisibleCharScanner flags zero-width and bidirectional control characters that adversaries embed in preprints to steer LLM interpretation. All three run in under ten milliseconds and slot in as INPUT rails on the retrieved-content boundary; the AI agent guardrails landscape covers the broader set.
Stage 4: Claim-evidence alignment (covered)
Atomic-claim decomposition is stage four. The brief gets split into sentence-level assertions, each tagged with the citation it carries, and a judge scores each (claim, cited passage) pair for entailment on a calibrated 1-5 scale. The aggregate is the percentage of claims above the threshold. Implementation lives in the research-assistant monitoring playbook and the hallucination detection guide for long-form outputs.
The rubric body is built-in. Use Groundedness for the per-claim entailment score, ChunkAttribution for the citation-to-passage map, and ContextAdherence for whether the brief stays inside the retrieved corpus rather than wandering into training data. Run all three at atomic-claim level. A brief can be 0.94 grounded as a whole and 0.61 aligned at the claim level; users only see the per-claim view.
from fi.evals import Evaluator
from fi.evals.templates import Groundedness, ChunkAttribution, ContextAdherence
from fi.testcases import TestCase
atomic_claims = decompose_report(brief, citations)
cases = [
TestCase(
input=claim.text,
output=claim.text,
context=claim.cited_passage,
metadata={"claim_id": claim.id, "source_url": claim.source_url},
)
for claim in atomic_claims
]
evaluator = Evaluator(fi_api_key=API_KEY, fi_secret_key=SECRET_KEY)
result = evaluator.evaluate(
eval_templates=[Groundedness(), ChunkAttribution(), ContextAdherence()],
inputs=cases,
)
supported = sum(1 for r in result.eval_results if r.metrics[0].value >= 0.8)
alignment_rate = supported / len(result.eval_results)
The aggregate is what you trend. The rows are what you fix.
Stage 5: Synthesis coherence
The synthesizer takes the source set and writes the brief. Three rubrics cover the failure surface.
Synthesis-versus-concatenation. A grounded brief can still be useless if each section is one source’s voice paraphrased, with no cross-source reconciliation. The judge counts cross-source claims (claims supported by two or more independent sources) and contradictions surfaced (claims where the brief explicitly notes that source A and source B disagree). A brief with zero cross-source claims is paragraph concatenation, not synthesis.
Format fidelity. The user asked for a comparison table across four vendors and got prose. The user asked for an executive summary plus three sections and got one flat section. Format compliance is parseable; check it deterministically against the requested structure and regenerate the failing sections when it misses.
Plan coherence. The brief addresses the original question, not a reframed adjacent one. This is the connective tissue between stage one and stage five and gets its own treatment below.
synthesis_quality_judge = CustomLLMJudge(
provider=LiteLLMProvider(),
config={
"name": "synthesis_versus_concatenation",
"model": "claude-sonnet-4-5",
"grading_criteria": (
"Given the source set and the brief, score 1-5 on synthesis quality. "
"5 = the brief identifies cross-source patterns, surfaces contradictions "
"between sources, and takes a defensible position with evidence. "
"1 = each section paraphrases one source with no reconciliation."
),
},
)
Synthesis-quality scores below 3 almost always trace back to a planner that never requested a synthesis step, or to a synthesizer prompt that asks for a brief without asking for cross-source patterns. The fix lives in stages one and five, not in the eval.
The plan-coherence problem on long-horizon runs
The plan from minute one drifts by minute six. The retriever returns a tangent that looks relevant; the synthesizer leans on it; the final brief answers the tangent. The brief is internally coherent, fluent, well-cited, and structurally complete. It also does not answer the question the user asked.
Long-horizon runs make this harder, not easier. Replanning is a feature: the agent that revises its plan based on what retrieval surfaces is doing the right thing. The rubric has to credit useful replanning without absolving silent drift. The working compromise is to score the executed sub-question graph against the user query (not against the original plan) and score the final brief against the executed graph. The first score catches drift; the second catches synthesis errors that ignored the agent’s own work.
Three operational requirements keep plan-coherence eval honest:
- A small human-labelled hold-out per agent version, with a drift alarm when judge-vs-human disagreement exceeds the inter-rater baseline. Without this, the judge silently drifts.
- A
session.idon every span so the multi-hour run is one navigable trace. The planner span and the synthesizer span have to be joinable. - An
agent.versionandprompt.versiontag on every span so a regression is attributable to a rollout, not to “last week vs this week.”
Production observability and the Error Feed
Per-stage scoring runs in two places: in CI against a golden set, and on the live span via traceAI. Same five rubrics in both places, same judge, same templates. That is the diff that closes the gap between “passes in pytest” and “fails in production.”
The trace tree carries the stage labels.
agent.research.run session.id, agent.version, prompt.version
agent.plan fi.span.kind=AGENT
eval.plan_coverage
agent.retrieve fi.span.kind=CHAIN
retriever.search.* fi.span.kind=RETRIEVER, search.query.depth
retriever.document.fetched
eval.recursion_depth_correctness
eval.retrieval_log_integrity
agent.evaluate_source
eval.source_diversity
eval.source_tier_floor
scanner.malicious_url
scanner.invisible_char
agent.synthesize fi.span.kind=LLM
claim.span (sentence + cited_passage)
eval.claim_evidence_alignment
agent.verify
eval.synthesis_quality
eval.plan_coherence
eval.format_fidelity
Tail-sampling at the OTel collector keeps 100% of runs with any rubric below threshold, any guardrail trigger, any error, top-percentile latency or cost, and any experiment cohort; it samples 5-20% of the remaining clean runs uniformly. Head sampling at 1% drops the long-tail failures research agents are designed to expose. The rare 40-source run is the one most likely to fabricate.
Failing runs land in the Error Feed. HDBSCAN soft-clustering groups failures over span embeddings in ClickHouse; a Claude Sonnet 4.5 Judge agent runs a 30-turn investigation across 8 span-tools, with a Haiku Chauffeur summarizing spans longer than 3000 characters and a roughly 90% prompt-cache hit ratio keeping the bill survivable. Per cluster, the Judge emits a 5-category 30-subtype taxonomy classification, a 4-dim trace score (factual grounding, privacy and safety, instruction adherence, optimal plan execution; 1-5 each), and an immediate_fix string. The cluster names the stage that broke: “stage one scope drop on healthcare regulatory queries,” “stage three tier-4 source reliance on legal questions,” “stage five paragraph concatenation under tight token budgets.” Promote representative traces into the offline regression set with one click; the next PR on that code path cannot pass CI until the new cases clear. Linear is wired today via OAuth; Slack, GitHub, Jira, and PagerDuty are on the roadmap.
The five-stage rubric vector is the input. A clustered failure with a named stage and an immediate_fix is the output. That is the loop.
Where Future AGI fits
Future AGI ships the five-stage eval stack as a package, not a single product.
ai-evaluation (Apache 2.0) carries the per-stage rubrics. Built-in templates handle stages three and four: Groundedness, ChunkAttribution, ContextAdherence, FactualAccuracy, Completeness, TaskCompletion, and DetectHallucination cover alignment and coverage; MaliciousURLScanner, SecretsScanner, and InvisibleCharScanner cover inline source safety. For stages one, two, and five, CustomLLMJudge accepts a grading_criteria config and runs against any LiteLLM-supported model. 20+ local heuristic metrics handle the deterministic structural checks (retrieval-log integrity, format fidelity, sub-question overlap) at sub-second latency.
traceAI (Apache 2.0) carries the same rubric as a span-attached score on live traffic. 50+ AI surfaces across Python, TypeScript, Java, and C#; auto-instrumentation for OpenAI, LangChain, Groq, Portkey, Gemini; 14 span kinds including AGENT, CHAIN, RETRIEVER, LLM, TOOL, EVALUATOR, GUARDRAIL, VECTOR_DB. Long-running spans treat the multi-hour research session as one navigable trace. Server-side EvalTag wires the rubric to the span at zero added inference latency, so the same template runs in pytest and on the live span.
The Future AGI Platform is the operational layer. Self-improving evaluators retune from thumbs feedback on the per-stage scores. An in-product authoring agent writes the four custom rubrics (plan coverage, recursion depth correctness, synthesis quality, format fidelity) from natural-language descriptions; each is novel enough that an off-the-shelf template needs domain tuning. Classifier-backed evals run at lower per-eval cost than Galileo Luna-2, which makes per-claim scoring on every sampled trace economically tractable for the hundreds of claims a deep-research brief contains. Error Feed sits inside the eval stack; the promoted regression test is the durable artifact that prevents the same stage from regressing on the next agent version.
For the gateway layer, the Agent Command Center fronts 100+ providers as a single Go binary (Apache 2.0). 18+ built-in guardrail scanners plus 15 third-party adapters at the same network hop; structural citation validity and the malicious-URL scanner run here on the outbound brief. Tag-scoped budgets cap the runaway research run before the dollar ceiling. ~29k req/s with P99 21 ms on a t3.xlarge with guardrails on, OpenAI-compatible drop-in.
Ready to evaluate your own deep-research agent? Start with the ai-evaluation quickstart: wire the five-stage rubric vector against a small golden set and check the per-stage scores before you trust any aggregate. Then attach the same rubrics as EvalTag spans on live traffic via traceAI. The vector is the dashboard. The stage that drops is the bug.
Related reading
Frequently asked questions
Why does one aggregate quality score hide deep-research failures?
What are the five stages of a deep-research run you should score independently?
How is per-stage scoring different from atomic-claim decomposition?
Why is plan coherence the hardest of the five stages to measure?
Which Future AGI evaluators map to each of the five research stages?
How do you stop a runaway research run from billing five figures overnight?
How does Error Feed handle the five-stage failure clusters?
LangGraph eval is graph-level, not message-level. Score state transitions: node-input, node-output, edge-routing, and checkpoint replay determinism.
Evaluating agent memory is four problems, not one: recall, freshness, contradiction handling, forgetting. A 2026 framework for Mem0, Zep, Letta, LangMem.
Bedrock's built-in eval is dev-loop only. Score action-group correctness, KB retrieval quality, and guardrail precision/recall on every release.