Research

AI Research Assistant Monitoring: The 2026 Playbook for Citation, Source, Claim, and Plan

Generic monitoring misses how research assistants fail. Four metrics that actually catch citation invention, source collapse, plan drift in production.

July 2, 2025

Updated May 20, 2026

13 min read

research-agents agent-monitoring citation-accuracy grounding hallucination-detection deep-research agent-observability 2026

Table of Contents

A user asks the research assistant for “the regulatory landscape for AI agents in healthcare in 2026.” Ninety seconds later, a 1,400-word brief comes back. Five citations, fluent prose, confident tone. The latency dashboard shows green. The token-cost dashboard shows green. The faithfulness eval, which scored the synthesis against the agent’s own retrieval log, shows 0.93. The trace has 47 spans, all healthy. Three of those five citations point to claims the cited paper does not actually make. One cites a URL the agent never retrieved. The user has no way to know which citation is which without re-checking each one by hand.

This is the operational shape of research-agent failure in 2026, and it does not look like chat-agent failure. Generic agent monitoring measures whether the model returned text, whether the tool calls succeeded, whether the synthesis is internally consistent. None of those catch the failure above. Research assistants fail differently. The unit of evaluation isn’t did it answer. It’s did it cite verifiable sources, did it actually read them, and does the synthesis match the evidence.

This guide is the monitoring framework that maps to that failure shape: four research-specific metrics (citation validity, source diversity, claim-evidence alignment, plan coherence), the three patterns for running them in production (live trace, sample-and-judge, Error Feed), and an honest map of where Future AGI’s stack fits.

Why generic agent monitoring misses research-assistant failures

Three properties make research agents their own monitoring discipline.

The output is structurally compound. A chat answer is one claim. A research brief is twelve claims stitched with five citations into a 1,400-word synthesis. The unit of correctness is the claim, not the answer. An answer-level rubric that scores 0.91 hides a claim-level distribution where two claims score zero. Users do not read averages; they click citation 7, find it does not support the sentence, and stop trusting the agent.

The output is fluent. A research agent’s training objective shapes prose that reads authoritative. Confabulated citations sound like real citations. Quotes that were never said appear in well-structured paragraphs. Fluency is the failure-camouflage layer that surface metrics never penetrate.

The output cites itself. Faithfulness as classically defined (“did the agent stay grounded in the retrieved context”) asks whether the synthesis is consistent with what the agent put in its own context window. If the agent retrieved a source but only read the abstract, then cited it for a methodological claim the abstract does not contain, faithfulness still passes. The agent grounded the synthesis in retrieved text. The text just was not the right text.

The classic LLM monitoring stack catches none of these. Toxicity, PII, latency, token cost, tool-call success: all green when the brief is fabricated. The metrics that catch research-agent failure live one layer up.

Citation validity: the URL resolves AND the cited claim is in it

Citation validity is two checks, not one. Most teams collapse them and lose half the failure signal.

The structural check asks three questions per citation. Does the URL or document ID resolve to a real source? Does the metadata (title, authors, year, publisher) match what the agent claims? Is the source present in the agent’s retrieval log, meaning the agent actually fetched it rather than confabulating a plausible reference? All three are cheap. The retrieval-log check alone catches the most common research-agent failure: citations to sources the agent never retrieved, generated from the LLM’s prior over what a citation should look like.

The semantic check is harder. The agent retrieved the source, cited it for a specific claim, and the claim is not in the source. The claim might be in an adjacent paper, in the model’s training data, or invented whole. A judge model reads the claim and the cited passage and scores entailment on a calibrated scale. This is per-claim, not per-answer. Aggregate as the percentage of claims whose cited passage entails them. The full rubric design for this check lives in evaluating LLM citation and attribution.

The split matters operationally. Structural validity runs inline at the gateway as a guardrail; failing structural citations get flagged or rewritten before the user sees them. Semantic validity runs offline on a sample because it costs a judge call per claim. A common production breakdown: 0.98 structural, 0.71 semantic. Every URL resolves; one in three citations does not actually support the claim attached to it. Both numbers are needed; either alone misleads.

Source diversity: independent sources, not citation count

Counting citations measures effort, not quality. A research brief with five citations all pointing to the same content farm is one source presented as five. Source diversity counts independent evidence.

Three numbers do the work.

Unique-domain count is the floor. If five citations resolve to two unique domains, the brief depends on two publishers no matter how the citations are formatted. Plot this against the depth of the question; a regulatory landscape with two unique domains is structurally underdiversified.

Primary-vs-secondary ratio separates sources from sources-about-sources. A press release on the agency website is primary; a tech blog summarizing the press release is secondary; a Reddit thread linking to the tech blog is tertiary. Research agents are biased toward whatever ranks well in retrieval, which is usually secondary. Score the ratio per run. Questions of fact want primary-heavy mixes. Questions of interpretation tolerate more secondary.

Citation-chain depth flags monoculture. If three of five citations all derive from one upstream source (one paper, three press summaries of that paper), the agent has cited the same evidence three times under different headers. Trace the chain; collapse the chain in your diversity metric.

Diversity is a quality signal because monoculture in sourcing is how research agents drift from the evidence into the consensus they were trained on. The agent retrieves what looks authoritative on the open web, and what looks authoritative on the open web is the small cluster of writers who got there first. Without a diversity check, the agent’s brief is a confidently written summary of the loudest existing summary.

Claim-evidence alignment: per-claim, not per-answer

Groundedness scores the synthesis against the retrieved corpus as a whole. That is the wrong unit for a research brief.

A brief is twelve claims plus a conclusion. Groundedness asks whether the synthesis stays inside the retrieved evidence space, averaged across the answer. Claim-evidence alignment asks the per-claim question: for each individual claim, does the cited passage entail it? The rubric runs once per claim, scores entailment on a calibrated 1-5 scale, and aggregates as the rate of claims with score >= 4.

The two metrics diverge sharply on real production traffic. A brief can be 0.94 grounded and 0.61 aligned. Groundedness is high because the synthesis as a whole is consistent with the corpus the agent retrieved. Alignment is low because individual claims are stitched to citations that do not specifically support them; the corpus contains the supporting evidence elsewhere, but the citation pointer is wrong. Users only see the pointer.

Implementation. Tag claim spans during synthesis (one span per sentence-with-citation), attach the cited passage as a span attribute, run the entailment judge on the (claim, passage) pair, write the score back as a span attribute. The rubric definition lives in ai-evaluation as a code-first template; the same template runs in CI against a versioned dataset of (brief, citations, ground-truth-alignment) examples and on live spans as a sampled score. The diagnostic vocabulary is identical in both places.

Plan coherence over long-horizon tasks

A research run is a plan executed against a sub-query graph. The plan from minute one drifts by minute six. Without a plan-coherence metric, the drift is invisible.

Three failure shapes plan coherence is meant to catch.

Scope drift. The user asks for “regulatory landscape in healthcare.” The planner decomposes into “FDA,” “HIPAA,” “state laws,” “international comparison.” During retrieval, the international leg returns rich content, the synthesizer leans on it, and the final brief is 60% international with one paragraph on FDA. The scope shifted under the agent.

Sub-question abandonment. The plan lists six sub-questions. The retriever returns nothing for two of them. The synthesizer drops them silently. The brief reads complete; the trace shows two retrieval branches with zero usable spans. Without plan-coherence scoring, the user reads a partial answer presented as a full one.

Question reframing. The retrieval surfaces a tangent that looks relevant. The agent quietly answers the tangent instead of the question. The brief is internally coherent and externally off-topic.

The rubric is structural plus semantic. Structurally, count planned sub-questions and verify each has at least one supporting paragraph in the final brief. Semantically, a judge scores the final answer against the original user query for scope coverage and against the planned sub-query graph for execution fidelity. Aggregate to a single 1-5 score and alert on rolling-mean drift per agent version.

Plan coherence is the metric that catches off-by-a-question failures. The other three score the brief against itself and against its sources. Plan coherence scores the brief against the question.

The trace tree of a deep-research agent

A research-agent trace is a tree, not a flat list. The shape matters because the four metrics above each attach at different depths. The companion piece, evaluating deep research agents, covers the long-horizon eval side of the same architecture.

agent.research.run
  agent.plan
    llm.chat (planner)
  agent.retrieve
    retriever.search.sub_query_1
      retriever.document.fetched
    retriever.search.sub_query_2
    retriever.search.sub_query_3
  agent.synthesize
    llm.chat.chunk_1
    llm.chat.chunk_2
    claim.span (sentence + cited_passage)
    claim.span
    claim.span
  agent.cite
    llm.chat.citation_extractor
  agent.verify
    eval.citation_validity_structural
    eval.citation_validity_semantic
    eval.claim_evidence_alignment
    eval.plan_coherence
    eval.source_diversity

What this gives you. Source count comes from retriever.document.fetched spans, not from the citation list at the end (the citation list is the agent’s claim about what it used; the retrieval log is the truth). Per-claim alignment runs on claim.span attributes. Plan coherence joins agent.plan against agent.synthesize. The eval.* spans carry per-rubric scores as attributes; drift alerts ride on them.

A flat span list buries every signal. The investment in tree-structured tracing pays back the first time a research agent regresses; the alternative is grep over a log file at 3am.

Monitoring patterns: live trace, sample-and-judge, Error Feed

Three patterns compose into production research-agent monitoring. Most teams run two; the strongest run all three.

Live trace is the always-on layer. Every run emits a span tree with the structural metrics attached: structural citation validity, source diversity numbers, plan-coverage flags, per-stage latency, per-stage cost. Structural metrics are cheap (regex, lookups, hash joins). Run them inline. The gateway can enforce them as guardrails: a brief with structural citation validity below threshold is held for review or returned with a warning to the user. This is the layer that catches confabulated URLs before the user sees them.

Sample-and-judge is the offline layer. A tail-sampling policy at the OTel collector keeps 100% of runs with errors, guardrail triggers, low structural scores, top-percentile latency or cost, and any experiment cohort; it samples 5-20% of the remaining clean runs uniformly. The retained traces get the expensive semantic rubrics: claim-evidence alignment, semantic citation validity, plan-coherence semantic score. Use a frontier judge on a fraction of traffic and a calibrated lighter judge (or a distilled classifier) on the rest. Score per claim, aggregate per run, write back as span attributes so the trace tree carries the full picture.

Error Feed is the loop closer. Failing traces (any rubric below threshold) cluster into named issues. A judge agent investigates the cluster across span-tools, emits a 4-dimensional trace score (factual grounding, privacy and safety, instruction adherence, optimal plan execution, each 1-5) and an immediate_fix string naming the change to ship. The cluster promotes into the offline regression set; the next PR on that code path cannot pass CI until the new cases clear. Without this loop, the failing traces accumulate in a queue no one reads. With it, every production failure becomes a regression test the team never has to hand-write.

The three patterns layer in that order. Live trace catches the cheap and obvious. Sample-and-judge catches the expensive and subtle. Error Feed turns both into a ratchet that ages the eval set forward with production.

Common mistakes when monitoring research agents

Answer-level scoring instead of claim-level. Fluent-but-wrong briefs pass on aggregate rubrics. Move the unit of evaluation to the claim.
Citation count as a proxy for source diversity. Five citations to two domains is two sources. Count unique domains, primary-vs-secondary ratio, and chain depth.
Faithfulness without alignment. Scoring synthesis against the retrieval log misses cases where the agent retrieved good evidence and cited it for the wrong claim.
Flat span lists. The four metrics attach at different tree depths. Without tree structure, the diagnostic is impossible.
Head sampling at 1%. Research agents are long-tail by design; the rare 40-source run is also the run most likely to fabricate. Sample tail-aware.
No prompt.version and agent.version tags on spans. A regression has to be attributable to a rollout, not to “last week vs this week.”
Trusting the judge without calibration. Judges drift. Pin a small human-labelled hold-out and alarm when judge-vs-human disagreement exceeds the inter-rater baseline.
Skipping plan coherence. The other three rubrics score the brief against itself and its sources. Plan coherence is the one that scores the brief against the question.
No promote-back loop. A failing trace that doesn’t become a regression test is a failure the team will see again in three weeks.

How Future AGI ships research-agent monitoring

Future AGI ships the research-monitoring stack as a package, not a single product. Start with the SDK for code-defined rubrics; graduate to the Platform when the loop needs self-improving evaluators, in-product authoring, and lower per-eval cost at sample-and-judge scale.

ai-evaluation (Apache 2.0) is the code-first surface. 50+ EvalTemplate classes including Groundedness, ContextAdherence, ContextRelevance, Completeness, ChunkAttribution, ChunkUtilization, FactualAccuracy, DetectHallucination, EvaluateFunctionCalling, and CustomLLMJudge for the research-specific rubrics this guide adds. Real Evaluator(fi_api_key=..., fi_secret_key=...).evaluate(...) API. 20+ local heuristic metrics for the structural checks. Error localization tells you which input field caused the failure.

traceAI (Apache 2.0) carries the same rubric as a span-attached score on live traffic. 50+ AI surfaces across Python, TypeScript, Java, and C#; auto-instrumentation for OpenAI, LangChain, Groq, Portkey, Gemini; 14 span kinds including TOOL, RETRIEVER, AGENT, EVALUATOR, GUARDRAIL, VECTOR_DB. Pluggable semantic conventions at register() time. Server-side EvalTag wires the rubric to the span at zero added inference latency, so the same eval template that runs in pytest runs on the live span.

The Future AGI Platform is the operational layer. Self-improving evaluators retune from thumbs feedback. An in-product authoring agent writes custom rubrics from natural-language descriptions, which is the realistic path for the four research-specific rubrics (each is novel enough that an off-the-shelf template needs domain tuning). Classifier-backed evals run at lower per-eval cost than Galileo Luna-2, which makes claim-level scoring on every sampled trace economically tractable.

Error Feed sits inside the eval stack. Failing research traces cluster via HDBSCAN soft-clustering over span embeddings in ClickHouse. A JudgeAgent on Claude Sonnet 4.5 runs a 30-turn investigation across 8 span-tools, with a Claude Haiku Chauffeur summarising spans over 3000 characters and a ~90% prompt cache hit ratio keeping the bill survivable. Per cluster, the Judge emits a 5-category 30-subtype taxonomy classification, the 4-dimensional trace score, and an immediate_fix string. The on-call engineer promotes representative traces into the offline regression set with one click; the next PR has to clear them. Linear is wired today via OAuth; Slack, GitHub, Jira, and PagerDuty are on the roadmap.

For the gateway layer, the Agent Command Center fronts 100+ providers as a single Go binary (Apache 2.0). 18+ built-in guardrail scanners plus 15 third-party adapters at the same network hop, exact and semantic caching, OTel and Prometheus observability, MCP and A2A protocol support. Structural citation validity and PII redaction run here as guardrails on the outbound response, so the user never sees a brief that fails the cheap checks. ~29k req/s with P99 21 ms with guardrails on, on a t3.xlarge.

Ready to monitor your own research agent? Start with the ai-evaluation quickstart: define the four research-specific rubrics as EvalTemplate instances, wire them in pytest against a small versioned dataset of (brief, citations) examples. Then attach the same rubrics as EvalTag spans on live traffic via traceAI. The same rubric running in both places is the diff that closes the citation-trust gap research agents structurally leak.

Frequently asked questions

Why doesn't generic LLM monitoring work for research assistants?

Because the failure shape is different. A chat agent fails when the answer is wrong. A research agent fails when the answer is fluent, the citations resolve, the trace shows 47 spans of green, and three of those citations point to claims the cited paper never actually makes. Faithfulness rubrics score the synthesis against the agent's own retrieval log, so a confidently wrong synthesis on retrieved-but-unread sources still passes. The unit of evaluation moves from did it answer to did it cite verifiable sources, did it actually read them, and does the synthesis match the evidence in them. That is four metrics generic agent monitoring does not measure.

What is citation validity and how do you score it?

Citation validity has two layers most teams collapse into one. Layer one is structural: the cited URL or document ID resolves, the title and authors match, and the source was actually in the agent's retrieval log. Layer two is semantic: the specific claim attached to that citation is supported by the cited passage, not by the rest of the source or by the model's priors. Score both. Structural checks are cheap and run inline as guardrails. Semantic checks are expensive and run on a sample. A research agent that scores 1.0 structural and 0.62 semantic is the most common failure mode in 2026, and it looks identical to a working agent on every other dashboard.

How do you measure source diversity for a research agent?

Not by counting citations. Five citations to five subdomains of the same content farm is one source dressed as five. Source diversity counts independent sources: distinct publishers, distinct authors, distinct primary or secondary tiers, distinct evidence types. A question on a regulatory landscape that cites only the agency website is fine for a press release and inadequate for analysis. Track three numbers per run: unique-domain count, primary-vs-secondary ratio, and the longest chain of derivative citations (one source quoting another quoting another). Diversity is a quality signal because monoculture in sourcing is how research agents drift from the evidence into the consensus they were trained on.

What does claim-evidence alignment catch that groundedness misses?

Groundedness scores the synthesis against the retrieved corpus as a whole. Claim-evidence alignment scores each claim against its specific cited passage. A research brief can be 0.94 grounded as a synthesis and 0.61 aligned at the claim level: the corpus supports the overall narrative, but seven of twelve specific claims attach to citations that do not contain them. Users do not read averages. They click citation 7, see it does not support the sentence, and the trust is gone. The rubric is per-claim, not per-answer, and the aggregate is the percentage of claims whose cited passage entails them under a calibrated judge.

Why does plan coherence matter for long-horizon research runs?

A deep-research run is a tree of sub-queries that get scoped, dispatched, and synthesized. The trace is twenty to a hundred spans. The plan from minute one can drift by minute six: the agent retrieves on a sub-question, the result reframes the question, and the synthesis answers a different question than the user asked. Plan coherence scores the final answer against the original user query and the executed sub-query graph. A high score means the plan covered the question's scope, the executed retrievals matched the plan, and the synthesis answered the asked question, not a reframed one. The metric is structural plus semantic, and it is the one that catches off-by-a-question failures the other three miss.

How do you sample traces for sample-and-judge scoring without missing the failures?

Outcome-aware tail sampling. Keep one hundred percent of runs with any rubric below threshold, any guardrail trigger, any error, top-percentile latency, top-percentile cost, and any experiment cohort. Sample five to twenty percent of the remaining clean runs uniformly. Head sampling at one percent drops the long-tail failures the trace was meant to catch, and research agents are long-tail by design (the rare run that consults forty sources is the one most likely to fabricate one). Wire the policy at the OpenTelemetry collector tier so the agent never decides its own evidence.

How does Future AGI ship research-agent monitoring as a stack?

Three surfaces, one rubric vocabulary. traceAI auto-instruments the planner, retriever, synthesizer, citation extractor, and verifier across Python, TypeScript, Java, and C# with 14 span kinds; the trace is a tree, not a flat log. ai-evaluation runs the four research-specific rubrics (citation validity structural plus semantic, source diversity, claim-evidence alignment, plan coherence) as code-defined templates on offline sets in CI and as span-attached scores on live traffic. Error Feed clusters failing traces with HDBSCAN, a Sonnet 4.5 Judge writes a 4-dimensional trace score and an immediate_fix string per cluster, and the cluster promotes into the offline set as a regression test the next PR has to clear. The eval rubric in CI and on the live span is the same rubric, scored by the same judge.

View all

Research

Best LLM Cost Tracking Tools in 2026: 8 Compared

Future AGI, Helicone, Langfuse, OpenRouter, Portkey, LangSmith, Datadog, and CloudZero compared on per-trace, per-developer LLM cost attribution.

Vrinda Damani · May 6, 2026

16 min

Research

Best LLMs of May 2026: Top Closed-Source, Open-Weight, Multimodal, and Coding Picks

Best LLMs May 2026: compare GPT-5.5, Claude Opus 4.7, Gemini 3.1 Pro, and DeepSeek V4 across coding, agents, multimodal, cost, and open weights.

Vrinda Damani · May 6, 2026

29 min

Research

Best Voice AI Models in May 2026: STT, TTS, and Voice Agent Stack

Best Voice AI May 2026: compare Deepgram, Cartesia, ElevenLabs, Retell, and Vapi for STT, TTS, latency budgets, and production voice agents.

Vrinda Damani · May 6, 2026

21 min

Why generic agent monitoring misses research-assistant failures

Citation validity: the URL resolves AND the cited claim is in it

Source diversity: independent sources, not citation count

Claim-evidence alignment: per-claim, not per-answer

Plan coherence over long-horizon tasks

The trace tree of a deep-research agent

Monitoring patterns: live trace, sample-and-judge, Error Feed

Common mistakes when monitoring research agents

How Future AGI ships research-agent monitoring

Related reading

Frequently asked questions