Evaluating Search-Augmented Agents in 2026: The Four Axes RAG Eval Forgets
Generic RAG eval misses what kills search agents: bad queries, stale sources, monoculture, and broken cites. A four-axis rubric you can ship this week.
Table of Contents
Search-augmented agents look like RAG until the citation list breaks. The first user clicks through, lands on a 404, and the trust budget for the product spends in one impression. Three months of eval work — Groundedness, ContextAdherence, FactualAccuracy, all green — did not catch it.
The reason is structural. Generic RAG eval was built for a frozen vector store and a stable ground truth. Search-augmented agents have neither. The corpus is the live web. The agent issues a literal query string that drives what comes back. Sources span government filings and LinkedIn hot takes in a single result set. Users click the URLs. Skip the four axes that are unique to this shape — query quality, source freshness, source diversity, citation validity — and you ship an agent that confidently cites stale or fabricated URLs. Those four are the rubric. The rest of RAG eval still applies, but the rest of RAG eval was already passing.
If you are coming from the broader 2026 LLM evaluation playbook, this post is the search-augmented variant. The dataset, judge, and CI gate ideas carry over. The four axes do not.
Why generic RAG eval misses what kills search agents
Three structural differences pull search-augmented eval out of the RAG bucket.
The corpus is not yours. A vector store you own is a stable artifact: index version, document IDs, chunking strategy, all fixed. The live web is none of those things. Two runs of the same query, twelve hours apart, return different URLs. Without a per-test-case cache, your CI flakes for reasons unrelated to the agent. Reproducibility lives in the snapshot, not in the SERP.
The query is the lever. RAG retrieval is sensitive to embedding choice and chunking — the user’s prompt goes through encode and search. Search retrieval is sensitive to the literal query string the agent writes. “Best laptop 2026 for ML research with battery life over 10 hours” rewrites into something that either decomposes the battery constraint or drops it. A bad rewrite returns a clean-looking result set that misses the constraint, and no synthesis prompt recovers from it. RAG eval has no metric for this because RAG retrieval does not have a rewrite step that the agent controls.
The user verifies. RAG citations point at internal document IDs the user usually cannot inspect. Search citations are URLs the user clicks. A fabricated URL is a one-click trust failure. A URL that resolves but points at a 2019 blog post on a 2026 question is the same. Citation validity becomes load-bearing in a way it never is for RAG. Groundedness against the retrieved set says nothing about whether the citation the agent emitted is real, whether it resolves, or whether the cited passage actually contains the claim.
If you have only evaluated RAG before, the closest mental model is agentic RAG systems — where the agent makes retrieval decisions on the fly. Search-augmented is that, with the corpus replaced by the live web and the user reading the citations.
The four axes generic RAG eval forgets
The metric set that survives a month of search-agent traffic in production.
| Axis | What it scores | Failure mode | Rubric anchor |
|---|---|---|---|
| 1. Query quality | The literal search string the agent sent | Dropped constraint, over-broad rewrite, entity loss | Constraint preservation, entity recall, rewrite calibration |
| 2. Source freshness | Per-source recency vs the question’s clock | Stale-fact bug, no freshness caveat on time-sensitive query | Recency floor by question class, freshness annotation present |
| 3. Source diversity | Domain spread across the retrieved set | Three URLs from one domain, citation monoculture | Unique-domain count, per-domain cap, primary-vs-secondary ratio |
| 4. Citation validity | URLs resolve and passages match | 404 citation, paraphrase drift, fabricated reference | URL resolves, cited passage contains the claim, anchor present |
Score them per query, not aggregated. A single number across the four hides which one collapsed, and the diagnostic value is in the vector. The standard RAG metrics (Groundedness, ContextAdherence, FactualAccuracy, Completeness) still run alongside — they are necessary, just not sufficient.
Axis 1: Query quality
The agent reads the user’s prompt and writes one or more search strings. That string is what the search engine literally reads. Three failure modes show up in almost every search-agent run.
Constraint drop. The user asks for laptops with battery life over 10 hours. The rewriter compresses to “best laptops 2026 for ML” and the search returns ten options that average five-hour batteries. The synthesizer ranks them and ships. Groundedness is fine because the brief grounded itself in the retrieved set. The constraint is gone.
Entity loss. The user asks about a specific company, regulation, or product version. The rewriter abstracts to a category. The search returns generic results. The brief is well-written and answers a different question.
Over-broad rewrite. The user asks a narrow factual lookup. The rewriter expands into a broader topic. The search returns a Wikipedia overview. The brief paraphrases the overview and misses the specific fact.
Score query quality with a CustomLLMJudge that takes the user’s original prompt and the agent’s rewrite as a pair and scores constraint preservation and entity recall on a 1-5 scale.
from fi.evals import Evaluator
from fi.evals.metrics import CustomLLMJudge
from fi.evals.llm.providers.litellm import LiteLLMProvider
query_quality_judge = CustomLLMJudge(
provider=LiteLLMProvider(),
config={
"name": "query_rewrite_quality",
"model": "claude-sonnet-4-5",
"grading_criteria": (
"Given the user's original prompt and the agent's search query, "
"score 1-5 on rewrite quality. 5 = every entity and constraint "
"from the prompt is preserved in the search query; the query is "
"specific enough to retrieve a narrow set. 1 = a constraint or "
"entity is dropped, or the rewrite is so broad it would surface "
"generic content. Penalize unnecessary expansion equally with drop."
),
},
)
The first time you run this rubric over a week of production traffic, the failure cluster will be “rewriter dropped the year/version/region” on roughly one in eight queries. It is the cheapest fix in the loop and the one most teams skip.
Axis 2: Source freshness
Freshness is a function of the question, not a global setting. “Capital of France” is freshness-insensitive. “Current price of X” is freshness-sensitive on the minute. “FDA guidance on AI-as-a-medical-device” is freshness-sensitive on the quarter. The agent has to decide which regime it is in and behave accordingly: prefer recent sources, refuse stale ones, and annotate the answer with a recency caveat when the question depends on current data.
Two rubrics carry the axis.
A freshness-regime classifier. Per query, classify the freshness sensitivity (insensitive, slow, fast, real-time). This can be a small judge or a heuristic over the question type. Attach the classification as a span attribute so downstream rubrics can score against it.
A recency-floor check. Per retrieved source, attach retrieval.source_recency_days (from the publication date in the result metadata, or from a date-extraction pass on the page itself if the provider does not return it). Per query, compute the median recency of cited sources. Fail when the median exceeds the recency floor for the freshness regime.
from fi.evals.metrics import CustomLLMJudge
from fi.evals.llm.providers.litellm import LiteLLMProvider
freshness_handling_judge = CustomLLMJudge(
provider=LiteLLMProvider(),
config={
"name": "freshness_handling",
"model": "claude-sonnet-4-5",
"grading_criteria": (
"Given the user's question, the freshness regime "
"(insensitive | slow | fast | real-time), and the agent's answer "
"with citations, score 1-5. 5 = the agent flagged the freshness "
"regime, preferred recent sources, and refused or annotated when "
"no recent source was available. 1 = the agent answered confidently "
"from a source older than the recency floor with no caveat."
),
},
)
The stale-fact bug is the failure mode users notice last and complain about loudest. It looks correct on the day the source was indexed and wrong every day after. Without an explicit freshness rubric, no other metric catches it.
Axis 3: Source diversity
A search call against a general engine returns Wikipedia, vendor blogs, news, government docs, Reddit, LinkedIn, and SEO spam in one set. The agent picks three to five to cite. The diversity failure mode is monoculture: all three citations are from the same domain, or all five trace back to the same upstream source, or the brief reads like consensus when it summarizes one tab of one website.
Three checks cover the axis.
Unique-domain count per citation set. If the brief has six citations and they live on three domains, log the count and apply a per-question-class floor. Factual lookups can be tight (two domains is fine for a date check). Synthesis questions need four or more.
Per-domain cap on citations. No single domain contributes more than two citations to the same answer, except when the question is specifically about that domain (“what does the EU AI Act say about X” can cite eur-lex four times). The cap is a hard rail at the gateway boundary.
Primary-versus-secondary ratio. Per citation, classify as primary (original source: regulator filing, paper, vendor announcement, primary data) or secondary (commentary, summary, aggregator). A regulatory question with zero primary citations is a failure regardless of how grounded the brief is against the secondaries.
from collections import Counter
from urllib.parse import urlparse
def evaluate_diversity(citations, per_domain_cap=2, min_unique_domains=3):
domains = [urlparse(c.url).netloc for c in citations]
counts = Counter(domains)
return {
"unique_domain_count": len(counts),
"max_per_domain": max(counts.values()),
"monoculture": (
len(counts) < min_unique_domains
or max(counts.values()) > per_domain_cap
),
}
The deterministic version runs in microseconds and slots in as a RailType.OUTPUT rail before the brief ships. The judge version handles the primary-versus-secondary classification on top.
Axis 4: Citation validity
The highest-signal cheap check in the entire stack, and the one most teams discover only after the first user complaint. Three sub-checks compose it.
The URL resolves. HEAD-request every citation in the answer. A non-2xx response is a fail. A redirect to a different domain is a fail. The check runs in tens of milliseconds and rejects fabricated URLs and stale links before the user sees them.
The cited passage exists on the page. Fetch the page body, extract the text, and verify the passage the agent quoted (or paraphrased) is actually present. A paraphrase that drifts beyond an embedding-similarity threshold is a fail. This catches the case where the URL resolves to a real page but the agent invented the quote.
The claim is supported by the cited passage. This is where ChunkAttribution and Groundedness apply, but at the per-claim level — not the per-answer level. A brief can be 0.94 grounded as a whole and 0.61 aligned at the per-claim level; users only see the per-claim view.
Wire it as a two-stage rail. The deterministic stage (URL resolves, passage present) is a hard gate at the gateway. The judge stage (claim-passage entailment) runs as an EvalTag on the synthesis span. Failures from either route a rejection back to the synthesizer with the failed citation flagged.
import httpx
from fi.evals import Evaluator
from fi.evals.templates import ChunkAttribution, ContainsValidLink
from fi.testcases import TestCase
def url_resolves(url: str) -> bool:
try:
r = httpx.head(url, follow_redirects=True, timeout=5.0)
return r.status_code < 400 and urlparse(str(r.url)).netloc == urlparse(url).netloc
except httpx.HTTPError:
return False
def passage_present(citation, fetched_pages: dict) -> bool:
body = fetched_pages.get(citation.url, "")
return citation.quoted_passage.strip() in body # plus paraphrase-tolerant pass
# claim-passage entailment on the surviving citations
cases = [
TestCase(
input=citation.claim_text,
output=citation.claim_text,
context=citation.cited_passage,
)
for citation in answer.citations
if url_resolves(citation.url) and passage_present(citation, fetched_pages)
]
result = Evaluator(fi_api_key=API_KEY, fi_secret_key=SECRET_KEY).evaluate(
eval_templates=[ContainsValidLink(), ChunkAttribution()],
inputs=cases,
)
A citation that survives all three checks is a real citation. A citation that fails any of the three is a hard reject — log it, regenerate the answer with the failing URL excluded from the context, and re-run the gate. Citation validity is the single highest-signal cheap check in this category; teams that wire it stop seeing the “user clicked a 404” failure mode almost immediately.
Caching the snapshot so CI does not flake
Two runs against the same prompt twelve hours apart can return different URLs. Without a per-test-case cache, the CI gate flakes for reasons unrelated to the agent, and the team loses faith in the eval.
Cache at three layers. The search call (provider, query, response headers, full result JSON). The page fetch (URL, response, body, fetch timestamp). The page-to-passage extraction (cleaned text, date metadata, anchor IDs). Replay every test from cache. Re-snapshot deliberately on a schedule that fits your domain — weekly for news agents, monthly for stable how-to agents — and treat each re-snapshot as a new golden-set version with its own baseline.
This is the difference between an eval suite that runs and one that nobody trusts.
Tracing the search-augmented agent
The eval scores attach to spans. The spans are how you debug a bad answer. Three span kinds carry the load.
RETRIEVER for the search call. One retriever span per search invocation. Attach retrieval.source_type="web", per-source retrieval.source_url, retrieval.source_recency_days, retrieval.source_authority_score, and the search.query the agent literally sent (the input for the query-quality judge).
TOOL for the HTTP call to Tavily, Brave, Bing, SerpAPI, Exa, or the Perplexity API. Splitting the wire call from the retriever span lets gateway-level latency and cost land on the right node.
LLM for the synthesis call. Standard LLM span, routed through the Agent Command Center gateway so x-prism-cost, x-prism-latency-ms, x-prism-model-used, and x-prism-fallback-used land on span attributes.
from fi_instrumentation import register
from fi_instrumentation.fi_types import ProjectType
trace_provider = register(
project_type=ProjectType.OBSERVE,
project_name="search-augmented-agent",
)
# swap the instrumentor for whatever runtime drives the agent
from traceai_openai import OpenAIInstrumentor
OpenAIInstrumentor().instrument(tracer_provider=trace_provider)
If the agent runs on LangChain, LlamaIndex, CrewAI, Autogen, or the OpenAI Agents SDK, swap the instrumentor. traceAI ships 50+ AI surfaces across Python, TypeScript, Java (including Spring AI and LangChain4j), and C#. For the broader picture see the best AI agent observability tools writeup.
Where Future AGI fits
Future AGI ships the self-improving loop for search-augmented agents — generate the dataset, simulate the rewrite-search-synthesize trace, evaluate on the four axes plus the built-in templates, optimize the prompts against the eval, loop the failures back into the dataset. Other tools give you the parts. Only FAGI loops them into one product.
ai-evaluation (Apache 2.0) carries the rubrics. Built-in templates handle the standard RAG layer (Groundedness, ChunkAttribution, ContextAdherence, FactualAccuracy, Completeness, AnswerRefusal, ContainsValidLink). CustomLLMJudge with grading_criteria handles the four search-specific rubrics (QueryRewriteQuality, FreshnessHandling, SourceDiversity, CitationValidity). 20+ heuristic metrics handle the deterministic checks (URL resolves, unique-domain count, per-domain cap) at sub-second latency.
traceAI (Apache 2.0) carries the same rubric as a span-attached score on live traffic. 14 span kinds including RETRIEVER, TOOL, LLM, EVALUATOR, GUARDRAIL. Server-side EvalTag wires the rubric to the span at zero added inference latency, so the same template runs in pytest and on the live span.
Three scanners run inline on the retrieved-content boundary. MaliciousURLScanner rejects phishing and typo-squat domains in the result set. SecretsScanner blocks leaked API keys from pastebins and gists ending up in the cited brief. InvisibleCharScanner flags zero-width and bidi characters that adversaries embed in pages to steer LLM interpretation. All three run in under ten milliseconds and slot in as RailType.INPUT rails before the synthesis call reads the page.
The Agent Command Center is the operational gateway: 100+ providers as a single Go binary, OpenAI-compatible drop-in, 18+ built-in guardrail scanners plus 15 third-party adapters at the same network hop. Citation-validity rails and the malicious-URL scanner run here on the outbound brief; tag-scoped budgets cap runaway query rewriters before the dollar ceiling. SOC 2 Type II, HIPAA, GDPR, and CCPA certified.
Failing runs land in the Error Feed. HDBSCAN soft-clustering groups failures over span embeddings; a Sonnet 4.5 Judge agent runs the investigation across span-tools, with a Haiku Chauffeur summarizing long spans at roughly 90% prompt-cache hit. Per cluster, the Judge writes an immediate_fix naming the axis that broke and the change to ship. Typical clusters in this category: “rewriter dropped year qualifier on regulation queries,” “all citations from same domain on healthcare AI questions,” “freshness ignored on stock-price queries,” “cited URL resolves to a redirect domain.” Promoted clusters land in the next golden-set version.
The four-axis rubric is the input. A clustered failure with a named axis and an immediate_fix is the output. That is the loop.
Ready to evaluate your own search-augmented agent? Start with the ai-evaluation quickstart: wire the four-axis rubric against a 50-query stratified set and check the per-axis scores before you trust any aggregate. Then attach the same rubrics as EvalTag spans on live traffic via traceAI. The vector is the dashboard. The axis that drops is the bug.
Related reading
Frequently asked questions
Why does generic RAG eval fail on search-augmented agents?
What are the four eval axes that matter for a search-augmented agent?
How do I make a search-agent eval reproducible when the web changes hourly?
How is citation validity different from Groundedness?
Which FAGI eval templates and custom rubrics cover the four axes?
Where does FAGI fit in a search-agent eval stack?
Bedrock's built-in eval is dev-loop only. Score action-group correctness, KB retrieval quality, and guardrail precision/recall on every release.
Vertex ships a managed runtime. Score Vertex Search retrieval, grounded-vs-reasoning outputs, and Gemini safety filter precision/recall.
How to build a four-bucket golden set (production sample, adversarial, edge cases, failure replays) that lets a CI eval gate actually prove something about production.