Guides

Evaluating Search-Augmented Agents in 2026: The Four Axes RAG Eval Forgets

Generic RAG eval misses what kills search agents: bad queries, stale sources, monoculture, and broken cites. A four-axis rubric you can ship this week.

·
Updated
·
11 min read
llm-evaluation agent-evaluation rag web-search citation-validity 2026
Editorial cover image for Evaluating Search-Augmented Agents in 2026
Table of Contents

Search-augmented agents look like RAG until the citation list breaks. The first user clicks through, lands on a 404, and the trust budget for the product spends in one impression. Three months of eval work — Groundedness, ContextAdherence, FactualAccuracy, all green — did not catch it.

The reason is structural. Generic RAG eval was built for a frozen vector store and a stable ground truth. Search-augmented agents have neither. The corpus is the live web. The agent issues a literal query string that drives what comes back. Sources span government filings and LinkedIn hot takes in a single result set. Users click the URLs. Skip the four axes that are unique to this shape — query quality, source freshness, source diversity, citation validity — and you ship an agent that confidently cites stale or fabricated URLs. Those four are the rubric. The rest of RAG eval still applies, but the rest of RAG eval was already passing.

If you are coming from the broader 2026 LLM evaluation playbook, this post is the search-augmented variant. The dataset, judge, and CI gate ideas carry over. The four axes do not.

Why generic RAG eval misses what kills search agents

Three structural differences pull search-augmented eval out of the RAG bucket.

The corpus is not yours. A vector store you own is a stable artifact: index version, document IDs, chunking strategy, all fixed. The live web is none of those things. Two runs of the same query, twelve hours apart, return different URLs. Without a per-test-case cache, your CI flakes for reasons unrelated to the agent. Reproducibility lives in the snapshot, not in the SERP.

The query is the lever. RAG retrieval is sensitive to embedding choice and chunking — the user’s prompt goes through encode and search. Search retrieval is sensitive to the literal query string the agent writes. “Best laptop 2026 for ML research with battery life over 10 hours” rewrites into something that either decomposes the battery constraint or drops it. A bad rewrite returns a clean-looking result set that misses the constraint, and no synthesis prompt recovers from it. RAG eval has no metric for this because RAG retrieval does not have a rewrite step that the agent controls.

The user verifies. RAG citations point at internal document IDs the user usually cannot inspect. Search citations are URLs the user clicks. A fabricated URL is a one-click trust failure. A URL that resolves but points at a 2019 blog post on a 2026 question is the same. Citation validity becomes load-bearing in a way it never is for RAG. Groundedness against the retrieved set says nothing about whether the citation the agent emitted is real, whether it resolves, or whether the cited passage actually contains the claim.

If you have only evaluated RAG before, the closest mental model is agentic RAG systems — where the agent makes retrieval decisions on the fly. Search-augmented is that, with the corpus replaced by the live web and the user reading the citations.

The four axes generic RAG eval forgets

The metric set that survives a month of search-agent traffic in production.

AxisWhat it scoresFailure modeRubric anchor
1. Query qualityThe literal search string the agent sentDropped constraint, over-broad rewrite, entity lossConstraint preservation, entity recall, rewrite calibration
2. Source freshnessPer-source recency vs the question’s clockStale-fact bug, no freshness caveat on time-sensitive queryRecency floor by question class, freshness annotation present
3. Source diversityDomain spread across the retrieved setThree URLs from one domain, citation monocultureUnique-domain count, per-domain cap, primary-vs-secondary ratio
4. Citation validityURLs resolve and passages match404 citation, paraphrase drift, fabricated referenceURL resolves, cited passage contains the claim, anchor present

Score them per query, not aggregated. A single number across the four hides which one collapsed, and the diagnostic value is in the vector. The standard RAG metrics (Groundedness, ContextAdherence, FactualAccuracy, Completeness) still run alongside — they are necessary, just not sufficient.

Axis 1: Query quality

The agent reads the user’s prompt and writes one or more search strings. That string is what the search engine literally reads. Three failure modes show up in almost every search-agent run.

Constraint drop. The user asks for laptops with battery life over 10 hours. The rewriter compresses to “best laptops 2026 for ML” and the search returns ten options that average five-hour batteries. The synthesizer ranks them and ships. Groundedness is fine because the brief grounded itself in the retrieved set. The constraint is gone.

Entity loss. The user asks about a specific company, regulation, or product version. The rewriter abstracts to a category. The search returns generic results. The brief is well-written and answers a different question.

Over-broad rewrite. The user asks a narrow factual lookup. The rewriter expands into a broader topic. The search returns a Wikipedia overview. The brief paraphrases the overview and misses the specific fact.

Score query quality with a CustomLLMJudge that takes the user’s original prompt and the agent’s rewrite as a pair and scores constraint preservation and entity recall on a 1-5 scale.

from fi.evals import Evaluator
from fi.evals.metrics import CustomLLMJudge
from fi.evals.llm.providers.litellm import LiteLLMProvider

query_quality_judge = CustomLLMJudge(
    provider=LiteLLMProvider(),
    config={
        "name": "query_rewrite_quality",
        "model": "claude-sonnet-4-5",
        "grading_criteria": (
            "Given the user's original prompt and the agent's search query, "
            "score 1-5 on rewrite quality. 5 = every entity and constraint "
            "from the prompt is preserved in the search query; the query is "
            "specific enough to retrieve a narrow set. 1 = a constraint or "
            "entity is dropped, or the rewrite is so broad it would surface "
            "generic content. Penalize unnecessary expansion equally with drop."
        ),
    },
)

The first time you run this rubric over a week of production traffic, the failure cluster will be “rewriter dropped the year/version/region” on roughly one in eight queries. It is the cheapest fix in the loop and the one most teams skip.

Axis 2: Source freshness

Freshness is a function of the question, not a global setting. “Capital of France” is freshness-insensitive. “Current price of X” is freshness-sensitive on the minute. “FDA guidance on AI-as-a-medical-device” is freshness-sensitive on the quarter. The agent has to decide which regime it is in and behave accordingly: prefer recent sources, refuse stale ones, and annotate the answer with a recency caveat when the question depends on current data.

Two rubrics carry the axis.

A freshness-regime classifier. Per query, classify the freshness sensitivity (insensitive, slow, fast, real-time). This can be a small judge or a heuristic over the question type. Attach the classification as a span attribute so downstream rubrics can score against it.

A recency-floor check. Per retrieved source, attach retrieval.source_recency_days (from the publication date in the result metadata, or from a date-extraction pass on the page itself if the provider does not return it). Per query, compute the median recency of cited sources. Fail when the median exceeds the recency floor for the freshness regime.

from fi.evals.metrics import CustomLLMJudge
from fi.evals.llm.providers.litellm import LiteLLMProvider

freshness_handling_judge = CustomLLMJudge(
    provider=LiteLLMProvider(),
    config={
        "name": "freshness_handling",
        "model": "claude-sonnet-4-5",
        "grading_criteria": (
            "Given the user's question, the freshness regime "
            "(insensitive | slow | fast | real-time), and the agent's answer "
            "with citations, score 1-5. 5 = the agent flagged the freshness "
            "regime, preferred recent sources, and refused or annotated when "
            "no recent source was available. 1 = the agent answered confidently "
            "from a source older than the recency floor with no caveat."
        ),
    },
)

The stale-fact bug is the failure mode users notice last and complain about loudest. It looks correct on the day the source was indexed and wrong every day after. Without an explicit freshness rubric, no other metric catches it.

Axis 3: Source diversity

A search call against a general engine returns Wikipedia, vendor blogs, news, government docs, Reddit, LinkedIn, and SEO spam in one set. The agent picks three to five to cite. The diversity failure mode is monoculture: all three citations are from the same domain, or all five trace back to the same upstream source, or the brief reads like consensus when it summarizes one tab of one website.

Three checks cover the axis.

Unique-domain count per citation set. If the brief has six citations and they live on three domains, log the count and apply a per-question-class floor. Factual lookups can be tight (two domains is fine for a date check). Synthesis questions need four or more.

Per-domain cap on citations. No single domain contributes more than two citations to the same answer, except when the question is specifically about that domain (“what does the EU AI Act say about X” can cite eur-lex four times). The cap is a hard rail at the gateway boundary.

Primary-versus-secondary ratio. Per citation, classify as primary (original source: regulator filing, paper, vendor announcement, primary data) or secondary (commentary, summary, aggregator). A regulatory question with zero primary citations is a failure regardless of how grounded the brief is against the secondaries.

from collections import Counter
from urllib.parse import urlparse

def evaluate_diversity(citations, per_domain_cap=2, min_unique_domains=3):
    domains = [urlparse(c.url).netloc for c in citations]
    counts = Counter(domains)
    return {
        "unique_domain_count": len(counts),
        "max_per_domain": max(counts.values()),
        "monoculture": (
            len(counts) < min_unique_domains
            or max(counts.values()) > per_domain_cap
        ),
    }

The deterministic version runs in microseconds and slots in as a RailType.OUTPUT rail before the brief ships. The judge version handles the primary-versus-secondary classification on top.

Axis 4: Citation validity

The highest-signal cheap check in the entire stack, and the one most teams discover only after the first user complaint. Three sub-checks compose it.

The URL resolves. HEAD-request every citation in the answer. A non-2xx response is a fail. A redirect to a different domain is a fail. The check runs in tens of milliseconds and rejects fabricated URLs and stale links before the user sees them.

The cited passage exists on the page. Fetch the page body, extract the text, and verify the passage the agent quoted (or paraphrased) is actually present. A paraphrase that drifts beyond an embedding-similarity threshold is a fail. This catches the case where the URL resolves to a real page but the agent invented the quote.

The claim is supported by the cited passage. This is where ChunkAttribution and Groundedness apply, but at the per-claim level — not the per-answer level. A brief can be 0.94 grounded as a whole and 0.61 aligned at the per-claim level; users only see the per-claim view.

Wire it as a two-stage rail. The deterministic stage (URL resolves, passage present) is a hard gate at the gateway. The judge stage (claim-passage entailment) runs as an EvalTag on the synthesis span. Failures from either route a rejection back to the synthesizer with the failed citation flagged.

import httpx
from fi.evals import Evaluator
from fi.evals.templates import ChunkAttribution, ContainsValidLink
from fi.testcases import TestCase

def url_resolves(url: str) -> bool:
    try:
        r = httpx.head(url, follow_redirects=True, timeout=5.0)
        return r.status_code < 400 and urlparse(str(r.url)).netloc == urlparse(url).netloc
    except httpx.HTTPError:
        return False

def passage_present(citation, fetched_pages: dict) -> bool:
    body = fetched_pages.get(citation.url, "")
    return citation.quoted_passage.strip() in body  # plus paraphrase-tolerant pass

# claim-passage entailment on the surviving citations
cases = [
    TestCase(
        input=citation.claim_text,
        output=citation.claim_text,
        context=citation.cited_passage,
    )
    for citation in answer.citations
    if url_resolves(citation.url) and passage_present(citation, fetched_pages)
]

result = Evaluator(fi_api_key=API_KEY, fi_secret_key=SECRET_KEY).evaluate(
    eval_templates=[ContainsValidLink(), ChunkAttribution()],
    inputs=cases,
)

A citation that survives all three checks is a real citation. A citation that fails any of the three is a hard reject — log it, regenerate the answer with the failing URL excluded from the context, and re-run the gate. Citation validity is the single highest-signal cheap check in this category; teams that wire it stop seeing the “user clicked a 404” failure mode almost immediately.

Caching the snapshot so CI does not flake

Two runs against the same prompt twelve hours apart can return different URLs. Without a per-test-case cache, the CI gate flakes for reasons unrelated to the agent, and the team loses faith in the eval.

Cache at three layers. The search call (provider, query, response headers, full result JSON). The page fetch (URL, response, body, fetch timestamp). The page-to-passage extraction (cleaned text, date metadata, anchor IDs). Replay every test from cache. Re-snapshot deliberately on a schedule that fits your domain — weekly for news agents, monthly for stable how-to agents — and treat each re-snapshot as a new golden-set version with its own baseline.

This is the difference between an eval suite that runs and one that nobody trusts.

Tracing the search-augmented agent

The eval scores attach to spans. The spans are how you debug a bad answer. Three span kinds carry the load.

RETRIEVER for the search call. One retriever span per search invocation. Attach retrieval.source_type="web", per-source retrieval.source_url, retrieval.source_recency_days, retrieval.source_authority_score, and the search.query the agent literally sent (the input for the query-quality judge).

TOOL for the HTTP call to Tavily, Brave, Bing, SerpAPI, Exa, or the Perplexity API. Splitting the wire call from the retriever span lets gateway-level latency and cost land on the right node.

LLM for the synthesis call. Standard LLM span, routed through the Agent Command Center gateway so x-prism-cost, x-prism-latency-ms, x-prism-model-used, and x-prism-fallback-used land on span attributes.

from fi_instrumentation import register
from fi_instrumentation.fi_types import ProjectType

trace_provider = register(
    project_type=ProjectType.OBSERVE,
    project_name="search-augmented-agent",
)

# swap the instrumentor for whatever runtime drives the agent
from traceai_openai import OpenAIInstrumentor
OpenAIInstrumentor().instrument(tracer_provider=trace_provider)

If the agent runs on LangChain, LlamaIndex, CrewAI, Autogen, or the OpenAI Agents SDK, swap the instrumentor. traceAI ships 50+ AI surfaces across Python, TypeScript, Java (including Spring AI and LangChain4j), and C#. For the broader picture see the best AI agent observability tools writeup.

Where Future AGI fits

Future AGI ships the self-improving loop for search-augmented agents — generate the dataset, simulate the rewrite-search-synthesize trace, evaluate on the four axes plus the built-in templates, optimize the prompts against the eval, loop the failures back into the dataset. Other tools give you the parts. Only FAGI loops them into one product.

ai-evaluation (Apache 2.0) carries the rubrics. Built-in templates handle the standard RAG layer (Groundedness, ChunkAttribution, ContextAdherence, FactualAccuracy, Completeness, AnswerRefusal, ContainsValidLink). CustomLLMJudge with grading_criteria handles the four search-specific rubrics (QueryRewriteQuality, FreshnessHandling, SourceDiversity, CitationValidity). 20+ heuristic metrics handle the deterministic checks (URL resolves, unique-domain count, per-domain cap) at sub-second latency.

traceAI (Apache 2.0) carries the same rubric as a span-attached score on live traffic. 14 span kinds including RETRIEVER, TOOL, LLM, EVALUATOR, GUARDRAIL. Server-side EvalTag wires the rubric to the span at zero added inference latency, so the same template runs in pytest and on the live span.

Three scanners run inline on the retrieved-content boundary. MaliciousURLScanner rejects phishing and typo-squat domains in the result set. SecretsScanner blocks leaked API keys from pastebins and gists ending up in the cited brief. InvisibleCharScanner flags zero-width and bidi characters that adversaries embed in pages to steer LLM interpretation. All three run in under ten milliseconds and slot in as RailType.INPUT rails before the synthesis call reads the page.

The Agent Command Center is the operational gateway: 100+ providers as a single Go binary, OpenAI-compatible drop-in, 18+ built-in guardrail scanners plus 15 third-party adapters at the same network hop. Citation-validity rails and the malicious-URL scanner run here on the outbound brief; tag-scoped budgets cap runaway query rewriters before the dollar ceiling. SOC 2 Type II, HIPAA, GDPR, and CCPA certified.

Failing runs land in the Error Feed. HDBSCAN soft-clustering groups failures over span embeddings; a Sonnet 4.5 Judge agent runs the investigation across span-tools, with a Haiku Chauffeur summarizing long spans at roughly 90% prompt-cache hit. Per cluster, the Judge writes an immediate_fix naming the axis that broke and the change to ship. Typical clusters in this category: “rewriter dropped year qualifier on regulation queries,” “all citations from same domain on healthcare AI questions,” “freshness ignored on stock-price queries,” “cited URL resolves to a redirect domain.” Promoted clusters land in the next golden-set version.

The four-axis rubric is the input. A clustered failure with a named axis and an immediate_fix is the output. That is the loop.

Ready to evaluate your own search-augmented agent? Start with the ai-evaluation quickstart: wire the four-axis rubric against a 50-query stratified set and check the per-axis scores before you trust any aggregate. Then attach the same rubrics as EvalTag spans on live traffic via traceAI. The vector is the dashboard. The axis that drops is the bug.

Frequently asked questions

Why does generic RAG eval fail on search-augmented agents?
RAG eval was built for a frozen vector store with stable ground truth. Search-augmented agents have neither. The corpus is the live web. Sources change between the test run and the next one. The agent issues a query string that the search engine literally reads, so query quality is an upstream lever RAG eval has no metric for. Freshness is a first-class signal, not a footnote. Sources span government docs to LinkedIn hot takes in a single result set, so source diversity has to be measured. And citations matter because users click through, so a fabricated URL or one that returns 404 is a hard failure that Groundedness alone never catches. Score those four axes — query quality, source freshness, source diversity, citation validity — or you will ship an agent that confidently cites stale or non-existent URLs.
What are the four eval axes that matter for a search-augmented agent?
Query quality, source freshness, source diversity, citation validity. Query quality scores whether the agent reformulated the user's prompt into a search string that actually retrieves the right set. Source freshness scores whether the agent flagged time-sensitive questions and preferred recent sources. Source diversity scores whether the retrieved set spans independent domains rather than three pages from one site. Citation validity is the deterministic check that every URL in the answer resolves and the cited passage actually contains the claim. Generic RAG eval misses all four because none of them apply to a frozen corpus.
How do I make a search-agent eval reproducible when the web changes hourly?
Cache the retrieved snapshot per test case. The first run captures the full search results, the page snapshots, the response headers, and the timestamps. Every subsequent run replays from cache. Without this, your CI gate flakes because two runs against the same prompt return different URLs, and you cannot tell whether a regression is a real bug or just a different SERP. Re-snapshot deliberately on a schedule that fits your domain — weekly for fast-moving news agents, monthly for stable how-to agents — and treat each re-snapshot as a new golden-set version with its own baseline.
How is citation validity different from Groundedness?
Groundedness checks that a claim is supported by some passage in the provided context. It runs after retrieval and assumes the context is real. Citation validity is the upstream check that the citations the agent emits actually exist and actually contain the claim. An agent can be 0.94 grounded against the retrieved set and still fail citation validity because it cited URL number seven, but the agent never fetched URL number seven — it invented the reference. Run citation validity as a deterministic gate: HEAD-request every URL in the answer, fetch the body, and verify the cited passage appears. Anything that fails is a hard reject.
Which FAGI eval templates and custom rubrics cover the four axes?
Citation validity uses two built-ins together: ContainsValidLink as a deterministic URL-resolves check, plus ChunkAttribution to verify each claim maps to a real retrieved passage. Source freshness and source diversity use CustomLLMJudge with grading_criteria targeting recency and domain-spread. Query quality uses CustomLLMJudge with the user's original prompt and the agent's rewritten search string as the comparison. Groundedness, ContextAdherence, and FactualAccuracy still apply to the synthesis but they are not the failures unique to search. The four custom rubrics are the differentiator.
Where does FAGI fit in a search-agent eval stack?
FAGI ships the self-improving loop for search agents — generate the dataset, simulate the search-rewrite-synthesize trace, evaluate on the four axes plus the built-in templates, optimize the rewrite and synthesis prompts against the eval, and loop the failures back into the dataset. The pieces ship as Apache 2.0 OSS (ai-evaluation, traceAI, agent-opt) and as the Agent Command Center gateway for the runtime guardrails. Other tools give you the parts. Only FAGI loops them into one product so the same templates that ran in pytest run as span-attached scores on live traffic and feed back into the next agent version.
Related Articles
View all