Models

What Is Evidence in AI Evaluation?

The source material — retrieved documents, tool outputs, or citations — that supports a model's claim and can be checked by a human or evaluator.

What Is Evidence in AI Evaluation?

Evidence in AI evaluation is the source material that supports a model’s claim: retrieved chunks behind a RAG answer, tool output behind an agent action, a citation to policy text, or provenance for a domain claim. A reviewer can check an answer with evidence; without it, the answer rests on parametric memory. In a FutureAGI trace, evidence appears as context for Groundedness, cited sources for SourceAttribution, and tool-output spans inside an agent trajectory.

Why evidence matters in production LLM and agent systems

An LLM that produces fluent, plausible answers without evidence is the most expensive failure mode an AI product ships. The user reads the answer, accepts it, and acts on it — and there is no audit path back to whether it was right. The pain hits multiple roles. A support team handles refund disputes from customers who acted on a confident-but-wrong policy answer. A medical-information team rewrites the prompt after the regulator asks for the source behind a specific claim and there is no source. A platform engineer sees thumbs-down rate climb without any model deploy because the retriever started indexing a stale corpus.

Common production symptoms are subtle: citations that point to documents the retriever did not return, citation-free assertions in domains where the policy requires a source, agent steps that “remember” a parameter no tool ever produced, summaries that include facts not present in the source bullets.

In 2026-era agent stacks, evidence is not just retrieved chunks. It is also tool observations, prior-step outputs, and external API responses. A planner that fires a tool call based on an unsupported assumption corrupts the rest of the trajectory. Multi-step pipelines need step-level evidence evaluation: does this claim trace back to a retrievable source, a tool output, or a prior trustworthy step?

How FutureAGI handles evidence

FutureAGI’s approach is to make the evidence layer measurable per span and per trace. Grounding runs through Groundedness, Faithfulness, and RAGFaithfulness, each returning a 0–1 score for whether the response is supported by the provided context — a direct test of whether the answer used evidence. Citation quality is scored by CitationPresence (are citations there at all?) and SourceAttribution (do they actually support the claim?). Multi-hop reasoning uses MultiHopReasoning to score whether intermediate inferences are evidence-backed. For trajectories, ReasoningQuality scores whether each step’s claims are warranted by the observations available at that step.

A practical pattern: a legal-research RAG team using the llamaindex traceAI integration instruments their chain, samples 5% of production traces, and runs Groundedness, CitationPresence, and SourceAttribution on each. They dashboard “ungrounded-but-cited” rate — responses that cite a source but are not actually supported by it — as a leading indicator of citation hallucination. When that rate spikes after a model swap, the trace view shows the new model paraphrases citations in ways that no longer align with retrieved chunks. They roll back through Agent Command Center model fallback and add a post-guardrail that drops responses below a Groundedness threshold. Unlike Ragas faithfulness, which focuses mainly on answer-context consistency, evidence evaluation separates support, citation presence, and source attribution so a cited answer can still fail.

How to measure or detect evidence

  • Groundedness: returns a 0–1 score for whether the response is supported by provided context; the canonical evidence test.
  • Faithfulness and RAGFaithfulness: alternative grounding evaluators tuned for RAG.
  • CitationPresence: checks whether the response includes citations at all.
  • SourceAttribution: scores citation quality — do the cited sources actually support the claim?
  • Ungrounded-but-cited rate (dashboard signal): citations present but evidence does not support the claim.
  • MultiHopReasoning: scores intermediate-inference evidence chains.

Minimal Python:

from fi.evals import Groundedness, CitationPresence, SourceAttribution

ground = Groundedness()
cite = CitationPresence()
source = SourceAttribution()
print(ground.evaluate(input=q, output=r, context=ctx).score)
print(source.evaluate(output=r, context=ctx).score)

Common mistakes

  • Trusting the citation as evidence. A citation that paraphrases something the source does not say is decorative; verify the support.
  • Evaluating only on generation, not retrieval. Evidence quality starts upstream — bad chunking or stale indices break grounding silently.
  • Skipping the no-evidence cohort. Prompts where retrieval returns nothing should hit a refusal, not a confident answer; alert if they don’t.
  • Treating tool outputs as untrusted text. They are evidence for downstream steps; score them as you would retrieved chunks.
  • No threshold for ungrounded responses. Without a guardrail, a model that scores 0.3 on Groundedness still ships its answer to the user.

Frequently Asked Questions

What is evidence in AI evaluation?

Evidence is the source material — retrieved documents, tool outputs, citations, or training-data provenance — that supports a model's claim and lets a human or evaluator check it.

How is evidence different from a citation?

A citation is the explicit reference the model emits. Evidence is the underlying source material; a citation without retrievable evidence is decorative, and evidence without a citation is unverifiable to the user.

How does FutureAGI evaluate evidence?

FutureAGI uses Groundedness for whether the response is supported by retrieved context, CitationPresence for whether citations are present, and SourceAttribution for citation quality.