Evaluation

What Is Citation Presence?

Citation presence checks whether evidence-required model outputs include a citation, link, footnote, or source marker.

What Is Citation Presence?

Citation presence is an LLM-evaluation metric that checks whether a model or agent response includes a citation, link, footnote, or source marker when the workflow requires evidence. It shows up in RAG eval pipelines, AI-search traces, legal-support agents, and compliance review queues. FutureAGI maps it to eval:CitationPresence and pairs it with eval:ContainsValidLink, so teams can distinguish “no source given” from “source link is present but broken” before judging source quality.

Why Citation Presence Matters in Production LLM and Agent Systems

Missing citations create an audit failure before they create a correctness failure. A RAG support bot can answer from the right document but omit the source. A legal assistant can summarize policy language without a footnote. An AI-search experience can cite nothing, leaving the end-user unable to inspect the evidence path. The answer may be true, but the system has lost traceability.

That pain lands on several teams. Product sees lower trust and more “where did this come from?” feedback. Compliance cannot prove that regulated advice came from approved material. SRE sees healthy latency and token metrics, but no obvious signal explaining why users distrust an answer. ML engineers see the issue only when a regression run shows citation-missing rate spiking after a prompt or formatter change.

Citation presence is especially important in 2026-era multi-step pipelines. One agent step may retrieve documents, another may summarize them, and a final step may reformat the answer for a channel such as chat, voice transcript, or AI search. If citations disappear between steps, later reviewers cannot tell whether the final answer came from retrieved evidence, tool output, or model memory. Common symptoms include llm.output with no source markers, answers that cite “the docs” generically, broken markdown references, and user escalations attached to otherwise successful traces.

How FutureAGI Handles Citation Presence

FutureAGI’s approach is to treat citation presence as the first gate in an evidence chain, not as proof that an answer is correct. In the eval layer, eval:CitationPresence maps to the CitationPresence local metric, which checks whether citations are present. The companion anchor eval:ContainsValidLink maps to ContainsValidLink, which checks whether text contains a link that returns a 2xx status code. Those two checks answer different questions: “did the model cite anything?” and “does the cited URL resolve?”

A real workflow starts with a RAG agent instrumented through traceAI-langchain. Retrieval spans carry retrieval.documents; the answer span carries llm.output; the evaluation run scores the final output with CitationPresence and, for URL-based citations, ContainsValidLink. If CitationPresence fails, the engineer checks whether the prompt removed source-format instructions or whether an intermediate summarizer stripped references. If ContainsValidLink fails, the next action is usually corpus cleanup, URL canonicalization, or a regression eval over known citation formats.

Unlike Ragas faithfulness, which asks whether claims are supported by retrieved context, citation presence is a syntactic gate. It should be paired with SourceAttribution, Groundedness, or FactualAccuracy before a team says the citation is useful. In our 2026 evals, the best pattern is to fail fast on missing citations, then run slower attribution and grounding checks only on answers that actually cite something.

How to Measure or Detect Citation Presence

Measure citation presence at the output, trace, and cohort level:

  • fi.evals.CitationPresence — local metric that checks whether citations are present in the response text.
  • fi.evals.ContainsValidLink — local metric that checks whether text contains a link returning a 2xx status code.
  • fi.evals.SourceAttribution — evaluates citation quality in RAG responses after a source marker exists.
  • Trace fields — store the generated answer in llm.output and the retrieved evidence in retrieval.documents so failures can be debugged.
  • Dashboard signals — citation-missing rate, valid-link fail rate, citation-format error count, thumbs-down rate with “uncited answer” reason, and escalation rate for evidence-required workflows.

Minimal Python:

from fi.evals import CitationPresence

evaluator = CitationPresence()
result = evaluator.evaluate(
    output="Rotate the API key within 24 hours after exposure. [1]"
)
print(result.score, result.reason)

Common Mistakes

  • Counting any bracket as a citation. [todo], [citation needed], and model-made reference IDs should fail unless they map to stored evidence or a real source.
  • Stopping at link validity. ContainsValidLink proves a URL returned 2xx; it does not prove the page supports the answer.
  • Letting citations appear only on final answers. In multi-step agents, uncited intermediate summaries can poison later steps before final formatting adds references.
  • Scoring citation presence on tasks that never require evidence. Keep it cohort-scoped; creative chat and classification outputs should not be penalized automatically.
  • Ignoring missing citation spikes after prompt changes. A rewrite that improves tone can accidentally remove source-format instructions.

Frequently Asked Questions

What is citation presence?

Citation presence checks whether a model or agent answer includes citations, links, footnotes, or source markers where evidence is required. FutureAGI maps it to the `CitationPresence` evaluator.

How is citation presence different from source attribution?

Citation presence only asks whether a source marker exists. Source attribution asks whether that source is the right one and whether it supports the cited claim.

How do you measure citation presence?

Use FutureAGI's `CitationPresence` evaluator on the generated output, then pair it with `ContainsValidLink` when citations include URLs. Track citation-missing rate by dataset, trace, model, and release.