Evaluating LLM Citation & Attribution: The Three-Rubric Methodology for 2026
Citation eval is three rubrics, not one: did the model emit a citation, does it resolve, and does the source actually contain the claim. The methodology, with code.
Table of Contents
A legal RAG bot returns a paragraph and drops in Smith v. Johnson, 412 F. Supp. 3d 891 (S.D.N.Y. 2019). The case does not exist. Groundedness scores 0.94 because the surrounding paragraph is consistent with the rest of the retrieval log. The structured-output check passes because the citation parses cleanly. The bot ships. Six weeks later the firm files a brief, opposing counsel runs the citation, and a sanctions motion follows. Citation eval is the rubric that was supposed to catch this. Most production stacks ship a rubric that cannot.
The reason is that teams treat citation evaluation as one number. It is three rubrics, not one. Structural: did the model emit a citation token in the right schema. Resolvability: does the citation point at a real, fetchable source. Semantic: does the cited source actually contain the claim attached to it. The fake-case failure above lives at the resolvability layer. The right-document-wrong-page failure lives at the semantic layer. Structural alone, which is what most teams ship, catches neither.
This guide is the three-rubric methodology: how each layer scores, the deterministic and judge code that runs it, per-domain calibration for legal, medical, research, and news, the traceAI wiring that joins retrieval spans to citation spans, and where Future AGI’s eval stack carries the rubrics from CI to live production traffic.
TL;DR: the three rubrics
| Rubric | Question | Implementation | Cost | Runs where |
|---|---|---|---|---|
| Structural | Did the model emit a citation token in the schema? | Pydantic, function-calling schema, regex for surface format | Free, sub-millisecond | Inline on every call |
| Resolvability | Does the citation point at a real, fetchable source? | Registry lookup (Westlaw, PubMed, DOI), URL resolve, retrieval-log join | Cheap, sub-100ms | Inline as guardrail |
| Semantic | Does the cited source actually contain the claim? | Atomic-claim decomposition + calibrated judge entailment | Expensive, judge call per claim | Sample 10-20% of traffic |
Three rules underwrite the methodology: every claim carries a source_id and a quoted span, every source_id joins against the retrieval log (not the model’s claim about what it cited), and the semantic rubric runs per claim, not per answer. Skip any of the three and the rubric does not catch the failure that gets the team sued or retracted.
Why one citation score hides the failure that hurts
A common production breakdown looks like this. Structural sits at 0.98 because the model emits the schema cleanly. Resolvability sits at 0.84 because one in six citations do not resolve or do not appear in the retrieval log. Semantic sits at 0.63 because the model attaches real documents to claims those documents do not contain. Collapsed into one number, the team sees 0.82 and ships. Split into three, the team sees one rubric on fire and ships nothing until the resolvability gap closes.
Three failure modes recur in regulated production traces:
- The fake-but-plausible citation. The model emits
Smith v. Johnson, 412 F. Supp. 3d 891 (S.D.N.Y. 2019). The string is well-formed Bluebook. Structural passes. The case does not exist. Resolvability catches it the moment the registry lookup runs. - The right document, wrong page or wrong claim. The model cites a real CMS guideline for a recommendation it does not contain. The recommendation is in an adjacent guideline. Resolvability passes (URL live, document in the retrieval log). Semantic catches it.
- The doc the agent never retrieved. The model’s parametric prior produces a citation to a real-sounding source the retriever never returned. Resolvability fails on the retrieval-log join, which is the cheapest catch of the three and the one most stacks skip.
The first failure dies at rubric two. The second only dies at rubric three. None die at rubric one alone, which is what makes “we ship structured citations” insufficient.
Rubric 1: structural — did the model emit a citation token
The structural rubric is the cheapest check in the stack. It runs inline on every call and catches the schema-drift failures that pollute every downstream layer.
Force the model into a citation schema and validate against it deterministically. Pydantic plus structured output mode on the LLM call is the working pattern. EvaluateFunctionCalling does the validation in the SDK; the schema check is free and sub-millisecond.
from pydantic import BaseModel
from typing import Literal
class Citation(BaseModel):
source_id: str # join key against the retrieval log
quoted_span: str # verbatim text from the cited source
page: int | None = None
url: str | None = None
class Claim(BaseModel):
text: str
citations: list[Citation]
class AnswerWithCitations(BaseModel):
claims: list[Claim]
refusal: Literal["none", "ambiguous", "out_of_scope"] = "none"
Two surface-format checks sit alongside the schema check. A RegexScanner enforces the legal-Bluebook, medical-Vancouver, or news inline-link convention. An InvisibleCharScanner blocks zero-width-character injection in citation strings, which is rare but cheap to defend against. Both run sub-10ms via the SDK Scanners.
from fi.evals.templates import EvaluateFunctionCalling
schema_check = EvaluateFunctionCalling(schema=AnswerWithCitations.model_json_schema())
What structural buys you: the structured Pydantic output makes rubric two and rubric three implementable in the first place. Without source_id and quoted_span as explicit fields, the resolvability lookup has nothing to join on and the semantic judge has nothing to score against. The structural rubric is the data-shape contract the next two rubrics depend on. It is necessary, never sufficient.
Rubric 2: resolvability — does the citation point at a real source
A citation that parses cleanly and resolves to nothing is the legal-AI sanctions story. Resolvability is the rubric that catches it. Two deterministic checks run on every citation:
The retrieval-log join. Every source_id in the citation list must appear in the upstream retriever.documents log on the same trace. This is the cheapest catch and the one most teams skip because it requires joining the citation span to the retriever span, which means tree-structured tracing. The check is a set-membership lookup; cost is microseconds.
The registry lookup. External citations resolve against the registry of record. URL gets a HEAD request with a 200 response and matching canonical-URL metadata. DOI gets a doi.org resolve. Legal cases join against Westlaw or CourtListener; medical studies join against PubMed. The check is a single HTTP round-trip per citation, parallelized; the ai-evaluation SDK’s MaliciousURLScanner plus a per-domain registry adapter cover the common cases.
from fi.evals.scanners import MaliciousURLScanner
def resolvability_score(answer: AnswerWithCitations, retrieval_log: set[str]) -> float:
total = 0
resolved = 0
for claim in answer.claims:
for cite in claim.citations:
total += 1
if cite.source_id not in retrieval_log:
continue # claim from outside the retrieval pool
if cite.url and not registry_resolves(cite.url):
continue # live URL but does not match cited metadata
resolved += 1
return resolved / max(total, 1)
Two operational rules earn their keep. Run resolvability inline at the gateway, not offline; there is no clean way to retract a hallucinated citation once it has been quoted in a brief. And alarm on a resolvability rate below the per-domain floor (typically 0.99 in legal and medical, 0.97 in news). Below the floor, the response gets held, rewritten with a stricter prompt, or refused, not handed to the user.
Resolvability is load-bearing because it catches the most newsworthy failure modes for cents per call, instead of dollars per claim like the semantic rubric.
Rubric 3: semantic — does the cited source actually contain the claim
A citation that resolves to a real document and is in the retrieval log is the right document. That does not mean it contains the claim attached to it. The semantic rubric is the per-claim entailment check that catches the right-doc-wrong-claim failure.
Atomic-claim decomposition is the unit of evaluation. Split the answer into one-fact-per-claim units, then score each (claim, cited_passage) pair against a calibrated judge. Aggregating per response only after every claim has a score. The aggregate that matters is the percentage of claims whose cited passage entails them at a calibrated threshold, not a mean.
from fi.evals.metrics.llm_as_judges.custom_judge.metric import CustomLLMJudge
from fi.evals.metrics.llm_as_judges.types import CustomInput
from fi.evals.llm.providers.litellm import LiteLLMProvider
semantic_judge = CustomLLMJudge(
provider=LiteLLMProvider(),
config={
"name": "claim_entailment",
"model": "claude-sonnet-4-5-20250929",
"grading_criteria": """Score 1.0 if the cited passage entails the claim
(the claim is supported by the passage on a plain reading). Score 0.5 if
the passage is topically relevant but does not entail the specific claim.
Score 0.0 if the claim is not supported by the passage.""",
},
)
Three calibration habits separate a working semantic rubric from theatre:
- Calibrate the judge against a human-labelled hold-out. Pin a 50-sample set with senior-domain-reviewer labels (lawyer, clinician, editor). Alarm when judge-vs-human disagreement exceeds the inter-rater baseline. Judges drift; calibration is the only way to know when.
- Score per claim, aggregate per response, report both. A 0.94 grounded response with a 0.61 alignment rate is the most common failure mode in 2026 production traffic. Users do not read averages; they click citation seven, see it does not support the sentence, and the trust goes.
- Run the judge on a sample, not on every claim. Tail-aware sampling at the OTel collector keeps 100% of runs with any structural or resolvability failure, top-percentile latency or cost, and any experiment cohort. Sample 10-20% of the remaining clean traffic for the semantic judge. See our LLM-as-judge best practices guide and deterministic vs LLM-judge evals comparison for the calibration playbook.
ChunkAttribution and ChunkUtilization cover the attribution surface; CustomLLMJudge carries the per-domain entailment rubric. Platform classifier-backed evals run the same rubric at lower per-eval cost than Galileo Luna-2, which makes claim-level scoring economically tractable at sample-and-judge volume.
Per-domain calibration: same rubrics, different registries
The three rubrics do not change across domains. The schemas, registries, and thresholds do. Calibrating them per domain is what makes the rubric audit-grade rather than directional.
| Domain | Structural format | Resolvability registry | Semantic calibration |
|---|---|---|---|
| Legal | Bluebook, structured {case, reporter, page, year} | Westlaw, CourtListener, PACER | Holding vs dicta; fact-vs-law per claim |
| Medical | Vancouver, AMA, structured DOI list | PubMed, DOI.org, FDA orange book, NIH guidelines | Guideline-version aware; current vs withdrawn |
| Research | APA, inline link plus DOI | arXiv, DOI.org, publisher canonical | Per-claim entailment; primary vs secondary source |
| News / journalism | Inline link plus byline | URL resolve plus publisher metadata join | Direct quote vs paraphrase; source-tier weighting |
Three calibration moves recur across all four:
- The resolvability floor varies by stakes. Legal and medical sit at 0.99. News sits at 0.97 because legitimate dead-link rates exist. Below the floor, the response gets held or refused.
- The semantic threshold varies by claim type. A legal holding needs full entailment; a procedural sentence tolerates topical relevance. Run a per-claim-type entailment threshold rather than a single floor.
- The refusal path is calibrated per domain. The legal “I do not have a standard for this clause” abstention is non-negotiable. The news “I could not verify this source within the deadline window” abstention is best practice. Wire
AnswerRefusalas its own rubric and alarm on drift in refusal rate alongside the three citation rubrics. Our legal RAG evaluation guide, healthcare RAG evaluation guide, and contract review RAG playbook carry the domain-specific calibration tables.
How traceAI carries the three rubrics from CI to production
The rubrics are only as good as the data they score against. traceAI (Apache 2.0) makes the retrieval and citation signal first-class so every rubric joins against the actual trace, not the model’s claim about what it cited.
The shape that earns its keep:
agent.run
retrieval (fi.span.kind = RETRIEVER)
retrieval.documents = [source_id_1, source_id_2, ...]
llm.completion (fi.span.kind = LLM)
llm.output.citations = [Citation(source_id, quoted_span, ...)]
eval.structural
eval.resolvability
eval.semantic
The retrieval span carries retrieval.documents as the source-of-truth pool. The LLM span carries llm.output.citations as the structured citation list. The three eval spans attach per-rubric scores. The resolvability join is set(citations.source_id) <= set(retrieval.documents), computed on the trace tree, not on the answer text.
from fi_instrumentation import register
from fi_instrumentation.fi_types import ProjectType
from traceai_openai import OpenAIInstrumentor
trace_provider = register(
project_type=ProjectType.OBSERVE,
project_name="legal_rag_citations",
)
OpenAIInstrumentor().instrument(tracer_provider=trace_provider)
traceAI ships 50+ AI surfaces across Python, TypeScript, Java (Spring Boot starter, Spring AI, LangChain4j, Semantic Kernel), and C#. Pluggable semantic conventions (FI, OTEL_GENAI, OPENINFERENCE, OPENLLMETRY) at register() time let the same trace ingest into Phoenix or Traceloop without re-instrumenting. The 14 span kinds include a first-class RETRIEVER, which is what makes the per-retrieval join practical. EvalTag wires the three rubrics as span-attached scorers at server side; the same template that runs in pytest CI runs on the live span at zero added inference latency.
The diagnostic value is the trace tree, not the score. A 0.61 semantic score on a clean retrieval is a generator problem; the same score on a degraded retrieval log is a retriever problem. The trace tree tells you which.
Where Future AGI ships the citation-eval stack
Future AGI ships the eval stack as a package. Start with the SDK for code-defined evals. Graduate to the Platform when the loop needs self-improving rubrics and lower per-eval cost at sample-and-judge volume.
- ai-evaluation SDK (Apache 2.0). 50+
EvalTemplateclasses includingChunkAttribution,ChunkUtilization,Groundedness,FactualAccuracy,AnswerRefusal, andEvaluateFunctionCallingfor the structural schema check.CustomLLMJudgewith Jinja2 grading carries the per-domain semantic rubrics. 20+ local heuristic metrics and 8 sub-10ms Scanners (MaliciousURLScanner,RegexScanner,InvisibleCharScanner, plus 5 others) cover the deterministic resolvability and format-drift checks. - traceAI (Apache 2.0). Carries the same rubric as a span-attached score on live traffic. 50+ AI surfaces across four languages.
EvalTagserver-side wiring means zero extra inference hops for the production rubric. - Future AGI Platform. Self-improving evaluators retune from senior-reviewer thumbs feedback. An in-product authoring agent writes the per-domain semantic rubric from a natural-language description (the realistic path for legal-holding, medical-guideline-version, and journalism-source-tier rubrics). Classifier-backed evals at lower per-eval cost than Galileo Luna-2 make claim-level scoring on every sampled trace affordable.
- Error Feed (inside the eval stack). HDBSCAN soft-clustering over ClickHouse-stored span embeddings groups failing traces into named clusters. Typical citation clusters: “model invents plausible-but-fake case”, “right doc, wrong page”, “over-citation pads procedural sentences”, “freshness lapses on amended statute”. A Claude Sonnet 4.5 Judge (30-turn budget, 8 span-tools, Haiku Chauffeur for spans over 3000 chars, ~90% prompt-cache hit) writes the RCA, an
immediate_fix, and a 4-dimensional trace score (factual_grounding, privacy_and_safety, instruction_adherence, optimal_plan_execution; 1-5 each). The cluster promotes into the offline regression set after reviewer sign-off. Linear is wired today via OAuth; Slack, GitHub, Jira, and PagerDuty are on the roadmap. - Agent Command Center. 100+ providers as a single Go binary (Apache 2.0). 18+ built-in guardrail scanners and 15 third-party adapters run at the same network hop. ~29k req/s with P99 21 ms with guardrails on, on a
t3.xlarge. Per-call cost telemetry viax-prism-costheaders quantifies what structured-output and citation-strict routing cost on a per-route basis.
SOC 2 Type II, HIPAA, GDPR, and CCPA certified; ISO/IEC 27001 in active audit.
Three deliberate tradeoffs
- Structured-output citations cost output tokens. Roughly 5-15% premium on output tokens for the Pydantic schema. The Agent Command Center’s
x-prism-costheaders quantify this exactly so the tradeoff is decided on a per-route basis rather than category-wide. The cost of getting citation wrong in legal or medical is two orders of magnitude higher. - The semantic rubric is the expensive layer. A claim-level judge call per response gets pricey at scale. Tail-aware sampling at the collector plus a calibrated lighter judge (or a distilled classifier) on the bulk traffic keeps the bill bounded. Frontier judges run on the 5-10% tail that triggered any other rubric or guardrail.
- The refusal head is the operational lever. A bot that refuses on ambiguous claims is safer and slower. The calibration is per claim type, not per response. Get the senior reviewer to label the refusal set; engineer-only labels do not survive contact with production.
Anti-patterns to avoid
- Scoring only the final answer. Pass the answer, ignore the cite. The audit artifact is wrong even when the answer is right.
- One score for citation eval. Three rubrics collapsed into a number hides which failure mode is on fire.
- Trusting the model’s own citation list as the source-of-truth pool. The citation list is the thing under evaluation. Join against the retrieval log, not against itself.
- Paragraph-level semantic rubric. A two-sentence response with one fake cite averages to a passing score. Move the unit of evaluation to the claim.
- No registry lookup on external citations. Live-URL-wrong-content is the failure that produced sanctions filings in 2024 and 2025. The cheapest catch in the stack.
- One threshold across domains. The legal floor is not the news floor. Calibrate per domain or under-trigger on the regulated workload that needed the trigger most.
Ready to evaluate your first citation-bearing system? Wire ChunkAttribution, Groundedness, EvaluateFunctionCalling, and AnswerRefusal into a pytest fixture this afternoon against the ai-evaluation SDK. Add a CustomLLMJudge for the per-domain entailment rubric. Then attach the same rubrics as EvalTag spans on live traffic via traceAI so the rubric that gated CI runs on production traces too. The three rubrics in both places is what closes the citation-trust gap regulated systems structurally leak.
Related reading
- The 2026 LLM Evaluation Playbook
- LLM-as-Judge Best Practices (2026)
- Deterministic vs LLM-Judge Evals (2026)
- Best Legal AI Evaluation Platforms (2026)
- Best Healthcare RAG Evaluation (2026)
- Best Fintech AI Evaluation Platforms (2026)
- How to Build (and Evaluate) a Contract Review RAG Agent
- AI Research Assistant Monitoring (2026)
- Evaluating LLM Structured Output Modes (2026)
- Automated Optimization for Agents (2026)
Frequently asked questions
What are the three rubrics of citation evaluation?
Why isn't groundedness enough for citation evaluation?
How do you catch a fabricated citation that looks plausible?
How do per-domain rubrics differ in legal, medical, and news?
How does traceAI attach citation evaluations to retrieval spans?
What does Future AGI ship for citation eval today?
What are the most common citation-eval anti-patterns?
Why answer-level Groundedness hides RAG hallucinations, and how claim-level decomposition, cherry-pick detection, and sycophantic-restatement scoring fix it. Methodology for senior ML engineers.
Hallucination is four distinct failure modes — factual, grounding, citation, and reasoning. Each needs a different detector and a different fix. The methodology, with code.
Contract review RAG in 2026: clause-level retrieval, citation enforcement, the eval suite in-house counsel will sign off, plus the LangGraph wiring to live OTel traces.