Guides

Evaluating LLM Citation & Attribution: The Three-Rubric Methodology for 2026

Citation eval is three rubrics, not one: did the model emit a citation, does it resolve, and does the source actually contain the claim. The methodology, with code.

·
Updated
·
12 min read
llm-evaluation citation attribution rag legal-ai medical-ai groundedness 2026
Editorial cover image for Evaluating LLM Citation & Attribution (2026)
Table of Contents

A legal RAG bot returns a paragraph and drops in Smith v. Johnson, 412 F. Supp. 3d 891 (S.D.N.Y. 2019). The case does not exist. Groundedness scores 0.94 because the surrounding paragraph is consistent with the rest of the retrieval log. The structured-output check passes because the citation parses cleanly. The bot ships. Six weeks later the firm files a brief, opposing counsel runs the citation, and a sanctions motion follows. Citation eval is the rubric that was supposed to catch this. Most production stacks ship a rubric that cannot.

The reason is that teams treat citation evaluation as one number. It is three rubrics, not one. Structural: did the model emit a citation token in the right schema. Resolvability: does the citation point at a real, fetchable source. Semantic: does the cited source actually contain the claim attached to it. The fake-case failure above lives at the resolvability layer. The right-document-wrong-page failure lives at the semantic layer. Structural alone, which is what most teams ship, catches neither.

This guide is the three-rubric methodology: how each layer scores, the deterministic and judge code that runs it, per-domain calibration for legal, medical, research, and news, the traceAI wiring that joins retrieval spans to citation spans, and where Future AGI’s eval stack carries the rubrics from CI to live production traffic.

TL;DR: the three rubrics

RubricQuestionImplementationCostRuns where
StructuralDid the model emit a citation token in the schema?Pydantic, function-calling schema, regex for surface formatFree, sub-millisecondInline on every call
ResolvabilityDoes the citation point at a real, fetchable source?Registry lookup (Westlaw, PubMed, DOI), URL resolve, retrieval-log joinCheap, sub-100msInline as guardrail
SemanticDoes the cited source actually contain the claim?Atomic-claim decomposition + calibrated judge entailmentExpensive, judge call per claimSample 10-20% of traffic

Three rules underwrite the methodology: every claim carries a source_id and a quoted span, every source_id joins against the retrieval log (not the model’s claim about what it cited), and the semantic rubric runs per claim, not per answer. Skip any of the three and the rubric does not catch the failure that gets the team sued or retracted.

Why one citation score hides the failure that hurts

A common production breakdown looks like this. Structural sits at 0.98 because the model emits the schema cleanly. Resolvability sits at 0.84 because one in six citations do not resolve or do not appear in the retrieval log. Semantic sits at 0.63 because the model attaches real documents to claims those documents do not contain. Collapsed into one number, the team sees 0.82 and ships. Split into three, the team sees one rubric on fire and ships nothing until the resolvability gap closes.

Three failure modes recur in regulated production traces:

  • The fake-but-plausible citation. The model emits Smith v. Johnson, 412 F. Supp. 3d 891 (S.D.N.Y. 2019). The string is well-formed Bluebook. Structural passes. The case does not exist. Resolvability catches it the moment the registry lookup runs.
  • The right document, wrong page or wrong claim. The model cites a real CMS guideline for a recommendation it does not contain. The recommendation is in an adjacent guideline. Resolvability passes (URL live, document in the retrieval log). Semantic catches it.
  • The doc the agent never retrieved. The model’s parametric prior produces a citation to a real-sounding source the retriever never returned. Resolvability fails on the retrieval-log join, which is the cheapest catch of the three and the one most stacks skip.

The first failure dies at rubric two. The second only dies at rubric three. None die at rubric one alone, which is what makes “we ship structured citations” insufficient.

Rubric 1: structural — did the model emit a citation token

The structural rubric is the cheapest check in the stack. It runs inline on every call and catches the schema-drift failures that pollute every downstream layer.

Force the model into a citation schema and validate against it deterministically. Pydantic plus structured output mode on the LLM call is the working pattern. EvaluateFunctionCalling does the validation in the SDK; the schema check is free and sub-millisecond.

from pydantic import BaseModel
from typing import Literal

class Citation(BaseModel):
    source_id: str          # join key against the retrieval log
    quoted_span: str        # verbatim text from the cited source
    page: int | None = None
    url: str | None = None

class Claim(BaseModel):
    text: str
    citations: list[Citation]

class AnswerWithCitations(BaseModel):
    claims: list[Claim]
    refusal: Literal["none", "ambiguous", "out_of_scope"] = "none"

Two surface-format checks sit alongside the schema check. A RegexScanner enforces the legal-Bluebook, medical-Vancouver, or news inline-link convention. An InvisibleCharScanner blocks zero-width-character injection in citation strings, which is rare but cheap to defend against. Both run sub-10ms via the SDK Scanners.

from fi.evals.templates import EvaluateFunctionCalling

schema_check = EvaluateFunctionCalling(schema=AnswerWithCitations.model_json_schema())

What structural buys you: the structured Pydantic output makes rubric two and rubric three implementable in the first place. Without source_id and quoted_span as explicit fields, the resolvability lookup has nothing to join on and the semantic judge has nothing to score against. The structural rubric is the data-shape contract the next two rubrics depend on. It is necessary, never sufficient.

Rubric 2: resolvability — does the citation point at a real source

A citation that parses cleanly and resolves to nothing is the legal-AI sanctions story. Resolvability is the rubric that catches it. Two deterministic checks run on every citation:

The retrieval-log join. Every source_id in the citation list must appear in the upstream retriever.documents log on the same trace. This is the cheapest catch and the one most teams skip because it requires joining the citation span to the retriever span, which means tree-structured tracing. The check is a set-membership lookup; cost is microseconds.

The registry lookup. External citations resolve against the registry of record. URL gets a HEAD request with a 200 response and matching canonical-URL metadata. DOI gets a doi.org resolve. Legal cases join against Westlaw or CourtListener; medical studies join against PubMed. The check is a single HTTP round-trip per citation, parallelized; the ai-evaluation SDK’s MaliciousURLScanner plus a per-domain registry adapter cover the common cases.

from fi.evals.scanners import MaliciousURLScanner

def resolvability_score(answer: AnswerWithCitations, retrieval_log: set[str]) -> float:
    total = 0
    resolved = 0
    for claim in answer.claims:
        for cite in claim.citations:
            total += 1
            if cite.source_id not in retrieval_log:
                continue  # claim from outside the retrieval pool
            if cite.url and not registry_resolves(cite.url):
                continue  # live URL but does not match cited metadata
            resolved += 1
    return resolved / max(total, 1)

Two operational rules earn their keep. Run resolvability inline at the gateway, not offline; there is no clean way to retract a hallucinated citation once it has been quoted in a brief. And alarm on a resolvability rate below the per-domain floor (typically 0.99 in legal and medical, 0.97 in news). Below the floor, the response gets held, rewritten with a stricter prompt, or refused, not handed to the user.

Resolvability is load-bearing because it catches the most newsworthy failure modes for cents per call, instead of dollars per claim like the semantic rubric.

Rubric 3: semantic — does the cited source actually contain the claim

A citation that resolves to a real document and is in the retrieval log is the right document. That does not mean it contains the claim attached to it. The semantic rubric is the per-claim entailment check that catches the right-doc-wrong-claim failure.

Atomic-claim decomposition is the unit of evaluation. Split the answer into one-fact-per-claim units, then score each (claim, cited_passage) pair against a calibrated judge. Aggregating per response only after every claim has a score. The aggregate that matters is the percentage of claims whose cited passage entails them at a calibrated threshold, not a mean.

from fi.evals.metrics.llm_as_judges.custom_judge.metric import CustomLLMJudge
from fi.evals.metrics.llm_as_judges.types import CustomInput
from fi.evals.llm.providers.litellm import LiteLLMProvider

semantic_judge = CustomLLMJudge(
    provider=LiteLLMProvider(),
    config={
        "name": "claim_entailment",
        "model": "claude-sonnet-4-5-20250929",
        "grading_criteria": """Score 1.0 if the cited passage entails the claim
        (the claim is supported by the passage on a plain reading). Score 0.5 if
        the passage is topically relevant but does not entail the specific claim.
        Score 0.0 if the claim is not supported by the passage.""",
    },
)

Three calibration habits separate a working semantic rubric from theatre:

  • Calibrate the judge against a human-labelled hold-out. Pin a 50-sample set with senior-domain-reviewer labels (lawyer, clinician, editor). Alarm when judge-vs-human disagreement exceeds the inter-rater baseline. Judges drift; calibration is the only way to know when.
  • Score per claim, aggregate per response, report both. A 0.94 grounded response with a 0.61 alignment rate is the most common failure mode in 2026 production traffic. Users do not read averages; they click citation seven, see it does not support the sentence, and the trust goes.
  • Run the judge on a sample, not on every claim. Tail-aware sampling at the OTel collector keeps 100% of runs with any structural or resolvability failure, top-percentile latency or cost, and any experiment cohort. Sample 10-20% of the remaining clean traffic for the semantic judge. See our LLM-as-judge best practices guide and deterministic vs LLM-judge evals comparison for the calibration playbook.

ChunkAttribution and ChunkUtilization cover the attribution surface; CustomLLMJudge carries the per-domain entailment rubric. Platform classifier-backed evals run the same rubric at lower per-eval cost than Galileo Luna-2, which makes claim-level scoring economically tractable at sample-and-judge volume.

Per-domain calibration: same rubrics, different registries

The three rubrics do not change across domains. The schemas, registries, and thresholds do. Calibrating them per domain is what makes the rubric audit-grade rather than directional.

DomainStructural formatResolvability registrySemantic calibration
LegalBluebook, structured {case, reporter, page, year}Westlaw, CourtListener, PACERHolding vs dicta; fact-vs-law per claim
MedicalVancouver, AMA, structured DOI listPubMed, DOI.org, FDA orange book, NIH guidelinesGuideline-version aware; current vs withdrawn
ResearchAPA, inline link plus DOIarXiv, DOI.org, publisher canonicalPer-claim entailment; primary vs secondary source
News / journalismInline link plus bylineURL resolve plus publisher metadata joinDirect quote vs paraphrase; source-tier weighting

Three calibration moves recur across all four:

  • The resolvability floor varies by stakes. Legal and medical sit at 0.99. News sits at 0.97 because legitimate dead-link rates exist. Below the floor, the response gets held or refused.
  • The semantic threshold varies by claim type. A legal holding needs full entailment; a procedural sentence tolerates topical relevance. Run a per-claim-type entailment threshold rather than a single floor.
  • The refusal path is calibrated per domain. The legal “I do not have a standard for this clause” abstention is non-negotiable. The news “I could not verify this source within the deadline window” abstention is best practice. Wire AnswerRefusal as its own rubric and alarm on drift in refusal rate alongside the three citation rubrics. Our legal RAG evaluation guide, healthcare RAG evaluation guide, and contract review RAG playbook carry the domain-specific calibration tables.

How traceAI carries the three rubrics from CI to production

The rubrics are only as good as the data they score against. traceAI (Apache 2.0) makes the retrieval and citation signal first-class so every rubric joins against the actual trace, not the model’s claim about what it cited.

The shape that earns its keep:

agent.run
  retrieval (fi.span.kind = RETRIEVER)
    retrieval.documents = [source_id_1, source_id_2, ...]
  llm.completion (fi.span.kind = LLM)
    llm.output.citations = [Citation(source_id, quoted_span, ...)]
  eval.structural
  eval.resolvability
  eval.semantic

The retrieval span carries retrieval.documents as the source-of-truth pool. The LLM span carries llm.output.citations as the structured citation list. The three eval spans attach per-rubric scores. The resolvability join is set(citations.source_id) <= set(retrieval.documents), computed on the trace tree, not on the answer text.

from fi_instrumentation import register
from fi_instrumentation.fi_types import ProjectType
from traceai_openai import OpenAIInstrumentor

trace_provider = register(
    project_type=ProjectType.OBSERVE,
    project_name="legal_rag_citations",
)
OpenAIInstrumentor().instrument(tracer_provider=trace_provider)

traceAI ships 50+ AI surfaces across Python, TypeScript, Java (Spring Boot starter, Spring AI, LangChain4j, Semantic Kernel), and C#. Pluggable semantic conventions (FI, OTEL_GENAI, OPENINFERENCE, OPENLLMETRY) at register() time let the same trace ingest into Phoenix or Traceloop without re-instrumenting. The 14 span kinds include a first-class RETRIEVER, which is what makes the per-retrieval join practical. EvalTag wires the three rubrics as span-attached scorers at server side; the same template that runs in pytest CI runs on the live span at zero added inference latency.

The diagnostic value is the trace tree, not the score. A 0.61 semantic score on a clean retrieval is a generator problem; the same score on a degraded retrieval log is a retriever problem. The trace tree tells you which.

Where Future AGI ships the citation-eval stack

Future AGI ships the eval stack as a package. Start with the SDK for code-defined evals. Graduate to the Platform when the loop needs self-improving rubrics and lower per-eval cost at sample-and-judge volume.

  • ai-evaluation SDK (Apache 2.0). 50+ EvalTemplate classes including ChunkAttribution, ChunkUtilization, Groundedness, FactualAccuracy, AnswerRefusal, and EvaluateFunctionCalling for the structural schema check. CustomLLMJudge with Jinja2 grading carries the per-domain semantic rubrics. 20+ local heuristic metrics and 8 sub-10ms Scanners (MaliciousURLScanner, RegexScanner, InvisibleCharScanner, plus 5 others) cover the deterministic resolvability and format-drift checks.
  • traceAI (Apache 2.0). Carries the same rubric as a span-attached score on live traffic. 50+ AI surfaces across four languages. EvalTag server-side wiring means zero extra inference hops for the production rubric.
  • Future AGI Platform. Self-improving evaluators retune from senior-reviewer thumbs feedback. An in-product authoring agent writes the per-domain semantic rubric from a natural-language description (the realistic path for legal-holding, medical-guideline-version, and journalism-source-tier rubrics). Classifier-backed evals at lower per-eval cost than Galileo Luna-2 make claim-level scoring on every sampled trace affordable.
  • Error Feed (inside the eval stack). HDBSCAN soft-clustering over ClickHouse-stored span embeddings groups failing traces into named clusters. Typical citation clusters: “model invents plausible-but-fake case”, “right doc, wrong page”, “over-citation pads procedural sentences”, “freshness lapses on amended statute”. A Claude Sonnet 4.5 Judge (30-turn budget, 8 span-tools, Haiku Chauffeur for spans over 3000 chars, ~90% prompt-cache hit) writes the RCA, an immediate_fix, and a 4-dimensional trace score (factual_grounding, privacy_and_safety, instruction_adherence, optimal_plan_execution; 1-5 each). The cluster promotes into the offline regression set after reviewer sign-off. Linear is wired today via OAuth; Slack, GitHub, Jira, and PagerDuty are on the roadmap.
  • Agent Command Center. 100+ providers as a single Go binary (Apache 2.0). 18+ built-in guardrail scanners and 15 third-party adapters run at the same network hop. ~29k req/s with P99 21 ms with guardrails on, on a t3.xlarge. Per-call cost telemetry via x-prism-cost headers quantifies what structured-output and citation-strict routing cost on a per-route basis.

SOC 2 Type II, HIPAA, GDPR, and CCPA certified; ISO/IEC 27001 in active audit.

Three deliberate tradeoffs

  • Structured-output citations cost output tokens. Roughly 5-15% premium on output tokens for the Pydantic schema. The Agent Command Center’s x-prism-cost headers quantify this exactly so the tradeoff is decided on a per-route basis rather than category-wide. The cost of getting citation wrong in legal or medical is two orders of magnitude higher.
  • The semantic rubric is the expensive layer. A claim-level judge call per response gets pricey at scale. Tail-aware sampling at the collector plus a calibrated lighter judge (or a distilled classifier) on the bulk traffic keeps the bill bounded. Frontier judges run on the 5-10% tail that triggered any other rubric or guardrail.
  • The refusal head is the operational lever. A bot that refuses on ambiguous claims is safer and slower. The calibration is per claim type, not per response. Get the senior reviewer to label the refusal set; engineer-only labels do not survive contact with production.

Anti-patterns to avoid

  • Scoring only the final answer. Pass the answer, ignore the cite. The audit artifact is wrong even when the answer is right.
  • One score for citation eval. Three rubrics collapsed into a number hides which failure mode is on fire.
  • Trusting the model’s own citation list as the source-of-truth pool. The citation list is the thing under evaluation. Join against the retrieval log, not against itself.
  • Paragraph-level semantic rubric. A two-sentence response with one fake cite averages to a passing score. Move the unit of evaluation to the claim.
  • No registry lookup on external citations. Live-URL-wrong-content is the failure that produced sanctions filings in 2024 and 2025. The cheapest catch in the stack.
  • One threshold across domains. The legal floor is not the news floor. Calibrate per domain or under-trigger on the regulated workload that needed the trigger most.

Ready to evaluate your first citation-bearing system? Wire ChunkAttribution, Groundedness, EvaluateFunctionCalling, and AnswerRefusal into a pytest fixture this afternoon against the ai-evaluation SDK. Add a CustomLLMJudge for the per-domain entailment rubric. Then attach the same rubrics as EvalTag spans on live traffic via traceAI so the rubric that gated CI runs on production traces too. The three rubrics in both places is what closes the citation-trust gap regulated systems structurally leak.

Frequently asked questions

What are the three rubrics of citation evaluation?
Structural: did the model emit a citation token in the right schema (regex, Pydantic, function-call validator). Resolvability: does the citation point at a fetchable real source (URL returns 200, DOI resolves, chunk_id exists in the retrieval log). Semantic: does the cited passage actually contain the claim the model attached to it (judge model scores entailment per claim). Most teams ship rubric one and call the project done. The two failure modes that hurt in regulated production, fake-looking citations that resolve and real citations attached to the wrong claim, only surface when you run all three.
Why isn't groundedness enough for citation evaluation?
Groundedness scores the answer against the retrieved context as a whole. Citation eval scores each claim against the specific source it was attributed to. A response can score 0.93 grounded and still cite a document the agent never retrieved, or attach a real document to a claim it does not contain. Groundedness is the synthesis-level check. The three citation rubrics are the audit-trail check, and in legal, medical, and journalism deployments the audit trail is the deliverable, not a side effect.
How do you catch a fabricated citation that looks plausible?
Three controls compound. The structural rubric forces the model into a Pydantic or function-calling schema where every claim carries a source_id and a quoted span. The resolvability rubric checks that the source_id is in the retrieval log and that any URL or DOI resolves to a live document with matching metadata. The semantic rubric runs a calibrated judge on the (claim, cited_passage) pair and scores entailment. The first two are deterministic and run inline. The third runs on a sample because it costs a judge call per claim. The famous fake-legal-case failure mode dies at the resolvability layer.
How do per-domain rubrics differ in legal, medical, and news?
The three rubrics are the same. The thresholds and registries are not. Legal AI calibrates structural to Bluebook, resolvability to a case-reporter registry like Westlaw or CourtListener, and semantic to a fact-vs-holding distinction. Medical AI calibrates structural to Vancouver or AMA, resolvability to PubMed and DOI registries, and semantic to a guideline-version-aware entailment check. News and research agents calibrate structural to inline link plus publisher metadata, resolvability to a URL-resolve plus retrieval-log join, and semantic to per-claim entailment with a source-tier weighting. Build the rubric once, calibrate the thresholds per domain.
How does traceAI attach citation evaluations to retrieval spans?
traceAI emits a RETRIEVER span carrying retrieval.documents and a downstream LLM span carrying llm.output.citations. The structured pair lets every citation rubric score against the actual retrieval log, not the citation list the model produced (which is the thing under evaluation). EvalTag wires the three rubrics as span-attached scorers at server side, so the same template that runs in pytest CI runs on live spans without a second inference hop. The 14 span kinds include first-class RETRIEVER, LLM, AGENT, and EVALUATOR, which is what makes per-claim attachment practical instead of grep over a log file.
What does Future AGI ship for citation eval today?
ai-evaluation (Apache 2.0) covers the three rubrics with stable EvalTemplate classes (ChunkAttribution and ChunkUtilization for attribution and utilization signal, Groundedness and ContextAdherence and FactualAccuracy for the substrate, EvaluateFunctionCalling for the structural schema check, AnswerRefusal for the calibrated abstention path) plus CustomLLMJudge for the per-domain semantic rubrics. traceAI carries the same rubric as a span-attached score on live traffic across 50+ AI surfaces. The Platform layers self-improving evaluators tuned by reviewer feedback, an in-product authoring agent that writes domain-specific rubrics from natural language, and classifier-backed evals at lower per-eval cost than Galileo Luna-2. Error Feed (HDBSCAN clustering plus a Sonnet 4.5 Judge) clusters failing traces and writes the immediate_fix string that promotes back into the offline regression set.
What are the most common citation-eval anti-patterns?
Five. Scoring only the final answer and skipping citation rubrics entirely, which loses the audit artifact even when the answer is right. Collapsing the three rubrics into a single score, which hides the failure mode that needs remediation. Running the semantic rubric at paragraph granularity, which lets a two-sentence answer with one fake cite average to a passing score. Trusting the model's own citation list as ground truth for what it retrieved (the citation list is the thing under evaluation; join against the retrieval log instead). Treating the rubric as static across domains; the calibration moves the registry, format, and entailment threshold, and the legal threshold is not the news threshold.
Related Articles
View all
Evaluating RAG Faithfulness: A 2026 Deep Dive
Guides

Why answer-level Groundedness hides RAG hallucinations, and how claim-level decomposition, cherry-pick detection, and sycophantic-restatement scoring fix it. Methodology for senior ML engineers.

NVJK Kartik
NVJK Kartik ·
11 min