Guides

LLM Summarization Evaluation: A 2026 Architectural Deep Dive

Summarization eval is four judge prompts, not four concepts. Groundedness, completeness, factuality, conciseness — each as a hardened prompt with a calibration set. The 2026 deep dive.

·
12 min read
llm-evaluation summarization ai-evaluation rag faithfulness 2026
Editorial cover image for LLM Summarization Evaluation Deep Dive 2026
Table of Contents

The companion post on this blog walks the seven steps of standing up a summarization eval. This one zooms in on the part that decides whether the whole thing works or quietly hallucinates: the prompt that runs inside the judge. Four rubrics, four prompts, four calibration corpora. The metrics are the easy half. The hard half is making each prompt do what its name promises.

The thesis: rubrics are prompts, not concepts

Summarization eval converges on four rubrics across every serious team: groundedness, completeness, factuality, conciseness. The companion post lays out why these four. This one assumes you’ve already agreed on the names.

The trap is treating the names as the work. A team drops a Groundedness() template into CI and ships. The number looks plausible. Six weeks later production users surface a hallucination the eval scored 0.92 on. The judge wasn’t broken. The prompt inside the judge was broken.

A rubric is a prompt plus a calibration corpus. The prompt is what the judge sees. The corpus is what proves the prompt works. Everything else — picking the model, the score range, wiring CI — is plumbing. The 80 percent of summarization eval that decides whether you catch failures or paper over them is prompt engineering on the four rubric judges.

The rest of this post is that loop, per rubric. Failure modes, the hardened prompt, the calibration set, and how CustomLLMJudge in the ai-evaluation SDK turns the iteration into a tight, versioned cycle.

Anatomy of a hardened judge prompt

Five constraints all four rubric prompts share.

Verdicts, not scores. Ask for supported / contradicted / missing / partial, not a 1-5 Likert. Categorical verdicts are easier to calibrate, aggregate, and argue about with annotators. The SDK collapses them to a float at the boundary; the judge thinks in verdicts.

Evidence first, verdict second. The prompt forces the judge to quote the source span that drove its decision before it writes the verdict. This is the single largest reliability lever. A judge that has to find evidence first hallucinates less, because the act of quoting is grounding.

Failure modes named. The prompt lists the failure modes by name. “Paraphrase is not entailment.” “Real-world truth is not source truth.” A judge that doesn’t know what failure looks like accepts everything.

One claim at a time. Each judge call scores one atomic unit against one source span. Summary-level prompts force the judge to do retrieval, decomposition, and verdict in one pass — three failure surfaces, one number out. Decompose first, then score atomic.

External knowledge disallowed. The prompt says so explicitly. “If the claim is true in the real world but not in the source, mark missing.” Without this line, the judge silently passes hallucinations that happen to be true.

The four rubric sections below apply these five.

Rubric 1: groundedness as a judge prompt

What it scores. Every claim in the summary is supported by content in the source.

Failure modes the corpus must cover.

  1. Pure invention. A claim with no source basis. Baseline judges catch this.
  2. Paraphrase confusion. Source says X; summary says near-synonym Y. Did the meaning drift?
  3. Causal embellishment. Source lists two events; summary adds “because of” between them.
  4. Quantification drift. “Many” becomes “most.” 18 percent becomes “nearly 20 percent.” The most-missed failure mode in vanilla groundedness prompts.
  5. Speculative aggregation. Source names three vendors; summary says “the industry.”

The vanilla prompt — “score groundedness from 1 to 5” — fails on all four except pure invention. The judge has no taxonomy for what un-grounded looks like.

The hardened prompt.

from fi.evals.metrics.llm_as_judges.custom_judge import CustomLLMJudge
from fi.evals.providers import LiteLLMProvider

groundedness_judge = CustomLLMJudge(
    provider=LiteLLMProvider(),
    config={
        "name": "GroundednessAtomic",
        "grading_criteria": (
            "You are given a single atomic claim from a generated summary "
            "and a span of source text. Decide whether the claim is "
            "entailed by the span.\n\n"
            "VERDICTS:\n"
            "- supported: every fact in the claim is explicitly stated "
            "or directly entailed by the span.\n"
            "- contradicted: any fact in the claim disagrees with the span.\n"
            "- missing: any fact in the claim is absent from the span.\n"
            "- partial: some facts are supported, others are absent or wrong.\n\n"
            "RULES:\n"
            "1. Quote the source phrase that supports your verdict verbatim. "
            "If no such phrase exists, write '<no support found>'.\n"
            "2. Paraphrase is supported only if the meaning is preserved "
            "without adding facts, quantifiers, or causal links.\n"
            "3. External knowledge is disallowed. If the claim is true in "
            "the real world but not in the span, mark missing.\n"
            "4. Quantifier drift counts as contradicted. 'Many' to 'most' "
            "is not entailment.\n"
            "5. Causal additions ('because', 'due to', 'as a result') count "
            "as missing unless the span states the cause.\n\n"
            "Output JSON with keys: verdict, evidence_quote, "
            "unsupported_fact (when verdict is contradicted, missing, "
            "or partial)."
        ),
    },
)

The prompt names each failure mode by reference. Rule 2 catches paraphrase confusion and causal embellishment. Rule 3 catches the real-world-truth leak. Rule 4 catches quantification drift. The evidence_quote field forces grounding before verdict; unsupported_fact forces a named failure when the verdict isn’t supported.

The calibration corpus. Thirty to fifty pairs, stratified across the five failure modes. Each pair is claim plus span plus human verdict, scored by two annotators with disagreements resolved. Track Cohen’s kappa per failure-mode bucket — not just overall. A prompt at 0.8 kappa overall but 0.4 on quantification drift has a known gap. When a bucket regresses, the fix is usually a sharper rule, not a model swap.

Rubric 2: completeness as a judge prompt

What it scores. The summary covers the important facts in the source.

The load-bearing word is “important.” A completeness judge without a notion of importance scores every fact equally and ships summaries that miss the headline. The vanilla “does the summary cover the source” averages across trivia and lede, lands near 0.7 on almost any summary, and never moves with failure cases.

Two prompt shapes work; the second is better. Shape one is holistic with a domain anchor: the prompt tells the judge what counts as important for this product. Shape two uses an expected-facts list per source — annotation produces, for each document, an explicit list of facts a correct summary must contain. The judge scores against the list, not the source. The importance question moves to annotation time (one-time cost) instead of inference time (relitigated every call).

The hardened prompt (shape two).

completeness_judge = CustomLLMJudge(
    provider=LiteLLMProvider(),
    config={
        "name": "CompletenessExpectedFacts",
        "grading_criteria": (
            "You are given a generated summary and a list of expected facts "
            "the summary must cover. For each expected fact, decide whether "
            "the summary covers it.\n\n"
            "VERDICTS PER FACT:\n"
            "- present: the fact is stated in the summary with all named "
            "entities, quantities, and qualifiers preserved.\n"
            "- approximate: the fact is in the summary but a quantifier, "
            "name, or qualifier is missing or changed.\n"
            "- absent: the fact is not in the summary.\n\n"
            "RULES:\n"
            "1. Quote the summary phrase that covers each fact verbatim. "
            "If absent, write '<not found>'.\n"
            "2. Synonyms are acceptable. 'Q3' and 'third quarter' are the "
            "same fact.\n"
            "3. A fact is present only if every load-bearing entity is "
            "preserved. 'Acme grew' is approximate when the fact is "
            "'Acme grew 18 percent year over year'.\n"
            "4. Do not infer presence from related facts. 'Revenue grew' "
            "does not cover 'Operating margin declined'.\n\n"
            "Output JSON: array of {fact_id, verdict, evidence_quote}, "
            "plus aggregate {percent_present, percent_approximate, percent_absent}."
        ),
    },
)

Rule 3 is the largest reliability lever. Without it the judge accepts “Acme grew” as covering a precise quantitative claim and the rubric stops detecting precision loss. Rule 4 stops the judge from inferring coverage from adjacent facts.

The calibration corpus. Expected-facts annotation is the expensive piece — budget two hours per fifty documents at first; the cost amortizes. Pairs are (summary, expected-facts list, per-fact human verdict). Failure buckets to cover: missing headline fact, approximate quantifiers, present-but-buried, synonym ambiguity, inferred-not-stated.

The expected-facts list also becomes the failure surface engineers fix against. “Missed three of six expected facts on regulated content this week” is actionable; “completeness dropped from 0.81 to 0.74” is not.

Rubric 3: factuality as a judge prompt

What it scores. Statements in the summary that are checkable against the source are correct.

Distinct from groundedness. Groundedness asks “is this claim in the source.” Factuality asks “is this claim, which is in the source, stated correctly.” A summary can be perfectly grounded and wrong — every claim traces to the source, but dates are off by a year, percentages are transposed, two people got swapped. Factuality ships the most embarrassing failures in production; near-misses on dates, numbers, and named entities slip past groundedness.

Failure modes.

  1. Numerical transposition. 18 percent becomes 81 percent. 2024 becomes 2042.
  2. Entity swap. Two people doing two things; the summary attributes the wrong action to the wrong person.
  3. Temporal drift. “Last quarter” becomes “this quarter.” “In 2023” becomes “recently.”
  4. Unit conversion error. 50 million dollars becomes 50 million euros. 5 megabytes becomes 5 gigabytes.
  5. Negation flip. “Did not approve” becomes “approved.”

A vanilla factuality prompt misses all five except gross negation flips, because the judge has no template for what “wrong but plausible” looks like.

The hardened prompt.

factuality_judge = CustomLLMJudge(
    provider=LiteLLMProvider(),
    config={
        "name": "FactualityAtomic",
        "grading_criteria": (
            "You are given a single atomic claim from a summary and the "
            "source span it is grounded in. Decide whether the claim is "
            "stated correctly relative to the span.\n\n"
            "VERDICTS:\n"
            "- correct: every entity, quantity, and qualifier in the claim "
            "matches the span exactly.\n"
            "- numerically_wrong: a number, percentage, or date in the "
            "claim does not match the span.\n"
            "- entity_wrong: an actor, object, or attribute is misattributed.\n"
            "- polarity_wrong: a negation, comparison, or conditional has "
            "been flipped.\n"
            "- temporal_wrong: a tense, time reference, or duration is off.\n\n"
            "RULES:\n"
            "1. Quote the source phrase that should match the claim verbatim, "
            "then quote the claim phrase verbatim. Compare them character by "
            "character on entities and numbers.\n"
            "2. Synonyms for the same entity are correct. 'Acme Corp' and "
            "'Acme' are correct unless the source distinguishes them.\n"
            "3. Rounding within 1 unit of granularity is correct. 18 percent "
            "and 'about 20 percent' is correct; 18 percent and 'about 30 "
            "percent' is numerically_wrong.\n"
            "4. Tense shifts that change which event happened when are "
            "temporal_wrong. Tense shifts that keep the temporal ordering "
            "are correct.\n\n"
            "Output JSON: {verdict, source_phrase, claim_phrase, "
            "specific_mismatch (when verdict is not correct)}."
        ),
    },
)

Each named verdict maps to a failure mode in the corpus. Rule 1’s character-by-character comparison is the largest reliability gain — a judge asked to compare phrases as text catches transpositions that an overall-correctness score doesn’t.

The calibration corpus. This is the rubric where synthetic perturbation pays off the most. Take fifty correct summaries; programmatically flip one number, swap one entity, or invert one negation per summary. Human verdicts are trivial (you wrote the mutation), but the judge’s ability to detect each perturbation type is exactly what you’re measuring. Track per-mutation kappa. When the judge misses a class — usually rounding-adjacent numerical errors — sharpen Rule 3 with an explicit example.

Rubric 4: conciseness as a judge prompt

What it scores. No padding, no restatement, length appropriate to the use case.

The dimension everyone gets wrong. Conciseness judging without a target ratio is grading on vibes. A judge asked “is this summary concise” passes everything that isn’t grotesquely long, because the model’s prior for “concise” is “shorter than the source.” The target ratio has to be in the prompt.

Compression-band stratification. A TL;DR runs at 20:1. An abstract at 5:1. An executive summary at 10:1. An action recap at 3:1. One rubric tuned for an average produces noise across all four. Tag each example with its target band and pass the band into the prompt.

Failure modes.

  1. Length out of band. Right form, wrong ratio.
  2. Surface padding. Filler like “it is important to note that,” “in summary,” “the document explains that.”
  3. Verbatim restatement. Whole source sentences pasted into the summary. Length looks fine; the summary is doing no compression.
  4. List bloat. A single source sentence expanded into three bullets paraphrasing the same fact.
  5. Hedging inflation. “The board may have approved, although the exact details are not entirely clear” doing the work of “the board approved.”

The hardened prompt.

conciseness_judge = CustomLLMJudge(
    provider=LiteLLMProvider(),
    config={
        "name": "ConcisenessBanded",
        "grading_criteria": (
            "You are given a source word count, a summary word count, a "
            "target compression band, and the summary text. Decide whether "
            "the summary is concise.\n\n"
            "BAND DEFINITIONS:\n"
            "- tldr: target ratio 15:1 to 25:1, two to four sentences.\n"
            "- exec_summary: target ratio 8:1 to 12:1, one to two paragraphs.\n"
            "- abstract: target ratio 4:1 to 6:1, one paragraph.\n"
            "- action_recap: target ratio 2:1 to 4:1, structured list.\n\n"
            "VERDICTS:\n"
            "- concise: ratio in band and no padding patterns present.\n"
            "- padded: ratio in band but contains filler phrases, verbatim "
            "restatements, or hedging inflation.\n"
            "- bloated: ratio above the upper bound.\n"
            "- under_compressed: ratio below the lower bound.\n\n"
            "RULES:\n"
            "1. Compute ratio = source_words / summary_words. Compare to "
            "the band bounds.\n"
            "2. Padding patterns to flag: filler ('it is important to note', "
            "'in summary'), restatement (any sentence overlapping a source "
            "sentence by more than 8 contiguous words), hedging inflation "
            "(three or more hedge words in one sentence), list bloat (two "
            "bullets paraphrasing the same fact).\n"
            "3. Quote the offending phrase verbatim when verdict is padded.\n\n"
            "Output JSON: {verdict, computed_ratio, target_band, "
            "padding_examples (list of verbatim quotes)}."
        ),
    },
)

Rule 2 names the four padding patterns by class. The judge that knows what padding looks like can find it.

The calibration corpus. Stratify across the four bands and the five failure modes. A useful trick: take one source document, produce four target-band versions, mutate each with one of the five failure modes. Twenty examples per band gives an 80-example corpus that covers the full failure space.

How the corpora turn into a regression suite

Four rubrics, four corpora. Each is small (30 to 80 pairs), targeted, rubric-specific. Together they gate every judge-prompt edit.

  1. Edit the grading_criteria string for one rubric.
  2. Run the judge against that rubric’s corpus.
  3. Compute per-failure-mode kappa against human consensus.
  4. If overall kappa drops by 0.05 or any bucket drops by 0.10, the edit fails the gate.
  5. Promote the new prompt with a version tag.

The loop runs in seconds because the corpora are small. It catches regressions a bigger corpus would mask in averages. Per-bucket kappa is what tells the author which failure mode the edit just broke. The mistake teams make is skipping step 3 — a prompt edit moves the aggregate by a point and ships, but the aggregate moved because the prompt is now over-permissive on quantification drift in exchange for being stricter on causal additions. The net was a wash; the failure surface changed.

CustomLLMJudge makes this loop concrete. The grading_criteria string is the unit of change. few_shot_examples teaches the judge a hard case without rewriting the rules. config["name"] is the version tag.

from fi.evals.metrics.llm_as_judges.custom_judge import CustomLLMJudge
from fi.evals.providers import LiteLLMProvider

groundedness_v3 = CustomLLMJudge(
    provider=LiteLLMProvider(),
    config={
        "name": "GroundednessAtomic_v3",
        "grading_criteria": HARDENED_GROUNDEDNESS_PROMPT,
        "few_shot_examples": [
            {
                "inputs": {
                    "claim": "Revenue grew nearly 20 percent.",
                    "span": "Revenue grew 18 percent year over year.",
                },
                "output": (
                    "{\"verdict\": \"contradicted\", \"evidence_quote\": "
                    "\"18 percent\", \"unsupported_fact\": \"quantifier drift "
                    "from 18 percent to nearly 20 percent\"}"
                ),
            },
        ],
    },
)

The example is one hard case lifted directly from the corpus. Three to five examples is the sweet spot — more crowds the model’s attention, fewer lets edge cases bleed back in.

Wiring the four hardened judges into the eval pipeline

Once each rubric has a hardened prompt and a passing corpus, the pipeline is mechanical.

from fi.evals import Evaluator
from fi.evals.templates import (
    Groundedness,
    Completeness,
    FactualAccuracy,
    IsConcise,
)

ev = Evaluator(fi_api_key="...", fi_secret_key="...")

result = ev.evaluate(
    eval_templates=[
        Groundedness(),
        Completeness(),
        FactualAccuracy(),
        IsConcise(),
        groundedness_v3,
        completeness_judge,
        factuality_judge,
        conciseness_judge,
    ],
    inputs=[
        {"input": source, "output": summary, "context": source}
        for source, summary in zip(sources, summaries)
    ],
    augment=True,
)

Built-in templates carry hardened defaults; custom judges carry product-specific sharpening. Both run through the same cascade with augment=True, which routes obvious cases through cheaper heuristics and reserves the LLM call for the ambiguous middle. The cascade is what makes per-claim atomic scoring economical — one judge call per uncertain claim, not per summary.

For regression sets over 5,000 documents, swap in a distributed runner. Celery, Ray, Temporal, or Kubernetes.

from fi.evals.runners import RayRunner

ev = Evaluator(
    fi_api_key="...",
    fi_secret_key="...",
    runner=RayRunner(num_workers=64),
)

A 50,000-document regression with four rubrics, cascade enabled, on 64 workers runs in under 30 minutes. That’s what decides whether the eval lives in CI or in someone’s quarterly tracker.

Closing the loop with Error Feed

Calibration corpora keep the judge prompts honest. Production traces keep the corpora honest. The bridge is failure clustering.

Error Feed inside Future AGI’s eval stack runs HDBSCAN soft-clustering over ClickHouse embeddings on failing summarization traces, then writes a named issue per cluster with an immediate fix. A Claude Sonnet 4.5 Judge agent (30-turn budget, 8 span-tools, Haiku Chauffeur summariser, 90 percent prompt-cache hit ratio) reads each cluster and produces the description.

Typical clusters on summarization workloads:

  • “Quantifier creep on financial sources: 18 percent rendered as nearly 20 percent in 12 percent of Q1 summaries.”
  • “Causal addition between events not connected in the source.”
  • “Hedging inflation on regulated content: ‘may have’ instead of ‘did’ in 4 percent of healthcare summaries last week.”
  • “Entity swap on multi-party meeting transcripts after 8k context.”

Each cluster ships with the named issue, the immediate fix, and a representative example. Promote the example into the appropriate rubric’s corpus. The next prompt edit has to clear it. The corpora grow from real production misses, not synthetic guesses. Linear is the only Error Feed integration today; Slack, GitHub Issues, Jira, and PagerDuty are on the roadmap.

The wider FAGI loop on summarization eval

The four hardened judges are the local optimum. The closed self-improving loop is the global one.

traceAI (Apache 2.0) captures every summarization span with the right semantic attributes (fi.span.kind=LLM, llm.input_tokens, llm.output_tokens). The four rubric scores attach via EvalTag at span-write time, so the per-rubric verdict lives next to the trace.

The hosted Agent Command Center routes summarization traffic through a BYOK gateway with 100+ providers, captures the trace, runs the eval cascade, and pipes failures into Error Feed. The Future AGI Platform’s classifier-backed evaluators run at lower per-eval cost than Galileo Luna-2 — the difference between running the four hardened judges on every production summary versus sampling 10 percent and hoping the tail looks like the head.

The loop closes when agent-opt consumes the calibration corpus plus production failures, runs BayesianSearchOptimizer or GEPAOptimizer on the summarizer prompt, and ships the optimized prompt back through the gateway. Six optimizers plus EarlyStoppingConfig. The trace-stream-to-agent-opt connector that would consume live traces directly is on the roadmap; today the dataset is the calibration corpus plus promoted production examples.

Open-source SDK, hosted platform, both ends of the loop on summarization-grade rubrics. The companion post walks the seven-step workflow that wraps around the four-rubric prompt-engineering core covered here.

What to take from this

Four rubrics, four hardened prompts, four calibration corpora. The names are the easy half. The prompt inside each judge is where the eval lives or dies. Verdicts not scores, evidence before verdict, failure modes named, atomic units, external knowledge disallowed — these five constraints separate a judge that catches production failures from one that produces confident noise.

CustomLLMJudge is the unit of iteration. Edit the grading_criteria, run against the corpus, gate on per-bucket kappa, version-tag, ship. The loop is fast because the corpora are small. The corpora are small because they’re targeted at failure modes, not summary-length distributions.

Install ai-evaluation, instantiate the four configs above, run them against your fifty hardest production summaries this afternoon. The kappa numbers from that first run are the prompt-engineering backlog for the week.

Frequently asked questions

Why are judge prompts the bottleneck in summarization eval, not metric selection?
Metric selection is the easy part. Groundedness, completeness, factuality, conciseness — the four rubrics are obvious. What separates a working summarization eval from one that produces confident noise is the prompt that runs inside the judge. A groundedness prompt that asks 'is this summary faithful' gets 0.4 kappa with humans; the same rubric rewritten as 'for each claim below, mark supported if every fact is explicitly stated in the source span, contradicted if any fact disagrees with the span, missing if any fact is absent' gets 0.75. Same metric. Same model. The difference is prompt engineering on the judge.
What does a hardened groundedness judge prompt look like?
It binds the verdict to evidence quotes, forbids guessing, defines the failure modes by name, and outputs a structured object instead of a number. The prompt names supported, contradicted, missing, and partial as the four allowed verdicts. It requires the judge to quote the source span that supports each verdict verbatim. It explicitly disallows external knowledge — 'if the fact is true in the real world but not in the source, mark missing.' And it asks for the verdict per claim, not per summary. The full pattern lives in the per-rubric section below.
How big does the calibration corpus need to be per rubric?
Thirty to fifty examples per rubric is enough to detect a broken prompt. Each example is one source-summary pair with a human verdict and one-sentence reasoning. The corpus is rubric-specific because the failure modes are rubric-specific — a groundedness calibration set needs hallucination cases, a completeness set needs omission cases, a conciseness set needs padding cases. Re-run the calibration after every prompt edit. Cohen's kappa above 0.6 against the human consensus is the bar to ship the prompt.
How does CustomLLMJudge in the ai-evaluation SDK help with this?
CustomLLMJudge wraps the four steps you'd otherwise stitch together — provider routing, structured output, evidence extraction, calibration tracking — into one EvalTemplate class. You pass a grading_criteria string and optional few-shot examples; the SDK enforces the JSON output schema, returns a score plus a reason field, and runs through the same cascade as the built-in templates. The grading_criteria string is where the prompt engineering lives. The SDK does the plumbing so you can iterate on the prompt without rebuilding the harness.
What are the failure modes a calibration corpus catches?
Each rubric has a signature set. Groundedness fails when the judge accepts loosely-related-but-not-stated claims, when it confuses paraphrase with entailment, and when it lets the model's prior leak in. Completeness fails when the judge has no defined notion of important — every fact looks equally weighted. Factuality fails when the judge can't distinguish a date error from a present-but-imprecise statement. Conciseness fails when the judge has no target length — it grades on a vibe. The calibration corpus surfaces all four by construction if you stratify by failure mode.
How does Future AGI run this at scale?
The ai-evaluation SDK ships Groundedness, Completeness, FactualAccuracy, and IsConcise as ready-to-use EvalTemplate classes with the hardened prompts baked in, plus CustomLLMJudge when you need to override. Four distributed runners (Celery, Ray, Temporal, Kubernetes) keep regression sets of 5k to 50k documents under an hour. The Future AGI Platform runs cascade evaluators at lower per-eval cost than Galileo Luna-2 and Error Feed clusters production failures into named issues with an immediate fix per cluster, so the calibration corpus grows from real misses, not synthetic guesses.
Related Articles
View all
Evaluating RAG Faithfulness: A 2026 Deep Dive
Guides

Why answer-level Groundedness hides RAG hallucinations, and how claim-level decomposition, cherry-pick detection, and sycophantic-restatement scoring fix it. Methodology for senior ML engineers.

NVJK Kartik
NVJK Kartik ·
11 min