Evaluating LLM Summarization: A Step-by-Step Guide (2026)
Summarization eval is four rubrics, not one number: groundedness, completeness, factuality, conciseness. Scored independently, calibrated against humans, run in CI. The 2026 guide.
Table of Contents
Most summarization eval suites stop at ROUGE and call it shipping-ready. ROUGE catches some lexical drift. It catches nothing else. A modern abstractive model can score 0.45 ROUGE-L against the reference, read fluently, and quietly invent a fact that was never in the source document. The user sees the invented fact. The dashboard sees a passing number. The eval was the bug.
The thesis
Summarization eval is four rubrics, not one metric. ROUGE was designed in 2004 for extractive systems that copied spans from the source. Modern abstractive systems paraphrase, restructure, and synthesize, and they break every assumption ROUGE was calibrated against. What you actually care about reduces to four independent questions. Did the model only assert facts present in the source. Did it cover the important ones. Did it state them accurately. Did it stay inside the length and structure spec. Score those four with calibrated LLM-as-judge rubrics, run them in CI on a golden set sampled from production, and let lexical-overlap metrics ride along as a cheap regression signal. Never as the gate.
The rest of this guide is the working pattern: the four dimensions, the calibration step against human raters, the prompts and the cascade, the variant playbooks for news / meeting notes / ticket summaries / document QA, the CI wiring, and the production loop.
TL;DR: the playbook
| Step | What you do | Why it matters |
|---|---|---|
| 1. Frame | Score four rubrics independently, not one aggregate | ROUGE conflates groundedness, completeness, conciseness |
| 2. Golden set | Sample 100-200 source-summary pairs from production, stratify by type and length | Invented test sets reflect author assumptions, not user behavior |
| 3. Calibrate | Human-label 50 examples per dimension, track Cohen’s kappa with the judge | Uncalibrated judges produce confident garbage |
| 4. Cascade | Deterministic → NLI → LLM-as-judge → frontier adjudication | Don’t run a $0.04 judge on a check a regex answers |
| 5. Specialize | Per-domain weights and a custom rubric tied to the product brief | News, meetings, tickets, and document QA fail differently |
| 6. Ship the gate | Per-dimension thresholds on a 7-day trailing baseline | A frozen baseline drifts past you in a quarter |
| 7. Close the loop | Production failures auto-cluster, promote into the golden set | Static datasets stop being regression suites |
Steps 1 through 4 are the eval. Steps 5 through 7 are how the eval stays honest over months.
Step 1: drop ROUGE as the primary metric
The argument against ROUGE isn’t that it’s wrong, it’s that it’s the wrong shape. ROUGE-L gives one score for a phenomenon that has three independent failure modes:
- Hallucination. The summary asserts a fact not in the source. ROUGE doesn’t model the source; it only compares to a reference. A confidently invented sentence that happens to share n-grams with the reference scores high.
- Omission. The summary leaves out the headline fact. If the reference also leaves it out, ROUGE never notices. If the reference includes it and the summary paraphrases around it, ROUGE penalizes the paraphrase, not the omission.
- Padding. The summary is twice the spec length but contains the right tokens. ROUGE-L rewards the overlap and ignores the length violation.
The G-Eval paper (Liu et al. 2023, arXiv:2303.16634) measured this directly against SummEval. GPT-4 scored as a four-dimensional rubric (coherence, consistency, fluency, relevance) hit a Spearman correlation of 0.514 with human raters on summarization. ROUGE-L hit 0.128 on the same dimension set. BERTScore hit 0.190. The lexical-overlap metrics were not just worse on average; they ranked summary pairs the opposite of how humans ranked them on the dimensions humans cared about.
Use ROUGE-L and BERTScore as one column on the dashboard for backward compatibility and cheap regression signal. Never as the gate. The four rubrics below are the gate.
Step 2: the four rubrics that decide summary quality
Each rubric scores an independent question. Each is small enough that a judge can apply it consistently. Together they cover what a human reader actually evaluates when they say “this is a good summary.”
Groundedness. Every claim in the summary is supported by content present in the source document. Score per-claim and average, or score the summary holistically on a 1-5 scale. The Groundedness template in the ai-evaluation SDK (eval_id=47) does this directly against the context field; the local NLI variant runs a DeBERTa-NLI head per claim with no API call.
Completeness. The important facts in the source are present in the summary. “Important” is the load-bearing word. Define it per domain. For news, the lede plus the supporting facts. For meeting notes, every decision and every action item with owner and deadline. For ticket summaries, the issue, the customer’s last action, and the next step. Completeness (eval_id=10) scores this; for a deterministic backstop, annotate an expected-fact list per golden-set example and compute fact-recall.
Factuality. Statements in the summary that are checkable against the source are true. Distinct from groundedness because a claim can be present in spirit but wrong in detail (a date off by a year, a name swapped between people). FactualAccuracy (eval_id=66) catches this. For RAG-grounded summarization, ContextAdherence (eval_id=5) adds the retrieval-step check.
Conciseness. The summary respects the length spec and contains no padding or repetition. Score with IsConcise (eval_id=83) for the qualitative judgment, and pair it with a deterministic length check (token count, sentence count, paragraph count) bound as a hard rubric. A 500-token spec that returns 2000 tokens fails the gate regardless of the other three rubrics.
SummaryQuality (eval_id=64) and IsGoodSummary (eval_id=94) are holistic aggregators that fold all four into a single score. Useful as dashboard tiles. Not useful as the gate, because they hide which dimension regressed.
Step 3: calibrate the judge before you trust it
An uncalibrated LLM judge is a confident random number generator. The fix is the same protocol the G-Eval paper used: score against human-labeled summaries and track agreement.
Run the calibration loop once when you set the rubric up, then quarterly:
- Sample 50 source-summary pairs from production. Skew toward edge cases (long sources, mixed-language, contested facts) where rubric disagreement is most informative.
- Two annotators score each pair on all four dimensions. Use a 1-5 Likert per dimension, with one-sentence reasoning. Resolve disagreements by discussion or a tiebreaker annotator.
- Run the judge prompt against the same set. Same prompt, same judge model, same temperature you’ll use in production.
- Compute Cohen’s kappa per dimension between judge scores and human consensus. Anything below 0.6 means the rubric prompt needs sharpening, the judge model is wrong for the dimension, or both. The G-Eval paper hit Spearman 0.514 on SummEval with GPT-4; a 2026 production-grade summarization judge should hit 0.6 or better on the dimension your product cares about most.
- Pin the judge model and the rubric prompt as one contract. Swapping the judge model is an eval migration, not a config change. Track judge-human kappa drift over time. When it slides, the source distribution or the model has shifted.
Calibrated judges produce scores you can act on. Uncalibrated judges produce scores you can argue about. The thirty minutes of human annotation per quarter is the cheapest reliability investment in the eval stack.
Step 4: build the cascade, don’t pay for the frontier on every span
LLM-as-judge is the right tool for groundedness, completeness, factuality, and conciseness. It is the wrong tool for everything below it. The pattern that works:
- Deterministic first. Token count, sentence count, JSON schema for structured summaries, regex for refusal phrasings, expected-field presence. Microsecond cost, zero API spend, runs on every trace.
- Embedding or NLI second. BERTScore against a gold answer if you have one. A DeBERTa-NLI head for claim-by-claim entailment against the source. Milliseconds, no LLM call.
- Small classifier third. Toxicity (
eval_id=15), PII, prompt injection. Sub-100ms, no judge call. - LLM-as-judge fourth. The four rubrics above. Pin the model. Cache verdicts keyed on
(judge_model_id, rubric_version, source_hash, summary_hash). - Frontier adjudication last. When the small classifier and the standard judge disagree, escalate to a stronger judge or a human label. The
augment=Trueflag in theai-evaluationSDK encodes this cascade directly.
The cost discipline matters. Running a $0.04 frontier judge on every summary across millions of spans per day burns the eval budget before it scores anything useful. Reserve the expensive primitive for the cases the cheap primitives can’t decide.
Step 5: specialize for the document type
The four-rubric framework holds across domains. The weights and the custom rubric change.
News summarization. Factuality and groundedness carry the weight. A misattributed quote or a wrong company name lands the publisher in a correction. Add a named-entity-grounding check (every entity in the summary must appear in the source) as a hard rubric. Set the factuality floor at 0.9. Conciseness matters but rarely fails because newsroom briefs hew to the spec.
Meeting notes. Completeness on action items is the load-bearing dimension. Score with a custom rubric: “Every action item in the source transcript appears in the summary with the owner and the deadline.” The action-item rubric is binary per item; the summary either has all of them or it doesn’t. Groundedness still matters (no invented action items). Conciseness is secondary. A meeting note that misses the assignment is a workflow failure even if it’s the right length.
Support-ticket summaries. Conciseness and structured-field extraction dominate. The downstream consumer is usually a routing system or a tier-2 agent, not a human reader. Score the summary as a structured object (issue, customer-action, next-step, priority) and treat each field as its own deterministic rubric. The qualitative groundedness and factuality rubrics ride along.
Document QA summaries. This is RAG-grounded summarization. The summary is a synthesis over retrieved chunks, not a single source. ContextAdherence joins the rubric set. Score citation validity (every cited span exists in the retrieval context) as a hard deterministic rubric. The completeness rubric splits into two: did retrieval surface the right chunks (ContextRelevance), and did the summary use the right ones (ChunkUtilization).
The custom rubric is where the product signal lives. Use CustomLLMJudge with a free-text grading_criteria field and a few-shot example or two. Examples that worked in production: “Does the legal-clause summary preserve the indemnity cap language verbatim.” “Does the medical-discharge summary include the prescribed dose and the follow-up appointment.” “Does the SOC 2 audit summary list every Type II control deviation.”
Step 6: wire the eval suite into CI
The five-line minimum, against the SDK:
from fi.evals import Evaluator
from fi.evals.templates import (
Groundedness, Completeness, FactualAccuracy, IsConcise, Toxicity,
)
ev = Evaluator(fi_api_key="...", fi_secret_key="...")
result = ev.evaluate(
eval_templates=[
Groundedness(), Completeness(), FactualAccuracy(), IsConcise(), Toxicity(),
],
inputs=[
{"input": doc, "output": summary, "context": doc}
for doc, summary in zip(source_docs, model_summaries)
],
)
The output is a BatchRunResult with per-input scores and reasons per rubric. Assert against per-dimension thresholds (groundedness above 0.85, completeness above 0.75, factuality above 0.90, conciseness above 0.80, toxicity below 0.05 are reasonable starting points to tune to your domain). Cache verdicts keyed on the eval tuple so a repeated CI run is free.
The custom rubric uses CustomLLMJudge:
from fi.evals.metrics.llm_as_judges.custom_judge import CustomLLMJudge
from fi.evals.providers import LiteLLMProvider
judge = CustomLLMJudge(
provider=LiteLLMProvider(),
config={
"name": "captures_action_items",
"grading_criteria": (
"Score 1.0 if every action item in the source transcript appears "
"in the summary with the owner and the deadline. Score 0.0 if any "
"action item is missing the owner or the deadline."
),
},
)
CustomLLMJudge runs via LiteLLM so you can point it at any provider you already use. Drop it into a pytest case, a GitHub Actions step, or a scheduled job. The artifact engineers should review on a PR is the list of newly failing examples with rubric scores, reasons, and diffs against the trailing 7-day baseline. Not the aggregate score, which hides which dimension regressed.
For the deterministic length check, run a tokenizer-aware count against the spec in the same step. A summary that scores 0.92 on every LLM rubric and exceeds the 500-token budget is a product failure.
Step 7: close the loop with production observation
Offline eval gates regressions before deploy. Production eval catches drift, new failure modes, and the rare paths the golden set doesn’t cover. The same four rubrics run in both places.
The pattern that works:
- Sample production summarization traces uniformly or by signal (low rubric score, user thumbs-down, regenerated output). Score with the same rubrics used in CI.
- Attach scores to the OTel span so the eval result lives next to the trace. traceAI (Apache 2.0) makes the semantic convention (
FI,OTEL_GENAI,OPENINFERENCE,OPENLLMETRY) a registration-time choice, so traces flow into your existing OTel collector without re-instrumenting. Built-in evals wire to span attributes viaEvalTagfor zero added latency at trace-write time. - Alarm on rolling-mean drift. Per-route, per-rubric, per-prompt-version. A 2-5 point sustained drop over 15-60 minutes is the right detection threshold for most summarization products. A drift on factuality is a different incident from a drift on conciseness. Page on each separately.
- Cluster failures into named issues. Error Feed sits inside Future AGI’s eval stack. HDBSCAN soft-clustering over ClickHouse-stored embeddings groups failing summaries; a Claude Sonnet 4.5 Judge agent (30-turn budget, 8 span-tools, Haiku Chauffeur summariser for spans over 3000 chars, 90% prompt-cache hit ratio) writes the root cause, the evidence quotes, and an
immediate_fixper issue. - Promote representative failures into the golden set with rubric labels. The next PR that touches the summarization prompt has to clear the new entries.
The loop is what compounds. Without it, each incident produces a one-off fix and the team ships the same regression twice.
Common summarization eval mistakes
- ROUGE as the gate. Catches lexical drift, misses hallucination and omission and padding. Use it as a regression tile, never as the threshold.
- One aggregate score. “Summary quality 0.82” tells you nothing actionable. The four dimensions score independently; the gate is per-dimension.
- Reference summaries as the only ground truth. Reference summaries are one annotator’s opinion about what matters. Expected-fact lists are cheaper, more deterministic, and easier to maintain.
- No calibration step. An LLM judge with judge-human kappa below 0.6 is a coin flip with a paragraph of reasoning attached. Calibrate before you ship.
- Custom rubric is too vague. “Is this summary good” is not a rubric. “Does this summary list every action item with owner and deadline” is.
- Eval runs once a quarter. Summarization quality drifts with model updates, prompt edits, and source distribution shifts. Continuous CI plus weekly production sampling is the right cadence.
- No length budget. A summary that ignores the 500-token spec is a product failure even if every other dimension passes.
How Future AGI ships the summarization eval stack
A package, three layers.
- ai-evaluation SDK (Apache 2.0). Code-first.
from fi.evals import Evaluator, Protect. 60+EvalTemplateclasses includingGroundedness,Completeness,FactualAccuracy,ContextAdherence,IsConcise,SummaryQuality,IsGoodSummary,BleuScore,Toxicity,CustomLLMJudge. 13 guardrail backends (9 open-weight Llama Guard / Qwen3-Guard / Granite Guardian / WildGuard / ShieldGemma; 4 API). 8 sub-10ms local Scanners. Four distributed runners (Celery, Ray, Temporal, Kubernetes).RailType.INPUT/OUTPUT/RETRIEVALandAggregationStrategy.ANY/ALL/MAJORITY/WEIGHTED. Multi-modalCustomLLMJudgevia LiteLLM. - Future AGI Platform. Self-improving evaluators that retune from thumbs up/down feedback on production summaries. An in-product authoring agent that turns a natural-language brief (“score this medical-discharge summary on dose-preservation and follow-up presence”) into a deployable rubric. Classifier-backed evals at lower per-eval cost than Galileo Luna-2: the economics that make weekly full-golden-set reruns the default, not a budget conversation.
- Error Feed (inside the eval stack). HDBSCAN soft-clustering over ClickHouse embeddings groups failing summaries into named issues. A Claude Sonnet 4.5 Judge agent (30-turn budget, 8 span-tools, Haiku Chauffeur summariser, 90% prompt-cache hit ratio) writes the root cause and the
immediate_fixper issue. Four-dimensional trace scoring (factual_grounding, privacy_and_safety, instruction_adherence, optimal_plan_execution; 1-5 scale). Linear ticketing today. Fixes feed back into the platform’s self-improving evaluators so the rubric ages with the product instead of against it.
traceAI (Apache 2.0) is the tracing layer: 50+ AI surfaces across Python, TypeScript, Java, and C#. Pluggable semantic conventions, 14 span kinds, 62 built-in evals via EvalTag. The hosted Agent Command Center is SOC 2 Type II, HIPAA, GDPR, and CCPA certified.
Ready to run the four-rubric suite against your own summaries? Install ai-evaluation, score your last 100 production summaries on Groundedness, Completeness, FactualAccuracy, and IsConcise this afternoon, then wire the same rubrics as EvalTag on live spans via traceAI tomorrow. The same four rubrics in CI and in production is what turns a summarization eval into a regression suite that holds for two years.
Related reading
Frequently asked questions
Why is ROUGE the wrong primary metric for modern summarization?
What are the four rubrics that actually decide summary quality?
How do you calibrate an LLM judge for summarization?
How should the eval differ for news versus meeting notes versus support tickets?
What's the right size for a summarization golden set?
How do you wire the eval suite into CI for summarization?
What does Future AGI ship for summarization evaluation?
Summarization eval is four judge prompts, not four concepts. Groundedness, completeness, factuality, conciseness — each as a hardened prompt with a calibration set. The 2026 deep dive.
Learn LLM evaluation from the inside out: the three primitives (deterministic, embedding, judge), offline vs online, and the starter workflow that holds up in production.
Contract review RAG in 2026: clause-level retrieval, citation enforcement, the eval suite in-house counsel will sign off, plus the LangGraph wiring to live OTel traces.