Guides

Evaluating LLM Summarization: A Step-by-Step Guide (2026)

Summarization eval is four rubrics, not one number: groundedness, completeness, factuality, conciseness, calibrated against humans in CI.

March 26, 2026

13 min read

llm-evaluation summarization ai-evaluation rag 2026

Table of Contents

Most summarization eval suites stop at ROUGE and call it shipping-ready. ROUGE catches some lexical drift. It catches nothing else. A modern abstractive model can score 0.45 ROUGE-L against the reference, read fluently, and quietly invent a fact that was never in the source document. The user sees the invented fact. The dashboard sees a passing number. The eval was the bug.

The thesis

Summarization eval is four rubrics, not one metric. ROUGE was designed in 2004 for extractive systems that copied spans from the source. Modern abstractive systems paraphrase, restructure, and synthesize, and they break every assumption ROUGE was calibrated against. What you actually care about reduces to four independent questions. Did the model only assert facts present in the source. Did it cover the important ones. Did it state them accurately. Did it stay inside the length and structure spec. Score those four with calibrated LLM-as-judge rubrics, run them in CI on a golden set sampled from production, and let lexical-overlap metrics ride along as a cheap regression signal. Never as the gate.

The rest of this guide is the working pattern: the four dimensions, the calibration step against human raters, the prompts and the cascade, the variant playbooks for news / meeting notes / ticket summaries / document QA, the CI wiring, and the production loop.

TL;DR: the playbook

Step	What you do	Why it matters
1. Frame	Score four rubrics independently, not one aggregate	ROUGE conflates groundedness, completeness, conciseness
2. Golden set	Sample 100-200 source-summary pairs from production, stratify by type and length	Invented test sets reflect author assumptions, not user behavior
3. Calibrate	Human-label 50 examples per dimension, track Cohen’s kappa with the judge	Uncalibrated judges produce confident garbage
4. Cascade	Deterministic → NLI → LLM-as-judge → frontier adjudication	Don’t run a $0.04 judge on a check a regex answers
5. Specialize	Per-domain weights and a custom rubric tied to the product brief	News, meetings, tickets, and document QA fail differently
6. Ship the gate	Per-dimension thresholds on a 7-day trailing baseline	A frozen baseline drifts past you in a quarter
7. Close the loop	Production failures auto-cluster, promote into the golden set	Static datasets stop being regression suites

Steps 1 through 4 are the eval. Steps 5 through 7 are how the eval stays honest over months.

Step 1: drop ROUGE as the primary metric

The argument against ROUGE isn’t that it’s wrong, it’s that it’s the wrong shape. ROUGE-L gives one score for a phenomenon that has three independent failure modes:

Hallucination. The summary asserts a fact not in the source. ROUGE doesn’t model the source; it only compares to a reference. A confidently invented sentence that happens to share n-grams with the reference scores high.
Omission. The summary leaves out the headline fact. If the reference also leaves it out, ROUGE never notices. If the reference includes it and the summary paraphrases around it, ROUGE penalizes the paraphrase, not the omission.
Padding. The summary is twice the spec length but contains the right tokens. ROUGE-L rewards the overlap and ignores the length violation.

The G-Eval paper (Liu et al. 2023, arXiv:2303.16634) measured this directly against SummEval. GPT-4 scored as a four-dimensional rubric (coherence, consistency, fluency, relevance) hit a Spearman correlation of 0.514 with human raters on summarization. ROUGE-L hit 0.128 on the same dimension set. BERTScore hit 0.190. The lexical-overlap metrics were not just worse on average; they ranked summary pairs the opposite of how humans ranked them on the dimensions humans cared about.

Use ROUGE-L and BERTScore as one column on the dashboard for backward compatibility and cheap regression signal. Never as the gate. For a deeper breakdown of BLEU, ROUGE, and BERTScore and where each one fits, see the dedicated guide. The four rubrics below are the gate.

Step 2: the four rubrics that decide summary quality

Each rubric scores an independent question. Each is small enough that a judge can apply it consistently. Together they cover what a human reader actually evaluates when they say “this is a good summary.”

Groundedness. Every claim in the summary is supported by content present in the source document. Score per-claim and average, or score the summary holistically on a 1-5 scale. The Groundedness template in the ai-evaluation SDK (eval_id=47) does this directly against the context field; the local NLI variant runs a DeBERTa-NLI head per claim with no API call.

Completeness. The important facts in the source are present in the summary. “Important” is the load-bearing word. Define it per domain. For news, the lede plus the supporting facts. For meeting notes, every decision and every action item with owner and deadline. For ticket summaries, the issue, the customer’s last action, and the next step. Completeness (eval_id=10) scores this; for a deterministic backstop, annotate an expected-fact list per golden-set example and compute fact-recall.

Factuality. Statements in the summary that are checkable against the source are true. Distinct from groundedness because a claim can be present in spirit but wrong in detail (a date off by a year, a name swapped between people). FactualAccuracy (eval_id=66) catches this. For RAG-grounded summarization, ContextAdherence (eval_id=5) adds the retrieval-step check.

Conciseness. The summary respects the length spec and contains no padding or repetition. Score with IsConcise (eval_id=83) for the qualitative judgment, and pair it with a deterministic length check (token count, sentence count, paragraph count) bound as a hard rubric. A 500-token spec that returns 2000 tokens fails the gate regardless of the other three rubrics.

SummaryQuality (eval_id=64) and IsGoodSummary (eval_id=94) are holistic aggregators that fold all four into a single score. Useful as dashboard tiles. Not useful as the gate, because they hide which dimension regressed.

Step 3: calibrate the judge before you trust it

An uncalibrated LLM judge is a confident random number generator. The fix is the same protocol the G-Eval paper used: score against human-labeled summaries and track agreement.

Run the calibration loop once when you set the rubric up, then quarterly:

Sample 50 source-summary pairs from production. Skew toward edge cases (long sources, mixed-language, contested facts) where rubric disagreement is most informative.
Two annotators score each pair on all four dimensions. Use a 1-5 Likert per dimension, with one-sentence reasoning. Resolve disagreements by discussion or a tiebreaker annotator.
Run the judge prompt against the same set. Same prompt, same judge model, same temperature you’ll use in production.
Compute Cohen’s kappa per dimension between judge scores and human consensus. Anything below 0.6 means the rubric prompt needs sharpening, the judge model is wrong for the dimension, or both. The G-Eval paper hit Spearman 0.514 on SummEval with GPT-4; a 2026 production-grade summarization judge should hit 0.6 or better on the dimension your product cares about most.
Pin the judge model and the rubric prompt as one contract. Swapping the judge model is an eval migration, not a config change. Track judge-human kappa drift over time. When it slides, the source distribution or the model has shifted.

Calibrated judges produce scores you can act on. Uncalibrated judges produce scores you can argue about. The thirty minutes of human annotation per quarter is the cheapest reliability investment in the eval stack.

Step 4: build the cascade, don’t pay for the frontier on every span

LLM-as-judge is the right tool for groundedness, completeness, factuality, and conciseness. It is the wrong tool for everything below it. The pattern that works:

Deterministic first. Token count, sentence count, JSON schema for structured summaries, regex for refusal phrasings, expected-field presence. Microsecond cost, zero API spend, runs on every trace.
Embedding or NLI second. BERTScore against a gold answer if you have one. A DeBERTa-NLI head for claim-by-claim entailment against the source. Milliseconds, no LLM call.
Small classifier third. Toxicity (eval_id=15), PII, prompt injection. Sub-100ms, no judge call.
LLM-as-judge fourth. The four rubrics above. Pin the model. Cache verdicts keyed on (judge_model_id, rubric_version, source_hash, summary_hash).
Frontier adjudication last. When the small classifier and the standard judge disagree, escalate to a stronger judge or a human label. The augment=True flag in the ai-evaluation SDK encodes this cascade directly.

The cost discipline matters. Running a $0.04 frontier judge on every summary across millions of spans per day burns the eval budget before it scores anything useful. Reserve the expensive primitive for the cases the cheap primitives can’t decide.

Step 5: specialize for the document type

The four-rubric framework holds across domains. The weights and the custom rubric change.

News summarization. Factuality and groundedness carry the weight. A misattributed quote or a wrong company name lands the publisher in a correction. Add a named-entity-grounding check (every entity in the summary must appear in the source) as a hard rubric. Set the factuality floor at 0.9. Conciseness matters but rarely fails because newsroom briefs hew to the spec.

Meeting notes. Completeness on action items is the load-bearing dimension. Score with a custom rubric: “Every action item in the source transcript appears in the summary with the owner and the deadline.” The action-item rubric is binary per item; the summary either has all of them or it doesn’t. Groundedness still matters (no invented action items). Conciseness is secondary. A meeting note that misses the assignment is a workflow failure even if it’s the right length.

Support-ticket summaries. Conciseness and structured-field extraction dominate. The downstream consumer is usually a routing system or a tier-2 agent, not a human reader. Score the summary as a structured object (issue, customer-action, next-step, priority) and treat each field as its own deterministic rubric. The qualitative groundedness and factuality rubrics ride along.

Document QA summaries. This is RAG-grounded summarization. The summary is a synthesis over retrieved chunks, not a single source. ContextAdherence joins the rubric set. Score citation validity (every cited span exists in the retrieval context) as a hard deterministic rubric. The completeness rubric splits into two: did retrieval surface the right chunks (ContextRelevance), and did the summary use the right ones (ChunkUtilization).

The custom rubric is where the product signal lives. Use CustomLLMJudge with a free-text grading_criteria field and a few-shot example or two. Examples that worked in production: “Does the legal-clause summary preserve the indemnity cap language verbatim.” “Does the medical-discharge summary include the prescribed dose and the follow-up appointment.” “Does the SOC 2 audit summary list every Type II control deviation.”

Step 6: wire the eval suite into CI

The five-line minimum, against the SDK:

from fi.evals import Evaluator
from fi.evals.templates import (
    Groundedness, Completeness, FactualAccuracy, IsConcise, Toxicity,
)

ev = Evaluator(fi_api_key="...", fi_secret_key="...")
result = ev.evaluate(
    eval_templates=[
        Groundedness(), Completeness(), FactualAccuracy(), IsConcise(), Toxicity(),
    ],
    inputs=[
        {"input": doc, "output": summary, "context": doc}
        for doc, summary in zip(source_docs, model_summaries)
    ],
)

The output is a BatchRunResult with per-input scores and reasons per rubric. Assert against per-dimension thresholds (groundedness above 0.85, completeness above 0.75, factuality above 0.90, conciseness above 0.80, toxicity below 0.05 are reasonable starting points to tune to your domain). Cache verdicts keyed on the eval tuple so a repeated CI run is free.

The custom rubric uses CustomLLMJudge:

from fi.evals.metrics.llm_as_judges.custom_judge import CustomLLMJudge
from fi.evals.providers import LiteLLMProvider

judge = CustomLLMJudge(
    provider=LiteLLMProvider(),
    config={
        "name": "captures_action_items",
        "grading_criteria": (
            "Score 1.0 if every action item in the source transcript appears "
            "in the summary with the owner and the deadline. Score 0.0 if any "
            "action item is missing the owner or the deadline."
        ),
    },
)

CustomLLMJudge runs via LiteLLM so you can point it at any provider you already use. Drop it into a pytest case, a GitHub Actions step, or a scheduled job. The artifact engineers should review on a PR is the list of newly failing examples with rubric scores, reasons, and diffs against the trailing 7-day baseline. Not the aggregate score, which hides which dimension regressed.

For the deterministic length check, run a tokenizer-aware count against the spec in the same step. A summary that scores 0.92 on every LLM rubric and exceeds the 500-token budget is a product failure.

Step 7: close the loop with production observation

Offline eval gates regressions before deploy. Production eval catches drift, new failure modes, and the rare paths the golden set doesn’t cover. The same four rubrics run in both places.

The pattern that works:

Sample production summarization traces uniformly or by signal (low rubric score, user thumbs-down, regenerated output). Score with the same rubrics used in CI.
Attach scores to the OTel span so the eval result lives next to the trace. traceAI (Apache 2.0) makes the semantic convention (FI, OTEL_GENAI, OPENINFERENCE, OPENLLMETRY) a registration-time choice, so traces flow into your existing OTel collector without re-instrumenting. Built-in evals wire to span attributes via EvalTag for zero added latency at trace-write time.
Alarm on rolling-mean drift. Per-route, per-rubric, per-prompt-version. A 2-5 point sustained drop over 15-60 minutes is the right detection threshold for most summarization products. A drift on factuality is a different incident from a drift on conciseness. Page on each separately.
Cluster failures into named issues. Error Feed sits inside Future AGI’s eval stack. HDBSCAN soft-clustering over ClickHouse-stored embeddings groups failing summaries; a Claude Sonnet 4.5 Judge agent (30-turn budget, 8 span-tools, Haiku Chauffeur summariser for spans over 3000 chars, 90% prompt-cache hit ratio) writes the root cause, the evidence quotes, and an immediate_fix per issue.
Promote representative failures into the golden set with rubric labels. The next PR that touches the summarization prompt has to clear the new entries.

The loop is what compounds. Without it, each incident produces a one-off fix and the team ships the same regression twice.

Common summarization eval mistakes

ROUGE as the gate. Catches lexical drift, misses hallucination and omission and padding. Use it as a regression tile, never as the threshold.
One aggregate score. “Summary quality 0.82” tells you nothing actionable. The four dimensions score independently; the gate is per-dimension.
Reference summaries as the only ground truth. Reference summaries are one annotator’s opinion about what matters. Expected-fact lists are cheaper, more deterministic, and easier to maintain.
No calibration step. An LLM judge with judge-human kappa below 0.6 is a coin flip with a paragraph of reasoning attached. Calibrate before you ship.
Custom rubric is too vague. “Is this summary good” is not a rubric. “Does this summary list every action item with owner and deadline” is.
Eval runs once a quarter. Summarization quality drifts with model updates, prompt edits, and source distribution shifts. Continuous CI plus weekly production sampling is the right cadence.
No length budget. A summary that ignores the 500-token spec is a product failure even if every other dimension passes.

How Future AGI ships the summarization eval stack

A package, three layers.

ai-evaluation SDK (Apache 2.0). Code-first. from fi.evals import Evaluator, Protect. 60+ EvalTemplate classes including Groundedness, Completeness, FactualAccuracy, ContextAdherence, IsConcise, SummaryQuality, IsGoodSummary, BleuScore, Toxicity, CustomLLMJudge. 13 guardrail backends (9 open-weight Llama Guard / Qwen3-Guard / Granite Guardian / WildGuard / ShieldGemma; 4 API). 8 sub-10ms local Scanners. Four distributed runners (Celery, Ray, Temporal, Kubernetes). RailType.INPUT/OUTPUT/RETRIEVAL and AggregationStrategy.ANY/ALL/MAJORITY/WEIGHTED. Multi-modal CustomLLMJudge via LiteLLM.
Future AGI Platform. Self-improving evaluators that retune from thumbs up/down feedback on production summaries. An in-product authoring agent that turns a natural-language brief (“score this medical-discharge summary on dose-preservation and follow-up presence”) into a deployable rubric. Classifier-backed evals at lower per-eval cost than Galileo Luna-2: the economics that make weekly full-golden-set reruns the default, not a budget conversation.
Error Feed (inside the eval stack). HDBSCAN soft-clustering over ClickHouse embeddings groups failing summaries into named issues. A Claude Sonnet 4.5 Judge agent (30-turn budget, 8 span-tools, Haiku Chauffeur summariser, 90% prompt-cache hit ratio) writes the root cause and the immediate_fix per issue. Four-dimensional trace scoring (factual_grounding, privacy_and_safety, instruction_adherence, optimal_plan_execution; 1-5 scale). Linear ticketing today. Fixes feed back into the platform’s self-improving evaluators so the rubric ages with the product instead of against it.

traceAI (Apache 2.0) is the tracing layer: 50+ AI surfaces across Python, TypeScript, Java, and C#. Pluggable semantic conventions, 14 span kinds, 62 built-in evals via EvalTag. The hosted Agent Command Center is SOC 2 Type II, HIPAA, GDPR, and CCPA certified.

Ready to run the four-rubric suite against your own summaries? Install ai-evaluation, score your last 100 production summaries on Groundedness, Completeness, FactualAccuracy, and IsConcise this afternoon, then wire the same rubrics as EvalTag on live spans via traceAI tomorrow. The same four rubrics in CI and in production is what turns a summarization eval into a regression suite that holds for two years.

Frequently asked questions

Why is ROUGE the wrong primary metric for modern summarization?

ROUGE measures n-gram overlap against a reference summary. That made sense for extractive systems in 2004, where the model copied spans and the reference was a known good copy. Modern abstractive models paraphrase the source, so a faithful summary can share zero n-grams with the reference and a hallucinated one can share most of them. The deeper problem is that ROUGE collapses three independent questions into one number. Is the summary grounded in the source? Did it cover the important facts? Is it free of invented claims? You need three rubrics scored independently, plus a fourth for conciseness, before the score correlates with what a human reader would call a good summary. ROUGE remains a cheap regression signal, but only after you score the four real dimensions first.

What are the four rubrics that actually decide summary quality?

Groundedness (every claim traceable to the source), completeness (the important things in the source are present), factuality (statements that are checkable against the source are true), and conciseness (no padding, no repetition, length within the spec). The four are independent. A summary can score 1.0 on groundedness by simply restating the input verbatim, then fail completeness by missing the headline fact and fail conciseness by exceeding the token budget. Scoring them as one aggregate hides the failure mode. Score each independently, set a per-dimension gate, and treat the average only as a dashboard tile.

How do you calibrate an LLM judge for summarization?

Pull fifty summaries from production, hand-label them on all four dimensions with two annotators, resolve disagreements, then run the same judge prompt against the same set. Track Cohen's kappa between judge and human per dimension. Anything under 0.6 means the rubric prompt needs sharpening or the judge model is the wrong fit. Re-run quarterly. The G-Eval paper hit a Spearman correlation of 0.514 with human raters on SummEval using GPT-4. A production-grade summarization judge in 2026 should hit 0.6 or better on the dimension your product cares about. Calibration drift is itself a signal: when the kappa drops month over month, the rubric or the source distribution has changed.

How should the eval differ for news versus meeting notes versus support tickets?

The four-dimension framework holds across all of them; the weights and the custom rubric change. News summarization weights factuality and groundedness highest because a wrong attribution lands the publisher in a correction. Meeting notes weight completeness on action items, owners, and deadlines because missing an owner breaks the workflow. Support-ticket summaries weight conciseness and structured-field extraction because the downstream consumer is a routing system, not a human reader. Document QA summaries fold in retrieval context, so context adherence joins the rubric set. Pick the dimension that maps to your product failure, set a strict gate on it, and treat the other three as guardrails.

What's the right size for a summarization golden set?

Start at 100 to 200 source-summary pairs sampled from production traffic. Stratify by document length and document type because failure modes cluster by both. Skew the sample toward the hardest 10 percent (long sources, mixed-language content, contested facts, technical jargon) where most of the eval signal lives. Grow the set weekly by promoting failing production traces with rubric labels. Beyond about 500 examples per route, judge cost dominates and sampling beats more data.

How do you wire the eval suite into CI for summarization?

Three layers. Deterministic checks first (token count against the length spec, structured-field presence for ticket summaries, schema validation for JSON summaries). Cheap NLI or classifier checks next (DeBERTa-based faithfulness scoring against the source, BLEU or BERTScore as a backward-compat regression signal). LLM-as-judge last, on the four rubrics, run with a pinned judge model and a versioned rubric prompt. Fail the PR if any dimension drops more than two points from the trailing 7-day baseline or falls below an agreed floor. Cache verdicts keyed on the eval tuple so re-runs are free.

What does Future AGI ship for summarization evaluation?

The ai-evaluation SDK (Apache 2.0) ships the full rubric set in one import. Groundedness (eval_id 47), Completeness (10), FactualAccuracy (66), ContextAdherence (5), IsConcise (83), SummaryQuality (64), IsGoodSummary (94), BleuScore (101), Toxicity (15), and CustomLLMJudge for the product-specific rubric. Sixty-plus EvalTemplate classes total, four distributed runners, multi-modal CustomLLMJudge via LiteLLM. Graduate to the Future AGI Platform when you want self-improving evaluators that retune from production thumbs feedback, and Error Feed when you want failing summaries auto-clustered into named issues with an immediate fix per issue.

View all

Guides

LLM Summarization Evaluation: A 2026 Architectural Deep Dive

Summarization eval is four judge prompts: groundedness, completeness, factuality, conciseness. Each a hardened prompt with a calibration set. 2026 guide.

Nikhil Pareek · Apr 27, 2026

12 min

Guides

A Gentle Introduction to LLM Evaluation (2026)

Learn LLM evaluation from the inside out: the three primitives (deterministic, embedding, judge), offline vs online, the starter workflow for production.

NVJK Kartik · Mar 25, 2026

12 min

Guides

Evaluating AWS Bedrock Agents in 2026

Bedrock's built-in eval is dev-loop only. Score action-group correctness, KB retrieval quality, and guardrail precision/recall on every release.

Rishav Hada · May 19, 2026

11 min

The thesis

TL;DR: the playbook

Step 1: drop ROUGE as the primary metric

Step 2: the four rubrics that decide summary quality

Step 3: calibrate the judge before you trust it

Step 4: build the cascade, don’t pay for the frontier on every span

Step 5: specialize for the document type

Step 6: wire the eval suite into CI

Step 7: close the loop with production observation

Common summarization eval mistakes

How Future AGI ships the summarization eval stack

Related reading

Frequently asked questions