How is summary quality different from ROUGE score?

ROUGE score measures lexical overlap with a reference. Summary quality asks whether the summary is faithful, complete, concise, and useful for the downstream task.

How do you measure summary quality?

In FutureAGI, use SummaryQuality for the named summary-quality evaluation and IsGoodSummary for accept/reject gates. Track failures beside final llm.output and agent.trajectory.step spans.

What Is Summary Quality? Definition & FutureAGI Guide (2026)

Q: What is summary quality?

Summary quality is an LLM-evaluation metric that checks whether a generated summary preserves essential source facts, removes noise, stays coherent, and avoids unsupported claims.

What Is Summary Quality?

Summary quality is an LLM-evaluation metric for whether a generated summary preserves the source’s important facts, omits irrelevant detail, stays coherent, and avoids unsupported claims. It shows up in eval pipelines, production traces, and agent workflows that compress meetings, tickets, documents, or tool histories before another step uses them. FutureAGI maps this work to SummaryQuality and IsGoodSummary, so teams can gate summaries before bad compression becomes a downstream decision error.

Why Summary Quality Matters in Production LLM and Agent Systems

Bad summaries fail through compression, not a crash. A meeting summarizer drops the decision owner. A support-ticket digest omits the refund deadline. An agent compresses a 30-step tool history and loses the failed API call that explains the next retry. The output is fluent, short, and wrong enough to move the system in the wrong direction.

Developers feel it when regression tests pass on relevance but users still correct the same missing facts. SREs see long traces followed by tiny final summaries, repeated tool calls, and high escalation rates after summarization steps. Product teams see summaries that users cannot act on because they lack deadlines, owners, citations, or caveats. Compliance reviewers see audit records that read cleanly but omit the policy language that made the decision lawful.

Agentic systems make summary quality more important than single-turn chat because summaries often become state. A planner summarizes retrieved documents before choosing a tool. A customer-support agent summarizes conversation history before handoff. A workflow engine summarizes evidence before a human review queue. If the summary loses a constraint, the next step may never see it. The failure mode is lossy state transfer: a locally acceptable summary becomes a global task failure.

How FutureAGI Handles Summary Quality

FutureAGI’s approach is to evaluate summaries against the source and the downstream job they serve, not only against a reference sentence. In the FAGI inventory, eval:SummaryQuality maps to the exact SummaryQuality evaluator class, and eval:IsGoodSummary maps to the exact IsGoodSummary evaluator class. The former is the cloud eval template for summary_quality; the latter is the cloud eval template for is_good_summary. Use SummaryQuality as the named metric when tracking summary quality over a dataset, and use IsGoodSummary as the named gate when a workflow must decide whether to accept the summary.

A real workflow starts with a dataset row containing the source text, the generated summary, and the summary’s purpose: incident handoff, sales-call recap, medical-note compression, or agent memory compaction. The engineer attaches SummaryQuality through the evaluation stack, then samples production summaries logged by fi.client.Client.log. If the app is instrumented with traceAI-langchain, the final summary lives beside the source-producing spans, and agent workflows can correlate failures with agent.trajectory.step events.

When the score drops after a prompt or model change, the engineer reviews the failing cohort, adds missing required facts to the rubric, and reruns a regression eval before rollout. Unlike ROUGE, which mostly rewards word overlap with a reference, FutureAGI checks whether the summary carries the facts that the next workflow step actually needs.

How to Measure or Detect Summary Quality

Measure summary quality with paired source-summary data, not with the summary alone:

SummaryQuality - FutureAGI cloud eval template summary_quality; use it as the main score for factual coverage, coherence, and useful compression.
IsGoodSummary - FutureAGI cloud eval template is_good_summary; use it as a release gate for summaries feeding tools, handoffs, or reports.
Coverage delta - count required facts in the source and compare how many appear in the summary.
Trace signal - inspect agent.trajectory.step and final llm.output spans when a downstream action contradicts the source.
Dashboard signal - track eval-fail-rate-by-cohort, summary-quality regressions by prompt version, and escalation rate after handoff summaries.
Human feedback proxy - compare reviewer edits, thumbs-down rate, and “missing detail” labels against evaluator failures.

Minimal Python:

from fi.evals import Evaluator, SummaryQuality
from fi.testcases import TestCase

case = TestCase(
    input="Source: outage began 09:10, owner SRE, fix pending.",
    output="The outage began at 09:10 and is fixed.",
)
result = Evaluator().evaluate(eval_templates=[SummaryQuality()], inputs=[case], model_name="turing_flash")
print(result)

Common Mistakes

Most summary-eval failures come from treating summaries as shorter answers. A summary is a lossy transformation; the question is whether it lost the right information. Tie the metric to the consumer of the summary.

Scoring summaries with ROUGE alone. ROUGE rewards lexical overlap; abstractive summaries can be correct with different wording and bad while sharing words.
Evaluating without the source. Fluency checks cannot detect omitted owners, dates, caveats, citations, or unsupported claims.
Using one threshold for every summary type. Ticket digests, medical notes, and agent memory compression carry different loss budgets.
Ignoring downstream use. A summary for search indexing can omit details that a planning agent must retain.
Letting the summarizer judge itself. Use independent evaluators or human annotation samples for calibration on risky cohorts.