RAG Summarization 2026: Patterns, Code + Long-Context Tradeoffs
RAG summarization in 2026: stuff, map-reduce, refine, RAPTOR, GraphRAG. Long-context vs RAG decision matrix with thresholds plus faithfulness eval code.
Table of Contents
TL;DR: RAG summarization in 2026
| Workload | Best pattern | When it wins | Eval focus |
|---|---|---|---|
| Single doc under 1M tokens, single summary | Long-context call (Claude Opus 4.7 / Gemini 2.5 Pro / GPT-5) | Document fits the window, freshness not needed | Faithfulness, completeness |
| Single doc, fits one chunk window (~16K tokens) | Stuff | Cheapest, lowest latency | Faithfulness |
| Single doc over the LLM context window | Map-reduce | Parallelizable, independent chunks | Faithfulness, dedup |
| Narrative, contract, time-series | Refine | Preserves chronological order | Coherence, citation gap |
| Multi-level summaries, multi-step reasoning | Hierarchical (RAPTOR-style) | Different abstraction levels matter | Multi-level faithfulness |
| Multi-doc, global “what are the themes” | GraphRAG community summaries | Cross-document synthesis | Comprehensiveness, diversity |
| Cost-sensitive hybrid | Self-Route (RAG-or-LC router) | Per-query routing cuts 35 to 60 percent of tokens | Faithfulness, router accuracy |
If you only read one row: pick stuff under the window, map-reduce above it, RAPTOR when multi-level matters, and GraphRAG when the question is global. Evaluate every output for faithfulness and completeness before you ship.
What changed since 2024
Four shifts moved RAG summarization from a single “retrieve, summarize” call to a proper pattern catalog.
First, context windows got long. Claude Opus 4.7 ships a 1M-token context window in beta at the same per-token price as Opus 4.6. Gemini 2.5 Pro has shipped at 1M tokens for over a year. GPT-5 sits at 400K. For many single-document summaries the right answer is now “send the whole document,” not “retrieve.”
Second, hybrid routing won. Li et al. 2024 (Retrieval Augmented Generation or Long-Context LLMs?, Google DeepMind) showed that RAG and long-context return identical predictions for over 60 percent of queries. Their Self-Route hybrid router cuts token cost by 35 to 60 percent at comparable quality. The 2026 default for cost-sensitive stacks is a router, not a single pattern.
Third, recursive and graph patterns matured. RAPTOR (Sarthi et al. 2024) showed that a tree of recursive summaries lifts the QuALITY benchmark by 20 percentage points absolute with GPT-4. GraphRAG (Edge et al. 2024) from Microsoft Research showed that community summaries over an entity graph beat naive RAG on global questions with a 70 to 80 percent win rate.
Fourth, eval got cheap and continuous. Summarization eval (faithfulness, completeness, coherence, citation gap) became prebuilt and fast enough to run on every commit and on every production trace.
Anatomy of a 2026 RAG summarization pipeline
A modern RAG summarization stack has five layers. Most production stacks ship all five.
- Ingestion and chunking: parse, normalize, semantic chunk with 512 to 1024 tokens per chunk and 50 to 100 token overlap. Preserve section boundaries.
- Indexing: BM25 (lexical), dense embeddings (semantic), optional graph (GraphRAG) or tree (RAPTOR) on top.
- Retrieval: hybrid BM25 plus dense with Reciprocal Rank Fusion, optional cross-encoder reranker on top-k.
- Summarization: stuff, map-reduce, refine, hierarchical, or community-summary pattern depending on workload.
- Evaluation and guardrails: faithfulness, completeness, summary_quality, hallucination detection at the boundary; production traces flow back to the eval golden set.
The shape of the corpus determines which summarization pattern wins. The eval contract is the same across all of them.
Five RAG summarization patterns with code
All Python below was parsed on 2026-05-14 against the public SDK signatures (one block is a shell snippet, marked as bash). Replace the LLM and retriever wrappers with your own; the structural shape is what matters.
1. Stuff: one LLM call, one summary
Best for: single document or small corpus that fits in the LLM context window. Cheapest. Lowest latency. The default for under 16K tokens.
# tested 2026-05-14
from langchain.chains.combine_documents.stuff import create_stuff_documents_chain
from langchain.prompts import ChatPromptTemplate
from langchain_anthropic import ChatAnthropic
prompt = ChatPromptTemplate.from_messages([
("system", "Summarize the documents below in 200 words."),
("human", "{context}"),
])
llm = ChatAnthropic(model="claude-opus-4-7", max_tokens=2048)
chain = create_stuff_documents_chain(llm, prompt)
# `docs` is a list of langchain_core.documents.Document
summary = chain.invoke({"context": docs})
print(summary)
When stuff fails: the retrieved chunks exceed the context window or the cost per call becomes the bottleneck. Move to map-reduce.
2. Map-reduce: parallel chunk summarization
Best for: long documents or many documents that do not fit a single context window. Each chunk is summarized independently in the map step, then the partial summaries are combined in the reduce step. The pattern is parallelizable.
# tested 2026-05-14
from concurrent.futures import ThreadPoolExecutor
from langchain_core.prompts import PromptTemplate
from langchain_openai import ChatOpenAI
llm = ChatOpenAI(model="gpt-5", max_tokens=1024)
map_prompt = PromptTemplate.from_template(
"Summarize the passage below in 5 bullet points.\n\n{chunk}"
)
reduce_prompt = PromptTemplate.from_template(
"Combine these partial summaries into one cohesive 250-word summary."
" Preserve all unique facts.\n\n{partials}"
)
def map_step(chunk: str) -> str:
return llm.invoke(map_prompt.format(chunk=chunk)).content
def reduce_step(partials: list[str]) -> str:
joined = "\n\n".join(partials)
return llm.invoke(reduce_prompt.format(partials=joined)).content
def map_reduce_summarize(chunks: list[str]) -> str:
with ThreadPoolExecutor(max_workers=8) as pool:
partials = list(pool.map(map_step, chunks))
return reduce_step(partials)
When map-reduce fails: information is lost across the reduce boundary or the reduce step repeats facts. Switch to refine for narratives, or to hierarchical when the corpus is large enough to justify a tree.
3. Refine: sequential update
Best for: narratives, contracts, time-series documents where order matters. The refine pattern updates a running summary one chunk at a time, so chronology and dependency are preserved.
# tested 2026-05-14
from langchain_core.prompts import PromptTemplate
from langchain_openai import ChatOpenAI
llm = ChatOpenAI(model="gpt-5", max_tokens=1024)
initial_prompt = PromptTemplate.from_template(
"Write a 200-word summary of the passage below.\n\n{chunk}"
)
refine_prompt = PromptTemplate.from_template(
"Existing summary:\n{summary}\n\nNew passage:\n{chunk}\n\n"
"Refine the summary to incorporate the new passage. Keep order."
" Do not add facts not present in either input."
)
def refine_summarize(chunks: list[str]) -> str:
summary = llm.invoke(initial_prompt.format(chunk=chunks[0])).content
for chunk in chunks[1:]:
summary = llm.invoke(
refine_prompt.format(summary=summary, chunk=chunk)
).content
return summary
When refine fails: latency grows linearly with chunk count, and drift accumulates as the running summary forgets early content. Cap with a token budget or move to hierarchical.
4. Hierarchical (RAPTOR-style): tree of recursive summaries
Best for: workloads that need summaries at multiple abstraction levels, or multi-step reasoning over long documents. RAPTOR (Sarthi et al. 2024) clusters chunks by embedding, summarizes each cluster, then recurses up to a single root summary. At retrieval time you can pull leaves (detailed evidence), intermediate nodes (section-level), or the root (executive summary).
# tested 2026-05-14
from dataclasses import dataclass, field
from sklearn.cluster import KMeans
import numpy as np
@dataclass
class TreeNode:
text: str
embedding: np.ndarray
children: list["TreeNode"] = field(default_factory=list)
def build_raptor_tree(
chunks: list[str],
embed_fn,
summarize_fn,
branching_factor: int = 5,
max_depth: int = 4,
) -> TreeNode:
"""Build a tree of recursive summaries, bottom-up."""
nodes = [
TreeNode(text=chunk, embedding=embed_fn(chunk)) for chunk in chunks
]
for _ in range(max_depth):
if len(nodes) <= 1:
break
n_clusters = max(1, len(nodes) // branching_factor)
embeddings = np.stack([n.embedding for n in nodes])
labels = KMeans(n_clusters=n_clusters, n_init=10).fit_predict(
embeddings
)
parents = []
for cluster_id in range(n_clusters):
members = [nodes[i] for i, lbl in enumerate(labels) if lbl == cluster_id]
joined = "\n\n".join(m.text for m in members)
parent_text = summarize_fn(joined)
parents.append(
TreeNode(
text=parent_text,
embedding=embed_fn(parent_text),
children=members,
)
)
nodes = parents
if len(nodes) == 1:
return nodes[0]
root_text = summarize_fn("\n\n".join(n.text for n in nodes))
return TreeNode(
text=root_text,
embedding=embed_fn(root_text),
children=nodes,
)
The reference implementation lives at the official RAPTOR repo. When RAPTOR fails: build cost is high for small corpora, and the tree is static (re-embed and rebuild on every corpus update). Use it when the corpus is large, stable, and multi-level retrieval matters.
5. GraphRAG: community summaries for global questions
Best for: cross-document synthesis where the question is global, like “what are the main themes across this corpus” or “which entities recur most”. GraphRAG (Edge et al. 2024, MIT-licensed, microsoft.github.io/graphrag) extracts an entity-and-relationship graph from the corpus, partitions it into communities of densely connected entities (typically via the Leiden algorithm), and summarizes each community with an LLM. The community summaries become the retrieval surface for global questions.
# tested 2026-05-14
pip install graphrag
graphrag init --root ./project
graphrag index --root ./project
graphrag query --method global --query "What are the main themes?" --root ./project
The full Python API surface is in the graphrag package docs. When GraphRAG fails: entity extraction is expensive (LLM tokens for every chunk on ingest) and community summaries are static. Use it when global questions are the workload and the corpus updates on a weekly or longer cadence.
Long-context vs RAG: a decision matrix with concrete thresholds
The 2026 question is not “RAG or long-context” but “which pattern at this corpus size and query rate”. Use the table below.
| Corpus size | Update cadence | Query rate | Default pattern |
|---|---|---|---|
| Under 200K tokens | Static or rare update | Any | Long-context call (Opus 4.7, Gemini 2.5 Pro, GPT-5) |
| 200K to 1M tokens | Static or rare update | Low (under 100 queries/day) | Long-context call |
| 200K to 1M tokens | Frequent update | Any | Map-reduce RAG (rebuild is cheap per chunk) |
| 200K to 1M tokens | Static | High (over 1K queries/day) | Hybrid Self-Route (RAG default, LC fallback) |
| 1M to 10M tokens | Any | Any | Map-reduce or hierarchical RAG |
| 10M+ tokens | Any | Any | Hierarchical (RAPTOR) or GraphRAG |
| Multi-doc, global questions | Any | Any | GraphRAG community summaries |
| Multi-doc, local questions | Any | Any | Hybrid RAG with reranker |
Numbers based on Li et al. 2024 (2407.16833) and Li et al. 2025 (2501.01880). Validate against your own latency and cost targets; per-query cost varies by provider.
Self-Route in practice
The Self-Route pattern (Li et al. 2024) routes each query to RAG first; the LLM answers with the retrieved chunks if it can, and the router falls back to a full long-context call only when the model self-reports that the retrieved evidence is insufficient. The original paper reported GPT-4o using 61 percent of the tokens and Gemini-1.5-Pro using 38.6 percent of the tokens compared to a pure long-context baseline, with comparable quality.
# tested 2026-05-14
from typing import Callable
ROUTER_PROMPT = (
"Below is a question and retrieved context."
" If the context contains enough information to answer fully, write the answer."
" If not, output exactly 'INSUFFICIENT_CONTEXT' and nothing else.\n\n"
"Question: {question}\nContext: {context}"
)
def self_route_summarize(
question: str,
rag_retrieve: Callable[[str], str],
rag_llm: Callable[[str], str],
long_context_llm: Callable[[str], str],
full_corpus: str,
) -> str:
"""Try RAG first, fall back to long-context only if RAG cannot answer."""
context = rag_retrieve(question)
rag_attempt = rag_llm(
ROUTER_PROMPT.format(question=question, context=context)
)
if rag_attempt.strip() == "INSUFFICIENT_CONTEXT":
return long_context_llm(
f"Question: {question}\n\nFull corpus:\n{full_corpus}"
)
return rag_attempt
The pattern needs an eval on the router decision itself. If the router falls back too often, RAG was the wrong default. If it falls back too rarely, the RAG step is hallucinating instead of admitting it cannot answer.
Production failure modes and mitigations
Five recurring failures show up in production summarization stacks.
- Hallucinated facts. The summary asserts claims that are not in the retrieved source. The mitigation is a faithfulness eval at the boundary that returns a pass or fail for every summary. Future AGI provides
groundednessanddetect_hallucinationevaluators for this gate. - Dropped sections. The summary skips a section because retrieval missed it. The mitigation is a completeness or coverage eval that compares the summary against the source and flags major gaps. For map-reduce stacks measure recall in the map step explicitly.
- Repetition. Map-reduce reduce-step output repeats the same fact under multiple framings. The mitigation is a dedup step in the reduce prompt and a
is_good_summary-style judge that flags repetition. - Citation gap. The summary states a fact that cannot be traced back to a specific source passage. The mitigation is span-attached scoring (the trace stores which retrieved chunk every claim came from) plus a citation_accuracy custom rubric.
- Context gaslighting. The summary inverts or exaggerates source meaning while sounding plausible. This shows up most often in finance and legal domains where the source has dense quantitative claims. The mitigation is groundedness eval plus a domain-specific custom judge plus human review for high-risk content.
The pattern across all five failures is the same: a strong faithfulness contract on every output, span-attached scoring on the trace, and a feedback loop from production traces back to the eval golden set.
Evaluating a RAG summarization pipeline
Four axes cover most summarization eval workloads. The framing comes from FineSurE (Song et al. 2024) and the broader summarization evaluation literature.
- Faithfulness. Every claim in the summary is supported by the retrieved source. Reference-free.
- Completeness (or coverage). The key information in the source appears in the summary. Reference-free when the source is the corpus; reference-based when a gold summary is available.
- Coherence. Adjacent sentences make sense together. Surface fluency plus inter-sentential logic.
- Citation accuracy. Every claim is traceable back to a source passage. Required for high-trust domains.
Sister-post link: see Custom LLM Eval Metrics: Best Practices in 2026 for the methodology of building a custom rubric, and RAG Evaluation Metrics for the broader RAG eval surface.
Future AGI eval code for summarization
The string-template API lets you call prebuilt summarization evaluators in one line. Faithfulness is the most common gate.
# tested 2026-05-14
import os
from fi.evals import evaluate
os.environ.setdefault("FI_API_KEY", "your-api-key")
os.environ.setdefault("FI_SECRET_KEY", "your-secret-key")
source = "Apple reported Q4 revenue of 94.9B USD, up 6 percent year over year."
summary = "Apple's Q4 revenue rose 6 percent to 94.9B USD."
result = evaluate(
"groundedness",
input=source,
output=summary,
context=source,
model_name="turing_large",
)
print(result.score, result.reason)
# Completeness check (no `context` arg, just input + output)
completeness = evaluate(
"completeness",
input=source,
output=summary,
model_name="turing_small",
)
print(completeness.score, completeness.reason)
# Detect hallucination (fast pre-flight)
hallu = evaluate(
"detect_hallucination",
input=source,
output=summary,
context=source,
model_name="turing_flash",
)
print(hallu.score, hallu.reason)
For a custom judge built on top of an LLM you control, use the local Evaluator and CustomLLMJudge wrappers.
# tested 2026-05-14
from fi.opt.base import Evaluator
from fi.evals.metrics import CustomLLMJudge
from fi.evals.llm import LiteLLMProvider
judge = CustomLLMJudge(
provider=LiteLLMProvider(model="claude-opus-4-7"),
rubric=(
"Return 1 if every claim in the summary is supported by the source"
" AND every named entity in the summary appears in the source."
" Otherwise return 0. Output JSON: {\"score\": int, \"reason\": str}."
),
)
evaluator = Evaluator(metric=judge)
score = evaluator.evaluate(
output=summary,
context=source,
)
print(score)
For continuous production scoring, instrument the summarization call with traceAI so every output carries a faithfulness and completeness score on the span.
# tested 2026-05-14
from fi.evals import evaluate
from fi_instrumentation import register, FITracer
register(project_name="rag-summarization-prod")
tracer = FITracer.get_tracer(__name__)
@tracer.chain
def summarize_with_eval(source: str, retrieve_fn, summarize_fn):
context = retrieve_fn(source)
summary = summarize_fn(context)
result = evaluate(
"groundedness",
input=source,
output=summary,
context=context,
model_name="turing_flash",
)
return {"summary": summary, "groundedness": result.score}
The result is a production trace that includes the retrieved chunks, the summary, the faithfulness score, and the latency for every call. Failures replay in pre-prod with the same scorer contract.
Inter-annotator agreement
When the rubric is subjective (executive briefs, marketing summaries) the judge correlates better with humans if you calibrate. Run the judge and three humans on the same 100 examples and compute Cohen’s kappa.
# tested 2026-05-14
from sklearn.metrics import cohen_kappa_score
# Pretend these are 0/1 labels from a 100-example calibration set
judge_labels = [1, 0, 1, 1, 0, 1, 1, 0, 0, 1] * 10
human_labels = [1, 0, 1, 1, 0, 1, 0, 0, 0, 1] * 10
kappa = cohen_kappa_score(judge_labels, human_labels)
print(f"Cohen's kappa = {kappa:.2f}")
Kappa above 0.6 is substantial agreement. Below that the rubric needs sharpening or the judge needs a better backbone. Re-calibrate after every prompt or model change.
Worked example: 10K-document corpus, daily summaries
A realistic 2026 workload. You have 10,000 customer-support transcripts ingested per day. The product team wants a 200-word “what changed today” summary every morning.
The corpus is multi-document and updates daily, so neither long-context nor stuff fits. The question is global (“what changed today”), so GraphRAG is a candidate, but daily entity-graph rebuild is expensive. The right pattern is map-reduce with a daily refresh of partial summaries.
# tested 2026-05-14
from concurrent.futures import ThreadPoolExecutor
from fi.evals import evaluate
from fi_instrumentation import register, FITracer
from langchain_core.prompts import PromptTemplate
from langchain_openai import ChatOpenAI
register(project_name="daily-support-summary")
tracer = FITracer.get_tracer(__name__)
llm = ChatOpenAI(model="gpt-5", max_tokens=1024)
map_prompt = PromptTemplate.from_template(
"Summarize this support transcript in 3 bullets covering"
" (1) issue, (2) resolution, (3) sentiment.\n\n{transcript}"
)
reduce_prompt = PromptTemplate.from_template(
"Combine the partial summaries below into a single 200-word"
" 'what changed today' brief. Group by issue type. Preserve all"
" unique issues. Do not invent facts.\n\n{partials}"
)
@tracer.tool
def map_step(transcript: str) -> str:
return llm.invoke(map_prompt.format(transcript=transcript)).content
@tracer.chain
def reduce_step(partials: list[str]) -> str:
return llm.invoke(
reduce_prompt.format(partials="\n\n".join(partials))
).content
@tracer.agent
def daily_brief(transcripts: list[str]) -> dict:
with ThreadPoolExecutor(max_workers=20) as pool:
partials = list(pool.map(map_step, transcripts))
summary = reduce_step(partials)
# Gate at the boundary
faith = evaluate(
"groundedness",
input="\n\n".join(transcripts[:50]), # sample for the check
output=summary,
context="\n\n".join(partials),
model_name="turing_small",
)
completeness = evaluate(
"completeness",
input="\n\n".join(partials),
output=summary,
model_name="turing_flash",
)
return {
"summary": summary,
"faithfulness": faith.score,
"completeness": completeness.score,
"partial_count": len(partials),
}
Two things to note. First, faithfulness is checked against a sample of the source plus all partials, because the full corpus does not fit a single judge call. Second, completeness is checked against the partials, not the raw transcripts, because the reduce step is what the rubric should grade. Production traces flow back to the daily eval golden set; failures (low faithfulness, low completeness) trigger a rerun with a stronger reduce prompt.
RAG summarization vs fine-tuning vs prompt-only
Three alternatives to the patterns above. They are not exclusive.
- Prompt-only with long context. Send the document. The Claude Opus 4.7 1M window or Gemini 2.5 Pro 1M is the simplest path for single documents under the window. Cost dominates at high query volume.
- Fine-tuning for summarization style. Useful when the summary format is fixed (legal briefs, SEC filings) and the corpus is small enough to label. Combine with RAG retrieval for fresh data. See RAG vs Fine-Tuning for the tradeoff.
- Hybrid. Fine-tune the summarizer on your style; retrieve the source via RAG; gate the output with a faithfulness eval. This is the production default for high-stakes domains (finance, legal, healthcare).
How Future AGI fits
Future AGI sits beside the retriever and summarizer, not inside them. The vector DB (Pinecone, Weaviate, pgvector, Qdrant), the reranker (Cohere Rerank, Voyage Rerank-2, BGE-Reranker-v2), and the LLM (Claude Opus 4.7, GPT-5, Gemini 2.5 Pro) are picked from specialists. Future AGI provides:
- Prebuilt summarization evaluators in
fi.evals:groundedness,completeness,summary_quality,is_good_summary,detect_hallucination,is_factually_consistent. All callable viaevaluate("metric_name", ...). - Custom rubric judges via
CustomLLMJudgeandLiteLLMProviderfor domain-specific rubrics (legal tone, financial precision, brand voice). - Span-attached scoring via
fi_instrumentation(Apache 2.0 traceAI) so every production summary carries a faithfulness score on the trace. - Runtime guardrails via the Agent Command Center BYOK gateway so a low-groundedness summary is blocked or rewritten before it reaches the user.
- Simulation via
fi.simulate.TestRunnerso production failures replay against the same scorer contract in pre-prod.
For the broader eval surface and the metric catalog see What is RAG Evaluation in 2026 and RAG Hallucinations.
Sources
- Patrick Lewis et al., Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks, NeurIPS 2020 (original RAG paper)
- Parth Sarthi et al., RAPTOR: Recursive Abstractive Processing for Tree-Organized Retrieval, ICLR 2024
- Darren Edge et al., From Local to Global: A Graph RAG Approach to Query-Focused Summarization, Microsoft Research 2024
- Zhuowan Li et al., Retrieval Augmented Generation or Long-Context LLMs? A Comprehensive Study and Hybrid Approach, Google DeepMind 2024
- Xinze Li et al., Long Context vs. RAG for LLMs: An Evaluation and Revisits, 2025
- Hwanjun Song et al., FineSurE: Fine-grained Summarization Evaluation using LLMs, ACL 2024
- Claude Opus 4.7 release notes, Anthropic 2026
- LangChain
load_summarize_chainAPI reference - LlamaIndex DocumentSummaryIndex API reference
- Microsoft GraphRAG documentation
- Official RAPTOR implementation
- Future AGI cloud evaluators
- traceAI on GitHub (Apache 2.0)
- ai-evaluation on GitHub (Apache 2.0)
Frequently asked questions
What is RAG summarization in 2026?
Long-context vs RAG: when does each win for summarization?
What are the canonical RAG summarization patterns?
How do you evaluate a RAG summarization pipeline?
What is RAPTOR and when should I use it?
What is GraphRAG and when does it beat vector RAG for summarization?
How does Future AGI fit into a RAG summarization stack?
What are the most common failure modes in RAG summarization?
LLM evaluation in 2026: deterministic metrics, LLM-as-judge, RAG metrics, agent metrics, and how to wire offline regression plus runtime guardrails.
What LLM hallucination is in 2026, the six types, why models fabricate, and how to detect each one with faithfulness, groundedness, and context-adherence scores.
Perplexity for RAG in 2026: the metric vs Perplexity.ai the product. When perplexity is the right LLM score, when faithfulness wins, plus the eval stack.