Guides

RAG Summarization 2026: Patterns, Code + Long-Context Tradeoffs

RAG summarization in 2026: stuff, map-reduce, refine, RAPTOR, GraphRAG. Long-context vs RAG decision matrix with thresholds plus faithfulness eval code.

·
Updated
·
11 min read
rag summarization long-context evaluation 2026
RAG summarization patterns in 2026: stuff, map-reduce, refine, hierarchical RAPTOR, GraphRAG
Table of Contents

TL;DR: RAG summarization in 2026

WorkloadBest patternWhen it winsEval focus
Single doc under 1M tokens, single summaryLong-context call (Claude Opus 4.7 / Gemini 2.5 Pro / GPT-5)Document fits the window, freshness not neededFaithfulness, completeness
Single doc, fits one chunk window (~16K tokens)StuffCheapest, lowest latencyFaithfulness
Single doc over the LLM context windowMap-reduceParallelizable, independent chunksFaithfulness, dedup
Narrative, contract, time-seriesRefinePreserves chronological orderCoherence, citation gap
Multi-level summaries, multi-step reasoningHierarchical (RAPTOR-style)Different abstraction levels matterMulti-level faithfulness
Multi-doc, global “what are the themes”GraphRAG community summariesCross-document synthesisComprehensiveness, diversity
Cost-sensitive hybridSelf-Route (RAG-or-LC router)Per-query routing cuts 35 to 60 percent of tokensFaithfulness, router accuracy

If you only read one row: pick stuff under the window, map-reduce above it, RAPTOR when multi-level matters, and GraphRAG when the question is global. Evaluate every output for faithfulness and completeness before you ship.

What changed since 2024

Four shifts moved RAG summarization from a single “retrieve, summarize” call to a proper pattern catalog.

First, context windows got long. Claude Opus 4.7 ships a 1M-token context window in beta at the same per-token price as Opus 4.6. Gemini 2.5 Pro has shipped at 1M tokens for over a year. GPT-5 sits at 400K. For many single-document summaries the right answer is now “send the whole document,” not “retrieve.”

Second, hybrid routing won. Li et al. 2024 (Retrieval Augmented Generation or Long-Context LLMs?, Google DeepMind) showed that RAG and long-context return identical predictions for over 60 percent of queries. Their Self-Route hybrid router cuts token cost by 35 to 60 percent at comparable quality. The 2026 default for cost-sensitive stacks is a router, not a single pattern.

Third, recursive and graph patterns matured. RAPTOR (Sarthi et al. 2024) showed that a tree of recursive summaries lifts the QuALITY benchmark by 20 percentage points absolute with GPT-4. GraphRAG (Edge et al. 2024) from Microsoft Research showed that community summaries over an entity graph beat naive RAG on global questions with a 70 to 80 percent win rate.

Fourth, eval got cheap and continuous. Summarization eval (faithfulness, completeness, coherence, citation gap) became prebuilt and fast enough to run on every commit and on every production trace.

Anatomy of a 2026 RAG summarization pipeline

A modern RAG summarization stack has five layers. Most production stacks ship all five.

  1. Ingestion and chunking: parse, normalize, semantic chunk with 512 to 1024 tokens per chunk and 50 to 100 token overlap. Preserve section boundaries.
  2. Indexing: BM25 (lexical), dense embeddings (semantic), optional graph (GraphRAG) or tree (RAPTOR) on top.
  3. Retrieval: hybrid BM25 plus dense with Reciprocal Rank Fusion, optional cross-encoder reranker on top-k.
  4. Summarization: stuff, map-reduce, refine, hierarchical, or community-summary pattern depending on workload.
  5. Evaluation and guardrails: faithfulness, completeness, summary_quality, hallucination detection at the boundary; production traces flow back to the eval golden set.

The shape of the corpus determines which summarization pattern wins. The eval contract is the same across all of them.

Five RAG summarization patterns with code

All Python below was parsed on 2026-05-14 against the public SDK signatures (one block is a shell snippet, marked as bash). Replace the LLM and retriever wrappers with your own; the structural shape is what matters.

1. Stuff: one LLM call, one summary

Best for: single document or small corpus that fits in the LLM context window. Cheapest. Lowest latency. The default for under 16K tokens.

# tested 2026-05-14
from langchain.chains.combine_documents.stuff import create_stuff_documents_chain
from langchain.prompts import ChatPromptTemplate
from langchain_anthropic import ChatAnthropic

prompt = ChatPromptTemplate.from_messages([
    ("system", "Summarize the documents below in 200 words."),
    ("human", "{context}"),
])

llm = ChatAnthropic(model="claude-opus-4-7", max_tokens=2048)
chain = create_stuff_documents_chain(llm, prompt)

# `docs` is a list of langchain_core.documents.Document
summary = chain.invoke({"context": docs})
print(summary)

When stuff fails: the retrieved chunks exceed the context window or the cost per call becomes the bottleneck. Move to map-reduce.

2. Map-reduce: parallel chunk summarization

Best for: long documents or many documents that do not fit a single context window. Each chunk is summarized independently in the map step, then the partial summaries are combined in the reduce step. The pattern is parallelizable.

# tested 2026-05-14
from concurrent.futures import ThreadPoolExecutor
from langchain_core.prompts import PromptTemplate
from langchain_openai import ChatOpenAI

llm = ChatOpenAI(model="gpt-5", max_tokens=1024)

map_prompt = PromptTemplate.from_template(
    "Summarize the passage below in 5 bullet points.\n\n{chunk}"
)
reduce_prompt = PromptTemplate.from_template(
    "Combine these partial summaries into one cohesive 250-word summary."
    " Preserve all unique facts.\n\n{partials}"
)


def map_step(chunk: str) -> str:
    return llm.invoke(map_prompt.format(chunk=chunk)).content


def reduce_step(partials: list[str]) -> str:
    joined = "\n\n".join(partials)
    return llm.invoke(reduce_prompt.format(partials=joined)).content


def map_reduce_summarize(chunks: list[str]) -> str:
    with ThreadPoolExecutor(max_workers=8) as pool:
        partials = list(pool.map(map_step, chunks))
    return reduce_step(partials)

When map-reduce fails: information is lost across the reduce boundary or the reduce step repeats facts. Switch to refine for narratives, or to hierarchical when the corpus is large enough to justify a tree.

3. Refine: sequential update

Best for: narratives, contracts, time-series documents where order matters. The refine pattern updates a running summary one chunk at a time, so chronology and dependency are preserved.

# tested 2026-05-14
from langchain_core.prompts import PromptTemplate
from langchain_openai import ChatOpenAI

llm = ChatOpenAI(model="gpt-5", max_tokens=1024)

initial_prompt = PromptTemplate.from_template(
    "Write a 200-word summary of the passage below.\n\n{chunk}"
)
refine_prompt = PromptTemplate.from_template(
    "Existing summary:\n{summary}\n\nNew passage:\n{chunk}\n\n"
    "Refine the summary to incorporate the new passage. Keep order."
    " Do not add facts not present in either input."
)


def refine_summarize(chunks: list[str]) -> str:
    summary = llm.invoke(initial_prompt.format(chunk=chunks[0])).content
    for chunk in chunks[1:]:
        summary = llm.invoke(
            refine_prompt.format(summary=summary, chunk=chunk)
        ).content
    return summary

When refine fails: latency grows linearly with chunk count, and drift accumulates as the running summary forgets early content. Cap with a token budget or move to hierarchical.

4. Hierarchical (RAPTOR-style): tree of recursive summaries

Best for: workloads that need summaries at multiple abstraction levels, or multi-step reasoning over long documents. RAPTOR (Sarthi et al. 2024) clusters chunks by embedding, summarizes each cluster, then recurses up to a single root summary. At retrieval time you can pull leaves (detailed evidence), intermediate nodes (section-level), or the root (executive summary).

# tested 2026-05-14
from dataclasses import dataclass, field
from sklearn.cluster import KMeans
import numpy as np


@dataclass
class TreeNode:
    text: str
    embedding: np.ndarray
    children: list["TreeNode"] = field(default_factory=list)


def build_raptor_tree(
    chunks: list[str],
    embed_fn,
    summarize_fn,
    branching_factor: int = 5,
    max_depth: int = 4,
) -> TreeNode:
    """Build a tree of recursive summaries, bottom-up."""
    nodes = [
        TreeNode(text=chunk, embedding=embed_fn(chunk)) for chunk in chunks
    ]
    for _ in range(max_depth):
        if len(nodes) <= 1:
            break
        n_clusters = max(1, len(nodes) // branching_factor)
        embeddings = np.stack([n.embedding for n in nodes])
        labels = KMeans(n_clusters=n_clusters, n_init=10).fit_predict(
            embeddings
        )
        parents = []
        for cluster_id in range(n_clusters):
            members = [nodes[i] for i, lbl in enumerate(labels) if lbl == cluster_id]
            joined = "\n\n".join(m.text for m in members)
            parent_text = summarize_fn(joined)
            parents.append(
                TreeNode(
                    text=parent_text,
                    embedding=embed_fn(parent_text),
                    children=members,
                )
            )
        nodes = parents
    if len(nodes) == 1:
        return nodes[0]
    root_text = summarize_fn("\n\n".join(n.text for n in nodes))
    return TreeNode(
        text=root_text,
        embedding=embed_fn(root_text),
        children=nodes,
    )

The reference implementation lives at the official RAPTOR repo. When RAPTOR fails: build cost is high for small corpora, and the tree is static (re-embed and rebuild on every corpus update). Use it when the corpus is large, stable, and multi-level retrieval matters.

5. GraphRAG: community summaries for global questions

Best for: cross-document synthesis where the question is global, like “what are the main themes across this corpus” or “which entities recur most”. GraphRAG (Edge et al. 2024, MIT-licensed, microsoft.github.io/graphrag) extracts an entity-and-relationship graph from the corpus, partitions it into communities of densely connected entities (typically via the Leiden algorithm), and summarizes each community with an LLM. The community summaries become the retrieval surface for global questions.

# tested 2026-05-14
pip install graphrag
graphrag init --root ./project
graphrag index --root ./project
graphrag query --method global --query "What are the main themes?" --root ./project

The full Python API surface is in the graphrag package docs. When GraphRAG fails: entity extraction is expensive (LLM tokens for every chunk on ingest) and community summaries are static. Use it when global questions are the workload and the corpus updates on a weekly or longer cadence.

Long-context vs RAG: a decision matrix with concrete thresholds

The 2026 question is not “RAG or long-context” but “which pattern at this corpus size and query rate”. Use the table below.

Corpus sizeUpdate cadenceQuery rateDefault pattern
Under 200K tokensStatic or rare updateAnyLong-context call (Opus 4.7, Gemini 2.5 Pro, GPT-5)
200K to 1M tokensStatic or rare updateLow (under 100 queries/day)Long-context call
200K to 1M tokensFrequent updateAnyMap-reduce RAG (rebuild is cheap per chunk)
200K to 1M tokensStaticHigh (over 1K queries/day)Hybrid Self-Route (RAG default, LC fallback)
1M to 10M tokensAnyAnyMap-reduce or hierarchical RAG
10M+ tokensAnyAnyHierarchical (RAPTOR) or GraphRAG
Multi-doc, global questionsAnyAnyGraphRAG community summaries
Multi-doc, local questionsAnyAnyHybrid RAG with reranker

Numbers based on Li et al. 2024 (2407.16833) and Li et al. 2025 (2501.01880). Validate against your own latency and cost targets; per-query cost varies by provider.

Self-Route in practice

The Self-Route pattern (Li et al. 2024) routes each query to RAG first; the LLM answers with the retrieved chunks if it can, and the router falls back to a full long-context call only when the model self-reports that the retrieved evidence is insufficient. The original paper reported GPT-4o using 61 percent of the tokens and Gemini-1.5-Pro using 38.6 percent of the tokens compared to a pure long-context baseline, with comparable quality.

# tested 2026-05-14
from typing import Callable


ROUTER_PROMPT = (
    "Below is a question and retrieved context."
    " If the context contains enough information to answer fully, write the answer."
    " If not, output exactly 'INSUFFICIENT_CONTEXT' and nothing else.\n\n"
    "Question: {question}\nContext: {context}"
)


def self_route_summarize(
    question: str,
    rag_retrieve: Callable[[str], str],
    rag_llm: Callable[[str], str],
    long_context_llm: Callable[[str], str],
    full_corpus: str,
) -> str:
    """Try RAG first, fall back to long-context only if RAG cannot answer."""
    context = rag_retrieve(question)
    rag_attempt = rag_llm(
        ROUTER_PROMPT.format(question=question, context=context)
    )
    if rag_attempt.strip() == "INSUFFICIENT_CONTEXT":
        return long_context_llm(
            f"Question: {question}\n\nFull corpus:\n{full_corpus}"
        )
    return rag_attempt

The pattern needs an eval on the router decision itself. If the router falls back too often, RAG was the wrong default. If it falls back too rarely, the RAG step is hallucinating instead of admitting it cannot answer.

Production failure modes and mitigations

Five recurring failures show up in production summarization stacks.

  1. Hallucinated facts. The summary asserts claims that are not in the retrieved source. The mitigation is a faithfulness eval at the boundary that returns a pass or fail for every summary. Future AGI provides groundedness and detect_hallucination evaluators for this gate.
  2. Dropped sections. The summary skips a section because retrieval missed it. The mitigation is a completeness or coverage eval that compares the summary against the source and flags major gaps. For map-reduce stacks measure recall in the map step explicitly.
  3. Repetition. Map-reduce reduce-step output repeats the same fact under multiple framings. The mitigation is a dedup step in the reduce prompt and a is_good_summary-style judge that flags repetition.
  4. Citation gap. The summary states a fact that cannot be traced back to a specific source passage. The mitigation is span-attached scoring (the trace stores which retrieved chunk every claim came from) plus a citation_accuracy custom rubric.
  5. Context gaslighting. The summary inverts or exaggerates source meaning while sounding plausible. This shows up most often in finance and legal domains where the source has dense quantitative claims. The mitigation is groundedness eval plus a domain-specific custom judge plus human review for high-risk content.

The pattern across all five failures is the same: a strong faithfulness contract on every output, span-attached scoring on the trace, and a feedback loop from production traces back to the eval golden set.

Evaluating a RAG summarization pipeline

Four axes cover most summarization eval workloads. The framing comes from FineSurE (Song et al. 2024) and the broader summarization evaluation literature.

  1. Faithfulness. Every claim in the summary is supported by the retrieved source. Reference-free.
  2. Completeness (or coverage). The key information in the source appears in the summary. Reference-free when the source is the corpus; reference-based when a gold summary is available.
  3. Coherence. Adjacent sentences make sense together. Surface fluency plus inter-sentential logic.
  4. Citation accuracy. Every claim is traceable back to a source passage. Required for high-trust domains.

Sister-post link: see Custom LLM Eval Metrics: Best Practices in 2026 for the methodology of building a custom rubric, and RAG Evaluation Metrics for the broader RAG eval surface.

Future AGI eval code for summarization

The string-template API lets you call prebuilt summarization evaluators in one line. Faithfulness is the most common gate.

# tested 2026-05-14
import os
from fi.evals import evaluate

os.environ.setdefault("FI_API_KEY", "your-api-key")
os.environ.setdefault("FI_SECRET_KEY", "your-secret-key")

source = "Apple reported Q4 revenue of 94.9B USD, up 6 percent year over year."
summary = "Apple's Q4 revenue rose 6 percent to 94.9B USD."

result = evaluate(
    "groundedness",
    input=source,
    output=summary,
    context=source,
    model_name="turing_large",
)
print(result.score, result.reason)

# Completeness check (no `context` arg, just input + output)
completeness = evaluate(
    "completeness",
    input=source,
    output=summary,
    model_name="turing_small",
)
print(completeness.score, completeness.reason)

# Detect hallucination (fast pre-flight)
hallu = evaluate(
    "detect_hallucination",
    input=source,
    output=summary,
    context=source,
    model_name="turing_flash",
)
print(hallu.score, hallu.reason)

For a custom judge built on top of an LLM you control, use the local Evaluator and CustomLLMJudge wrappers.

# tested 2026-05-14
from fi.opt.base import Evaluator
from fi.evals.metrics import CustomLLMJudge
from fi.evals.llm import LiteLLMProvider

judge = CustomLLMJudge(
    provider=LiteLLMProvider(model="claude-opus-4-7"),
    rubric=(
        "Return 1 if every claim in the summary is supported by the source"
        " AND every named entity in the summary appears in the source."
        " Otherwise return 0. Output JSON: {\"score\": int, \"reason\": str}."
    ),
)

evaluator = Evaluator(metric=judge)
score = evaluator.evaluate(
    output=summary,
    context=source,
)
print(score)

For continuous production scoring, instrument the summarization call with traceAI so every output carries a faithfulness and completeness score on the span.

# tested 2026-05-14
from fi.evals import evaluate
from fi_instrumentation import register, FITracer

register(project_name="rag-summarization-prod")
tracer = FITracer.get_tracer(__name__)


@tracer.chain
def summarize_with_eval(source: str, retrieve_fn, summarize_fn):
    context = retrieve_fn(source)
    summary = summarize_fn(context)

    result = evaluate(
        "groundedness",
        input=source,
        output=summary,
        context=context,
        model_name="turing_flash",
    )
    return {"summary": summary, "groundedness": result.score}

The result is a production trace that includes the retrieved chunks, the summary, the faithfulness score, and the latency for every call. Failures replay in pre-prod with the same scorer contract.

Inter-annotator agreement

When the rubric is subjective (executive briefs, marketing summaries) the judge correlates better with humans if you calibrate. Run the judge and three humans on the same 100 examples and compute Cohen’s kappa.

# tested 2026-05-14
from sklearn.metrics import cohen_kappa_score

# Pretend these are 0/1 labels from a 100-example calibration set
judge_labels = [1, 0, 1, 1, 0, 1, 1, 0, 0, 1] * 10
human_labels = [1, 0, 1, 1, 0, 1, 0, 0, 0, 1] * 10

kappa = cohen_kappa_score(judge_labels, human_labels)
print(f"Cohen's kappa = {kappa:.2f}")

Kappa above 0.6 is substantial agreement. Below that the rubric needs sharpening or the judge needs a better backbone. Re-calibrate after every prompt or model change.

Worked example: 10K-document corpus, daily summaries

A realistic 2026 workload. You have 10,000 customer-support transcripts ingested per day. The product team wants a 200-word “what changed today” summary every morning.

The corpus is multi-document and updates daily, so neither long-context nor stuff fits. The question is global (“what changed today”), so GraphRAG is a candidate, but daily entity-graph rebuild is expensive. The right pattern is map-reduce with a daily refresh of partial summaries.

# tested 2026-05-14
from concurrent.futures import ThreadPoolExecutor
from fi.evals import evaluate
from fi_instrumentation import register, FITracer
from langchain_core.prompts import PromptTemplate
from langchain_openai import ChatOpenAI

register(project_name="daily-support-summary")
tracer = FITracer.get_tracer(__name__)

llm = ChatOpenAI(model="gpt-5", max_tokens=1024)

map_prompt = PromptTemplate.from_template(
    "Summarize this support transcript in 3 bullets covering"
    " (1) issue, (2) resolution, (3) sentiment.\n\n{transcript}"
)
reduce_prompt = PromptTemplate.from_template(
    "Combine the partial summaries below into a single 200-word"
    " 'what changed today' brief. Group by issue type. Preserve all"
    " unique issues. Do not invent facts.\n\n{partials}"
)


@tracer.tool
def map_step(transcript: str) -> str:
    return llm.invoke(map_prompt.format(transcript=transcript)).content


@tracer.chain
def reduce_step(partials: list[str]) -> str:
    return llm.invoke(
        reduce_prompt.format(partials="\n\n".join(partials))
    ).content


@tracer.agent
def daily_brief(transcripts: list[str]) -> dict:
    with ThreadPoolExecutor(max_workers=20) as pool:
        partials = list(pool.map(map_step, transcripts))
    summary = reduce_step(partials)

    # Gate at the boundary
    faith = evaluate(
        "groundedness",
        input="\n\n".join(transcripts[:50]),  # sample for the check
        output=summary,
        context="\n\n".join(partials),
        model_name="turing_small",
    )
    completeness = evaluate(
        "completeness",
        input="\n\n".join(partials),
        output=summary,
        model_name="turing_flash",
    )
    return {
        "summary": summary,
        "faithfulness": faith.score,
        "completeness": completeness.score,
        "partial_count": len(partials),
    }

Two things to note. First, faithfulness is checked against a sample of the source plus all partials, because the full corpus does not fit a single judge call. Second, completeness is checked against the partials, not the raw transcripts, because the reduce step is what the rubric should grade. Production traces flow back to the daily eval golden set; failures (low faithfulness, low completeness) trigger a rerun with a stronger reduce prompt.

RAG summarization vs fine-tuning vs prompt-only

Three alternatives to the patterns above. They are not exclusive.

  • Prompt-only with long context. Send the document. The Claude Opus 4.7 1M window or Gemini 2.5 Pro 1M is the simplest path for single documents under the window. Cost dominates at high query volume.
  • Fine-tuning for summarization style. Useful when the summary format is fixed (legal briefs, SEC filings) and the corpus is small enough to label. Combine with RAG retrieval for fresh data. See RAG vs Fine-Tuning for the tradeoff.
  • Hybrid. Fine-tune the summarizer on your style; retrieve the source via RAG; gate the output with a faithfulness eval. This is the production default for high-stakes domains (finance, legal, healthcare).

How Future AGI fits

Future AGI sits beside the retriever and summarizer, not inside them. The vector DB (Pinecone, Weaviate, pgvector, Qdrant), the reranker (Cohere Rerank, Voyage Rerank-2, BGE-Reranker-v2), and the LLM (Claude Opus 4.7, GPT-5, Gemini 2.5 Pro) are picked from specialists. Future AGI provides:

  • Prebuilt summarization evaluators in fi.evals: groundedness, completeness, summary_quality, is_good_summary, detect_hallucination, is_factually_consistent. All callable via evaluate("metric_name", ...).
  • Custom rubric judges via CustomLLMJudge and LiteLLMProvider for domain-specific rubrics (legal tone, financial precision, brand voice).
  • Span-attached scoring via fi_instrumentation (Apache 2.0 traceAI) so every production summary carries a faithfulness score on the trace.
  • Runtime guardrails via the Agent Command Center BYOK gateway so a low-groundedness summary is blocked or rewritten before it reaches the user.
  • Simulation via fi.simulate.TestRunner so production failures replay against the same scorer contract in pre-prod.

For the broader eval surface and the metric catalog see What is RAG Evaluation in 2026 and RAG Hallucinations.

Sources

Frequently asked questions

What is RAG summarization in 2026?
RAG summarization is the family of patterns that pair retrieval (BM25, dense, hybrid, graph) with a generation step that produces a summary, an outline, or a brief. In 2026 the canonical patterns are stuff, map-reduce, refine, hierarchical (RAPTOR-style recursive summarization), and GraphRAG community summaries. The choice depends on document length, freshness, cost budget, and whether the corpus is single-document or multi-document. Production stacks pair the chosen pattern with span-attached eval for faithfulness, completeness, and coherence.
Long-context vs RAG: when does each win for summarization?
Long-context models (Claude Opus 4.7 with 1M tokens in beta, Gemini 2.5 Pro at 1M, GPT-5 at 400K) handle single-document summaries cleanly when the document fits the window. RAG summarization wins when the corpus exceeds the window, when the corpus updates (a single 1M-token call repeats work), when cost dominates (RAG can use 35 to 60 percent fewer tokens per Self-Route, Li et al. 2024), and when multi-document synthesis is required. The hybrid Self-Route router is a 2026 default for cost-sensitive stacks.
What are the canonical RAG summarization patterns?
Five patterns. Stuff: concatenate retrieved chunks and call the LLM once. Map-reduce: summarize each chunk in parallel, then reduce the partial summaries into one. Refine: sequential pass that updates a running summary one chunk at a time. Hierarchical (RAPTOR, Sarthi et al. 2024): cluster chunks, summarize each cluster, recurse to build a tree, retrieve from any level at query time. GraphRAG (Edge et al. 2024): build an entity graph, partition into communities, summarize each community, then answer global questions over the community summaries.
How do you evaluate a RAG summarization pipeline?
Four axes per FineSurE and the broader summarization eval literature. Faithfulness: every claim in the summary is supported by the retrieved source. Completeness (or coverage): the key information in the source appears in the summary. Coherence: adjacent sentences make sense together. Citation accuracy: every claim is traceable back to a passage. Future AGI ships prebuilt evaluators (groundedness, completeness, summary_quality, detect_hallucination, is_good_summary) on the turing_flash (~1-2s), turing_small (~2-3s), and turing_large (~3-5s) judge tiers.
What is RAPTOR and when should I use it?
RAPTOR (Recursive Abstractive Processing for Tree-Organized Retrieval, Sarthi et al. 2024) clusters chunks by embedding, summarizes each cluster, and repeats the process up the tree until a single root summary remains. At query time it retrieves from any level of the tree. The original paper reported a 20 percentage point absolute accuracy gain on the QuALITY multi-step reasoning benchmark with GPT-4. RAPTOR is the right pattern when summaries at different abstraction levels matter (top of tree for executive briefs, lower levels for detailed evidence).
What is GraphRAG and when does it beat vector RAG for summarization?
GraphRAG (Edge et al. 2024, MIT-licensed, by Microsoft Research) extracts an entity-and-relationship graph from the corpus, partitions it into communities of densely connected entities, and summarizes each community with an LLM. For global questions ("what are the main themes across this corpus") it outperforms naive RAG with a 70 to 80 percent win rate on comprehensiveness and diversity. For local single-document questions vector RAG is comparable and cheaper. Use GraphRAG when cross-document synthesis is the workload.
How does Future AGI fit into a RAG summarization stack?
Future AGI sits beside the retriever and generator, not inside them. The retriever (BM25, dense embeddings, hybrid, graph) and the summarizer LLM are picked from specialists. Future AGI ships prebuilt evaluators for faithfulness, completeness, summary_quality, and hallucination detection via `fi.evals.evaluate(...)`, traces every summarization call through `fi_instrumentation` (Apache 2.0 traceAI), and runs runtime guardrails through the Agent Command Center BYOK gateway. Production failures captured in traceAI flow back into the eval golden set for the next release.
What are the most common failure modes in RAG summarization?
Five recurring failures. Hallucinated facts: the summary asserts claims not in the source (mitigation: faithfulness eval at the boundary). Dropped sections: the summary skips a chunk because retrieval missed it (mitigation: completeness eval, recall measurement). Repetition: map-reduce reduce-step repeats facts (mitigation: dedup, refine pattern). Citation gap: the summary cannot be traced back to a chunk (mitigation: span-attached scoring). Context gaslighting (per finance audits): the summary inverts or exaggerates source meaning, sounding plausible (mitigation: groundedness plus human review).
Related Articles
View all
Stay updated on AI observability

Get weekly insights on building reliable AI systems. No spam.