Guides

What is Retrieval Augmented Generation? The 2026 Definition

RAG is four pipeline stages and three failure modes per stage. A methodology reference for picking each stage and measuring what it broke.

·
Updated
·
13 min read
rag retrieval-augmented-generation embeddings vector-search llm rag-evaluation groundedness chunking reranking 2026
Editorial cover image for What is Retrieval Augmented Generation
Table of Contents

A support agent built on a RAG stack answers a customer with confidence: “Per Section 7.3 of your enterprise agreement, refunds are processed within 14 business days.” The trace shows the retriever returned five chunks. None of them contain Section 7.3, and none of them mention refunds. The model invented the citation because the prompt told it to cite sources, and a string-match eval marked the response correct because it named the right entities. The customer escalates. The system looked healthy on every dashboard.

This is RAG in 2026: a pipeline that is easy to demo, hard to make reliable, and indistinguishable from a working system until somebody reads the trace. Most explainers run five thousand words covering eight stages, six advanced patterns, and a vendor comparison without telling an engineer what to pick or what each stage breaks on. The opinion this post earns: RAG is four pipeline stages and three failure modes per stage, and every 2026 RAG explainer that does not give the methodology for picking and measuring each stage is theater.

The four stages are chunking, indexing, retrieval, generation. Older explainers split out embedding, reranking, and prompt assembly as separate stages; those are sub-stages of indexing, retrieval, and generation respectively. Four is the right grain because each stage shares a failure surface and an evaluation rubric. Audience: ML engineers and tech leads who want one reference page on RAG methodology — what to pick at each stage, what breaks, and the rubric that catches it.

TL;DR: the four stages, the failures, the rubrics

StagePick (2026 default)Top failure modeRubric that catches it
ChunkingStructure-aware + header carryBoundary breaks across clausesContextRelevance, ContextRecall
IndexingHybrid (dense + BM25), per-tenant metadataEmbedding-corpus mismatchContextRecall, embedding probe
RetrievalHybrid + cross-encoder rerankerIdentifier blindness, lost-in-the-middleContextRelevance, ChunkUtilization
GenerationStructured output + cited spansFabricated citations, grounded-but-wrongGroundedness, ChunkAttribution, FactualAccuracy

RAG works when each stage is picked deliberately and measured separately. RAG fails when the engineer ships the LangChain quickstart, calls the dense top-5 a retriever, and reports one aggregate quality score that hides which stage moved.

Why RAG, in one section

Four forces made RAG the default knowledge pattern for production LLM applications.

Knowledge freshness. Foundation models freeze at a training cutoff. RAG indexes the corpus on a separate cadence (nightly, hourly, or on document change). A regulatory team updates a policy at 2 PM and the assistant cites the new version at 2:01 PM. Fine-tuning would take a training run, an eval cycle, and a deploy.

Factual grounding. A model asked from its weights alone will sometimes invent a plausible answer. The same model handed a relevant passage and instructed to answer only from the passage hallucinates far less. RAG does not eliminate hallucination; it shifts the failure mode from “made up the answer” to “ignored the context” or “cited a chunk that did not say that.” Both are catchable with the right evals.

Domain expertise without retraining. A general-purpose model paired with a vector index of legal contracts, medical guidelines, or internal engineering docs behaves like a domain specialist at a fraction of fine-tuning cost.

Cost economics. Fine-tuning a frontier model on a million documents is expensive and the resulting weights are locked to the training snapshot. RAG keeps the index outside the model; the marginal cost of a new document is one embedding call and a row in the vector database.

The tradeoff: every query runs an extra retrieval round-trip and a longer prompt. Done well, that is 50 to 200 ms and a few thousand tokens. Done badly, that is a second of added latency, two thousand wasted tokens, and a hallucination.

Stage 1: chunking

Documents are split into retrievable units. The chunker decides what the unit of retrieval and the unit of attribution will be; if the boundary cuts a clause in half, the citation will too.

The picks. Fixed-size character chunking (cheap, deterministic, fine baseline for unstructured text). Recursive character splitting (RecursiveCharacterTextSplitter and friends; splits on paragraph, then sentence, then word boundaries). Semantic chunking (embed sentences, split when adjacent embeddings diverge). Document-structure-aware chunking (parse Markdown headings, HTML elements, PDF section structure, code blocks). Late chunking (embed the full document first, then derive chunk embeddings from token-level outputs; preserves cross-chunk context and beats fixed-size on most retrieval benchmarks).

Pick rule. Default to structure-aware chunking with header carry — every chunk is prefixed with its section heading so the embedding encodes “this is from the Liability section,” not just the body text. Cap chunk size at the embedding model’s window minus headroom for the header prefix. Keep tables and figure captions as their own chunks and embed them on the caption plus header row, not the full table flattened into a wall of cells. Fixed-character chunking costs 5 to 15 points on retrieval rubrics in the domains we have measured; the advanced chunking guide covers the diagnostic checks.

The three failures. Boundary breaks: the chunker cuts a clause across two chunks and neither carries the full claim. Context loss: the chunk arrives without its section heading and embeds on body text alone. Table flattening: a multi-row table dissolves into interleaved cells and the embedding loses the relationship between header and value.

What the rubric sees. ContextRecall drops because the answer-bearing chunk is mangled or missing. ContextRelevance drops because adjacent chunks now look closer to the query than the (now broken) right one. The chunking evaluation post covers the diagnostic.

Stage 2: indexing

Chunks become vectors and rows in a store. The embedding model and the store choice are real picks, but they are less load-bearing than the retrieval and reranking choices that come next.

Embedding picks. OpenAI text-embedding-3-small (1536 dim, cheap baseline), text-embedding-3-large (3072 dim, higher quality). Cohere embed-english-v3.0 / embed-multilingual-v3.0 (1024 dim). BAAI bge-small, bge-large, bge-m3 (open-weight, strong MTEB scores, multilingual). Nomic nomic-embed-text-v2 and Microsoft e5-large-v2 for on-prem. Matryoshka embeddings (Nomic, OpenAI v3) let you truncate to a smaller dimension at query time without retraining. Most teams over-index on embedding choice; once a reranker is in the pipeline, the marginal lift from upgrading embeddings is usually under 5 percent. The embedding model landscape covers benchmark deltas.

Vector store picks. pgvector when Postgres is already in production (no separate database to operate). Qdrant or Pinecone when retrieval is the workload (strong hybrid search, managed options). Weaviate, Milvus, Chroma, LanceDB for specific shapes (hybrid built-in, scale-out, embedded). The vector database comparison covers the full trade space.

Pick rule. Index BM25 alongside vectors from day one. Carry metadata fields the retriever can filter on — tenant_id, document type, date, source path. Cross-tenant leakage is a configuration class, not a model class, and the only fix is to filter before similarity search rather than after.

The three failures. Embedding-corpus mismatch: a general-purpose embedding model on a specialised corpus (legal, biomedical, code) underperforms a domain-tuned model by 10 to 20 points on recall. Metadata gaps: no tenant filter ships and a customer reads another customer’s document. Dense-only indexing: BM25 is not indexed, identifier-heavy queries (contract numbers, error codes, drug names) miss, and nobody notices until a regulated workload audits the trace.

What the rubric sees. ContextRecall drops on the held-out probe set. A cheap pre-deployment check: 50 to 100 hand-labeled (query, expected_chunk) pairs, measure recall@k against your own corpus before committing the embedding choice. MTEB is a leaderboard, not a procurement document.

Stage 3: retrieval

Given a query, pull the candidates and reorder them. This is where the most production lift lives and where the most “we ship the LangChain default” damage compounds.

Picks. Dense retrieval embeds the query and finds top-k by cosine similarity — captures semantic similarity, misses exact-token matches. Sparse retrieval (BM25) catches exact matches the dense retriever misses (names, error codes, SKUs). Hybrid retrieval runs both and fuses the result lists with Reciprocal Rank Fusion (1 / (60 + rank) per result list, sum the contributions). Reranking reorders the top-50 or top-100 hybrid output with a cross-encoder (Cohere Rerank v3, BGE Reranker v2-m3, Mixedbread mxbai-rerank-large) to top-5 that ship into the prompt.

Three query-side patterns lift retrieval on hard queries. Query rewrite: an LLM expands the user’s query into a richer search query before embedding (“How do I cancel?” becomes “How do I cancel my subscription, refund policy, account termination process”). HyDE (Hypothetical Document Embeddings): generate a hypothetical answer with an LLM and embed that, beating raw-query embedding when queries are short and documents are long. Multi-query: generate several reformulations and fuse the results.

Pick rule. Hybrid + cross-encoder reranker is the floor. A 50 ms rerank step typically lifts answer quality by 10 to 25 percent on production traffic. Most “our embeddings are great, no reranker needed” arguments end in a postmortem six months later. The reranker landscape covers latency, cost, and accuracy tradeoffs.

The three failures. Identifier blindness: dense-only retrieval surfaces the paraphrase and misses the exact-match clause ID. Missing reranker: top-5 has the right chunk at rank 4 and the generator latches onto rank 1, the relevant content gets ignored. Lost-in-the-middle: top-10 has the right chunk at rank 5, the generator attends to the start and end of the prompt and ignores the middle.

What the rubric sees. ContextRelevance drops because the surfaced chunks do not answer the query. ChunkUtilization drops because the generator uses 2 of 10 retrieved chunks and the rest were noise. The retrieval quality monitoring post covers the production diagnostics.

Stage 4: generation

The retrieved chunks get stitched into a prompt with instructions, citations, and the query, the LLM runs, and the answer ships. Generation is the most expensive stage and the one engineers spend the most time on; it is rarely where the bug actually lives.

Picks. Model: a frontier model for complex reasoning, a cheaper model for high-volume Q&A, a fine-tuned model when the domain rewards style. Prompt template: instructions, context with chunk IDs, query, refusal path when retrieval is empty or weak. Citation contract: structured output with per-claim chunk_id and quoted span; deterministic validation that every cited span exists verbatim in the retrieval context before the response leaves the server.

Pick rule. Structured output with citation validation is the floor for any production workload where wrong citations are a compliance problem. The validator is a string match plus a fuzzy-tolerance check; a failed validation triggers retry-with-stricter-prompt or refusal, not a hand-back of an invalid answer. The cheaper the check, the more often you can run it — citation validity runs on 100 percent of production responses for negligible cost.

The three failures. Context truncation: retrieved chunks plus instructions plus query overflow the model’s context window, the silent truncation drops the most-relevant chunk (often the last one), and the unit test misses it because the unit test runs on a frozen prompt. Fabricated citations: the model invents a chunk_id that looks like the template but does not exist in the context, or cites a real chunk with a quoted span the chunk does not contain. Grounded-but-wrong: the model faithfully grounds its answer in a chunk that is stale or incorrect, Groundedness passes, FactualAccuracy fails, and the source corpus is the bug.

What the rubric sees. Groundedness catches fabrication from parametric memory. ChunkAttribution catches the fabricated-citation case. FactualAccuracy catches the grounded-but-wrong case where the source itself is the regression. The PDF QA chatbot post covers the citation validator pattern end-to-end.

The eval that catches each stage

The opinion the deep-dive earns: RAG evaluation is a bisection problem. Tag every rubric by stage in the eval store, report layer-level aggregates (retrieval_health = mean(ContextRelevance, ChunkAttribution, ChunkUtilization), generation_health = mean(Groundedness, AnswerRelevance, Completeness)), and a regression bisect collapses to five minutes.

StageRubricCatches
ChunkingContextRecallThe answer-bearing chunk is missing or mangled
IndexingContextRecall + recall@k probeEmbedding-corpus mismatch, missing BM25
RetrievalContextRelevance, ChunkUtilizationWrong chunks, over-fetch, lost-in-the-middle
GenerationGroundedness, ChunkAttributionFabrication from memory, fabricated citations
Cross-cutFactualAccuracy, citation validityStale source, hallucinated quote on real chunk

The diagnostic decision tree: ContextRelevance dropped means retrieval regressed (bisect the chunker, embedder, or reranker change). ContextRelevance steady and Groundedness dropped means the generator regressed (bisect the prompt, model, or decoding params). Groundedness steady and FactualAccuracy dropped means the source data regressed (audit the corpus). The RAG evaluation metrics deep-dive walks the tree.

When RAG is the wrong tool

RAG is not a default. Four signals say use something else.

Small static corpus. A 50-page handbook in a 200K-context model needs no vector database. Paste the whole thing into the system prompt and skip the pipeline.

Real-time data. Stock prices, sensor feeds, live weather. Call a tool or API at query time. Indexing real-time data into a vector store guarantees staleness.

Conversational state. “What did I just ask?” is memory, not retrieval. Store turns in a session store and skip vector search.

Structured queries over structured data. “Show me all customers in Q3 with churn risk above 0.7” is SQL. A Text-to-SQL agent or a knowledge graph beats vector retrieval on multi-step reasoning over structured fields.

The deeper choice is RAG vs fine-tuning. Fine-tuning bakes facts into weights and locks them to a training snapshot; RAG keeps facts outside the model and refreshes by reindexing. Fine-tuning wins for style, formatting, and reasoning patterns the model must internalise. RAG wins for facts that change slowly and live in documents. Production stacks run both: fine-tune for behaviour, RAG for current facts. The RAG vs fine-tuning decision framework and LLM eval vs fine-tuning post cover the trade in depth.

How Future AGI grounds and measures a production RAG stack

Future AGI ships the RAG eval stack as a package. Start with the SDK for code-defined evals; graduate to the Platform when you want self-improving rubrics authored by an in-product agent.

  • ai-evaluation SDK (Apache 2.0): every canonical RAG rubric as a typed EvalTemplate class — Groundedness (eval_id 47), ContextAdherence (5), ContextRelevance (9), Completeness (10), ChunkAttribution (11), ChunkUtilization (12), FactualAccuracy (66), AnswerRefusal (88). Local NLI-backed equivalents (groundedness, faithfulness, claim_support, factual_consistency, context_recall, context_precision, context_entity_recall, multi_hop_reasoning, source_attribution) run on a DeBERTa model with no API call for on-prem deployments. Thirteen guardrail backends, eight sub-10 ms Scanners, four distributed runners (Celery, Ray, Temporal, Kubernetes).
  • traceAI (Apache 2.0): 50+ AI surfaces across Python, TypeScript, Java, C#. Fourteen span kinds including first-class RETRIEVER, RERANKER, EMBEDDING. Pluggable semantic conventions (FI, OTEL_GENAI, OPENINFERENCE for Arize Phoenix compatibility, OPENLLMETRY for Traceloop). One instrumentation layer, four downstream collectors. Sixty-two built-in evals via EvalTag for span-attached scoring on live RAG traces.
  • Future AGI Platform: self-improving evaluators tuned by thumbs-up/down feedback through a FeedbackRetriever that injects similar past corrections as few-shot examples in the judge prompt; a ThresholdCalibrator sweeps thresholds in [0.3, 0.9] over feedback entries and picks the threshold maximising F1. Classifier-backed evals at lower per-eval cost than Galileo Luna-2 for high-volume RAG-trace scoring.
  • Error Feed (inside the eval stack): HDBSCAN soft-clustering over ClickHouse-stored span embeddings groups failing RAG traces into named issues. A Claude Sonnet 4.5 Judge agent on Bedrock (30-turn budget, eight span-tools including read_span, get_children, submit_finding) reads the failing trace, writes the RCA, evidence quotes, an immediate_fix, and a four-dimensional score. Fixes feed the Platform’s self-improving evaluators. Common RAG clusters: “wrong chunk retrieved, top-1 not in relevant set,” “context truncated at prompt assembly,” “citation hallucinated, chunk_id not in retrieved set,” “judge marked grounded but answer contradicts chunk content.”
  • Agent Command Center: 17 MB Go binary self-hosts in your VPC. Twenty-plus providers via six native adapters plus OpenAI-compatible presets. RBAC and SOC 2 Type II, HIPAA, GDPR, CCPA, ISO 27001 certified.

The cheap version: wire ContextRelevance, ChunkAttribution, Groundedness, and citation validity into a pytest fixture this afternoon against the SDK, then add the traceAI instrumentor when production traces start asking questions the CI gate missed. The what is RAG evaluation post and agent passes evals, fails production post cover the offline-to-online bridge.

Frequently asked questions

What is Retrieval Augmented Generation in one sentence?
Retrieval Augmented Generation is a four-stage pipeline (chunking, indexing, retrieval, generation) that pulls relevant context from a knowledge store at query time and injects it into the prompt, so the model answers from retrieved evidence rather than parametric memory alone. The stages are real failure surfaces — each one has its own choices, its own failure modes, and its own evaluation rubric — which is why a 2026 RAG explainer that does not give the methodology for picking and measuring each stage is theater.
What are the four stages of a RAG pipeline?
Chunking splits documents into retrievable units. Indexing stores those units as vectors plus metadata in a vector store, with an optional BM25 index alongside for keyword retrieval. Retrieval pulls top-k candidates for a query and reorders them with a stronger relevance model. Generation assembles the candidates into a prompt and calls the LLM. Older explainers count six to eight stages by breaking out embedding, reranking, and prompt assembly as separate items; those are sub-stages of indexing, retrieval, and generation respectively. Four is the right grain because each stage shares a failure surface and an evaluation rubric.
What are the three failure modes per stage?
Chunking fails on boundary breaks (clauses split across chunks), context loss (no section headers), and table flattening (tabular data dissolved into noise). Indexing fails on embedding-corpus mismatch (general-purpose embeddings on specialised text), metadata gaps (no tenant or date filter), and dense-only retrieval (BM25 not indexed alongside). Retrieval fails on identifier blindness (paraphrase wins, exact match loses), missing reranker (relevance order is wrong), and lost-in-the-middle (the right chunk is in the prompt but the model ignores it). Generation fails on context truncation, fabricated citations, and grounded-but-wrong (the model faithfully grounds its answer in a stale or incorrect chunk).
How is RAG different from fine-tuning?
Fine-tuning bakes facts into model weights and locks them to a training snapshot. RAG keeps facts outside the model and injects them at query time. The cost of refreshing knowledge under RAG is one embedding call per new document; under fine-tuning it is a training run plus an evaluation cycle plus a deploy. Fine-tuning still wins for style, formatting, and reasoning patterns the model must internalise. Production stacks run both: fine-tune for behaviour, RAG for current facts. The decision is not which to use but what the model is missing.
Which evaluation rubric catches which RAG failure?
Each stage has a rubric. ContextRelevance catches chunking and retrieval failures — wrong chunks surfaced or chunked at the wrong boundary. ContextRecall catches missing-chunk failures the retriever never returned. ChunkAttribution catches fabricated citations pinned to real chunk IDs. ChunkUtilization catches retrieval over-fetch when most of the context window goes unused. Groundedness catches generation failures where the model fabricates from parametric memory. ContextAdherence catches off-policy elaboration. FactualAccuracy is the cross-cutting rubric that fires when retrieval grounds the response in a stale or incorrect source. Run them tagged by stage and a regression bisect collapses to five minutes.
When should I not use RAG?
Four signals say use something else. Small static corpus: paste the whole thing into the system prompt and skip the pipeline. Real-time data (stock prices, sensor feeds, live weather): call a tool or API; indexing real-time data guarantees staleness. Conversational state: store turns in a session store, not a vector index. Structured queries over structured data: a Text-to-SQL agent or a knowledge graph beats vector retrieval. The deeper question is what the model is missing — facts in documents that change slowly is RAG's home; everything else has a better tool.
Why do RAG systems pass eval and fail in production?
Because the eval set went stale or measured the wrong layer. The corpus shifts faster than the dataset; new product names, new error codes, and new document types arrive that the eval never saw, the reranker degrades silently on out-of-distribution queries, and a generation-only rubric (Groundedness on the final answer) cannot tell a retrieval regression from a generation regression. The fix is layered: per-stage rubrics tagged by stage, span-level tracing on production traffic, weekly promotion of failing traces into the dataset, and clustering on failing traces so the named issue (not the score drop) is the unit of triage.
Related Articles
View all
Evaluating RAG Faithfulness: A 2026 Deep Dive
Guides

Why answer-level Groundedness hides RAG hallucinations, and how claim-level decomposition, cherry-pick detection, and sycophantic-restatement scoring fix it. Methodology for senior ML engineers.

NVJK Kartik
NVJK Kartik ·
11 min