Articles

What Is RAG (Retrieval-Augmented Generation) for LLMs? A 2026 Guide

Retrieval-Augmented Generation (RAG) for LLMs in 2026: how it works, hybrid + reranker stack, evaluation metrics, and the FAGI eval companion for production.

·
Updated
·
9 min read
rag llms retrieval evaluation vector-search 2026
Retrieval-Augmented Generation (RAG) architecture in 2026: chunker, embedder, vector store, retriever, reranker, generator, evaluator loop.
Table of Contents

Updated May 14, 2026. Retrieval-Augmented Generation is now the default architecture for any LLM application that needs to answer questions about private, fresh, or large bodies of documents. Here is how RAG actually works in 2026, the modern hybrid plus reranker stack, the evaluation metrics that predict production quality, and where Future AGI fits as the evaluation companion.

RAG (Retrieval-Augmented Generation) architecture in 2026: chunker, embedder, vector store, retriever, reranker, generator, evaluator loop.

TL;DR: RAG in May 2026

ComponentPurpose2026 picks
ChunkerSplit documents into retrievable unitsUnstructured, LlamaIndex parsers, recursive splitter, structure-aware splitter
EmbedderTurn chunks and queries into vectorstext-embedding-3-large, Cohere Embed v4, Voyage-3-large, bge-m3 (open)
Vector storeIndex and search vectors at scaleQdrant, Pinecone, Weaviate, Milvus, pgvector
RetrieverFetch top K candidatesHybrid (BM25 + dense) with filters, RRF fusion
RerankerRefine top K to top NCohere Rerank 3.5, Voyage Rerank-2, BGE Reranker v2
GeneratorCompose the answer from contextgpt-5-2025-08-07, claude-opus-4-7, gemini-3.x, llama-4.x
EvaluatorScore retrieval + generationFuture AGI fi.evals (faithfulness, context recall, hallucination)
TracingSee which doc produced which answertraceAI Apache 2.0 OTel spans

If you only read one row: the core RAG pipeline has six components (chunker, embedder, vector store, retriever, reranker, generator), with evaluator and tracing as the two production layers on top. Future AGI sits in the evaluator and tracing slots; the rest of the stack is yours to pick.

What is RAG in plain terms

Retrieval-Augmented Generation is the pattern where the LLM does not answer from its weights alone. At query time, the system retrieves relevant documents from an external store and conditions the LLM on those documents. The model then composes an answer that cites or references the retrieved content.

Three concrete benefits drive its adoption.

  • Lower hallucination. Grounding the model on retrieved facts reduces fabricated content. The fix is not perfect, and the magnitude of improvement varies heavily by domain, query type, and retrieval quality, but well-designed RAG systems consistently lower factual hallucination rates versus a base model answering from weights alone.
  • Fresh knowledge without retraining. The model can answer questions about content created after its training cutoff because the content lives in the retrieval store, not the weights.
  • Citable, auditable answers. Each answer carries a pointer to the source chunks, which is what makes RAG viable in regulated industries.

The contract is simple: the retriever’s job is to put the right document chunk in front of the model. The model’s job is to compose an answer that follows from that chunk. RAG quality is the product of both halves, and both have their own evaluation metrics.

The six components of a 2026 RAG system

1. Chunker

The chunker splits documents into retrievable units. Bad chunking is the most common reason RAG systems underperform.

  • Recursive character splitter. Default for most prose. Configure overlap to roughly 10 percent of chunk size.
  • Structure-aware splitter. Splits by Markdown headers, code blocks, table boundaries. Best for technical content.
  • Unstructured.io. Handles PDFs, HTML, DOCX with layout awareness; preserves table and section context.
  • LlamaIndex parsers. Strong for hierarchical document structures.

Chunk size is a hyperparameter: 200 to 800 tokens is the standard range. Larger chunks improve generation quality at the cost of recall. Smaller chunks improve recall but force the LLM to reason over more fragments.

2. Embedder

The embedder maps chunks and queries into the same vector space. The 2026 picks:

  • OpenAI text-embedding-3-large. 3072 dimensions; strong on English; closed-source; pay-per-token.
  • Cohere Embed v4. Multilingual, with multimodal options; closed-source.
  • Voyage AI Voyage-3-large. Strong on Artificial Analysis embedding benchmarks; closed-source.
  • bge-m3 (open source). Apache 2.0; self-hostable; multilingual; competitive on MTEB.

For the deeper survey see our best embedding models 2025 and agentic RAG systems 2025 guides.

3. Vector store

The vector store indexes embeddings and serves nearest-neighbor queries at scale.

Vector storeLicenseSelf-hostBest for
QdrantApache 2.0YesOpen source with a clean managed offering
PineconeClosedNoPure managed, fastest production setup
WeaviateBSD 3-ClauseYesMultimodal, GraphQL API
MilvusApache 2.0YesLargest-scale enterprise deployments
pgvectorPostgreSQLYesAlready-on-Postgres stacks
ChromaApache 2.0YesLocal-first development, prototyping

For the deeper survey see our best vector databases for RAG 2026 guide.

4. Retriever (hybrid + filters)

The retriever takes a query, embeds it, and fetches the top K candidate chunks. In 2026 the retriever is almost always hybrid.

  • Dense retrieval. Cosine similarity on embeddings. Wins on paraphrase, conceptual match.
  • Sparse retrieval. BM25 or BM42. Wins on exact terms, product codes, named entities, rare vocabulary.
  • Fusion. Reciprocal Rank Fusion (RRF) combines the two ranked lists into one. Dominant 2026 default.
  • Metadata filters. Pre-filter or post-filter on document attributes (tenant, date, source) to enforce access control and recency.

Hybrid retrieval often improves context recall over dense-only retrieval, especially on corpora with proper nouns, product codes, or technical terms. The exact lift is dataset-dependent. The cost is one extra index per corpus.

5. Reranker

The reranker takes the top 50 to 100 retrieved chunks and refines them to the top 5 to 10 that actually go into the prompt. Rerankers are slow per call but operate on a much smaller set than retrieval.

  • Cohere Rerank 3.5. Closed-source, hosted.
  • Voyage Rerank-2. Closed-source, hosted.
  • BGE Reranker v2 (open source). Apache 2.0, self-hostable.
  • Cross-encoders. Custom-trained for domain quality; slower; common for high-stakes use cases.

For the deeper survey see our best rerankers for RAG 2026 guide.

6. Generator (the LLM)

The generator takes the reranked top N and the original query, then composes the answer. The 2026 picks:

  • Closed source. gpt-5-2025-08-07 (OpenAI), claude-opus-4-7 (Anthropic), gemini-3.x (Google).
  • Open source. llama-4.x (Meta), Mistral large, Qwen 3.

The prompt template is its own art form. The 2026 conventions: pass the chunks with explicit boundaries, instruct the model to cite chunk IDs, and add a refusal clause for when the context does not contain the answer.

Hybrid search, end to end

Hybrid retrieval is now table stakes. Here is the typical 2026 flow.

  1. Query in. “What is the refund policy for orders shipped after the policy update on 2026-02-15?”
  2. Sparse index. BM25 retrieves the top 100 chunks that mention “refund,” “policy,” or “2026-02-15.”
  3. Dense index. Embedding retrieves the top 100 chunks that are semantically close to the query.
  4. RRF fusion. Reciprocal Rank Fusion combines the two lists into a single ranked list.
  5. Metadata filter. Filter by document type (policy doc), tenant ID, and recency.
  6. Reranker. Cohere Rerank 3.5 (or BGE Reranker v2) scores the top 50 to 100 and returns the top 5 to 10.
  7. Generator. The LLM composes the answer with citations to the reranked chunks.
  8. Evaluator. Faithfulness, context recall, and hallucination scores run inline (turing_flash) or async (turing_large).

The reranker step is what separates serious RAG systems from prototypes. Skipping it inflates the noise in the LLM’s context window and is the most common cause of “the right answer was retrieved but the LLM still got it wrong.”

How to evaluate a RAG system

Score the two halves separately. A single end-to-end score hides which half is broken.

Retrieval-side metrics

  • Context recall. Did at least one of the gold-standard chunks make it into the top K? Most important retrieval metric.
  • Context precision. What fraction of the retrieved context is actually relevant? Drives signal-to-noise.
  • MRR (Mean Reciprocal Rank). How early does the first relevant chunk appear?
  • NDCG. Position-weighted relevance, common in IR benchmarks.

Generation-side metrics

  • Faithfulness. Does the answer follow from the retrieved context? The single most important RAG quality metric in 2026.
  • Answer relevance. Does the answer address the question?
  • Hallucination. Does the answer contain facts not in the context?
  • Citation accuracy. If the system claims a citation, does the citation actually support the claim?

Future AGI supports these evaluation patterns through fi.evals templates plus the CustomLLMJudge rubric pattern for domain-specific quality. The full RAG evaluation example:

from fi.evals import evaluate

# Retrieval-side: did we get the right context?
context_recall = evaluate(
    "context_recall",
    output=retrieved_chunks,
    expected=gold_chunks,
)

# Generation-side: did the answer follow from the context?
faithfulness = evaluate(
    "faithfulness",
    output=llm_answer,
    context=retrieved_chunks,
)

# Hallucination: any facts not in the context?
hallucination = evaluate(
    "hallucination",
    output=llm_answer,
    context=retrieved_chunks,
)

For the deeper evaluation pattern see our RAG evaluation metrics 2025 and best RAG evaluation tools 2026 guides.

Tracing RAG with traceAI

The other half of the production story is tracing. Without spans that tie the answer to the retrieved chunks, debugging a low score takes hours of manual log inspection.

from fi_instrumentation import register, FITracer
from fi_instrumentation.fi_types import ProjectType

tracer_provider = register(
    project_type=ProjectType.OBSERVE,
    project_name="rag-prod",
)
tracer = FITracer(tracer_provider.get_tracer(__name__))


def rag_query(question):
    with tracer.start_as_current_span("rag.query") as span:
        span.set_attribute("input.value", question)

        with tracer.start_as_current_span("rag.retrieve"):
            chunks = retriever.search(question, k=100)

        with tracer.start_as_current_span("rag.rerank"):
            top = reranker.rerank(question, chunks, top_n=8)

        with tracer.start_as_current_span("rag.generate"):
            answer = llm.complete(build_prompt(question, top))

        span.set_attribute("output.value", answer)
        return answer

The FI_API_KEY and FI_SECRET_KEY environment variables ship the spans to the Future AGI dashboard, where retrieval, rerank, and generation appear as separate spans tied to the answer. When the faithfulness score drops, the dashboard surfaces the retrieved chunks for that exact trace so you can see whether the right context was there.

Common RAG failure modes

Four failure modes account for most production RAG incidents.

  • Chunking artifacts. The right answer sits across two chunks and neither retrieves cleanly. Fix: increase overlap, switch to structure-aware splitting.
  • Embedding domain mismatch. A general-purpose embedder fails on legal or medical jargon. Fix: switch embedder or fine-tune.
  • Reranker not deployed. The top 100 BM25 + dense hits go directly into the prompt; the LLM drowns in noise. Fix: add a reranker on top 50 to 100.
  • Retrieval recall masked by generator paraphrasing. The answer looks fluent but is hallucinated because the right chunk was never retrieved. Fix: separate retrieval-side metrics from generation-side metrics in your eval suite.

For the deeper debugging pattern see our best RAG debugging tools 2026 guide.

When to skip RAG

RAG is not always the right answer.

  • Pure reasoning workloads. Math, code, common-sense reasoning. The model’s weights are the bottleneck, not the document store.
  • Behavior change. A new tone, format, or language. Fine-tuning or prompt engineering, not retrieval.
  • Small corpus that fits the context window. Some legal contracts, codebases, or specs fit in a 1M-token context. The cost and latency of a full retrieval pipeline may not pay back.

For everything else (enterprise QA, customer support, document chat, research assistants), RAG is the 2026 default.

Closing: pick the stack, then add the evaluator

In 2026 the RAG stack itself is largely solved. The picks are well known. The work that separates good RAG from great RAG is the evaluation and tracing loop on top.

Future AGI is the evaluation and observability companion. fi.evals ships RAG-specific evaluators (faithfulness, context recall, context precision, hallucination, answer relevance). traceAI (Apache 2.0) instruments the retrieval and generation hops so the answer is tied to the chunks. The Agent Command Center at /platform/monitor/command-center surfaces low-score traces so the next round of chunker, retriever, or reranker tuning has a real failure set to work against.

Book a Future AGI demo to see RAG evaluation and observability in action.

Frequently asked questions

What is Retrieval-Augmented Generation (RAG)?
RAG is the pattern where a large language model is conditioned on documents fetched from an external knowledge store at query time, rather than relying solely on its pre-trained weights. The two halves are the retriever (chunker plus embedder plus vector index plus optional reranker) and the generator (the LLM). RAG reduces hallucination, makes outputs traceable to source documents, and lets the system answer questions about content that did not exist at training time. The pattern is the dominant 2026 architecture for enterprise search, customer support, and document QA.
How is RAG different from fine-tuning?
Fine-tuning updates the model weights so the model memorizes new behavior. RAG keeps the weights static and changes which documents the model sees at inference. Fine-tuning is the right pick when you need a new style, format, or capability the base model lacks. RAG is the right pick when you need fresh facts, citable sources, or per-tenant access control. Most production stacks use both: fine-tuning or instruction tuning for behavior, RAG for facts. RAG also avoids the cost and rollback risk of retraining.
What is the standard RAG stack in May 2026?
Chunker (Unstructured, LlamaIndex parsers, or recursive splitter), embedder (OpenAI text-embedding-3-large, Cohere Embed v4, Voyage AI Voyage-3-large, or self-hosted bge-m3), vector store (Qdrant, Pinecone, Weaviate, pgvector, Milvus), retriever with hybrid BM25 plus dense plus filters, reranker (Cohere Rerank 3.5, Voyage Rerank-2, BGE Reranker v2), and an LLM generator (gpt-5-2025-08-07, claude-opus-4-7, gemini-3.x, llama-4.x). Pair with a RAG evaluator (faithfulness, context recall, context precision) running continuously on production traces.
How do you evaluate a RAG system?
Score two halves separately. On retrieval: context recall (did the right chunk make it into the top K?), context precision (how much of the retrieved context is actually relevant?), MRR, and NDCG. On generation: faithfulness (does the answer follow from the retrieved context?), answer relevance, hallucination (does the answer contain facts not in the context?), and citation accuracy. Future AGI supports these evaluation patterns through fi.evals templates plus the CustomLLMJudge rubric pattern for domain-specific quality.
What is hybrid search and why does it matter for RAG?
Hybrid search combines dense vector search (semantic similarity via embeddings) with sparse keyword search (BM25 or BM42). Dense search wins on paraphrase and conceptual match. Sparse search wins on exact terms, product codes, names, and rare vocabulary. Most production RAG systems use a weighted fusion (Reciprocal Rank Fusion is the dominant pattern) plus a reranker on the top 50 to 100 hits. Hybrid retrieval often improves context recall over dense-only retrieval, especially on enterprise corpora with proper nouns, product codes, or technical terms, though the exact lift is dataset-dependent.
What changed in RAG between 2024 and 2026?
Five shifts. First, hybrid retrieval became default; dense-only is now a smell. Second, rerankers consolidated to a small set (Cohere Rerank 3.5, Voyage Rerank-2, BGE Reranker v2). Third, long-context models (1M token windows) made some RAG cases optional, but the cost and latency math still favors retrieval. Fourth, agentic RAG (the model decides when and what to retrieve) replaced single-shot retrieval for complex questions. Fifth, continuous RAG evaluation in production replaced one-off offline benchmarks as the QA backbone.
Does Future AGI sell a vector database, embedder, or reranker?
No. Future AGI does not sell embedding models, vector databases, or rerankers. Future AGI is the evaluation and observability companion for whichever retrieval stack you pick. fi.evals ships RAG-specific evaluators (faithfulness, context recall, context precision, hallucination, answer relevance). traceAI (Apache 2.0) instruments the retrieval and generation hops so you can see which document the answer came from. The Agent Command Center at /platform/monitor/command-center surfaces low-score traces for the next round of chunker, retriever, or reranker tuning.
When should you not use RAG?
Three cases. First, when the answer depends on the model's reasoning over general knowledge rather than your private documents (math, code, common-sense reasoning). Second, when the full corpus fits in the model's context window cheaply (long-context models with 1M tokens make some cases tractable without a retriever). Third, when you need a behavior change (a new tone, format, language), in which case fine-tuning or prompt engineering is the right lever. For everything else (enterprise QA, customer support, document chat, research assistants), RAG is the 2026 default.
Related Articles
View all
Stay updated on AI observability

Get weekly insights on building reliable AI systems. No spam.