RAG Architecture in 2026: Patterns, Code, and How to Evaluate
RAG architecture in 2026: agentic RAG, multi-hop, query rewriting, hybrid search, reranking, graph RAG. Real code plus Context Adherence and Groundedness eval.
Table of Contents
TL;DR: RAG Architecture in 2026
| Layer | 2023 default | 2026 default |
|---|---|---|
| Retrieval | Dense embeddings only | Hybrid (BM25 plus dense) with RRF |
| Reranking | None | Cross-encoder reranker on top-k |
| Query handling | Pass raw | Rewrite (HyDE or decomposition) when intent is ambiguous |
| Orchestration | Single retrieve-then-generate | Agentic loop or multi-hop chain when needed |
| Knowledge structure | Flat chunk store | Hybrid plus a graph layer for global questions |
| Evaluation | Manual spot check | Continuous groundedness + context adherence + answer relevance |
| Runtime safety | None | Inline hallucination guardrail at the boundary |
What Changed Since 2023
Four shifts moved RAG from “retrieve, generate” to a proper architecture discipline.
First, dense-only retrieval lost. Almost every public 2024-2025 benchmark (BEIR, MTEB, the Anthropic Contextual Retrieval post) shows that BM25 plus dense embeddings, fused with Reciprocal Rank Fusion, beats either alone. A cross-encoder reranker on top adds another 5 to 15 points of MRR on hard sets.
Second, agentic RAG became real. Self-RAG (Asai et al., 2023), FLARE (Jiang et al., 2023), and the production patterns that came after move retrieval inside the agent loop instead of in front of it. The model can ask for more evidence, rewrite its query, or stop early.
Third, graph RAG showed up at scale. Microsoft Research published GraphRAG (Edge et al., 2024) with code under MIT. The technique builds an entity and relationship graph from the corpus and uses graph traversal at retrieval time. On global “what are the main themes in this corpus” queries, GraphRAG outperforms vector-only RAG by a wide margin.
Fourth (smaller), evaluation got cheap. RAG-specific evaluators (groundedness, context adherence, faithfulness, answer relevance) became prebuilt and fast enough to run on every commit.
Why RAG Architecture Is the Solution to LLM Knowledge Cutoffs and Hallucinations
Three problems push teams to RAG.
- Knowledge cutoff: foundation models freeze on a training-data date. RAG fetches what is fresh.
- Hallucinations: an LLM with no evidence will confidently make claims up. RAG grounds the answer on retrieved passages and lets you measure the grounding.
- Context window: even with 1M-token context windows, putting the entire corpus into the prompt is wasteful and slow. RAG keeps the prompt focused.
The result is an answer that cites its sources, can be re-verified, and updates the moment the underlying corpus updates.
Core Components of a 2026 RAG Stack
A modern RAG pipeline has six layers. Most production stacks ship all six.
- Ingestion: parse documents, chunk, normalize, deduplicate.
- Indexing: BM25 (lexical), dense embeddings (semantic), optional graph.
- Query processing: classify intent, rewrite if needed, decompose if multi-hop.
- Retrieval: hybrid search with RRF or weighted scoring across BM25 and dense.
- Reranking: cross-encoder reranker (Cohere Rerank, Voyage Rerank-2, BGE-M3) on the top-k.
- Generation + evaluation: LLM consumes context, eval scores groundedness and answer relevance, guardrail blocks unsafe outputs at the boundary.
Ingestion and chunking
Chunk size is one of the most important tuning knobs. The 2026 default is 512 to 1024 tokens with 50 to 100 tokens of overlap, then a semantic re-chunker that respects section boundaries. Anthropic’s contextual-retrieval trick (prefix every chunk with a 1-sentence summary of its parent document) lifts retrieval by 35 to 50 percent on their benchmark.
Indexing
Run both indexes in parallel.
- BM25 via Pyserini or Tantivy.
- Dense embeddings via the MTEB leaderboard top picks (Voyage-3, OpenAI text-embedding-3-large, BGE-M3, Cohere Embed v4).
- Optional graph index via Microsoft GraphRAG (MIT) or Neo4j LLM Knowledge Graph Builder.
Hybrid retrieval with RRF
The Reciprocal Rank Fusion helper is small enough to inline; treat the bm25_search and dense_search calls as pseudocode for whatever indexes you use (Pyserini, Elasticsearch, Weaviate, pgvector).
def reciprocal_rank_fusion(rankings: list[list[str]], k: int = 60) -> dict[str, float]:
"""Fuse N ranked lists of doc_ids into a single score-per-doc."""
fused: dict[str, float] = {}
for ranking in rankings:
for rank, doc_id in enumerate(ranking, start=1):
fused[doc_id] = fused.get(doc_id, 0.0) + 1.0 / (k + rank)
return fused
# Pseudo-usage (replace bm25_search / dense_search with your real callers).
# bm25_top = bm25_search(query, top_k=50)
# dense_top = dense_search(query, top_k=50)
# fused = reciprocal_rank_fusion([bm25_top, dense_top])
# top_ids = sorted(fused, key=fused.get, reverse=True)[:20]
Cross-encoder reranker
After hybrid retrieval, rerank with a cross-encoder. Cohere Rerank v3, Voyage Rerank-2, Jina Reranker v2, and the open-source BGE-Reranker-v2 are the production-grade options. A reranker reads the query and each candidate together, so it produces a much more accurate relevance score than dense cosine similarity alone.
RAG Patterns in 2026
Classic RAG
Retrieve once, generate once. Still the right default for FAQ-style and single-document QA where the answer sits in one passage.
Hybrid + reranker
BM25 plus dense, fused, then reranked. The 2026 production baseline. Cost: one extra reranker call per query (cheap with Cohere or Voyage, ~50ms).
Query rewriting
Three flavours.
- HyDE (Gao et al., 2022): generate a hypothetical answer, embed the answer, retrieve against it.
- Step-back (Zheng et al., 2023): rewrite to a more general query, retrieve, then specialise.
- Decomposition: split a multi-part question into sub-queries, retrieve per sub-query, then combine.
from openai import OpenAI
REWRITE_MODEL = "gpt-4.1" # any current chat model
def step_back(query: str) -> str:
client = OpenAI()
response = client.chat.completions.create(
model=REWRITE_MODEL,
messages=[
{"role": "system", "content": "Rewrite the question to a more general form that captures the underlying concept."},
{"role": "user", "content": query},
],
)
return response.choices[0].message.content.strip()
rewrite = step_back("What was the GDP growth of the country hosting the 2024 Olympics?")
# example output: "What is the GDP of France in 2024?"
Multi-hop retrieval
Sequential retrieval calls for compositional questions. Implementations include Self-Ask, IRCoT, and the agentic patterns below.
Agentic RAG
Retrieval is a tool the LLM can call. The agent inspects, decides, re-queries, and stops. The pattern is well covered in the agentic RAG systems guide.
Skeleton (pseudo-code):
def agentic_rag(question: str, max_hops: int = 3) -> str:
state = {"question": question, "evidence": [], "hops": 0}
while state["hops"] < max_hops:
decision = llm_decide(state)
if decision.action == "answer":
return llm_answer(state)
chunks = retrieve(decision.query)
state["evidence"].extend(chunks)
state["hops"] += 1
return llm_answer(state)
Graph RAG
Build an entity + relationship graph from the corpus once, traverse at retrieval time. Microsoft’s GraphRAG repo is the reference implementation. The technique shines on global “summarise the themes” queries; for local “what did the report say about X” queries, hybrid plus reranker is usually enough.
Evaluation: The Part Most Teams Skip
A RAG pipeline without measurement is a hallucination machine. Score four axes per release.
| Axis | Question it answers | Measure |
|---|---|---|
| Context recall | Did the retriever surface the gold passages? | Fraction of gold passages in top-k |
| Context precision | Are retrieved passages relevant? | Mean per-passage relevance score |
| Groundedness | Is every claim supported by retrieved context? | LLM judge or NLI model |
| Answer relevance | Does the answer address the question? | LLM judge |
Run all four on a labelled golden set on every commit. Future AGI ships prebuilt evaluators for each.
import os
from fi.evals import evaluate
os.environ["FI_API_KEY"] = "fi-..."
os.environ["FI_SECRET_KEY"] = "..."
question = "What was France's GDP in 2024?"
retrieved_chunks = [
"France's nominal GDP in 2024 was approximately 2.92 trillion USD per the IMF.",
"The 2024 Summer Olympics were held in Paris, France.",
]
answer = "France's nominal GDP in 2024 was about 2.92 trillion USD."
groundedness = evaluate(
"groundedness",
output=answer,
context="\n".join(retrieved_chunks),
model="turing_flash",
)
relevance = evaluate(
"answer_relevance",
output=answer,
input=question,
model="turing_flash",
)
print(groundedness.score, relevance.score)
For a custom rubric (for example, “the answer must cite at least two sources”), wrap a judge:
from fi.evals.metrics import CustomLLMJudge
from fi.evals.llm import LiteLLMProvider
judge = CustomLLMJudge(
provider=LiteLLMProvider(model="gpt-4.1"),
name="rag-citation-judge",
grading_criteria=(
"Score 1 if the answer cites at least two distinct passages "
"from the provided context; score 0 otherwise."
),
)
verdict = judge.evaluate({
"input": question,
"output": answer,
"context": "\n".join(retrieved_chunks),
})
print(verdict.score, verdict.reason)
Trace every call
Wire traceAI (Apache 2.0) so every retrieval and generation lands in a dashboard. Promote production failures into the golden set on a weekly cadence.
from fi_instrumentation import register, FITracer
tracer = FITracer(register(project_name="rag-prod"))
def retrieve(q: str) -> list[str]:
"""Wire your hybrid retriever here; returns the top-k passages."""
return ["..."]
def generate(q: str, ctx: list[str]) -> str:
"""Wire your LLM call here; returns the final answer string."""
return "..."
@tracer.chain
def rag_answer(question: str) -> dict:
chunks = retrieve(question)
answer = generate(question, chunks)
return {"question": question, "answer": answer, "chunks": chunks}
Design Considerations
Retriever selection
| Pick | When to use |
|---|---|
| BM25 only | Code identifiers, exact-name lookup, very small corpora |
| Dense only | Paraphrase-heavy queries, multilingual concept matching |
| Hybrid + reranker | Most production cases |
| Hybrid + reranker + graph | Cross-document, global, or compositional queries |
Data preprocessing
Three checks before indexing.
- Deduplicate at the chunk level (exact and near-duplicate). Duplicates inflate retrieval scores.
- Strip boilerplate (footers, navigation) so chunks contain real content.
- Add a context prefix to each chunk (Anthropic-style contextual retrieval) when the corpus has many short, ambiguous chunks.
Latency management
Production latency budget for a chat RAG:
| Stage | Budget |
|---|---|
| Query rewrite (optional) | 200-400 ms |
| Hybrid retrieval | 50-150 ms |
| Cross-encoder rerank | 50-150 ms |
| Generation (first token) | 200-500 ms |
| Generation (full) | 500-1500 ms |
Groundedness eval (turing_flash) | 1-2 s (out-of-band) |
For streaming UIs, run groundedness asynchronously after the first stream and gate the next turn if it fails.
Applications
Customer support
Hybrid retrieval over FAQs, support tickets, and product guides. The reranker is the difference between “we have docs” and “the bot quotes the right line”. Run groundedness as a post-response check; if it fires, fall back to escalation.
Content workflows
Marketing and research teams use RAG over internal knowledge plus recent web data (Bing, Tavily, You.com). The evaluation focus is answer relevance and citation accuracy. Set a “must cite source” guardrail.
Education
Personalised tutoring over textbook PDFs and class notes. Multi-hop helps when the student asks compound questions (“explain how this theorem relates to the chapter on…”).
Healthcare
Graph RAG over medical literature for clinical decision support. Groundedness is non-negotiable; a hallucinated drug dose is a patient-safety event. Wire a hallucination guardrail at the response boundary.
Production Examples
- ChatGPT search combines generation with live web retrieval; the pricing and search docs describe the architecture.
- Perplexity ships an agentic RAG product over the open web.
- Bing Chat and Google AI Overviews are large-scale RAG products with reranking and citation surfaces.
- Enterprise teams build on Azure AI Search plus a reranker, Vertex AI Search, AWS Bedrock Knowledge Bases, or self-hosted Weaviate / Qdrant / pgvector.
Challenges and Mitigations
Compute overhead
Cross-encoder reranking adds 50 to 150 ms; query rewriting adds 200 to 400 ms. Mitigate with caching (rewrites cache cheap), GPU-accelerated rerankers (Voyage and Cohere are managed), and ANN indexes (HNSW, IVF-PQ).
Data quality
Garbage in, hallucinations out. Curate the corpus, version it, and run a periodic offline drift check on retrieval recall. The eval golden set should track the data version too.
Bias and stale information
Tag every chunk with a publication date and a source. The retriever can score recency; the generator can be instructed to flag stale evidence.
How Future AGI Fits
Future AGI is the evaluation and observability companion to your RAG stack. Pick your vector DB, reranker, and embedding model from the specialists; Future AGI scores every generation on groundedness, context adherence, faithfulness, and answer relevance with prebuilt evaluators in fi.evals, traces every call through fi_instrumentation (Apache 2.0 traceAI), and runs runtime guardrails for hallucination and PII through the Agent Command Center at /platform/monitor/command-center. The eval SDK (ai-evaluation LICENSE) and traceAI (LICENSE) are both Apache 2.0.
Closing Thought
A 2026 RAG pipeline is six layers, three orchestration patterns, and a continuous eval. The components have settled (hybrid plus reranker, optional graph, optional agentic loop). The wins now come from picking the right pattern for the workload and instrumenting it tightly. For deeper builds, see agentic RAG systems, RAG evaluation metrics, RAG hallucinations, embedding model picks, and hallucination detection tools.
Frequently asked questions
What is RAG architecture in 2026?
What is agentic RAG and how is it different from classic RAG?
Hybrid search vs dense retrieval: which should I use?
What is graph RAG and when is it worth the cost?
How do you evaluate a RAG pipeline in 2026?
What does multi-hop retrieval look like?
What is query rewriting in RAG?
How does Future AGI fit into a RAG stack?
Retrieval-Augmented Generation (RAG) for LLMs in 2026: how it works, hybrid + reranker stack, evaluation metrics, and the FAGI eval companion for production.
Compare the top AI guardrail tools in 2026: Future AGI, NeMo Guardrails, GuardrailsAI, Lakera Guard, Protect AI, and Presidio. Coverage, latency, and how to choose.
Agentic RAG in 2026: tool-using agents over vector DBs, query rewriting, multi-hop retrieval, and how to trace and evaluate every retrieve span with FAGI.