Articles

RAG Architecture in 2026: Patterns, Code, and How to Evaluate

RAG architecture in 2026: agentic RAG, multi-hop, query rewriting, hybrid search, reranking, graph RAG. Real code plus Context Adherence and Groundedness eval.

·
Updated
·
8 min read
agents llms rag evaluation 2026
RAG Architecture 2026: agentic RAG, multi-hop, query rewriting, hybrid search, reranking, graph RAG
Table of Contents

TL;DR: RAG Architecture in 2026

Layer2023 default2026 default
RetrievalDense embeddings onlyHybrid (BM25 plus dense) with RRF
RerankingNoneCross-encoder reranker on top-k
Query handlingPass rawRewrite (HyDE or decomposition) when intent is ambiguous
OrchestrationSingle retrieve-then-generateAgentic loop or multi-hop chain when needed
Knowledge structureFlat chunk storeHybrid plus a graph layer for global questions
EvaluationManual spot checkContinuous groundedness + context adherence + answer relevance
Runtime safetyNoneInline hallucination guardrail at the boundary

What Changed Since 2023

Four shifts moved RAG from “retrieve, generate” to a proper architecture discipline.

First, dense-only retrieval lost. Almost every public 2024-2025 benchmark (BEIR, MTEB, the Anthropic Contextual Retrieval post) shows that BM25 plus dense embeddings, fused with Reciprocal Rank Fusion, beats either alone. A cross-encoder reranker on top adds another 5 to 15 points of MRR on hard sets.

Second, agentic RAG became real. Self-RAG (Asai et al., 2023), FLARE (Jiang et al., 2023), and the production patterns that came after move retrieval inside the agent loop instead of in front of it. The model can ask for more evidence, rewrite its query, or stop early.

Third, graph RAG showed up at scale. Microsoft Research published GraphRAG (Edge et al., 2024) with code under MIT. The technique builds an entity and relationship graph from the corpus and uses graph traversal at retrieval time. On global “what are the main themes in this corpus” queries, GraphRAG outperforms vector-only RAG by a wide margin.

Fourth (smaller), evaluation got cheap. RAG-specific evaluators (groundedness, context adherence, faithfulness, answer relevance) became prebuilt and fast enough to run on every commit.

Why RAG Architecture Is the Solution to LLM Knowledge Cutoffs and Hallucinations

Three problems push teams to RAG.

  • Knowledge cutoff: foundation models freeze on a training-data date. RAG fetches what is fresh.
  • Hallucinations: an LLM with no evidence will confidently make claims up. RAG grounds the answer on retrieved passages and lets you measure the grounding.
  • Context window: even with 1M-token context windows, putting the entire corpus into the prompt is wasteful and slow. RAG keeps the prompt focused.

The result is an answer that cites its sources, can be re-verified, and updates the moment the underlying corpus updates.

Core Components of a 2026 RAG Stack

A modern RAG pipeline has six layers. Most production stacks ship all six.

  1. Ingestion: parse documents, chunk, normalize, deduplicate.
  2. Indexing: BM25 (lexical), dense embeddings (semantic), optional graph.
  3. Query processing: classify intent, rewrite if needed, decompose if multi-hop.
  4. Retrieval: hybrid search with RRF or weighted scoring across BM25 and dense.
  5. Reranking: cross-encoder reranker (Cohere Rerank, Voyage Rerank-2, BGE-M3) on the top-k.
  6. Generation + evaluation: LLM consumes context, eval scores groundedness and answer relevance, guardrail blocks unsafe outputs at the boundary.

Ingestion and chunking

Chunk size is one of the most important tuning knobs. The 2026 default is 512 to 1024 tokens with 50 to 100 tokens of overlap, then a semantic re-chunker that respects section boundaries. Anthropic’s contextual-retrieval trick (prefix every chunk with a 1-sentence summary of its parent document) lifts retrieval by 35 to 50 percent on their benchmark.

Indexing

Run both indexes in parallel.

Hybrid retrieval with RRF

The Reciprocal Rank Fusion helper is small enough to inline; treat the bm25_search and dense_search calls as pseudocode for whatever indexes you use (Pyserini, Elasticsearch, Weaviate, pgvector).

def reciprocal_rank_fusion(rankings: list[list[str]], k: int = 60) -> dict[str, float]:
    """Fuse N ranked lists of doc_ids into a single score-per-doc."""
    fused: dict[str, float] = {}
    for ranking in rankings:
        for rank, doc_id in enumerate(ranking, start=1):
            fused[doc_id] = fused.get(doc_id, 0.0) + 1.0 / (k + rank)
    return fused


# Pseudo-usage (replace bm25_search / dense_search with your real callers).
# bm25_top = bm25_search(query, top_k=50)
# dense_top = dense_search(query, top_k=50)
# fused = reciprocal_rank_fusion([bm25_top, dense_top])
# top_ids = sorted(fused, key=fused.get, reverse=True)[:20]

Cross-encoder reranker

After hybrid retrieval, rerank with a cross-encoder. Cohere Rerank v3, Voyage Rerank-2, Jina Reranker v2, and the open-source BGE-Reranker-v2 are the production-grade options. A reranker reads the query and each candidate together, so it produces a much more accurate relevance score than dense cosine similarity alone.

RAG Patterns in 2026

Classic RAG

Retrieve once, generate once. Still the right default for FAQ-style and single-document QA where the answer sits in one passage.

Hybrid + reranker

BM25 plus dense, fused, then reranked. The 2026 production baseline. Cost: one extra reranker call per query (cheap with Cohere or Voyage, ~50ms).

Query rewriting

Three flavours.

  • HyDE (Gao et al., 2022): generate a hypothetical answer, embed the answer, retrieve against it.
  • Step-back (Zheng et al., 2023): rewrite to a more general query, retrieve, then specialise.
  • Decomposition: split a multi-part question into sub-queries, retrieve per sub-query, then combine.
from openai import OpenAI

REWRITE_MODEL = "gpt-4.1"  # any current chat model


def step_back(query: str) -> str:
    client = OpenAI()
    response = client.chat.completions.create(
        model=REWRITE_MODEL,
        messages=[
            {"role": "system", "content": "Rewrite the question to a more general form that captures the underlying concept."},
            {"role": "user", "content": query},
        ],
    )
    return response.choices[0].message.content.strip()


rewrite = step_back("What was the GDP growth of the country hosting the 2024 Olympics?")
# example output: "What is the GDP of France in 2024?"

Multi-hop retrieval

Sequential retrieval calls for compositional questions. Implementations include Self-Ask, IRCoT, and the agentic patterns below.

Agentic RAG

Retrieval is a tool the LLM can call. The agent inspects, decides, re-queries, and stops. The pattern is well covered in the agentic RAG systems guide.

Skeleton (pseudo-code):

def agentic_rag(question: str, max_hops: int = 3) -> str:
    state = {"question": question, "evidence": [], "hops": 0}
    while state["hops"] < max_hops:
        decision = llm_decide(state)
        if decision.action == "answer":
            return llm_answer(state)
        chunks = retrieve(decision.query)
        state["evidence"].extend(chunks)
        state["hops"] += 1
    return llm_answer(state)

Graph RAG

Build an entity + relationship graph from the corpus once, traverse at retrieval time. Microsoft’s GraphRAG repo is the reference implementation. The technique shines on global “summarise the themes” queries; for local “what did the report say about X” queries, hybrid plus reranker is usually enough.

Evaluation: The Part Most Teams Skip

A RAG pipeline without measurement is a hallucination machine. Score four axes per release.

AxisQuestion it answersMeasure
Context recallDid the retriever surface the gold passages?Fraction of gold passages in top-k
Context precisionAre retrieved passages relevant?Mean per-passage relevance score
GroundednessIs every claim supported by retrieved context?LLM judge or NLI model
Answer relevanceDoes the answer address the question?LLM judge

Run all four on a labelled golden set on every commit. Future AGI ships prebuilt evaluators for each.

import os
from fi.evals import evaluate

os.environ["FI_API_KEY"] = "fi-..."
os.environ["FI_SECRET_KEY"] = "..."

question = "What was France's GDP in 2024?"
retrieved_chunks = [
    "France's nominal GDP in 2024 was approximately 2.92 trillion USD per the IMF.",
    "The 2024 Summer Olympics were held in Paris, France.",
]
answer = "France's nominal GDP in 2024 was about 2.92 trillion USD."

groundedness = evaluate(
    "groundedness",
    output=answer,
    context="\n".join(retrieved_chunks),
    model="turing_flash",
)
relevance = evaluate(
    "answer_relevance",
    output=answer,
    input=question,
    model="turing_flash",
)
print(groundedness.score, relevance.score)

For a custom rubric (for example, “the answer must cite at least two sources”), wrap a judge:

from fi.evals.metrics import CustomLLMJudge
from fi.evals.llm import LiteLLMProvider

judge = CustomLLMJudge(
    provider=LiteLLMProvider(model="gpt-4.1"),
    name="rag-citation-judge",
    grading_criteria=(
        "Score 1 if the answer cites at least two distinct passages "
        "from the provided context; score 0 otherwise."
    ),
)

verdict = judge.evaluate({
    "input": question,
    "output": answer,
    "context": "\n".join(retrieved_chunks),
})
print(verdict.score, verdict.reason)

Trace every call

Wire traceAI (Apache 2.0) so every retrieval and generation lands in a dashboard. Promote production failures into the golden set on a weekly cadence.

from fi_instrumentation import register, FITracer

tracer = FITracer(register(project_name="rag-prod"))


def retrieve(q: str) -> list[str]:
    """Wire your hybrid retriever here; returns the top-k passages."""
    return ["..."]


def generate(q: str, ctx: list[str]) -> str:
    """Wire your LLM call here; returns the final answer string."""
    return "..."


@tracer.chain
def rag_answer(question: str) -> dict:
    chunks = retrieve(question)
    answer = generate(question, chunks)
    return {"question": question, "answer": answer, "chunks": chunks}

Design Considerations

Retriever selection

PickWhen to use
BM25 onlyCode identifiers, exact-name lookup, very small corpora
Dense onlyParaphrase-heavy queries, multilingual concept matching
Hybrid + rerankerMost production cases
Hybrid + reranker + graphCross-document, global, or compositional queries

Data preprocessing

Three checks before indexing.

  • Deduplicate at the chunk level (exact and near-duplicate). Duplicates inflate retrieval scores.
  • Strip boilerplate (footers, navigation) so chunks contain real content.
  • Add a context prefix to each chunk (Anthropic-style contextual retrieval) when the corpus has many short, ambiguous chunks.

Latency management

Production latency budget for a chat RAG:

StageBudget
Query rewrite (optional)200-400 ms
Hybrid retrieval50-150 ms
Cross-encoder rerank50-150 ms
Generation (first token)200-500 ms
Generation (full)500-1500 ms
Groundedness eval (turing_flash)1-2 s (out-of-band)

For streaming UIs, run groundedness asynchronously after the first stream and gate the next turn if it fails.

Applications

Customer support

Hybrid retrieval over FAQs, support tickets, and product guides. The reranker is the difference between “we have docs” and “the bot quotes the right line”. Run groundedness as a post-response check; if it fires, fall back to escalation.

Content workflows

Marketing and research teams use RAG over internal knowledge plus recent web data (Bing, Tavily, You.com). The evaluation focus is answer relevance and citation accuracy. Set a “must cite source” guardrail.

Education

Personalised tutoring over textbook PDFs and class notes. Multi-hop helps when the student asks compound questions (“explain how this theorem relates to the chapter on…”).

Healthcare

Graph RAG over medical literature for clinical decision support. Groundedness is non-negotiable; a hallucinated drug dose is a patient-safety event. Wire a hallucination guardrail at the response boundary.

Production Examples

Challenges and Mitigations

Compute overhead

Cross-encoder reranking adds 50 to 150 ms; query rewriting adds 200 to 400 ms. Mitigate with caching (rewrites cache cheap), GPU-accelerated rerankers (Voyage and Cohere are managed), and ANN indexes (HNSW, IVF-PQ).

Data quality

Garbage in, hallucinations out. Curate the corpus, version it, and run a periodic offline drift check on retrieval recall. The eval golden set should track the data version too.

Bias and stale information

Tag every chunk with a publication date and a source. The retriever can score recency; the generator can be instructed to flag stale evidence.

How Future AGI Fits

Future AGI is the evaluation and observability companion to your RAG stack. Pick your vector DB, reranker, and embedding model from the specialists; Future AGI scores every generation on groundedness, context adherence, faithfulness, and answer relevance with prebuilt evaluators in fi.evals, traces every call through fi_instrumentation (Apache 2.0 traceAI), and runs runtime guardrails for hallucination and PII through the Agent Command Center at /platform/monitor/command-center. The eval SDK (ai-evaluation LICENSE) and traceAI (LICENSE) are both Apache 2.0.

Closing Thought

A 2026 RAG pipeline is six layers, three orchestration patterns, and a continuous eval. The components have settled (hybrid plus reranker, optional graph, optional agentic loop). The wins now come from picking the right pattern for the workload and instrumenting it tightly. For deeper builds, see agentic RAG systems, RAG evaluation metrics, RAG hallucinations, embedding model picks, and hallucination detection tools.

Frequently asked questions

What is RAG architecture in 2026?
RAG (retrieval-augmented generation) in 2026 is a family of patterns that fetch external knowledge at inference time and ground the LLM's answer on it. The 2026 default is no longer the 2023 retrieve-once-then-generate pattern. Production stacks combine query rewriting, hybrid search (BM25 plus dense embeddings), a cross-encoder reranker, and a generator that can re-ask the retriever (agentic RAG). For complex queries, multi-hop retrieval and graph RAG sit on top. Evaluation tracks groundedness, context adherence, and retrieval recall at every release.
What is agentic RAG and how is it different from classic RAG?
Classic RAG runs a single retrieval call, then one generation. Agentic RAG treats retrieval as a tool the LLM can call multiple times within one turn. The agent inspects the question, decides whether it needs more evidence, issues a new search with a rewritten query, and only generates the final answer when it has enough context. The cost is more tokens and higher latency; the benefit is fewer hallucinations on multi-hop questions and the ability to handle queries the original retriever cannot answer in one pass.
Hybrid search vs dense retrieval: which should I use?
Hybrid search (BM25 plus dense embeddings, fused with Reciprocal Rank Fusion or weighted scoring) outperforms pure dense retrieval on most production benchmarks. Dense retrieval wins on paraphrase and concept matching. BM25 wins on rare terms, code identifiers, and exact-name lookups. Fusing the two captures both. The standard implementation is BM25 plus an off-the-shelf dense encoder (Cohere, OpenAI, BGE) plus RRF or weighted scoring, followed by a cross-encoder reranker. The reranker delivers most of the precision gain in practice.
What is graph RAG and when is it worth the cost?
Graph RAG builds an entity and relationship graph from the corpus and uses graph traversal at retrieval time to assemble context that a flat vector store cannot. Microsoft's GraphRAG paper (Edge et al., 2024) showed that local-question performance is comparable to vector RAG, but global questions ("summarise the themes across the corpus") are materially better. The cost is graph construction (LLM tokens for entity extraction, plus storage) and slower retrieval. Use graph RAG when global, cross-document questions matter; otherwise hybrid plus reranker is enough.
How do you evaluate a RAG pipeline in 2026?
Score four axes per release. Context recall: did the retriever surface the gold passages? Context precision: how many retrieved passages are actually relevant? Groundedness: is every claim in the answer supported by retrieved context? Answer relevance: does the answer address the question? Future AGI ships prebuilt evaluators for groundedness, context adherence, faithfulness, and answer relevance on the `turing_flash` (~1-2s), `turing_small` (~2-3s), and `turing_large` (~3-5s) judge tiers. Run them on a labelled golden set on every commit; promote production failures into the golden set with traceAI.
What does multi-hop retrieval look like?
Multi-hop retrieval issues multiple sequential retrieval calls, each one informed by the previous result. The pattern works for questions like 'what is the GDP of the country where the 2024 Olympics were held'. Hop 1 retrieves '2024 Olympics host country' (Paris, France). Hop 2 retrieves 'France GDP 2024'. The orchestrator can be a fixed pipeline (Self-RAG, FLARE) or an LLM agent that decides whether to hop. Eval is harder because retrieval recall has to be measured across the full chain, not a single call.
What is query rewriting in RAG?
Query rewriting expands or rephrases the user's question before retrieval so the retriever matches better. Common patterns are HyDE (generate a hypothetical answer, embed that), step-back prompting (rewrite to a more general query, retrieve, then specialise), and decomposition (split a multi-part question into sub-queries). Rewriting can lift recall by 5 to 15 percentage points on hard benchmarks but adds latency (one extra LLM call) and a new failure mode (the rewrite drifts from intent). Gate it with an eval that checks the rewrite preserves intent.
How does Future AGI fit into a RAG stack?
Future AGI sits beside the retrieval stack, not inside it. The vector DB, reranker, and embedding model are picked from specialists (Pinecone, Weaviate, Cohere, BGE, Voyage). Future AGI scores every generation on groundedness, context adherence, faithfulness, and answer relevance with prebuilt evaluators in `fi.evals`, traces every call through `fi_instrumentation` (Apache 2.0 traceAI), and runs runtime guardrails through the Agent Command Center. Production failures from traceAI flow back into the eval golden set for the next release.
Related Articles
View all
Stay updated on AI observability

Get weekly insights on building reliable AI systems. No spam.