RAG

What Is Retrieval-Augmented Generation?

An LLM pattern that retrieves external text at query time and conditions generation on it, grounding answers in source data instead of model weights.

What Is Retrieval-Augmented Generation?

Retrieval-Augmented Generation (RAG) is an LLM design pattern that fetches relevant passages from an external corpus at query time and injects them into the prompt before the model generates a response. A retriever. typically vector search over chunked documents. selects top-k passages, and the generator (the LLM) conditions its answer on that context. RAG lets a frozen model cite private docs, recent events, or domain-specific knowledge without retraining. It is still the dominant pattern for production LLM apps that must answer accurately over data the base model never saw, but in May 2026 the shape of “RAG” has shifted toward agentic retrieval, long-context hybrid stacks, and graph-aware retrievers. not the 2023 “embed-and-lookup” template most tutorials still describe.

Why Retrieval-Augmented Generation matters in production LLM and agent systems

A bare LLM has two structural weaknesses: its weights are frozen at a training cutoff, and it has no concept of “I don’t know. let me look it up.” Both produce confident hallucinations on anything outside its parametric memory. yesterday’s pricing change, your customer’s refund policy, the latest CVE, the patch that landed last night. RAG (originally proposed in the Lewis et al. 2020 paper) closes that gap by routing every question through a retrieval step, so the model is reasoning over text it can quote rather than guessing from priors. By May 2026, even frontier models with 1M-2M token context windows (Gemini 3.x, Claude Opus 4.7) still degrade on RULER and LongBench v2 past the 128K mark, which is why long-context did not kill RAG the way Twitter predicted in 2024. it just changed the chunk-size math.

The pain of skipping RAG hits engineers, support teams, and end users in different ways. Engineers see eval-fail rates spike on private-domain test sets. Support teams field tickets about wrong policy quotes. End users lose trust the first time an answer cites a feature that does not exist. In 2026 agent stacks the stakes compound: an agent that plans, retrieves, and acts over multiple tool calls inherits every retrieval error downstream. A missed chunk at step one becomes a wrong invoice at step five, and a trajectory check passes the final answer despite the silent corruption.

RAG is also where most LLM apps fail silently. The model still produces fluent prose even when retrieval surfaces irrelevant passages, so failures look like normal answers. the only signal is a quality drop you cannot see without traceAI instrumentation. That is why RAG and RAG evaluation ship together; one without the other is a liability. We’ve found in our 2026 evals that the most common production failure is not bad embeddings. it is chunk overlap misconfigured against chunking strategy, producing top-k results that all repeat the same fragment and miss the answer.

What changed about RAG between 2024 and May 2026

The 2023 reference architecture. naive chunk → embed → cosine top-k → stuff prompt. still works for FAQ bots but fails everywhere else. Three forces reshaped the stack:

Layer2023-2024 defaultMay 2026 production defaultWhy the shift
ChunkingFixed 512-token windows, no overlapAgentic chunking, late chunking, recursive chunking with semantic boundariesFixed windows shred entities and break context precision
RetrieverSingle-vector dense searchHybrid (BM25 + dense + ColBERT-style late interaction), often with multi-vector retrievalSparse beats dense on rare terms, IDs, codes; hybrid wins on BEIR + private corpora
Re-rankingOptionalMandatory; Cohere Rerank 3.5, Voyage Rerank-2, BGE-Reranker-v2-m3Top-k=20 plus rerank-to-3 beats top-k=3 from any single retriever
GeneratorGPT-4 / Claude 2 stuff promptGPT-5.x, Claude Opus 4.7, Gemini 3.x with structured citation requirementsFrontier models cite spans natively; citation skipping is a hallucination signal
EvalLLM-as-judge over final answerComponent-level: ContextRelevance, ContextPrecision, ContextRecall, Groundedness, Faithfulness, AnswerRelevancySingle-score evals hid which side broke; component scores localize the regression
OrchestrationLangChain RetrievalQAAgentic RAG: planner decides when to retrieve, what query to rewrite, how many hopsOne-shot retrieval misses multi-hop questions; agents trade latency for recall
MemoryStateless per-queryAdaptive knowledge-graph memory, agent workflow memoryProduction agents need entity continuity across sessions

The headline: by 2026 “RAG” is closer to “an agent with a retriever tool” than to “a vector store with a prompt template.” Anyone still benchmarking RAG with naive top-3 cosine is measuring 2023 infrastructure.

Agentic RAG and multi-hop retrieval

The biggest shift in 2025-2026 is moving the retrieval decision into the agent loop. Instead of one shot of vector search followed by generation, an agentic RAG workflow lets the planner decide whether to retrieve, what query to issue, when to refine, and when to call a second tool entirely. In our 2026 evals on enterprise support corpora, agentic RAG lifted ContextRecall from 0.71 to 0.88 on multi-hop questions where the answer required joining two documents, at the cost of 2-3x latency. That tradeoff is the live design decision teams are making: a routing policy inside Agent Command Center can pick “fast single-shot” for clear queries and “agentic multi-hop” for ambiguous ones, scored by query classifier and bound to a model fallback chain when latency exceeds budget.

The agentic-RAG pattern also makes trajectory evaluation mandatory. A two-hop retrieval that gets the answer right after a wrong first query still wastes tokens and confuses downstream memory; TrajectoryScore plus step efficiency catch the over-retrieval pattern before it shows up in your token bill.

How FutureAGI handles RAG

FutureAGI’s approach is to instrument every layer of the RAG path, not just the final answer. On the trace side, the traceAI-llamaindex and traceAI-langchain integrations (built on the OpenTelemetry GenAI semantic conventions) capture retriever spans (query, top-k, scores, chunk text), generator spans (prompt, completion, tokens), and the wiring between them, exposing OpenTelemetry attributes like retrieval.documents, retrieval.score, and embedding.text for every request. Spans are linked through agent.trajectory.step so multi-hop agentic RAG trajectories stay readable.

On the eval side, we score RAG along five orthogonal axes: ContextRelevance asks “did the retriever return useful passages?”; ContextPrecision asks “how much of the retrieved context was actually used?”; ContextRecall asks “did the retriever find every passage needed to answer?”; Groundedness asks “did the generator stay inside that context?”; Faithfulness asks “do all claims trace back to a cited source?”. AnswerRelevancy closes the loop on whether the response addressed the user. Unlike Ragas’ faithfulness. which collapses claims and context into one number. the FutureAGI split lets a release gate fail on ContextRecall while Groundedness still passes, which is the canonical “retriever broke, generator covered for it” failure pattern.

A typical FutureAGI workflow: a team running a Pinecone-backed LlamaIndex chain instruments with traceAI-llamaindex, samples 5% of production traces into an evaluation cohort, runs the component evaluators on each, and dashboards per-component scores. When ContextRelevance drops, the retriever is the suspect. chunking, embedding model, or top-k. When Groundedness drops while ContextRelevance holds, the generator is hallucinating despite good context, and the prompt or model is at fault. When ContextRecall drops but ContextPrecision is high, the index is missing documents, not noisy. That separation is the difference between fixing RAG in an hour and fixing it in a week. We pair this with LLM-as-a-judge CustomEvaluation rubrics for tone, refusal scope, and policy adherence. the company-specific layer no off-the-shelf evaluator can encode.

For runtime control, the same evaluators run as post-guardrail checks inside Agent Command Center, so a low-Groundedness completion can be auto-routed to a fallback model or escalated to a human before reaching the user. see the platform/guard surface for the policy editor. The same gateway exposes semantic-cache over normalized queries (not raw prompts), pre-guardrail checks for prompt injection on the retrieved context, and traffic-mirroring so a candidate retriever can run shadow against production for a week before any switch.

Release-gate wiring for RAG

A RAG release gate inside FutureAGI has four inputs: a baseline dataset (the last shipped index’s scores), a delta threshold per component (Groundedness may not drop more than 2 points; ContextRecall may not drop at all on safety-critical cohorts), a cohort filter (billing, policy, healthcare, refund), and a traceAI attribute filter so traces stay scoped to the release surface. The CI job runs the evaluators, posts scores back to the dataset, and either passes the build or blocks the deploy with a per-row diff. Engineers see which rows failed, which component evaluator fired, and which trace span shows the regression. not just an aggregate that moved. That is the closed loop public RAG benchmarks like BEIR cannot run for you, because they do not know your prompts, your policies, or your refusal rubric.

How to measure RAG quality

RAG is measurable end-to-end and per-component:

  • Groundedness. pass/fail on whether the answer is supported by the retrieved context; the canonical hallucination guard.
  • ContextRelevance. 0–1 score on whether the retrieved passages actually answer the input; catches retriever issues.
  • ContextPrecision. share of retrieved chunks that contributed signal; high precision with low recall means the retriever is too narrow.
  • ContextRecall. coverage of all chunks needed for a correct answer, scored against golden dataset ground truth.
  • AnswerRelevancy. whether the response addresses the user’s question semantically, not just lexically.
  • Faithfulness. claim-level decomposition with source attribution; stronger than Groundedness for long answers.
  • OTel attributes. retrieval.documents, retrieval.score, embedding.text, agent.trajectory.step. captured by traceAI-llamaindex and traceAI-langchain for trace-level inspection.
  • Eval-fail-rate by cohort. percentage of traces where a component score falls below threshold, sliced by tenant, route, or document set.
  • Token-cost-per-trace and p99 latency. retrieval and rerank add cost; a precision win that doubles latency is a partial regression.
from fi.evals import Groundedness, ContextRelevance, ContextPrecision, ContextRecall, AnswerRelevancy

groundedness = Groundedness()
ctx_rel = ContextRelevance()
ctx_prec = ContextPrecision()
ctx_rec = ContextRecall()
ans_rel = AnswerRelevancy()

for row in dataset:
    g = groundedness.evaluate(response=row.answer, context=row.context)
    cr = ctx_rel.evaluate(input=row.prompt, context=row.context)
    cp = ctx_prec.evaluate(context=row.context, expected=row.expected)
    rc = ctx_rec.evaluate(context=row.context, expected=row.expected)
    ar = ans_rel.evaluate(input=row.prompt, output=row.answer)
    row.attach_scores(groundedness=g, ctx_rel=cr, ctx_prec=cp, ctx_rec=rc, ans_rel=ar)

For continuous coverage, wire the same evaluators into a cohort-filtered regression eval over a versioned Dataset and post results back to the trace store:

from fi.datasets import Dataset
from fi.evals import Groundedness, ContextRecall, Faithfulness, AggregatedMetric

golden = Dataset.get("rag-golden", version="v34")
suite = AggregatedMetric(
    [Groundedness(threshold=0.95), ContextRecall(), Faithfulness()],
    weights=[0.4, 0.3, 0.3],
)
result = golden.add_evaluation(
    suite,
    cohorts=["billing", "policy", "kyc"],
    trace_filter={"index.version": "2026-05-11"},
    threshold=0.85,
)
assert result.passed, result.per_cohort_report

Healthy RAG: reruns are reproducible, failures are explainable per component, and score movement matches trace and feedback signals. In our 2026 evals across 14 enterprise RAG stacks, the strongest predictor of production health was not a single score. it was whether component scores remained stable across a daily replay of a 1,000-row golden dataset sampled to mirror live cohort proportions. Teams that lacked that daily replay caught regressions ~5 days late on average; teams with it caught the same regressions on the same day they shipped.

What public RAG benchmarks miss

BEIR, MS MARCO, Natural Questions, HotpotQA, and the more recent BRIGHT benchmark are all useful for retriever model selection. but every one of them was published before 2025 and is now contaminated by being inside the embedding model’s training data. The 2026 advice: use public benchmarks to shortlist embedding and reranker models, then validate against a domain golden dataset before committing. Treat any pre-2024 retrieval suite as a tier filter, not a release gate. The FrontierMath-style “private holdout” pattern that worked for reasoning evals applies here too. keep a 500-row retrieval test set that never leaves your VPC.

Common mistakes (May 2026 edition)

  • Equating “RAG works” with “the model returned text”. Fluency is not faithfulness. Score every response with Groundedness and Faithfulness, not eyeballs.
  • Tuning chunk size without measuring ContextRelevance and ContextRecall. Teams swap 512 tokens for 256 because a 2023 blog said so, with no metric to confirm relevance improved. The 2026 default is agentic chunking or recursive chunking at semantic boundaries. not a fixed token window.
  • Treating retrieval and generation as one problem. When quality drops, you need component-level scores to know which side broke. A single end-to-end score wastes a day of debugging.
  • Skipping the reranker because “dense retrieval is good enough”. In 2026 production stacks, hybrid + rerank is table stakes; pure dense single-vector retrieval is a 2023 baseline that loses on rare terms, codes, and acronyms.
  • Long-context-only stacks (“just stuff 1M tokens”). Frontier models still degrade past 128K on RULER and LongBench v2; long context complements retrieval, it does not replace it. The 2024 hot-take “RAG is dead because of long context” did not survive 2025 production load.
  • Caching by exact prompt hash on a RAG pipeline. Retrieval makes prompts fluctuate by milliseconds. use semantic-cache via Agent Command Center instead.
  • Skipping retriever evaluation entirely. Most RAG failures originate in retrieval, not the LLM, but teams over-invest in prompt tuning and never measure recall.
  • Benchmarking with public BEIR or MS MARCO only. Public retrieval benchmarks are saturated and contaminated; private golden datasets over a representative sample of production queries are the only honest test.
  • No regression eval when the embedding model upgrades. Swapping text-embedding-3-large for a newer embedding model rewrites the entire index without re-checking scores. guaranteed silent regression.
  • Ignoring prompt injection inside retrieved context. Untrusted documents in the index can carry instructions that hijack the generator at runtime. A pre-guardrail for indirect injection on the retrieved chunks is now table stakes for any RAG over user-contributed content. The 2025 wave of indirect prompt injection incidents made this a compliance issue, not just a security one.
  • Treating every query as needing retrieval. Greeting messages, clarifications, and tool-call confirmations do not need a vector lookup; route them past the retriever with a lightweight classifier, save the token cost, and keep your trace volume down.
  • Building one retriever for every cohort. Legal, support, code, and product-doc queries have wildly different recall and precision tradeoffs; in 2026 production stacks the standard pattern is per-cohort embedding indexes routed by a query-intent classifier. not a single monolith.
  • Forgetting to version the index. When a re-ingestion pipeline runs, last week’s traces no longer reproduce. Tag every retrieval span with index.version and gate releases on version-aware regression evals.

Frequently Asked Questions

What is Retrieval-Augmented Generation (RAG)?

RAG is an LLM pattern that retrieves relevant text from an external corpus at query time and conditions generation on it, so answers are grounded in source data rather than the model's frozen training weights.

How is RAG different from fine-tuning?

Fine-tuning bakes new knowledge into model weights and requires retraining whenever the corpus changes. RAG keeps the model frozen and edits the corpus instead. cheaper, traceable, and updatable in seconds rather than days.

How do you measure RAG quality?

FutureAGI scores RAG end-to-end with Groundedness, AnswerRelevancy, ContextRelevance, ContextPrecision, and ContextRecall, captured on traceAI spans for every request.