RAG

What Is RAG?

RAG retrieves external context before generation so LLM answers can be grounded, current, and traceable.

What Is RAG?

RAG (retrieval-augmented generation) is a production architecture that retrieves external context before an LLM generates an answer, then asks the model to ground its response in that context. It shows up in production as retriever spans, retrieved chunks, context passed to the model, and final answer traces. FutureAGI evaluates RAG with surfaces such as RAGScore, ContextRelevance, Groundedness, and ChunkAttribution, so teams can tell whether a failure came from retrieval, generation, or unsupported claims. As of May 2026, RAG is the dominant pattern for grounding frontier models like GPT-5.x, Claude Opus 4.7, and Gemini 3.x against private or rapidly changing data. long context windows did not kill RAG, they shifted its shape.

Why RAG matters in production LLM and agent systems

RAG fails quietly when the retriever fetches weak evidence and the generator writes fluent prose anyway. The user sees a confident answer. The trace may show a normal latency profile. The business impact appears later as wrong policy guidance, support escalations, bad citations, or an agent taking an action based on a stale document.

The pain splits across teams. Application engineers need to know whether to fix chunking, embeddings, reranking, prompt instructions, or the model. SREs watch p99 latency, token-cost-per-trace, retrieval timeout rate, and sudden changes in the number of retrieved chunks per request. Product and compliance teams care about the downstream symptom: answers that cite irrelevant documents, omit the source of a claim, or blend two policies that should never be mixed.

In 2026 multi-step pipelines, RAG is rarely a single search call before a chat response. Agentic RAG. the dominant 2026 pattern. retrieves context for planning, tool selection, customer policy checks, code execution, and human handoff summaries. One retrieval miss can become a wrong tool call several steps later. Unlike a standalone Ragas faithfulness check, production RAG evaluation needs component signals that separate retrieval quality from generation grounding and final answer usefulness, traced through every step.

The compliance dimension matters too. RAG is also the dominant mechanism for grounding LLM answers in regulated content (legal, medical, financial), and an ungrounded answer in those domains is a compliance event, not a quality event. The EU AI Act treats hallucinated grounding in a high-risk system as a documented control failure under Art. 15 robustness requirements. That moves Groundedness from “nice to have” to “release-blocking signal” for regulated RAG.

The 1M+ token context windows now standard across frontier models (Gemini 3.x at 2M, Claude Opus 4.7 at 1M, GPT-5.x at 1M) changed the operational question. Pre-2025, the question was “does my retriever recall enough?” Post-2025, it is “does my retriever return relevant and concise context, or am I paying $0.03 per request to send 800K junk tokens?” Long context is not a substitute for retrieval. it amplifies the cost of a bad retriever.

The 2026 cost-quality frontier for RAG

A naive 2026 RAG application can quietly cost 5–10x what a well-tuned one costs for the same quality. The drivers:

  • Embedding compute: a Cohere embed-v3 or OpenAI text-embedding-3-large call per query is cheap individually but multiplied by the rerank set adds up.
  • Reranker calls: Cohere Rerank, Voyage reranker, or a Mixedbread model adds a second model call per query.
  • Context length: every retrieved chunk is paid for at input-token rates by the generator. At Claude Opus 4.7 pricing (May 2026), 50K tokens of context is about $0.75; 800K tokens is $12.
  • Long-context degradation: models lose focus past ~200K tokens for most tasks. Pay-and-degrade is the worst trade.

In our 2026 evals, a well-tuned hybrid retriever with a tight reranker delivers 0.92 RAGScore on customer-support traffic at ~$0.04/request; a naive top-50 dump with no reranker delivers 0.86 RAGScore at ~$0.31/request. Retrieval quality is a cost lever, not just a quality lever.

Public benchmarks worth pinning to every RAG release gate: RAGTruth (18K labeled chunks; median frontier model fails Groundedness on 5-8% of answers), RAGBench (12 RAG tasks across 6 domains, 100K+ examples), CRAG (4400 Q stratified by difficulty and noise, Meta), and MultiHop-RAG (2556 multi-hop Q over news, where naive RAG typically scores 30-45% vs 65-75% for graph-augmented or sub-question-decomposed pipelines). On RULER (NVIDIA, 4K-128K context), frontier models drop 15-30 points between 4K and 128K on multi-hop variable tracking. the cleanest evidence that long context is not a substitute for retrieval.

How FutureAGI handles RAG

FutureAGI’s approach is to treat RAG as an evaluable trace, not just a prompt pattern. The anchor surface is the RAGScore evaluator from fi.evals: it combines retrieval relevance, answer grounding, and response quality into one production score. Engineers can pair it with ContextRelevance, Groundedness, ChunkAttribution, and NoiseSensitivity when they need to isolate the failing layer.

Consider an internal support assistant built with LangChain and a vector database. The application instruments the pipeline with traceAI-langchain, so each request records the user input, retriever span, retrieved chunks, generator call, model output, token usage, and latency. FutureAGI samples those traces into a Dataset and runs RAGScore on every candidate answer. If the score drops after a knowledge-base migration, the team opens the trace cohort and checks the component metrics.

A low ContextRelevance score points to search: chunk size, embedding model, top-k, filters, or reranker settings. A low Groundedness score with strong context points to generation: prompt instructions, citation formatting, or model choice. A missing ChunkAttribution signal means the answer cannot be tied back to a retrieved source. A high NoiseSensitivity score means irrelevant retrieved context is degrading reasoning. the fix is a tighter reranker, not a smarter model. The engineer then sets a metric threshold, creates a regression eval from the failing traces, and routes high-risk cases to a fallback answer or human review through Agent Command Center.

Unlike Ragas, which scores RAG quality offline on a fixed dataset, FutureAGI ties every evaluator call back to a traceAI span, so a Groundedness regression is debuggable down to the prompt, retrieved chunks, and model version that produced it. In our 2026 evals across enterprise RAG deployments, ~58% of “RAG quality regressions” actually localised to retrieval (embedding model swap, index rebuild, chunk-size change), ~24% to generation, and the rest to prompt or model upgrades. only component-level scoring tells those apart.

A worked example: debugging a RAG regression

A customer-support team running a Claude Sonnet 4.6-powered RAG assistant sees RAGScore drop from 0.89 to 0.78 over a weekend. No code changed. The component view tells the story:

  • ContextRelevance dropped from 0.91 to 0.71. retrieval is suspect.
  • Groundedness held at 0.93. generation is fine.
  • ChunkAttribution held at 0.95. the model is still citing retrieved chunks.
  • NoiseSensitivity rose from 0.12 to 0.34. distractors are degrading reasoning.

The fingerprint is “retriever brought back the wrong chunks.” Investigation reveals an embedding-index rebuild that ran Saturday night with a new chunking strategy (1024 tokens, no overlap). Rolling back to the previous embedding index restores ContextRelevance. The team adds a regression eval that pins ContextRelevance >= 0.85 on every index rebuild via LLM regression testing and a eval-driven gate on the index pipeline.

The total debugging time was 45 minutes. Without component-level scoring, the same incident took the same team three days to diagnose six months earlier.

2026 RAG architectures and the evaluators they need

RAG patternDescriptionPrimary evaluatorsWhen to use
Naive RAGSingle retrieve → generateContextRelevance, GroundednessPrototypes, small KBs
Hybrid retrievalDense + sparse + reranker+ ChunkAttribution, ChunkUtilizationProduction search at scale
Agentic RAGAgent decides when/what to retrieve+ ToolSelectionAccuracy, TaskCompletionMulti-step assistants
Corrective RAG (CRAG)Retrieve → score → re-retrieve+ NoiseSensitivity, FaithfulnessLow-tolerance domains (legal, medical)
Self-RAGModel emits retrieve tokens + reflection+ ReasoningQuality, FaithfulnessLong-form generation
GraphRAGKG-augmented retrieval+ ContextEntityRecall, MultiHopReasoningConnected-data domains
Long-context RAG1M-token dump, no chunkingContextUtilization, NoiseSensitivityWhen recall > cost
Multimodal RAGImage + text retrieval+ CaptionHallucination, ImageInstructionAdherenceVisual KBs, product catalogs

Wiring RAG evaluation into release gates

A RAG release gate has the same shape as an LLM evaluation release gate: a baseline dataset, a delta threshold per component evaluator, and a cohort filter. The difference is the evaluators. For RAG we recommend ContextRelevance >= baseline - 0.03, Groundedness >= baseline - 0.02, NoiseSensitivity <= baseline + 0.05, and RAGScore >= baseline - 0.02. Each evaluator runs per cohort (billing, policy, legal, multilingual) and the gate fails the deploy if any cohort breaches threshold. The retriever index, the reranker config, the chunking strategy, and the generator prompt are all versioned in the trace store, so a regression diff can be reconstructed end-to-end.

How to measure or detect RAG quality

Measure RAG at retrieval, generation, and answer layers:

  • RAGScore returns a combined RAG quality score for a query, retrieved context, and generated answer.
  • ContextRelevance detects whether retrieved chunks answer the user’s query before generation happens.
  • Groundedness checks whether the final answer is supported by the provided context.
  • Faithfulness scores multi-claim grounding across the response.
  • ChunkAttribution verifies the answer actually cites a retrieved chunk.
  • ChunkUtilization measures how much of the retrieved chunk the model actually used.
  • NoiseSensitivity measures degradation when irrelevant context is added. a high score means your reranker is failing.
  • ContextPrecision and ContextRecall for retrieval ranking quality and completeness.
  • MultiHopReasoning for queries that need to chain across multiple retrieved chunks.
  • Trace signals include retriever latency, retrieved chunk count, token-cost-per-trace, citation-missing rate, and eval-fail-rate-by-cohort.
  • User proxies include thumbs-down rate on sourced answers, escalation rate after knowledge-base answers, and citation click-through.

Also separate retrieval absence from poor retrieval. If no documents are returned, alert on empty-context rate; if documents are returned but irrelevant, alert on low ContextRelevance. This prevents the dashboard mistake of mixing outages, ranking regressions, and generation hallucinations into one quality metric.

from fi.evals import RAGScore, NoiseSensitivity

score = RAGScore().evaluate(
    input="What is our refund policy?",
    output=answer,
    context=retrieved_chunks,
)
noise = NoiseSensitivity().evaluate(
    input=question,
    output=answer,
    context=retrieved_chunks,
)
print(score.score, score.reason)

The important detection pattern is not one global score. Track RAGScore by dataset, retriever version, document collection, model, and prompt version so a release can fail only the cohort it actually changed.

Common mistakes (May 2026 edition)

RAG problems usually come from treating the whole pipeline as one model call:

  • Scoring only the final answer. A single score hides whether retrieval, context packing, or generation failed. Run RAG evaluation at the component level.
  • Optimizing top-k without relevance labels. More chunks can increase distractors and cost while lowering Groundedness. Use NoiseSensitivity to find the optimal k for your reranker.
  • Running evals without storing retrieved chunks. The evaluator needs context; answer-only traces cannot explain the source of a failure. Pin retrieval.documents as a span attribute in traceAI.
  • Treating stale content as hallucination. The model may be grounded in an outdated document. Version the corpus and trace document timestamps. This is the leading cause of false-positive hallucination flags.
  • Skipping regression evals after reindexing. Embedding, chunking, and metadata-filter changes can alter answers even when prompts stay fixed. Pin a fixed golden dataset and run on every index rebuild.
  • Dumping a 1M-token context to skip retrieval. Long context is a license to be lazy. It costs 100x more per request and lowers Groundedness because the model loses focus. Retrieve first, then expand if needed.
  • Mixing the retrieval index for PII and public data. A retrieved CRM ticket can land in a public-facing answer. Filter and audit by document sensitivity at retrieval time.
  • Self-evaluating with the generator model. Self-evaluation inflates scores. Pin the judge to a different model family. see the llm-as-a-judge entry for details.
  • Skipping chunk overlap tuning. Zero-overlap chunking drops information at boundaries; too much overlap inflates the index and hurts retrieval precision. Sweep overlap and score with ChunkUtilization.
  • Choosing an embedding model by leaderboard rank alone. MTEB and BEIR rankings are useful filters; the right embedding for your domain is the one with the highest ContextRelevance on your golden dataset. Run a head-to-head on 200 rows before you commit.
  • Treating RAG architecture as final. Agentic-RAG, GraphRAG, and corrective-RAG patterns each unlock new performance ceilings. re-evaluate quarterly as the patterns mature.

A useful review question is simple: can you name the exact retrieved chunk that made the answer pass or fail? If not, your tracing is the bug.

2026 RAG architectural choices that pay off

A handful of architectural choices consistently differentiate good 2026 RAG systems from mediocre ones:

  • Hybrid retrieval (dense + BM25 + reranker) beats dense-only on every benchmark we’ve run since late 2024. BM25 catches the exact-keyword queries that embeddings miss.
  • Late-interaction models (ColBERT, Voyage-3-large) help when your corpus has high lexical overlap and the dense embedding loses discriminative power.
  • Query rewriting with a small model (a Llama 4 8B or Claude Haiku 4.5 call) before retrieval improves recall by 8–15 points on multi-clause user queries.
  • Per-chunk metadata filters (document type, recency, access scope) cut down the candidate set faster than reranking and prevent stale-doc grounding.
  • Reranker tuned to your domain beats a generic reranker by ~10 points on ContextPrecision. The investment pays off within a quarter of running it in production.
  • Caching at the question-embedding layer (with EmbeddingSimilarity-based semantic cache) cuts retrieval latency by 30–50% on repeat queries.

These are not new ideas. what is new is that you can now measure each one’s contribution with FutureAGI’s component evaluators and ship only the ones that move your specific eval cohort.

Multimodal RAG and the 2026 expansion

RAG in 2026 has expanded beyond text. Multimodal RAG retrieves images, tables, video keyframes, audio transcripts, and code snippets alongside text. The evaluator stack expands to match: CaptionHallucination for image-to-text answers, ImageInstructionAdherence for image-conditioned generation, OCREvaluation for document parsing, and TextToSQL for retrieval-grounded SQL generation. The same component-level discipline applies. score the retriever and the generator separately, and trace every modality boundary.

GraphRAG, which augments retrieval with knowledge-graph traversal, is the other 2026 frontier. It is the right pattern for connected-data domains (biomedical, legal precedent, supply-chain) where the answer requires walking from entity to entity, not just keyword similarity. ContextEntityRecall and MultiHopReasoning are the evaluators that surface graph-RAG failures.

Long-context “RAG-less” approaches. dumping the entire knowledge base into a 1M-token context. remain a tempting shortcut for small KBs. We use them sparingly because they fail two ways: cost scales linearly with context size, and NoiseSensitivity rises as the model loses focus past ~200K tokens. The right tool is still retrieval; long context is the fallback when the KB genuinely fits and retrieval engineering is not worth the team’s time.

Frequently Asked Questions

What is RAG?

RAG is a production pattern that retrieves external context before LLM generation, so answers can be grounded in current source data rather than only model weights. FutureAGI measures it with RAGScore, ContextRelevance, and Groundedness.

How is RAG different from fine-tuning?

RAG adds context at request time from a knowledge base or search layer. Fine-tuning changes model behavior during training, but it does not automatically keep answers current with changing documents.

How do you measure RAG?

Measure RAG with FutureAGI's RAGScore for an aggregate signal, plus ContextRelevance for retrieval quality and Groundedness for answer support. Trace the retriever, context, and generator spans together.