How is corrective RAG different from agentic RAG?

Agentic RAG uses a general LLM-driven agent loop where the model decides retrieval strategy. Corrective RAG is a specific prescribed pattern: a retrieval evaluator runs after retrieve and triggers a fixed set of fallback strategies based on its label. Narrower and more debuggable.

How do you evaluate corrective RAG?

FutureAGI's fi.evals RAGScoreDetailed scores each retrieve-and-correct loop, while NoiseSensitivity measures whether the system degrades when irrelevant chunks slip past the corrective evaluator.

What Is Corrective RAG (CRAG)? Definition & FutureAGI Guide (2026)

Q: What is corrective RAG?

Corrective RAG (CRAG) is a RAG pattern that scores retrieved chunks before generation, labels them as correct, ambiguous, or incorrect, and triggers fallback strategies — web search, decomposition, query rewriting — when retrieval is judged insufficient.

What Is Corrective RAG?

Corrective RAG (CRAG) is a retrieval-augmented generation pattern that inserts a retrieval evaluator between the retrieve and generate steps. After retrieval, each chunk is graded — typically as correct, ambiguous, or incorrect. The system then conditionally branches: correct chunks pass straight to generation, ambiguous chunks trigger query refinement or chunk decomposition, and incorrect chunks trigger a fallback retriever (often web search or a different vector store). The pattern self-corrects when the primary retriever fails, which is a structural improvement over naive RAG. Originally proposed by Yan et al. (2024), CRAG has become a baseline for production RAG that needs higher reliability than single-shot retrieval offers.

Why It Matters in Production LLM and Agent Systems

Naive RAG has one fatal property: it trusts retrieval. If the top-k passages are irrelevant, the LLM still produces a fluent answer — just a wrong one. CRAG breaks that failure mode by gating generation on a relevance check. When the gate fails, the system rewrites the query, decomposes it, or escalates to a different corpus. The user sees a slightly slower correct answer instead of a fast wrong one.

The pain shows up across roles. Engineers see hallucinations on long-tail queries where retrieval missed. Product managers see “the bot doesn’t know about [recent feature]” tickets because the static index was stale. Compliance leads cannot justify a RAG answer that cites no source — CRAG’s corrective branch logs which fallback fired and why. SREs see runaway costs when fallbacks are too aggressive — CRAG demands threshold tuning to balance reliability and spend.

In 2026 production stacks, CRAG sits between naive RAG (cheap, fragile) and full agentic RAG (capable, expensive). For high-stakes Q&A — medical, legal, financial — CRAG’s prescribed corrective loop is more auditable than a general agent: every fallback fires for a logged reason, and the trace shows whether the corrective layer activated. Multi-tenant SaaS RAG systems use CRAG to upgrade quality on enterprise tiers without paying agent-loop latency on every request.

How FutureAGI Handles Corrective RAG

FutureAGI’s approach is to provide the evaluator that drives the CRAG corrective branch and the trace surface that proves it fired. The retrieval evaluator at the heart of CRAG is conceptually identical to fi.evals.ContextRelevance and fi.evals.RAGScoreDetailed — both grade retrieved chunks against the input query. Teams building CRAG plug these evaluators in as the gating function: if ContextRelevance < threshold, branch to the fallback retriever. fi.evals.NoiseSensitivity measures how robust the downstream answer is when the corrective layer lets through some noise — the canonical tuning signal.

On the trace side, traceAI integrations capture both the primary retrieval span and any fallback retrieval span as a typed pair. A CRAG trace shows: query → retrieve → evaluator decision → (optional) fallback retrieve → generate. Every branch is observable, every fallback is logged with its trigger reason.

A typical FutureAGI workflow: a healthcare-knowledge RAG team implements CRAG with ContextRelevance as the corrective evaluator, threshold 0.7. They run RAGScoreDetailed on the final answer to confirm corrections actually improved quality, and NoiseSensitivity to confirm the threshold isn’t letting too much noise through. The trace dashboard shows fallback-fire rate per route — when it spikes on a new product line, the team knows to refresh the corpus. That feedback loop — corrective layer fires, eval confirms quality, trace shows pattern — is what makes CRAG operable rather than experimental.

How to Measure or Detect It

CRAG-specific signals layer on top of standard RAG signals:

fi.evals.ContextRelevance: the canonical retrieval evaluator that drives the corrective branch — 0–1 score per chunk.
fi.evals.RAGScoreDetailed: end-to-end after corrections applied — confirms the corrective loop improved quality.
fi.evals.NoiseSensitivity: measures how the answer degrades when irrelevant chunks slip past the corrective layer.
Fallback-fire rate: percentage of requests where the corrective branch triggered — a tuning signal.
fi.evals.Groundedness on the final answer: pass/fail check that synthesis stayed inside corrected context.
OTel attributes: retrieval.score, retrieval.fallback, retrieval.documents — emitted by traceAI integrations.

from fi.evals import RAGScoreDetailed, NoiseSensitivity

scorer = RAGScoreDetailed()
result = scorer.evaluate(
    input="What is the dose for amoxicillin in adults?",
    output="500 mg every 8 hours for adult mild infections.",
    context=["...amoxicillin adult dose: 500 mg q8h..."]
)
print(result.score, result.reason)

Common Mistakes

Conflating CRAG with agentic RAG. CRAG is a fixed corrective pattern; agentic RAG is general agent-driven retrieval. CRAG is narrower and more predictable.
Using one threshold for all routes. Threshold for medical Q&A should be stricter than for FAQ — tune ContextRelevance threshold per cohort.
Logging only the final answer span. Without the corrective-evaluator span and fallback span, you cannot prove CRAG actually fired or diagnose when it should have but didn’t.
Ignoring fallback latency. Web-search fallbacks add seconds; budget for them or surface a “searching” UX.
Skipping NoiseSensitivity. A CRAG threshold that is too loose lets noise through; without measuring sensitivity, you will not catch degradation until users do.