How to Cut RAG Hallucinations in 2026: An Evaluation-Driven Playbook with Future AGI
Cut RAG hallucinations in 2026 with the Future AGI eval loop. Context Adherence + Groundedness metrics, real fi.evals code, chunk + retriever + reranker tuning.
Table of Contents
How to Cut RAG Hallucinations in 2026: Future AGI Playbook
RAG hallucinations destroy user trust faster than any other LLM failure mode. The answer sounds confident, the citation looks plausible, and the underlying fact is wrong. This guide shows how to detect and reduce RAG hallucinations in 2026 using Future AGI’s Context Adherence and Groundedness metrics, with runnable code, a three-axis tuning loop, and the SLO thresholds we use day to day.
TL;DR
| Question | Short answer |
|---|---|
| Top RAG eval metrics in 2026 | Context Adherence, Groundedness, Context Retrieval Quality (all in the Future AGI catalog). |
| Best detection workflow | Tag every production RAG span with EvalTag(eval_name=EvalName.CONTEXT_ADHERENCE, ...) and EvalTag(eval_name=EvalName.GROUNDEDNESS, ...). |
| Biggest hallucination drivers | Thin or noisy retrieval, oversized chunks, fluency bias in the generator. |
| Biggest single fix | Add a reranker on top of the initial retriever. |
| Future AGI integration | evaluate(eval_templates="context_adherence", inputs={"output": ..., "context": ...}) |
| Cloud judge latency | turing_flash 1-2s, turing_small 2-3s, turing_large 3-5s. |
| SDK license | ai-evaluation Apache 2.0 (github.com/future-agi/ai-evaluation/blob/main/LICENSE). |
Why RAG hallucinations happen
Three causes drive most hallucinations:
- Insufficient context. The retriever returns a short or weak passage, and the generator fills the gap with plausible text.
- Retriever failure. The search step returns passages that look similar but are off-topic, especially with similarity-only retrieval and no reranker.
- Fluency bias. The generator prefers a smooth sentence over a faithful one, especially with weak system prompts and no inline groundedness check.
Each pipeline weakness maps onto a measurable signal:
- Chunking issues drop the Context Retrieval Quality score.
- Retriever failure (irrelevant passages) drops Context Retrieval Quality and often Groundedness; if the model still tries to answer it will produce ungrounded claims.
- Fluency bias drops Groundedness even when Context Adherence stays high.
Real-world cost of skipping eval
Two public incidents illustrate the stakes. Cursor’s AI support agent told a user a non-existent “single-device policy” was forcing logouts; the co-founder issued a public apology after the story trended (eweek.com). New York City’s MyCity small-business chatbot gave illegal advice on tipping, terminations, and zoning even when the correct rules were available in its knowledge base (apnews.com). Continuous evaluation against the retrieved context would have surfaced the unsupported answers before users saw them.
The three Future AGI metrics for RAG
| Metric | Question it answers | Best for |
|---|---|---|
| Context Adherence | Did the answer stay inside the retrieved passages? | Detecting drift outside the source context. |
| Groundedness | Is every claim explicitly supported by evidence? | Catching invented facts even when adherence is high. |
| Context Retrieval Quality | Were the passages themselves relevant and sufficient? | Diagnosing retriever or chunker problems upstream. |
All three are model-based, so no labeled ground truth is required.
Three-axis tuning loop
Run a sweep over three axes and score every configuration with the metrics above.
| Axis | Options to try |
|---|---|
| Chunking | RecursiveCharacterTextSplitter, CharacterTextSplitter, semantic chunker |
| Retrieval | FAISS similarity, MMR (Maximal Marginal Relevance), hybrid BM25 + dense, reranker on top |
| Chain | stuff, map_reduce, refine, map_rerank |
The combination with the highest joint Context Adherence + Groundedness on your dev set wins. On the public benchmark we ran internally, the winner was CharacterTextSplitter for chunking, MMR for retrieval, and map-rerank for generation. Your dataset may pick a different winner; the right combination is the one your eval set chooses.
Code: score a RAG answer with fi.evals
The cloud context_adherence template scores any answer against any context.
import os
from fi.evals import evaluate
os.environ["FI_API_KEY"] = "your-future-agi-api-key"
os.environ["FI_SECRET_KEY"] = "your-future-agi-secret-key"
retrieved_context = (
"Future AGI's ai-evaluation SDK is Apache 2.0 licensed "
"and hosted at github.com/future-agi/ai-evaluation."
)
answer = (
"The Future AGI ai-evaluation SDK is Apache 2.0 and "
"available on GitHub at github.com/future-agi/ai-evaluation."
)
result = evaluate(
eval_templates="context_adherence",
inputs={
"output": answer,
"context": retrieved_context,
},
model_name="turing_flash",
)
print(result.eval_results[0].metrics[0].value)
Swap context_adherence for groundedness or context_retrieval_quality to score the other two RAG metrics. turing_flash returns in 1-2 seconds; switch to turing_small (2-3 seconds) or turing_large (3-5 seconds) for deeper judgment (docs.futureagi.com).
Code: continuous online scoring with EvalTag
For production traffic, tag spans at instrumentation time so every RAG call is scored automatically without extra application code.
from fi_instrumentation import register
from fi_instrumentation.fi_types import (
ProjectType,
EvalTag,
EvalTagType,
EvalSpanKind,
EvalName,
)
from traceai_langchain import LangChainInstrumentor
eval_tags = [
EvalTag(
type=EvalTagType.OBSERVATION_SPAN,
value=EvalSpanKind.LLM,
eval_name=EvalName.GROUNDEDNESS,
config={},
mapping={
"input": "llm.input_messages.1.message.content",
"output": "llm.output_messages.0.message.content",
"context": "llm.input_messages.0.message.content",
},
custom_eval_name="Groundedness",
),
EvalTag(
type=EvalTagType.OBSERVATION_SPAN,
value=EvalSpanKind.LLM,
eval_name=EvalName.CONTEXT_ADHERENCE,
config={},
mapping={
"context": "llm.input_messages.0.message.content",
"output": "llm.output_messages.0.message.content",
},
custom_eval_name="Context_Adherence",
),
EvalTag(
type=EvalTagType.OBSERVATION_SPAN,
value=EvalSpanKind.LLM,
eval_name=EvalName.EVAL_CONTEXT_RETRIEVAL_QUALITY,
config={
"criteria": (
"Evaluate if the context is relevant and "
"sufficient to support the output."
),
},
mapping={
"input": "llm.input_messages.1.message.content",
"output": "llm.output_messages.0.message.content",
"context": "llm.input_messages.0.message.content",
},
custom_eval_name="Context_Retrieval_Quality",
),
]
trace_provider = register(
project_type=ProjectType.OBSERVE,
project_name="rag_support_assistant",
eval_tags=eval_tags,
)
LangChainInstrumentor().instrument(tracer_provider=trace_provider)
Once the tracer is registered, every LangChain call produces a span with the three RAG scores attached. Swap LangChainInstrumentor for LlamaIndexInstrumentor, HaystackInstrumentor, or DSPyInstrumentor to instrument other stacks.
Code: a custom RAG rubric with CustomLLMJudge
When the catalog metrics do not fully capture the rubric (regulated industries, brand voice, JSON schema validation on retrieved citations), define a local judge.
import os
from fi.evals.metrics import CustomLLMJudge
from fi.evals.llm import LiteLLMProvider
from fi.opt.base import Evaluator
citation_judge = CustomLLMJudge(
name="citation_format_judge",
grading_criteria=(
"Score 1 if every numbered claim in the answer is "
"followed by a bracketed citation like [1] or [2] "
"and the cited source appears verbatim in the context. "
"Otherwise score 0."
),
provider=LiteLLMProvider(
model=os.getenv("JUDGE_MODEL", "gpt-4o-mini"),
),
)
evaluator = Evaluator(metric=citation_judge)
score = evaluator.evaluate(
output=(
"Future AGI ships Context Adherence [1] and "
"Groundedness [2] metrics."
),
context=(
"[1] Context Adherence checks whether the answer "
"stays inside the retrieved context. "
"[2] Groundedness checks whether each claim is "
"supported by evidence."
),
)
print(score)
The judge runs locally, hits the configured provider through LiteLLM, and returns a rubric score. Pin the provider temperature to 0 and pin the model version in the env var if you need reproducible scores across runs.
How to ship the loop
- Pin an eval set. 50 to 200 representative RAG queries with retrieved context. Refresh quarterly.
- Score the baseline. Run the eval set through your current pipeline; record Context Adherence, Groundedness, Context Retrieval Quality.
- Run the three-axis sweep. Vary chunker, retriever, and chain.
- Pick the winner. Highest joint score on Context Adherence + Groundedness wins. Confirm on a held-out test set.
- Wire production tagging. Use the
EvalTagsnippet above so every prod call is scored. - Alert on regressions. Track the rolling mean Context Adherence and the rate of low-score samples (e.g., samples below 0.6); alert when the failure rate exceeds your SLO (commonly 5-10%, tune on your data) and route those samples to a review queue.
Most common RAG mistakes and how the metrics surface them
| Mistake | What you will see | Fix |
|---|---|---|
| Chunks too long | Retrieved passages contain relevant text plus noise; Groundedness drops | Shorten chunks, add overlap |
| Chunks too short | Context Retrieval Quality low | Increase chunk size, switch to recursive splitter |
| Similarity-only retrieval | Context Adherence high, Groundedness low | Add MMR and a cross-encoder reranker |
| Fluency-biased generator | Groundedness drops; Adherence stays high | Tighten system prompt; switch chain to map-rerank or refine |
| Prompt asks for “your knowledge” | Adherence drops on questions outside context | Add explicit “only answer from the context” instruction |
When to combine RAG with prompt-opt
Once chunker, retriever, and chain are tuned, run a prompt-opt loop on the generator prompt with the same Future AGI eval as the search target. Future AGI’s prompt-opt loop uses the same evaluate() calls in the search loop and in production, so the optimized prompt is judged against the production rubric, not a proxy.
Where Future AGI fits
Future AGI is the eval-and-observability layer for RAG. The Apache 2.0 SDK ships the three RAG metrics, the trace instrumentors for every major framework, and the prompt-opt loop fed by the same metrics. The Agent Command Center BYOK gateway at /platform/monitor/command-center adds multi-provider routing and guardrails for the generator side. Together they form a continuous evaluation loop on every RAG call without rewriting the application.
Wrap-up
Hallucinations are not a model problem. They are a measurement problem. Score every RAG call against Context Adherence and Groundedness, run the three-axis sweep, and tag production traces so the loop never stops. The integration is a few lines of Python, the SDK is Apache 2.0, and the same playbook works whether you ship on LangChain, LlamaIndex, Haystack, or DSPy.
For deeper reading see the RAG evaluation metrics guide, agentic RAG systems, and the 2026 best RAG evaluation tools comparison.
Frequently asked questions
What is a RAG hallucination?
Why is Future AGI a fit for RAG hallucination detection?
Do I need labeled ground truth data?
Which combination of chunker, retriever, and chain wins in practice?
Will the eval pipeline plug into LangChain, LlamaIndex, Haystack, or DSPy?
How fast are the cloud evaluators?
What about reranking and query rewriting?
How do I run online evaluations on production traffic?
Model drift vs data drift in 2026: PSI, KS test, embedding cosine drift, and 7 tools ranked. Detect distribution shift in LLM and ML pipelines before users notice.
Data annotation meets synthetic data in 2026: GANs, VAEs, LLM annotators, self-supervision, RLHF, plus tooling and pitfalls. Updated with FAGI Annotate & Synthesize.
Build production LLM agents in 2026. Task scoping, model selection (gpt-5, claude-opus-4.5), tools, evals, observability, and the orchestration-plus-eval loop.