Guides

How to Cut RAG Hallucinations in 2026: An Evaluation-Driven Playbook with Future AGI

Cut RAG hallucinations in 2026 with the Future AGI eval loop. Context Adherence + Groundedness metrics, real fi.evals code, chunk + retriever + reranker tuning.

April 29, 2025

Updated May 14, 2026

6 min read

evaluations hallucination rag

Table of Contents

How to Cut RAG Hallucinations in 2026: Future AGI Playbook

RAG hallucinations destroy user trust faster than any other LLM failure mode. The answer sounds confident, the citation looks plausible, and the underlying fact is wrong. This guide shows how to detect and reduce RAG hallucinations in 2026 using Future AGI’s Context Adherence and Groundedness metrics, with runnable code, a three-axis tuning loop, and the SLO thresholds we use day to day.

TL;DR

Question	Short answer
Top RAG eval metrics in 2026	Context Adherence, Groundedness, Context Retrieval Quality (all in the Future AGI catalog).
Best detection workflow	Tag every production RAG span with `EvalTag(eval_name=EvalName.CONTEXT_ADHERENCE, ...)` and `EvalTag(eval_name=EvalName.GROUNDEDNESS, ...)`.
Biggest hallucination drivers	Thin or noisy retrieval, oversized chunks, fluency bias in the generator.
Biggest single fix	Add a reranker on top of the initial retriever.
Future AGI integration	`evaluate(eval_templates="context_adherence", inputs={"output": ..., "context": ...})`
Cloud judge latency	`turing_flash` 1-2s, `turing_small` 2-3s, `turing_large` 3-5s.
SDK license	`ai-evaluation` Apache 2.0 (github.com/future-agi/ai-evaluation/blob/main/LICENSE).

Why RAG hallucinations happen

Three causes drive most hallucinations:

Insufficient context. The retriever returns a short or weak passage, and the generator fills the gap with plausible text.
Retriever failure. The search step returns passages that look similar but are off-topic, especially with similarity-only retrieval and no reranker.
Fluency bias. The generator prefers a smooth sentence over a faithful one, especially with weak system prompts and no inline groundedness check.

Each pipeline weakness maps onto a measurable signal:

Chunking issues drop the Context Retrieval Quality score.
Retriever failure (irrelevant passages) drops Context Retrieval Quality and often Groundedness; if the model still tries to answer it will produce ungrounded claims.
Fluency bias drops Groundedness even when Context Adherence stays high.

Real-world cost of skipping eval

Two public incidents illustrate the stakes. Cursor’s AI support agent told a user a non-existent “single-device policy” was forcing logouts; the co-founder issued a public apology after the story trended (eweek.com). New York City’s MyCity small-business chatbot gave illegal advice on tipping, terminations, and zoning even when the correct rules were available in its knowledge base (apnews.com). Continuous evaluation against the retrieved context would have surfaced the unsupported answers before users saw them.

The three Future AGI metrics for RAG

Metric	Question it answers	Best for
Context Adherence	Did the answer stay inside the retrieved passages?	Detecting drift outside the source context.
Groundedness	Is every claim explicitly supported by evidence?	Catching invented facts even when adherence is high.
Context Retrieval Quality	Were the passages themselves relevant and sufficient?	Diagnosing retriever or chunker problems upstream.

All three are model-based, so no labeled ground truth is required.

Three-axis tuning loop

Run a sweep over three axes and score every configuration with the metrics above.

Axis	Options to try
Chunking	RecursiveCharacterTextSplitter, CharacterTextSplitter, semantic chunker
Retrieval	FAISS similarity, MMR (Maximal Marginal Relevance), hybrid BM25 + dense, reranker on top
Chain	stuff, map_reduce, refine, map_rerank

The combination with the highest joint Context Adherence + Groundedness on your dev set wins. On the public benchmark we ran internally, the winner was CharacterTextSplitter for chunking, MMR for retrieval, and map-rerank for generation. Your dataset may pick a different winner; the right combination is the one your eval set chooses.

Code: score a RAG answer with fi.evals

The cloud context_adherence template scores any answer against any context.

import os

from fi.evals import evaluate

os.environ["FI_API_KEY"] = "your-future-agi-api-key"
os.environ["FI_SECRET_KEY"] = "your-future-agi-secret-key"

retrieved_context = (
    "Future AGI's ai-evaluation SDK is Apache 2.0 licensed "
    "and hosted at github.com/future-agi/ai-evaluation."
)

answer = (
    "The Future AGI ai-evaluation SDK is Apache 2.0 and "
    "available on GitHub at github.com/future-agi/ai-evaluation."
)

result = evaluate(
    eval_templates="context_adherence",
    inputs={
        "output": answer,
        "context": retrieved_context,
    },
    model_name="turing_flash",
)

print(result.eval_results[0].metrics[0].value)

Swap context_adherence for groundedness or context_retrieval_quality to score the other two RAG metrics. turing_flash returns in 1-2 seconds; switch to turing_small (2-3 seconds) or turing_large (3-5 seconds) for deeper judgment (docs.futureagi.com).

Code: continuous online scoring with EvalTag

For production traffic, tag spans at instrumentation time so every RAG call is scored automatically without extra application code.

from fi_instrumentation import register
from fi_instrumentation.fi_types import (
    ProjectType,
    EvalTag,
    EvalTagType,
    EvalSpanKind,
    EvalName,
)
from traceai_langchain import LangChainInstrumentor

eval_tags = [
    EvalTag(
        type=EvalTagType.OBSERVATION_SPAN,
        value=EvalSpanKind.LLM,
        eval_name=EvalName.GROUNDEDNESS,
        config={},
        mapping={
            "input": "llm.input_messages.1.message.content",
            "output": "llm.output_messages.0.message.content",
            "context": "llm.input_messages.0.message.content",
        },
        custom_eval_name="Groundedness",
    ),
    EvalTag(
        type=EvalTagType.OBSERVATION_SPAN,
        value=EvalSpanKind.LLM,
        eval_name=EvalName.CONTEXT_ADHERENCE,
        config={},
        mapping={
            "context": "llm.input_messages.0.message.content",
            "output": "llm.output_messages.0.message.content",
        },
        custom_eval_name="Context_Adherence",
    ),
    EvalTag(
        type=EvalTagType.OBSERVATION_SPAN,
        value=EvalSpanKind.LLM,
        eval_name=EvalName.EVAL_CONTEXT_RETRIEVAL_QUALITY,
        config={
            "criteria": (
                "Evaluate if the context is relevant and "
                "sufficient to support the output."
            ),
        },
        mapping={
            "input": "llm.input_messages.1.message.content",
            "output": "llm.output_messages.0.message.content",
            "context": "llm.input_messages.0.message.content",
        },
        custom_eval_name="Context_Retrieval_Quality",
    ),
]

trace_provider = register(
    project_type=ProjectType.OBSERVE,
    project_name="rag_support_assistant",
    eval_tags=eval_tags,
)

LangChainInstrumentor().instrument(tracer_provider=trace_provider)

Once the tracer is registered, every LangChain call produces a span with the three RAG scores attached. Swap LangChainInstrumentor for LlamaIndexInstrumentor, HaystackInstrumentor, or DSPyInstrumentor to instrument other stacks.

Code: a custom RAG rubric with CustomLLMJudge

When the catalog metrics do not fully capture the rubric (regulated industries, brand voice, JSON schema validation on retrieved citations), define a local judge.

import os

from fi.evals.metrics import CustomLLMJudge
from fi.evals.llm import LiteLLMProvider
from fi.opt.base import Evaluator

citation_judge = CustomLLMJudge(
    name="citation_format_judge",
    grading_criteria=(
        "Score 1 if every numbered claim in the answer is "
        "followed by a bracketed citation like [1] or [2] "
        "and the cited source appears verbatim in the context. "
        "Otherwise score 0."
    ),
    provider=LiteLLMProvider(
        model=os.getenv("JUDGE_MODEL", "gpt-4o-mini"),
    ),
)

evaluator = Evaluator(metric=citation_judge)
score = evaluator.evaluate(
    output=(
        "Future AGI ships Context Adherence [1] and "
        "Groundedness [2] metrics."
    ),
    context=(
        "[1] Context Adherence checks whether the answer "
        "stays inside the retrieved context. "
        "[2] Groundedness checks whether each claim is "
        "supported by evidence."
    ),
)
print(score)

The judge runs locally, hits the configured provider through LiteLLM, and returns a rubric score. Pin the provider temperature to 0 and pin the model version in the env var if you need reproducible scores across runs.

How to ship the loop

Pin an eval set. 50 to 200 representative RAG queries with retrieved context. Refresh quarterly.
Score the baseline. Run the eval set through your current pipeline; record Context Adherence, Groundedness, Context Retrieval Quality.
Run the three-axis sweep. Vary chunker, retriever, and chain.
Pick the winner. Highest joint score on Context Adherence + Groundedness wins. Confirm on a held-out test set.
Wire production tagging. Use the EvalTag snippet above so every prod call is scored.
Alert on regressions. Track the rolling mean Context Adherence and the rate of low-score samples (e.g., samples below 0.6); alert when the failure rate exceeds your SLO (commonly 5-10%, tune on your data) and route those samples to a review queue.

Most common RAG mistakes and how the metrics surface them

Mistake	What you will see	Fix
Chunks too long	Retrieved passages contain relevant text plus noise; Groundedness drops	Shorten chunks, add overlap
Chunks too short	Context Retrieval Quality low	Increase chunk size, switch to recursive splitter
Similarity-only retrieval	Context Adherence high, Groundedness low	Add MMR and a cross-encoder reranker
Fluency-biased generator	Groundedness drops; Adherence stays high	Tighten system prompt; switch chain to map-rerank or refine
Prompt asks for “your knowledge”	Adherence drops on questions outside context	Add explicit “only answer from the context” instruction

When to combine RAG with prompt-opt

Once chunker, retriever, and chain are tuned, run a prompt-opt loop on the generator prompt with the same Future AGI eval as the search target. Future AGI’s prompt-opt loop uses the same evaluate() calls in the search loop and in production, so the optimized prompt is judged against the production rubric, not a proxy.

Where Future AGI fits

Future AGI is the eval-and-observability layer for RAG. The Apache 2.0 SDK ships the three RAG metrics, the trace instrumentors for every major framework, and the prompt-opt loop fed by the same metrics. The Agent Command Center BYOK gateway at /platform/monitor/command-center adds multi-provider routing and guardrails for the generator side. Together they form a continuous evaluation loop on every RAG call without rewriting the application.

Wrap-up

Hallucinations are not a model problem. They are a measurement problem. Score every RAG call against Context Adherence and Groundedness, run the three-axis sweep, and tag production traces so the loop never stops. The integration is a few lines of Python, the SDK is Apache 2.0, and the same playbook works whether you ship on LangChain, LlamaIndex, Haystack, or DSPy.

For deeper reading see the RAG evaluation metrics guide, agentic RAG systems, and the 2026 best RAG evaluation tools comparison.

Frequently asked questions

What is a RAG hallucination?

A RAG hallucination happens when the generator inserts information that is not supported by the retrieved context. The model sounds confident, but the claim is not grounded in the source passages. This is the most common failure mode in RAG production systems, and it usually traces back to thin retrieval, weak chunking, or a generator that prefers fluent text over evidence.

Why is Future AGI a fit for RAG hallucination detection?

Future AGI exposes three purpose-built RAG metrics: Context Adherence (does the answer stay inside the retrieved passages?), Groundedness (is every claim supported by evidence?), and Context Retrieval Quality (were the retrieved passages relevant and sufficient?). All three run as cloud evals through `evaluate(eval_templates="context_adherence", ...)` or as instrumentation-time tags via `EvalTag(eval_name=EvalName.CONTEXT_ADHERENCE, ...)` so every production trace is scored automatically. The `ai-evaluation` SDK is Apache 2.0 on GitHub.

Do I need labeled ground truth data?

No. Future AGI scores RAG outputs with model-based evaluators that compare the answer against the retrieved context, not against a labeled answer key. This lets you score live production traffic, run sweeps over chunkers and retrievers without re-labeling, and catch drift the moment a new model checkpoint changes behavior. For final acceptance, a small labeled set of 50 to 200 examples is still recommended.

Which combination of chunker, retriever, and chain wins in practice?

In our internal sweep on a public RAG benchmark, the strongest combination was CharacterTextSplitter for chunking, MMR for retrieval, and map-rerank for generation. The exact winner is dataset-dependent. Run the same three-axis sweep on your own data: chunk strategy, retrieval strategy, chain strategy, scored by Context Adherence and Groundedness, and pick the configuration with the highest joint score.

Will the eval pipeline plug into LangChain, LlamaIndex, Haystack, or DSPy?

Yes. The Future AGI SDK ships traceAI OpenTelemetry instrumentors for LangChain, LlamaIndex, Haystack, DSPy, CrewAI, and more. Once `register()` is called with a project name and eval tags, every RAG span gets scored automatically. The same `fi.evals` cloud API also works standalone for ad-hoc evaluation runs outside any framework.

How fast are the cloud evaluators?

Three judge models are available. `turing_flash` returns in roughly 1-2 seconds, `turing_small` in 2-3 seconds, and `turing_large` in 3-5 seconds (docs.futureagi.com/docs/sdk/evals/cloud-evals). Most teams run `turing_flash` for high-volume online scoring and bump up to `turing_large` for offline experiments where deeper judgment matters more than latency.

What about reranking and query rewriting?

Both help. A cross-encoder reranker on top of the initial retriever cuts irrelevant passages, and a query-rewriting step before retrieval handles vague prompts. Score the impact of each change with the same Context Adherence + Groundedness metrics. Most teams see the biggest jump from adding a reranker, the second-biggest from query rewriting, and only then from chunk-size tuning.

How do I run online evaluations on production traffic?

Register a tracer at the start of the service with `project_type=ProjectType.OBSERVE` and attach `EvalTag` entries for `EvalName.CONTEXT_ADHERENCE` and `EvalName.GROUNDEDNESS`. Every RAG call produces a trace with the eval score attached. Alert when the rolling P95 score drops below your SLO. Send drifting samples to a review queue for prompt-opt or retrieval tuning.

View all

Guides

Model Drift vs Data Drift in 2026: Detection & Mitigation Guide

Model drift vs data drift in 2026: PSI, KS test, embedding cosine drift, and 7 tools ranked. Detect distribution shift in LLM and ML pipelines before users notice.

Rishav Hada · Feb 11, 2025

8 min

Guides

Data Annotation and Synthetic Data in 2026: The Honest Guide

Data annotation meets synthetic data in 2026: GANs, VAEs, LLM annotators, self-supervision, RLHF, plus tooling and pitfalls. Updated with FAGI Annotate & Synthesize.

Rishav Hada · Feb 10, 2025

11 min

Guides

How to Build LLM Agents in 2026: A Production Guide

Build production LLM agents in 2026. Task scoping, model selection (gpt-5, claude-opus-4.5), tools, evals, observability, and the orchestration-plus-eval loop.

Rishav Hada · Jan 7, 2025

11 min