AI Regulations

Hallucination

RAG

How to Decrease RAG Hallucinations with Future AGI

Q: Will I be able to re-use this evaluation setup for other RAG use cases or datasets?

Yes. The evaluation pipeline described in this blog is configuration based and task agnostic. The instrumentation and metric setup you have applies to any RAG dataset.

Q: Will I require labeled data in order to evaluate the hallucinations when using Future AGI?

No, future AGI does model-based evaluation, it rates your outputs with AI evaluators without needing labeled ground truth answers beforehand. This enables rapid, scalable testing across configurations without the manual annotation burden.

Q: I am using a different framework for my RAG application. Can I still use Future AGI for evaluation purposes?

Yes. It is compatible with a variety of frameworks via automatic tracing and SDK integrations, such as LangChain, Haystack, DSPy, LlamaIndex, Instructor, Crew AI, and others. With little to no setup, most major RAG stacks can have their evaluations instrumented.

Q: Can I create custom evaluations tailored to my RAG use case in Future AGI?

Yes. The Deterministic Eval template in Future AGI supports custom evaluations (Click here to learn more about deterministic eval). This lets you apply stringent criteria to your RAG outputs minimising variability.

Last Updated

Jun 29, 2025

Sahil N

Time to read

9 mins

Methods to decrease hallucinations in RAG models using Future AGI

Explore Future AGI

Introduction

RAG Hallucinations can wreck user trust almost instantly. Retrieval-Augmented Generation (RAG) relies on two parts working together. First, a retriever pulls passages from an external knowledge base, such as a FAISS vector store. Next, a generator blends those passages with the user’s question and writes a grounded reply.

However, RAG can still slip. A hallucination appears when the model states a “fact” that no source supports. In RAG, this happens when the generator adds details that clash with the retrieved context. Once these errors reach production, companies may spread false information, face public embarrassment, and lose money.

This article gives you a clear, evaluation-driven playbook. With Future AGI’s instrumentation you can test, compare, and refine RAG pipelines. Follow the steps, and you will ship context-aligned systems that stay accurate even under pressure.

Cost of Neglecting RAG Hallucinations - What Can Go Wrong?

Cursor, a well-known AI coding IDE, once invented a rule that did not exist - a “single-device policy.” A customer asked support why the account kept logging out. The AI agent answered with that fake restriction. Co-founder Michael Truell stepped in and apologized, yet the damage was already done.

New York City’s MyCity chatbot met a similar fate. Built to guide small-business owners, it soon gave illegal advice that clashed with city regulations. The correct rules were in its knowledge base, yet RAG Hallucinations still appeared with full confidence.

These stories show that external knowledge alone is not enough. Without steady RAG Evaluation, hallucinations stay hidden until real users uncover them.

What Causes Hallucination in a RAG Application?

Insufficient context — The model receives thin snippets, so it guesses.
Retriever failure — The search step returns irrelevant passages.
Fluency bias — The generator prefers text that “sounds good,” even if it strays.

Pipeline Weaknesses That Trigger Hallucinations

Chunking issues: Poorly split documents omit critical sentences.
Similarity-only retrieval: Close but irrelevant passages confuse the model.
Loose chain logic: Generation chains may ignore context and hallucinate anyway.

How Future AGI Tunes a RAG Pipeline to Reduce Hallucination

Future AGI runs its checks in three clear steps:

Configuration-Driven Setup – List your chunkers, retrievers, and chains in a YAML file. This simple file keeps every experiment easy to repeat.
Model Response Generation – Use the same set of queries with each setup and collect the answers.
Automated RAG Evaluation – While the model replies, the system scores Groundedness, Context Adherence, and Context Retrieval Quality in real time.

5.1 Configuration-Driven RAG Setup

Chunking Strategy – First, decide how you want to split your documents. Use RecursiveCharacterTextSplitter if you need smart, nested cuts, or go with CharacterTextSplitter for straightforward, fixed-length slices that are easy to debug.
Retrieval Strategy – Next, pick a search method. Standard FAISS similarity finds the closest passages fast, while MMR (Maximal Marginal Relevance) balances relevance and diversity so the model sees fewer near-duplicate chunks.
Chain Strategy – Finally, choose how the system will combine everything into an answer. LangChain offers several flows - stuff, map_reduce, refine, and map_rerank. Run any of them with GPT-4o-mini to create clear, context-grounded responses.

5.2 Instrumentation with Future AGI’s SDK

fi_instrumentation tags each LLM span:

Metric	Purpose
Groundedness	Is the answer explicitly supported by context?
Context Adherence	Does the response stay inside retrieved bounds?
Context Retrieval Quality	Were passages relevant and sufficient?

About the Dataset

We employ a public benchmark designed for RAG Evaluation. Each row contains:

question — the user prompt.
context — retrieved text used by the model.
answer — generated response.

RAG Hallucinations evaluation dataset benchmark retrieval augmented generation groundedness metric testing examples

Image 1: RAG Evaluation Dataset Examples

Experimentation — How We Detect RAG Hallucinations

Groundedness Metric – This score checks whether the answer sticks to the evidence you supplied. If the reply invents new “facts” that never appear in the context, the number drops. Put simply, the higher the score, the more the answer stays anchored to the source you gave it.
Context Adherence – This test flags a response that wanders beyond the retrieved passages. A statement might be true somewhere else, yet it still counts as drift if it is not in the given text. Because of this check, you can see at once when the model leaves its safe zone.
Context Retrieval Quality – Here the focus shifts to the retriever. The metric asks, “Did the search step gather passages that are both relevant and helpful?” If the retriever pulls weak or off-topic chunks, the score exposes that gap. Stronger context leads directly to stronger answers.

Instrumentation Code Snippet

def setup_instrumentation(config: dict)

   eval_tags=[

       EvalTag(

           type=EvalTagType.OBSERVATION_SPAN,

           value=EvalSpanKind.LLM,

           eval_name=EvalName.GROUNDEDNESS,

           config={},

           mapping={

               "input": "llm.input_messages.1.message.content",

               "output": "llm.output_messages.0.message.content"

           },

           custom_eval_name="Groundedness"

       ),

       EvalTag(

           type=EvalTagType.OBSERVATION_SPAN,

           value=EvalSpanKind.LLM,

           eval_name=EvalName.CONTEXT_ADHERENCE,

           config={},

           mapping={

               "context": "llm.input_messages.0.message.content",

               "output": "llm.output_messages.0.message.content"

           },

           custom_eval_name="Context_Adherence"

       ),

       EvalTag(

           type=EvalTagType.OBSERVATION_SPAN,

           value=EvalSpanKind.LLM,

           eval_name=EvalName.EVAL_CONTEXT_RETRIEVAL_QUALITY,

           config={

               "criteria": "Evaluate if the context is relevant and sufficient to support the output."

           },

           mapping={

               "input": "llm.input_messages.1.message.content",

               "output": "llm.output_messages.0.message.content",

               "context": "llm.input_messages.0.message.content"

           },

           custom_eval_name="Context_Retrieval_Quality"

       )

   ]

   trace_provider = register(

       project_type=ProjectType.EXPERIMENT,

       project_name=config['future_agi']['project_name'],

       project_version_name=config['future_agi']['project_version'],

       eval_tags=eval_tags

   )

   LangChainInstrumentor().instrument(tracer_provider=trace_provider)

   print(f"FutureAGI instrumentation setup for Project: {config['future_agi']['project_name']}, Version: {config['future_agi']['project_version']}")

Result: Which Configuration Reduces RAG Hallucinations Best?

Future AGI’s dashboard ranks runs by slider weights: accuracy outranks latency and cost. CharacterTextSplitter + MMR + map_rerank triumphs because:

MMR balances relevance and diversity.
CharacterTextSplitter offers rich, varied chunks.
map_rerank scores each chunk independently, selecting the strongest answer.

Inside the ‘Choose Winner’ option provided in top right corner of All Runs, the evaluation sliders were positioned to place higher value on model accuracy than operational efficiency. Weights were assigned as follows:

Winner settings for RAG evaluation in Future AGI with groundedness, context adherence, retrieval quality, cost, and latency sliders

Image 2: Chooses winner section to select best performing run

RAG pipeline evaluation results comparing chunking, retrieval, generation strategies on groundedness, context adherence, and retrieval quality

Image 3: Comparison of all runs executed during the experiment

Understanding the Result

Precise chunks stop partial sentences that mislead similarity search.
Diversified retrieval supplies broader evidence, lowering blind spots.
Selective generation filters weak chunks, anchoring arguments firmly.

Combined, these steps slash hallucination without hurting response time.

Ready to Reduce Hallucinations in Your RAG Application?

Future AGI gives you automated, label-free RAG Evaluation in one package. Book a quick demo, and watch the platform score Groundedness and Context Adherence on the spot. In just a few hours, you’ll receive clear reports that highlight strengths, expose weak points, and show exactly how to tune your pipeline.

References

[1] Cursor AI hallucination incident - eWeek
[2] NYC MyCity chatbot misinformation - AP News

FAQs

Will I be able to re-use this evaluation setup for other RAG use cases or datasets?

Will I require labeled data in order to evaluate the hallucinations when using Future AGI?

I am using a different framework for my RAG application. Can I still use Future AGI for evaluation purposes?

Can I create custom evaluations tailored to my RAG use case in Future AGI?