Introduction
Retrieval-Augmented Generation (RAG) is a hybrid architecture that combines a retriever and a generator to improve the factual accuracy and relevance in responses of a language model. In this system, instead of generating responses purely from a model’s internal knowledge, a retriever first retrieves relevant documents or passages from an external knowledge base (such as a vector store or database) using similarity search and then this language model takes the retrieved context and the user query as input and generates a grounded, context-aware response.
Although RAG systems are designed to generate grounded responses by incorporating external knowledge source, it still remains susceptible to generating hallucinated response. Hallucination in LLMs is when you generate content that sounds good but is factually incorrect or not supported by any source. In RAG, hallucination happens when the model generates unsupported or irrelevant information compared to retrieved context. These failures are increasingly appearing in production environments, leading to misinformation, reputational and financial damages.
This blog outlines a structured evaluation-driven methodology to diagnosing and mitigating hallucination in RAG pipeline using Future AGI. The resulting evaluation pipeline provides a reproducible and scalable framework for testing, comparing, and refining RAG systems to ensure factual consistency, improved reliability, and readiness for deployment in high-stakes applications.
Cost of Neglecting RAG Hallucinations
One of the most recent and high-profile case of AI hallucinating is the one where Cursor, one of the most famous AI-powered coding assistant IDE, hallucinated its company policy when a developer reached out to its AI-powered support agent to complaint about getting logged out if used in multiple devices. The support chatbot responded with an hallucinated company policy that Cursor account is supposed to be used in one device per subscription. Cursor’s co-founder Michael Truell himself had to step in to clarify that no such policy exist. Despite the swift response, the incident was a major embarrassment for a company selling AI tools. This shows that even an AI-focused firm can be burned by its own AI’s hallucinations [1].
Another such high-profile incident involves when New York City government tries to launch a chatbot called MyCity, which is a Microsoft-powered chatbot, with the purpose of helping small business owners. Soon after the launch it starts responding with illegal advices and guidelines, which were not part of official guidelines. Despite having access to the city’s regulation guideline, it failed in providing grounded responses. [2]
Across these examples, it is evident that having access to an external knowledge does not guarantee that the AI won’t hallucinate. The consequences can range from public ridicule to legal action and financial losses, thus prompting organisations to improve such oversights.
What Causes Hallucination in RAG Application?
Hallucination in RAG systems typically is caused by three root issues:
The model receives insufficient context thus forcing it to rely on its pre-training instead of the provided context.
The retriever fails to return relevant document, leaving the model without factual grounding.
The LLM model prioritise fluency over accuracy and generating response that “sounds right” but are disconnected from the context, hence hallucinated response.
Potential Issues in a Typical RAG Pipeline Leading to Hallucination?
While RAG aims to reduce the potential hallucination of model outputs by grounding them in retrieved external knowledge, it may still produce hallucinated responses due to issues with components in its pipeline:
Chunking issues: badly split documents will leave out relevant content or have passages cut in two
Retrieval failures: The passages retrieved could be irrelevant or not sufficiently similar to the query.
Weaknesses in chain logic: The last step of generating response may fail to leverage retrieved context properly.
How Future AGI Can Help In Tuning RAG Pipeline To Reduce Hallucination?
To mitigate hallucinations in RAG workflows, this blog adopts an evaluation-driven pipeline supported by Future AGI’s instrumentation framework. The methodology is focused around three phases: configuration-driven RAG setup, model response generation, and automated evaluation of factual alignment and context adherence.
Configuration-Driven RAG Setup:**The RAG system is parameterised in a configuration file which enables reproducible experimentation for different strategies. These key components include:
Chunking Strategy: The input document are chunked using either
RecursiveCharacterTextSplitter
orCharacterTextSplitter
.Retrieval Strategy: Using FAISS-based vector stores to perform document retrieval via either
similarity
ormmr
(Maximal Marginal Relevance) search modesChain Strategy: Feed retrieved documents+input queries into a LangChain-based chain (
stuff
,map_reduce
,refine
ormap_rerank
) to get final responses via OpenAI’s GPT-4o-mini.
Instrumentation: Future AGI provides evaluation through the
fi_instrumentation
SDK. This allows evaluation in real-time across the following metrics:Groundedness: Evaluates if a response is firmly based on the provided context. (Learn more)
Context Adherence: Evaluates how well a response stays within the given context (Learn more)
Context Retrieval Quality: Evaluates the quality of the context retrieved for generating a response. (Learn more)
Click here to learn how to setup trace provider in Future AGI
Automated Evaluation Execution: A predefined set of queries is executed against each RAG configuration. For each query:
The RAG pipeline generates a response based on the configured setup.
Evaluation spans are automatically captured and sent to Future AGI.
Scores for groundedness, context adherence, and retrieval quality are logged and analysed.
About The Dataset
We use here a benchmark dataset for the evaluation of the response alignment for RAG workflows. This allows to measure how models use retrieved context to generate relevant responses. The dataset contains the following columns:
question: The user query that was asked to the language model.
context: The retrieved text provided to the model to help answer the query.
answer: The response generated by the model using the given context and question.
Below are a few sample rows from the dataset:

Experimentation
To detect these hallucination we use following eval metrics provided by Future AGI:
Groundedness: measures how well the generated response is substantiated by the retrieved context. It helps identify when the model is "making things up" rather than using the provided evidence. This directly surfaces hallucinations caused by over-reliance on pre-training or under-utilization of context.
Context Adherence: assesses whether the response stays within the bounds of the retrieved information. Even if a response is factually correct, responding out of the retrieved context can be misleading. This metric captures subtle forms of hallucination where content is plausible but contextually disconnected.
Context Retrieval Quality: evaluates whether the retrieved context were relevant and sufficient in the first place.
Click here to learn about these evals in detail
Below is the code snippet for configuring instrumentation for systematic evaluation of typical RAG application. It also defines the evaluation metrics used to assess the quality of each generated response.
Click here to read complete experimentation details in the cookbook.
Click here to learn more about setting up instrumentation in Future AGI
Result
Future AGI’s scoring framework, for selecting best run, was used to assess each experimental run to establish which RAG configuration was the most effective. The evaluation included both quality metrics including groundedness, context correctness, and retrieval quality as well as system metrics like cost and latency. A weighted preference model to reflect real-world tradeoffs between performance and efficiency was employed to rank the output.
Inside the ‘Choose Winner’ option provided in top right corner of All Runs, the evaluation sliders were positioned to place higher value on model accuracy than operational efficiency. Weights were assigned as follows:

Chooses winner section to select best performing run
This setup prioritizes accuracy and context in alignment at a reasonable cost in keep time and responsiveness.

Comparison of all runs executed during the experiment
The winner configuration was CharacterTextSplitter_mmr_map_rerank, which combines chatacter-based chunking, MMR (Maximal Marginal Relevance) retrieval and a map-rerank generation. This approach provides a solid trade-off between reliability and efficiency of resources, making it a good fit for production-level RAG pipelines where hallucination minimisation is of concern.
Understanding the Result
Before understanding why CharacterTextSplitter_mmr_map_rerank performed best, lets understand the RAG pipeline in brief:
First step is chunking, which refers to the process of dividing a big document into smaller pieces before ingesting it into a RAG pipeline.
After chunking documents and storing the chunks in a vector database, next step is retrieval, which is the process of finding the most relevant chunks given a query.
The last stage of RAG is to put together the retrieved chunks and the user query to form a response.
Each of these stages, chunking, retrieval, and generation, plays a critical role in influencing the final output, and the performance of different strategies at each step can significantly affect hallucination rates.
Reason behind CharacterTextSplitter_mmr_map_rerank configuration performed best:
MMR improves on basic similarity by balancing relevance and diversity. It selects chunks that are not only close to the query but also different from each other.
CharacterTextSplitter approach performed better in our tests when combined with MMR retrieval because the resulting diversity in chunk content allowed MMR to pick from a wider range of non-redundant text passages.
map_rerank chain strategy performed best by treating each chunk independently and selecting only the strongest answer, this method ensured high factual consistency and minimised hallucinated reasoning.
Ready to Reduce Hallucinations in Your RAG Applications?
Start evaluating your RAG workflows with confidence using Future AGI’s automated, no-label-required evaluation framework. Future AGI provides the tools you need to systematically reduce hallucination.
References
[1] https://www.eweek.com/news/cursor-ai-chatbot-hallucination-fake-policy/
[2] https://apnews.com/article/new-york-city-chatbot-misinformation-6ebc71db5b770b9969c906a7ee4fae21
FAQs
