April 29, 2025

April 29, 2025

How to Decrease RAG Hallucinations with Future AGI

How to Decrease RAG Hallucinations with Future AGI

Methods to decrease hallucinations in RAG models using Future AGI
Methods to decrease hallucinations in RAG models using Future AGI
Methods to decrease hallucinations in RAG models using Future AGI
Methods to decrease hallucinations in RAG models using Future AGI
Methods to decrease hallucinations in RAG models using Future AGI
Methods to decrease hallucinations in RAG models using Future AGI
Methods to decrease hallucinations in RAG models using Future AGI
  1. What Is Hallucination?

In Large Language Models (LLM), hallucination is when you generate content that sounds appealing but is factually incorrect or not supported by any source. In RAG, hallucination happens when the model generates unsupported or irrelevant information compared to retrieved context.

  1. How Can a RAG Application Hallucinate?

While RAG aims to reduce the potential hallucination of model outputs by grounding them in retrieved external knowledge, it may still produce hallucinated responses due to issues with components in its pipeline:

  • Chunking issues: badly split documents will leave out relevant content or have passages cut in two

  • Retrieval failures: The passages retrieved could be irrelevant or not sufficiently similar to the query.

  • Weaknesses of chain logic: The last step of generating a response may fail to leverage any retrieved context accurately.

All these issues undermine the factual accuracy of RAG outputs, which is critical for most production systems that need verifiable and context-based responses.

  1. How Future AGI Can Help In Tuning RAG Pipeline To Reduce Hallucination?

To systematically reduce hallucinations in RAG workflows, this blog adopts a structured evaluation pipeline driven by Future AGI’s automated instrumentation framework. The methodology is centered around three phases: configuration-driven RAG setup, model response generation, and automated evaluation of factual alignment and context adherence.

  • Configuration-Driven RAG Setup: The RAG system is parameterized in a configuration file, which enables reproducible experimentation for different strategies. These key components include:

    • Chunking Strategy: The input documents are chunked using either RecursiveCharacterTextSplitter or CharacterTextSplitter.

    • Retrieval Strategy: Using FAISS-based vector stores to perform document retrieval via either similarity or mmr (Maximal Marginal Relevance) search modes

    • Chain Strategy: Feed retrieved documents+input queries into a LangChain-based chain (stuff, map_reduce, refine or map_rerank) to get final responses via OpenAI’s GPT-4o-mini.

  • Instrumentation: The evaluation from Future AGI is provided through the fi_instrumentation This setup allows evaluation in real-time across the following metrics:

    • Groundedness: Evaluates whether a response is firmly based on the provided context. (Learn more)

    • Context Adherence: Evaluates how well responses stays within the provided context (Learn more)

    • Context Retrieval Quality: Evaluates the quality of the context retrieved for generating a response. (Learn more)

Click here to learn how to setup trace provider in Future AGI

  • Automated Evaluation Execution: The system executes a predefined set of queries against each RAG configuration. Each query is evaluated as follows:

    • The RAG pipeline generates a response based on the configured setup.

    • Evaluation spans are automatically captured and sent to Future AGI.

    • Scores for groundedness, context adherence, and retrieval quality are logged and analyzed.

  1. About The Dataset

We use here a benchmark dataset for the evaluation of the response alignment for RAG workflows. This allows us to measure how models use retrieved context to generate relevant responses. The dataset contains the following columns:

  • question: The user query that was asked of the language model.

  • context: The retrieved text provided to the model to help answer the query.

  • answer: The response generated by the model using the given context and question.

Below are a few sample rows from the dataset:

Sample context, question, and answer examples used for evaluating RAG hallucination and groundedness in Future AGI's automated pipeline
  1. Experimentation

Hallucination in RAG systems typically stems from three root causes:

  • The model receives insufficient context, thus forcing it to rely on its pre-training instead of the provided context.

  • The retriever fails to return relevant documents, leaving the model without factual grounding.

  • The model prioritizes fluency over accuracy, generating outputs that “sound right” but are disconnected from the context, hence hallucinated responses.

To detect these hallucinations, we use the following evaluation metrics provided by Future AGI:

  • Groundedness: Measures how well the generated response is substantiated by the retrieved context. It helps identify when the model is "making things up" rather than using the provided evidence. This feature directly surfaces hallucinations caused by over-reliance on pre-training or under-utilization of context.

  • Context Adherence: assesses whether the response stays within the bounds of the retrieved information. Even if a response is factually correct, responding out of the retrieved context can be misleading. This metric captures subtle forms of hallucination where content is plausible but contextually disconnected.

  • Context Retrieval Quality: evaluates whether the retrieved context were relevant and sufficient in the first place.

Click here to learn about these evals in detail

Below is the code snippet for configuring instrumentation for systematic evaluation of a typical RAG application. It also defines the evaluation metrics used to assess the quality of each generated response.

Click here to read complete experimentation details in the cookbook.

Click here to learn more about setting up instrumentation in Future AGI

def setup_instrumentation(config: dict)
    eval_tags=[
        EvalTag(
            type=EvalTagType.OBSERVATION_SPAN,
            value=EvalSpanKind.LLM,
            eval_name=EvalName.GROUNDEDNESS,
            config={},
            mapping={
                "input": "llm.input_messages.1.message.content",
                "output": "llm.output_messages.0.message.content"
            },
            custom_eval_name="Groundedness"
        ),
        EvalTag(
            type=EvalTagType.OBSERVATION_SPAN,
            value=EvalSpanKind.LLM,
            eval_name=EvalName.CONTEXT_ADHERENCE,
            config={},
            mapping={
                "context": "llm.input_messages.0.message.content",
                "output": "llm.output_messages.0.message.content"
            },
            custom_eval_name="Context_Adherence"
        ),
        EvalTag(
            type=EvalTagType.OBSERVATION_SPAN,
            value=EvalSpanKind.LLM,
            eval_name=EvalName.EVAL_CONTEXT_RETRIEVAL_QUALITY,
            config={
                "criteria": "Evaluate if the context is relevant and sufficient to support the output."
            },
            mapping={
                "input": "llm.input_messages.1.message.content",
                "output": "llm.output_messages.0.message.content",
                "context": "llm.input_messages.0.message.content"
            },
            custom_eval_name="Context_Retrieval_Quality"
        )
    ]

    trace_provider = register(
        project_type=ProjectType.EXPERIMENT,
        project_name=config['future_agi']['project_name'],
        project_version_name=config['future_agi']['project_version'],
        eval_tags=eval_tags
    )
    LangChainInstrumentor().instrument(tracer_provider=trace_provider)
    print(f"FutureAGI instrumentation setup for Project: {config['future_agi']['project_name']}, Version: {config['future_agi']['project_version']}")
  1. Result

We used Future AGI's automated scoring framework to evaluate each experimental run and determine the most effective RAG configuration. The evaluation included both quality metrics, including groundedness, context correctness, and retrieval quality, as well as system metrics like cost and latency. A weighted preference model to reflect real-world trade-offs between performance and efficiency was employed to rank the output.

Inside the ‘Choose Winner’ option provided in the top right corner of All Runs, the evaluation sliders were positioned to place higher value on model accuracy than operational efficiency. Weights were assigned as follows:

Winner settings for RAG evaluation in Future AGI with groundedness, context adherence, retrieval quality, cost, and latency sliders

Chooses winner section to select best performing run

This setup prioritizes accuracy and context in alignment at a reasonable cost in keep time and responsiveness.

RAG pipeline evaluation results comparing chunking, retrieval, generation strategies on groundedness, context adherence, and retrieval quality

Comparison of all runs executed during the experiment

The winning configuration was CharacterTextSplitter_mmr_map_rerank, which combines character-based chunking, MMR (Maximal Marginal Relevance) retrieval and a map-rerank generation. This approach provides a solid trade-off between reliability and efficiency of resources, making it a good fit for production-level RAG pipelines where hallucination minimization is of concern.

  1. Understanding the Result

Before understanding why CharacterTextSplitter_mmr_map_rerank performed best, lets understand the RAG pipeline in brief:

  • First step is chunking, which refers to the process of dividing a big document into smaller pieces before ingesting it into a RAG pipeline.

  • After chunking documents and storing them in a vector database, the next step is retrieval, which is the process of finding the most relevant chunks given a query.

  • The last stage of RAG is to put together the retrieved chunks and the user query to form a response.

Each of these stages that are chunking, retrieval, and generation, plays a critical role in influencing the final output, and the performance of different strategies at each step can significantly affect hallucination rates.

Reason behind CharacterTextSplitter_mmr_map_rerank configuration performed best:

  • MMR improves on basic similarity by balancing relevance and diversity. It selects chunks that are not only close to the query but also different from each other.

  • The CharacterTextSplitter method worked better in our tests when used with MMR retrieval because it created more varied chunks of content, giving MMR a larger selection of unique text passages to choose from.

  • The map_rerank chain strategy performed best by treating each chunk independently and selecting only the strongest answer; this method ensured high factual consistency and minimized hallucinated reasoning.

FAQs

FAQs

FAQs

FAQs

FAQs

Will I be able to re-use this evaluation setup for other RAG use cases or datasets?

Will I require labeled data in order to evaluate the hallucinations when using Future AGI?

I am using a different framework for my RAG application. Can I still use Future AGI for evaluation purposes?

Can I create custom evaluations tailored to my RAG use case in Future AGI?

Will I be able to re-use this evaluation setup for other RAG use cases or datasets?

Will I require labeled data in order to evaluate the hallucinations when using Future AGI?

I am using a different framework for my RAG application. Can I still use Future AGI for evaluation purposes?

Can I create custom evaluations tailored to my RAG use case in Future AGI?

Will I be able to re-use this evaluation setup for other RAG use cases or datasets?

Will I require labeled data in order to evaluate the hallucinations when using Future AGI?

I am using a different framework for my RAG application. Can I still use Future AGI for evaluation purposes?

Can I create custom evaluations tailored to my RAG use case in Future AGI?

Will I be able to re-use this evaluation setup for other RAG use cases or datasets?

Will I require labeled data in order to evaluate the hallucinations when using Future AGI?

I am using a different framework for my RAG application. Can I still use Future AGI for evaluation purposes?

Can I create custom evaluations tailored to my RAG use case in Future AGI?

Will I be able to re-use this evaluation setup for other RAG use cases or datasets?

Will I require labeled data in order to evaluate the hallucinations when using Future AGI?

I am using a different framework for my RAG application. Can I still use Future AGI for evaluation purposes?

Can I create custom evaluations tailored to my RAG use case in Future AGI?

Will I be able to re-use this evaluation setup for other RAG use cases or datasets?

Will I require labeled data in order to evaluate the hallucinations when using Future AGI?

I am using a different framework for my RAG application. Can I still use Future AGI for evaluation purposes?

Can I create custom evaluations tailored to my RAG use case in Future AGI?

Will I be able to re-use this evaluation setup for other RAG use cases or datasets?

Will I require labeled data in order to evaluate the hallucinations when using Future AGI?

I am using a different framework for my RAG application. Can I still use Future AGI for evaluation purposes?

Can I create custom evaluations tailored to my RAG use case in Future AGI?

More By

Sahil N

future agi background
Background image

Ready to deploy Accurate AI?

Book a Demo
Background image

Ready to deploy Accurate AI?

Book a Demo
future agi background
Background image

Ready to deploy Accurate AI?

Book a Demo
future agi background
Background image

Ready to deploy Accurate AI?

Book a Demo
future agi background
Background image

Ready to deploy Accurate AI?

Book a Demo
future agi background
Background image

Ready to deploy Accurate AI?

Book a Demo
future agi background
Background image

Ready to deploy Accurate AI?

Book a Demo