Introduction
Building on our previous blog on advanced chunking strategies to enhance RAG performance, this edition delves into RAG Prompting to Reduce Hallucination, highlighting techniques that enhance factual accuracy and ensure well-grounded responses.
We know that RAG combines generative pre-trained models with real-time retrieval mechanism, ensuring the outputs are grounded and up-to-date with current knowledge. Unlike traditional models, which rely solely on the data they were trained on, RAG fetch relevant information from external sources at inference time. For instance, in a traditional language model, like OpenAI’s GPT-4, without RAG, a query like "Why does the Meeseeks Box require maintenance?" might result in a generic or speculative answer. Here's how it might look:
However, when using RAG, the model can retrieve relevant documents from an external knowledge base in real-time to provide a more accurate and grounded response. The choice of chain type plays a pivotal role in how the retrieved information is processed. For example, the stuff chain type combines all retrieved documents into a single input, making it effective for scenarios with concise or limited data. On the other hand, the map_reduce chain type processes each document individually to summarise and then aggregates the results, suitable for larger and complex dataset.
Here’s how RAG works with different chain types:
RAG’s ability to provide accurate and relevant responses depends significantly on the quality of the prompts used. Effective prompt engineering helps guide the model to better utilise the retrieved information and reduces the likelihood of irrelevant or speculative answers. Additionally, the choice of further impacts the quality.
For example, consider the following two prompts for the same query:
Generic Prompt: "Why does the Meeseeks Box require maintenance?"
Specific Prompt: "Based on retrieved documents, explain the maintenance requirements of the Meeseeks Box."
The second prompt is more precise, explicitly directing the model to ground its response in the retrieved content. This improves factual accuracy and coherence. However, when combined with the map_reduce chain, the process becomes even more robust. The map_reduce chain processes each document individually, summarises its key points, and then combines these summaries to produce a cohesive and comprehensive answer. This approach minimises the risk of information overload and ensures that the model can handle larger datasets more effectively than the stuff chain.
Let’s now explore different types of prompts to improve RAG’s performance, using different chain types to evaluate how effectively each prompt guides the model’s responses
Types of Prompting Techniques
To demonstrate how different prompting techniques enhances RAG’s performance and reducing hallucinations, we will explore them with an example query and document:
Retrieved documents using above query:
a. Baseline prompt
This is the simplest type of prompt where the model is directly asked to answer a question without any additional guidance or examples.
Without explicit context or guidance, the model may produce irrelevant or inaccurate answers, increasing the risk of hallucinations.
b. Context Highlighting
This prompt explicitly instructs the model to rely only on the provided context to respond.
It can sometimes restrict the model’s flexibility, making it less capable of exploring ideas beyond the provided context.
c. Step-by-Step Reasoning
This method encourages the model to break down its thought process into logical steps before answering.
Effective for complex questions but might feel redundant or overly verbose for simple queries.
d. Fact Verification
This prompt ensure the model validates its response against the provided context for accuracy.
This reduces errors and helps build confidence in the reliability of the output.
Its effectiveness depends on the quality and completeness of the provided context
e. Role-Based Prompting
This approach frames the model as a specific persona, allowing it to respond with specialized knowledge and insights.
Ideal for domain-specific queries or tailored responses.
Can sometimes overly narrow the focus, limiting consideration of broader or alternative perspectives.
Evaluating Prompts
To determine the most effective prompting technique for reducing hallucination in a RAG system, we evaluate outputs using a consistent query and analyze their alignment with retrieved context to ensure accuracy and coherence.
a. BLEU
Measures n-gram overlap between the response and the source documents to evaluate how closely the response matches the original text.
Combines all retrieved documents into a single text and then calculates n-gram overlap.
b. ROUGE-L
Evaluates the lexical overlap between the response and the source documents by measuring the longest common subsequence (LCS).
Combines all retrieved documents into a single text and then compares this aggregated text with the response to compute the LCS-based ROUGE-L score.
c. BERT Score
Measures token-level semantic similarity using contextual embeddings.
Aggregates the source content and computes contextual embeddings for both the response and source. Then compute precision, recall, and F1 scores between the response and source content.
d. Embedding Similarity
Measures the semantic alignment between the generated response and the retrieved documents.
Converts the response and source documents into vector representations. Then, it computes cosine similarity for each pair and selects the maximum similarity score.
Result
The following Python code generates responses for each prompt, calculates the metrics, and prepares the results for easy comparison:
Below is the evaluation of different prompts using the stuff chain type:
Below is the evaluation of different prompts using the map_reduce chain type:
Conclusion
The stuff chain shows variability across prompts because it processes all retrieved documents as a single input, making it more susceptible to token limits and the relative quality of individual prompts.
The map_reduce chain processes documents individually, ensuring each contributes equally to the final response, thus resulting in more consistent performance across the prompts.
Context Highlighting excels in both chains, showcasing its adaptability and effectiveness in grounding responses to retrieved content.
Baseline Prompt delivers steady but average performance due to its lack of specificity.
Step-by-step reasoning ensures logical flow but shows minimal improvement due to its broad approach.
Fact verification performs well in the stuff chain due to its focus on specific details, boosting BLEU scores, but struggles with semantic alignment due to weaker contextual grounding.
Role-based prompting underperforms across chains as it lacks a clear focus on retrieved content, leading to weaker grounding and coherence.