Guides

RAG Prompting to Reduce Hallucination: 6 Techniques and How to Measure Them

Six RAG prompting patterns that reduce hallucination, with example prompts, retrieval grounding, and Context Adherence + Groundedness eval code.

December 24, 2024

Updated May 14, 2026

5 min read

rag prompting hallucination evaluations

Table of Contents

TL;DR

Technique	When to use	Effect on hallucination
Context Highlighting	Default for every RAG prompt	Large reduction
Citations Required	Compliance, medical, legal, finance	Large, makes failures visible
Step-by-Step Reasoning	Multi-hop questions, comparisons	Medium, adds latency
Fact Verification Loop	High-stakes outputs, second pass on retrieved evidence	Medium-large
Role-Based Prompting	Domain-specific phrasing	Small on its own
Refusal When Empty	Always (paired with Context Highlighting)	Eliminates fabricated answers when retrieval fails

The single best prompt change is Context Highlighting plus a refusal clause. Pair it with mandatory citations, score every production response with Context Adherence and Groundedness, and treat refusals as a feature, not a failure.

The base setup we will prompt against

We use Rick and Morty’s Meeseeks Box for examples because the queries are made-up enough that a non-RAG model has no choice but to guess. Any hallucination is therefore visible.

from langchain_openai import ChatOpenAI, OpenAIEmbeddings
from langchain.schema import Document
from langchain_community.vectorstores import FAISS
from langchain.chains import RetrievalQA

llm = ChatOpenAI(model="gpt-5", temperature=0)
embedding_model = OpenAIEmbeddings()

documents = [
    Document(page_content="The Meeseeks Box begins its creation process by harvesting proto-Meeseeks from a quantum foam field."),
    Document(page_content="The harvested proto-Meeseeks are condensed into small energy packets and stored in temporal stasis."),
    Document(page_content="A neural imprinting laser programs each Meeseeks with a single objective, ensuring they are task-oriented."),
    Document(page_content="The Meeseeks Box has an internal logic circuit that randomly assigns objectives, such as opening jars or solving math problems."),
    Document(page_content="When the button is pressed, the Box releases a fully-formed Meeseeks, temporarily stabilized by an anti-decay field."),
    Document(page_content="After completing their task, Meeseeks are designed to disintegrate into harmless particles of joy-energy."),
    Document(page_content="The Meeseeks Box requires periodic maintenance to recharge its quantum foam reservoir, which can run dry if overused."),
]

vectorstore = FAISS.from_documents(documents, embedding_model)
retriever = vectorstore.as_retriever(search_kwargs={"k": 4})

query = "Why does the Meeseeks Box require maintenance?"

Without RAG, the model would speculate. With retrieval but a bad prompt, it can still speculate beyond the retrieved chunks. The six techniques below are the prompt-side levers that close that gap.

Technique 1: Context Highlighting

The single most impactful change. Tell the model to answer using only the provided context, and to refuse when the context does not contain the answer.

context_highlight_prompt = """
You are answering a question for an internal team.

Use ONLY the context below to answer the question. Do not use prior knowledge.
If the context does not contain the answer, say "I cannot answer from the
provided context" and stop.

Context:
{context}

Question: {question}

Answer:
"""

This prompt removes two failure modes at once: drift into pretraining knowledge, and hallucination when retrieval misses. The reduction in hallucination rate depends heavily on retriever quality, model, and dataset; in practice, adding an explicit “only use this context” instruction with a refusal clause commonly helps and should be validated with evals against your own traffic.

Technique 2: Citations Required

Force the model to cite each claim. The model can no longer make a confident assertion without pointing at a source.

citation_prompt = """
Answer the question using ONLY the numbered context below.

Rules:
- Every sentence in your answer must end with a citation like [1] or [2, 3].
- A claim with no supporting context must not appear in the answer.
- If no context supports an answer, say "Insufficient context" and stop.

Context:
{numbered_context}

Question: {question}

Answer (with citations):
"""

To use this prompt, format the retrieved chunks as [1] chunk text\n[2] chunk text\n.... Citations transform implicit failures into visible ones: a downstream check can confirm that each cited chunk id actually exists and contains the cited claim.

Technique 3: Step-by-Step Reasoning

For multi-hop questions (“what happens when X interacts with Y, given the rules in Z”), explicit step-by-step reasoning over the retrieved evidence improves answers.

step_by_step_prompt = """
Use the context to answer. Reason in numbered steps before giving the final answer.

Steps:
1. List the facts from the context that are relevant to the question.
2. Identify what the question is actually asking.
3. Combine the relevant facts to derive the answer.
4. State the final answer in one or two sentences.

If at any step you find the context lacks a required fact, stop and say
"Insufficient context."

Context:
{context}

Question: {question}
"""

Reasoning models like GPT-5 and Claude Opus 4.7 already chain-of-thought internally, so the marginal lift on simple questions is small. The structural value remains: the numbered steps make the failure mode visible in the trace.

Technique 4: Fact Verification Loop

After the model produces a draft answer, prompt it again to verify each claim against the retrieved context. Either inline as a two-step chain or as a second LLM call with the original context plus the draft.

verify_prompt = """
You wrote the following draft answer to a user question.

Question: {question}
Retrieved context:
{context}

Draft answer:
{draft_answer}

Now check the draft. For each claim in the draft:
1. Quote the supporting passage from the context.
2. If no passage supports the claim, mark it UNSUPPORTED.

Then produce a revised answer that contains only the supported claims, or
"Insufficient context" if no supported claim remains.
"""

This is more expensive (two LLM calls) but can improve accuracy on long, multi-claim answers. Run it conditionally only when the Context Adherence score on the first draft is below a threshold.

Technique 5: Role-Based Prompting

A small but consistent improvement on domain-specific queries, especially when paired with an explicit honesty clause.

role_prompt = """
You are a careful technical writer summarizing internal product documentation.
You never speculate. You only state what is directly supported by the docs.

Documentation:
{context}

User question: {question}

Write a concise, accurate answer based only on the documentation above. If the
documentation does not contain the answer, say so explicitly.
"""

Role framing on its own is the weakest of the techniques. Stack it on top of Context Highlighting plus citations; do not rely on it alone.

Technique 6: Refusal When Empty

Always include a refusal clause. The most underrated lever in RAG. Most production hallucinations occur when retrieval returns nothing useful and the model fills the void with plausible-sounding fiction.

refusal_clause = """
Important: if the retrieved context contains no information directly answering
the question, respond with "I do not have enough information to answer."
Do not guess. Do not use general knowledge.
"""

Treat refusal rate as a first-class production metric, not a bug. A RAG system that confidently refuses 5 percent of the time is more trustworthy than one that confidently fabricates 5 percent of the time.

Stitching it together: a production-grade prompt

production_prompt = """
You are a technical assistant answering questions for engineers using internal
documentation.

RULES:
1. Use ONLY the numbered context below. Do not use prior knowledge.
2. Every sentence in your answer must end with a citation like [1] or [2, 3].
3. If the context does not contain the answer, respond with exactly:
   "I do not have enough information to answer."
4. Do not summarize the question. Do not preface the answer with hedges.
5. Be concise: one to four sentences unless the question explicitly asks for detail.

Context:
{numbered_context}

Question: {question}

Answer (with citations):
"""

This single template combines Context Highlighting, mandatory Citations, and an explicit Refusal clause. It is the recommended default for production RAG in 2026.

Evaluating RAG outputs in 2026

Lexical metrics (BLEU, ROUGE-L) and embedding similarity are useful as cheap signals but are not credible primary metrics for RAG hallucination. They reward verbatim copying and miss confident fabrications that paraphrase well.

The 2026 standard is LLM-judge evaluators run on every production trace.

Context Adherence with Future AGI

Context Adherence scores whether the answer stays within the retrieved context.

from fi.evals import evaluate

result = evaluate(
    "context_adherence",
    output="The Meeseeks Box requires periodic maintenance to recharge its quantum foam reservoir, which runs dry if overused.",
    context="The Meeseeks Box requires periodic maintenance to recharge its quantum foam reservoir, which can run dry if overused.",
)

print(result.score)    # 0.0 to 1.0
print(result.passed)   # True / False
print(result.reason)   # Explanation string

Groundedness for retrieved-evidence anchoring

Groundedness asks whether each claim in the answer is supported by the retrieved evidence.

result = evaluate(
    "groundedness",
    output=answer,
    context=retrieved_context,
)

if not result.passed:
    print("Ungrounded claims:", result.reason)

Tracing the full RAG pipeline

For production, pair scoring with span-level tracing through traceAI, so each retrieval call and each chunk is visible alongside the final answer.

from fi_instrumentation import register, FITracer
from fi_instrumentation.fi_types import ProjectType
from traceai_langchain import LangChainInstrumentor

trace_provider = register(
    project_type=ProjectType.OBSERVE,
    project_name="meeseeks-rag",
)

LangChainInstrumentor().instrument(tracer_provider=trace_provider)

# All LangChain RetrievalQA calls now emit RETRIEVER, LLM, and CHAIN spans.

In the Future AGI dashboard you see the full nested trace: query → retriever → ranked chunks → LLM → answer → evaluator scores. When a hallucination is reported, the trace shows whether retrieval returned the wrong chunks, the prompt failed to constrain the model, or the answer drifted from the cited evidence.

A reference pattern: prompt + retrieval + eval + guardrail

A production-grade RAG loop in 2026 has four parts:

Retriever (vector + reranker) with strict relevance threshold; return zero chunks rather than weak chunks.
Prompt with Context Highlighting + Citations + Refusal clause.
Evaluator gate (Context Adherence + Groundedness) scored before the answer is returned to the user.
Guardrail layer (Future AGI Agent Command Center or NeMo Guardrails) that blocks or rewrites responses that fail the gate.

Future AGI’s ai-evaluation (Apache 2.0) and traceAI (Apache 2.0) are the open-source pieces that make steps 3 and 4 easier to add to an existing LangChain or LlamaIndex stack.

Common failure modes and what to look for in a trace

Failure pattern	Where to look in the trace
Confident wrong answer when retrieval was empty	RETRIEVER span returned 0 chunks; missing refusal clause
Confident wrong answer with retrieved chunks	LLM span shows entities not in any chunk; missing Context Adherence gate
Right facts, wrong combination (multi-hop fail)	LLM span shows skipped step; missing Step-by-Step prompt
Stale answer despite fresh index	RETRIEVER span shows old chunk ids; reindex
Answer drifts toward LLM’s pretraining	Prompt missing “ONLY use context” clause
Hallucinated citation	Cited chunk id not in numbered context list; rerun with mandatory citation check

Frequently asked questions

What is RAG and why does it reduce hallucination?

Retrieval-Augmented Generation is a pattern where a retriever pulls relevant documents from a knowledge base at inference time, and the LLM is prompted to answer using that retrieved context. Compared with a pure LLM, RAG reduces hallucination because the model is given the facts it needs in the prompt, rather than relying only on what was memorized during pretraining. RAG also keeps the system current with information that did not exist at training time. The reduction in hallucination is real but not automatic; it depends on the quality of the retrieval, the chunking, and especially the prompt that tells the model how to use the retrieved context.

Which RAG prompting technique reduces hallucination the most?

On the kinds of factual question-answering tasks most teams run, Context Highlighting with an explicit refusal clause is the single most impactful change. Adding mandatory citations on top of that catches a further class of subtle hallucinations because the model is forced to point at the chunk it is drawing from. Step-by-step reasoning helps on multi-hop questions but adds latency and cost. The best combination for production is Context Highlighting plus mandatory citations plus a refusal clause, scored end-to-end with a Context Adherence evaluator.

What is the difference between Context Adherence, Faithfulness, and Groundedness?

These three metrics overlap but each emphasizes a different dimension. Context Adherence measures whether the answer stays inside the provided context, penalizing claims the model invents on its own. Faithfulness measures whether each claim in the answer is supported by the source material, often by decomposing the answer into atomic facts and checking each one. Groundedness is similar to faithfulness but explicitly asks whether the answer is anchored in the retrieved evidence. Future AGI ships all three as separate evaluators in the fi.evals library so you can score the same RAG output along all three axes.

How do I evaluate RAG outputs in production?

Combine span-level tracing with metric-level scoring. Use traceAI to capture each retrieval call, each chunk, and the final LLM output as a structured span. Run Context Adherence, Faithfulness, and Groundedness evaluators on every trace, with thresholds that block or escalate failed responses. Track the failure rate over time, and replay every failed trace through the same evaluators after each prompt or retriever change. The minimum production setup is traceAI + ai-evaluation + a dashboard that surfaces drift.

Why does my RAG system still hallucinate after I added retrieval?

The most common causes are chunking that splits the relevant fact across two chunks (so the retrieved context is incomplete), a retriever that returns superficially similar but irrelevant chunks, a prompt that lets the model fill gaps from its own knowledge, and no refusal path when retrieval returns nothing. Each of these failure modes shows up clearly in a trace: an empty chunk list, a low-relevance reranker score, or an answer that contains entities not present in the retrieved context. Fix retrieval first, then prompting, then add evaluator-based guardrails.

Should I use BLEU and ROUGE to evaluate RAG?

Lexical overlap metrics like BLEU and ROUGE-L are weak for RAG. They reward responses that copy the retrieved chunk verbatim and punish equally correct but paraphrased answers. They also cannot detect that a fluent answer is factually wrong. In 2026 the standard for RAG evaluation is LLM-judge metrics like Context Adherence and Groundedness, supplemented with embedding similarity for relevance and human review for the long tail. BLEU and ROUGE are still useful as cheap signals for regression testing but not as primary metrics.

How does refusal differ from hallucination?

A refusal is the model saying it does not have enough information to answer. A hallucination is the model making something up to fill the gap. The right design treats refusal as a feature, not a failure. A RAG system that confidently refuses 5 percent of the time is much more trustworthy than one that confidently fabricates 5 percent of the time. Build refusal into the prompt explicitly: 'If the retrieved context does not contain the answer, say so and stop.' Then score refusal rate as a separate metric alongside hallucination rate.

What changed in RAG best practices between 2025 and 2026?

Three things. Reasoning models like GPT-5, Claude Opus 4.7, and Gemini 3 Pro made step-by-step prompting cheaper because the reasoning is amortized in the model's internal chain of thought. Long-context windows up to 1M tokens reduced pressure on aggressive chunking, although chunking still helps retrieval quality. And the eval bar moved from lexical metrics to LLM-judge Context Adherence and Groundedness scoring on every production trace, with span-level tracing through traceAI as the default observability layer.

View all

Guides

How to Cut RAG Hallucinations in 2026: Future AGI Playbook

Cut RAG hallucinations in 2026 with the Future AGI eval loop. Context Adherence + Groundedness metrics, real fi.evals code, chunk + retriever + reranker tuning.

Vrinda Damani · Apr 29, 2025

6 min

Guides

Model Drift vs Data Drift in 2026: Detection & Mitigation Guide

Model drift vs data drift in 2026: PSI, KS test, embedding cosine drift, and 7 tools ranked. Detect distribution shift in LLM and ML pipelines before users notice.

Rishav Hada · Feb 11, 2025

8 min

Guides

Data Annotation and Synthetic Data in 2026: The Honest Guide

Data annotation meets synthetic data in 2026: GANs, VAEs, LLM annotators, self-supervision, RLHF, plus tooling and pitfalls. Updated with FAGI Annotate & Synthesize.

Rishav Hada · Feb 10, 2025

11 min

TL;DR

The base setup we will prompt against

Technique 1: Context Highlighting

Technique 2: Citations Required

Technique 3: Step-by-Step Reasoning

Technique 4: Fact Verification Loop

Technique 5: Role-Based Prompting

Technique 6: Refusal When Empty

Stitching it together: a production-grade prompt

Evaluating RAG outputs in 2026

Context Adherence with Future AGI

Groundedness for retrieved-evidence anchoring

Tracing the full RAG pipeline

A reference pattern: prompt + retrieval + eval + guardrail

Common failure modes and what to look for in a trace

Recommended reading

Frequently asked questions