RAG Prompting to Reduce Hallucination: 6 Techniques and How to Measure Them
Six RAG prompting patterns that reduce hallucination, with example prompts, retrieval grounding, and Context Adherence + Groundedness eval code.
Table of Contents
TL;DR
| Technique | When to use | Effect on hallucination |
|---|---|---|
| Context Highlighting | Default for every RAG prompt | Large reduction |
| Citations Required | Compliance, medical, legal, finance | Large, makes failures visible |
| Step-by-Step Reasoning | Multi-hop questions, comparisons | Medium, adds latency |
| Fact Verification Loop | High-stakes outputs, second pass on retrieved evidence | Medium-large |
| Role-Based Prompting | Domain-specific phrasing | Small on its own |
| Refusal When Empty | Always (paired with Context Highlighting) | Eliminates fabricated answers when retrieval fails |
The single best prompt change is Context Highlighting plus a refusal clause. Pair it with mandatory citations, score every production response with Context Adherence and Groundedness, and treat refusals as a feature, not a failure.
The base setup we will prompt against
We use Rick and Morty’s Meeseeks Box for examples because the queries are made-up enough that a non-RAG model has no choice but to guess. Any hallucination is therefore visible.
from langchain_openai import ChatOpenAI, OpenAIEmbeddings
from langchain.schema import Document
from langchain_community.vectorstores import FAISS
from langchain.chains import RetrievalQA
llm = ChatOpenAI(model="gpt-5", temperature=0)
embedding_model = OpenAIEmbeddings()
documents = [
Document(page_content="The Meeseeks Box begins its creation process by harvesting proto-Meeseeks from a quantum foam field."),
Document(page_content="The harvested proto-Meeseeks are condensed into small energy packets and stored in temporal stasis."),
Document(page_content="A neural imprinting laser programs each Meeseeks with a single objective, ensuring they are task-oriented."),
Document(page_content="The Meeseeks Box has an internal logic circuit that randomly assigns objectives, such as opening jars or solving math problems."),
Document(page_content="When the button is pressed, the Box releases a fully-formed Meeseeks, temporarily stabilized by an anti-decay field."),
Document(page_content="After completing their task, Meeseeks are designed to disintegrate into harmless particles of joy-energy."),
Document(page_content="The Meeseeks Box requires periodic maintenance to recharge its quantum foam reservoir, which can run dry if overused."),
]
vectorstore = FAISS.from_documents(documents, embedding_model)
retriever = vectorstore.as_retriever(search_kwargs={"k": 4})
query = "Why does the Meeseeks Box require maintenance?"
Without RAG, the model would speculate. With retrieval but a bad prompt, it can still speculate beyond the retrieved chunks. The six techniques below are the prompt-side levers that close that gap.
Technique 1: Context Highlighting
The single most impactful change. Tell the model to answer using only the provided context, and to refuse when the context does not contain the answer.
context_highlight_prompt = """
You are answering a question for an internal team.
Use ONLY the context below to answer the question. Do not use prior knowledge.
If the context does not contain the answer, say "I cannot answer from the
provided context" and stop.
Context:
{context}
Question: {question}
Answer:
"""
This prompt removes two failure modes at once: drift into pretraining knowledge, and hallucination when retrieval misses. The reduction in hallucination rate depends heavily on retriever quality, model, and dataset; in practice, adding an explicit “only use this context” instruction with a refusal clause commonly helps and should be validated with evals against your own traffic.
Technique 2: Citations Required
Force the model to cite each claim. The model can no longer make a confident assertion without pointing at a source.
citation_prompt = """
Answer the question using ONLY the numbered context below.
Rules:
- Every sentence in your answer must end with a citation like [1] or [2, 3].
- A claim with no supporting context must not appear in the answer.
- If no context supports an answer, say "Insufficient context" and stop.
Context:
{numbered_context}
Question: {question}
Answer (with citations):
"""
To use this prompt, format the retrieved chunks as [1] chunk text\n[2] chunk text\n.... Citations transform implicit failures into visible ones: a downstream check can confirm that each cited chunk id actually exists and contains the cited claim.
Technique 3: Step-by-Step Reasoning
For multi-hop questions (“what happens when X interacts with Y, given the rules in Z”), explicit step-by-step reasoning over the retrieved evidence improves answers.
step_by_step_prompt = """
Use the context to answer. Reason in numbered steps before giving the final answer.
Steps:
1. List the facts from the context that are relevant to the question.
2. Identify what the question is actually asking.
3. Combine the relevant facts to derive the answer.
4. State the final answer in one or two sentences.
If at any step you find the context lacks a required fact, stop and say
"Insufficient context."
Context:
{context}
Question: {question}
"""
Reasoning models like GPT-5 and Claude Opus 4.7 already chain-of-thought internally, so the marginal lift on simple questions is small. The structural value remains: the numbered steps make the failure mode visible in the trace.
Technique 4: Fact Verification Loop
After the model produces a draft answer, prompt it again to verify each claim against the retrieved context. Either inline as a two-step chain or as a second LLM call with the original context plus the draft.
verify_prompt = """
You wrote the following draft answer to a user question.
Question: {question}
Retrieved context:
{context}
Draft answer:
{draft_answer}
Now check the draft. For each claim in the draft:
1. Quote the supporting passage from the context.
2. If no passage supports the claim, mark it UNSUPPORTED.
Then produce a revised answer that contains only the supported claims, or
"Insufficient context" if no supported claim remains.
"""
This is more expensive (two LLM calls) but can improve accuracy on long, multi-claim answers. Run it conditionally only when the Context Adherence score on the first draft is below a threshold.
Technique 5: Role-Based Prompting
A small but consistent improvement on domain-specific queries, especially when paired with an explicit honesty clause.
role_prompt = """
You are a careful technical writer summarizing internal product documentation.
You never speculate. You only state what is directly supported by the docs.
Documentation:
{context}
User question: {question}
Write a concise, accurate answer based only on the documentation above. If the
documentation does not contain the answer, say so explicitly.
"""
Role framing on its own is the weakest of the techniques. Stack it on top of Context Highlighting plus citations; do not rely on it alone.
Technique 6: Refusal When Empty
Always include a refusal clause. The most underrated lever in RAG. Most production hallucinations occur when retrieval returns nothing useful and the model fills the void with plausible-sounding fiction.
refusal_clause = """
Important: if the retrieved context contains no information directly answering
the question, respond with "I do not have enough information to answer."
Do not guess. Do not use general knowledge.
"""
Treat refusal rate as a first-class production metric, not a bug. A RAG system that confidently refuses 5 percent of the time is more trustworthy than one that confidently fabricates 5 percent of the time.
Stitching it together: a production-grade prompt
production_prompt = """
You are a technical assistant answering questions for engineers using internal
documentation.
RULES:
1. Use ONLY the numbered context below. Do not use prior knowledge.
2. Every sentence in your answer must end with a citation like [1] or [2, 3].
3. If the context does not contain the answer, respond with exactly:
"I do not have enough information to answer."
4. Do not summarize the question. Do not preface the answer with hedges.
5. Be concise: one to four sentences unless the question explicitly asks for detail.
Context:
{numbered_context}
Question: {question}
Answer (with citations):
"""
This single template combines Context Highlighting, mandatory Citations, and an explicit Refusal clause. It is the recommended default for production RAG in 2026.
Evaluating RAG outputs in 2026
Lexical metrics (BLEU, ROUGE-L) and embedding similarity are useful as cheap signals but are not credible primary metrics for RAG hallucination. They reward verbatim copying and miss confident fabrications that paraphrase well.
The 2026 standard is LLM-judge evaluators run on every production trace.
Context Adherence with Future AGI
Context Adherence scores whether the answer stays within the retrieved context.
from fi.evals import evaluate
result = evaluate(
"context_adherence",
output="The Meeseeks Box requires periodic maintenance to recharge its quantum foam reservoir, which runs dry if overused.",
context="The Meeseeks Box requires periodic maintenance to recharge its quantum foam reservoir, which can run dry if overused.",
)
print(result.score) # 0.0 to 1.0
print(result.passed) # True / False
print(result.reason) # Explanation string
Groundedness for retrieved-evidence anchoring
Groundedness asks whether each claim in the answer is supported by the retrieved evidence.
result = evaluate(
"groundedness",
output=answer,
context=retrieved_context,
)
if not result.passed:
print("Ungrounded claims:", result.reason)
Tracing the full RAG pipeline
For production, pair scoring with span-level tracing through traceAI, so each retrieval call and each chunk is visible alongside the final answer.
from fi_instrumentation import register, FITracer
from fi_instrumentation.fi_types import ProjectType
from traceai_langchain import LangChainInstrumentor
trace_provider = register(
project_type=ProjectType.OBSERVE,
project_name="meeseeks-rag",
)
LangChainInstrumentor().instrument(tracer_provider=trace_provider)
# All LangChain RetrievalQA calls now emit RETRIEVER, LLM, and CHAIN spans.
In the Future AGI dashboard you see the full nested trace: query → retriever → ranked chunks → LLM → answer → evaluator scores. When a hallucination is reported, the trace shows whether retrieval returned the wrong chunks, the prompt failed to constrain the model, or the answer drifted from the cited evidence.
A reference pattern: prompt + retrieval + eval + guardrail
A production-grade RAG loop in 2026 has four parts:
- Retriever (vector + reranker) with strict relevance threshold; return zero chunks rather than weak chunks.
- Prompt with Context Highlighting + Citations + Refusal clause.
- Evaluator gate (
Context Adherence+Groundedness) scored before the answer is returned to the user. - Guardrail layer (Future AGI Agent Command Center or NeMo Guardrails) that blocks or rewrites responses that fail the gate.
Future AGI’s ai-evaluation (Apache 2.0) and traceAI (Apache 2.0) are the open-source pieces that make steps 3 and 4 easier to add to an existing LangChain or LlamaIndex stack.
Common failure modes and what to look for in a trace
| Failure pattern | Where to look in the trace |
|---|---|
| Confident wrong answer when retrieval was empty | RETRIEVER span returned 0 chunks; missing refusal clause |
| Confident wrong answer with retrieved chunks | LLM span shows entities not in any chunk; missing Context Adherence gate |
| Right facts, wrong combination (multi-hop fail) | LLM span shows skipped step; missing Step-by-Step prompt |
| Stale answer despite fresh index | RETRIEVER span shows old chunk ids; reindex |
| Answer drifts toward LLM’s pretraining | Prompt missing “ONLY use context” clause |
| Hallucinated citation | Cited chunk id not in numbered context list; rerun with mandatory citation check |
Recommended reading
- Advanced chunking techniques for RAG covers chunk size, overlap, and structural chunking.
- RAG evaluation metrics details Context Adherence, Groundedness, and Faithfulness scoring.
- Agentic RAG systems walks through retrieval-then-tool-use patterns.
- traceAI + OpenTelemetry for LLM tracing covers production tracing for RAG pipelines.
- Taming the hallucination beast: strategies for reliable LLMs for non-RAG hallucination mitigation patterns.
The short version of 2026 RAG prompting: prompt the model to use only the retrieved context, cite every claim, refuse when retrieval fails, and score every response with Context Adherence and Groundedness in production. Skip any of those four and you ship a system that confidently makes things up.
Frequently asked questions
What is RAG and why does it reduce hallucination?
Which RAG prompting technique reduces hallucination the most?
What is the difference between Context Adherence, Faithfulness, and Groundedness?
How do I evaluate RAG outputs in production?
Why does my RAG system still hallucinate after I added retrieval?
Should I use BLEU and ROUGE to evaluate RAG?
How does refusal differ from hallucination?
What changed in RAG best practices between 2025 and 2026?
Cut RAG hallucinations in 2026 with the Future AGI eval loop. Context Adherence + Groundedness metrics, real fi.evals code, chunk + retriever + reranker tuning.
Model drift vs data drift in 2026: PSI, KS test, embedding cosine drift, and 7 tools ranked. Detect distribution shift in LLM and ML pipelines before users notice.
Data annotation meets synthetic data in 2026: GANs, VAEs, LLM annotators, self-supervision, RLHF, plus tooling and pitfalls. Updated with FAGI Annotate & Synthesize.