What Is a RAGET Complex Question Hallucination Attack?
A RAG-evaluation test case that probes hallucination on multi-hop, compositional, or reasoning-heavy questions requiring synthesis across multiple retrieved documents.
What Is a RAGET Complex Question Hallucination Attack?
A RAGET complex-question hallucination attack is a structured test case in the RAGET evaluation taxonomy (popularised by Giskard’s RAG testing tool) that probes a retrieval-augmented generation system with multi-hop, compositional, or reasoning-heavy questions. Where a simple lookup question asks “what is the refund period?”, a complex question asks “given the refund period for enterprise customers and the SLA on processing, how long will a refund take if I file on Friday?” The attack is constructed so that no single retrieved chunk contains the answer — the model has to retrieve multiple chunks and synthesise across them. It is a release-gate eval pattern, not a runtime exploit, and it surfaces the failure mode where RAG systems pass simple tests and fabricate confident answers on harder ones.
Why It Matters in Production LLM and Agent Systems
RAG systems pass single-fact benchmarks easily and fail in production on the question shapes users actually ask. A support bot answers “what is your return policy?” perfectly because the answer is one paragraph in one document. The same bot fabricates an answer to “given that I ordered last week and shipping takes 3 days, when do I need to file by to be within the return window?” because the answer requires combining the return-policy chunk, the shipping-time chunk, and a date calculation. The retrieval may pull the right chunks; the model synthesises them wrong; the user gets a confident wrong answer.
The pain falls on RAG team owners. A bot that scores 0.91 on the simple-question eval scores 0.62 on the complex-question eval, and production traffic is biased toward complex questions because users ask the simple ones in search. Compliance teams care because complex questions in regulated industries (medical, financial, legal) are exactly the ones where confabulation is most dangerous. Product teams see CSAT drop when the bot starts giving subtly wrong answers — wrong dates, wrong amounts, wrong sequence of steps.
In 2026, with reasoning models and long-context LLMs ostensibly able to handle multi-hop synthesis, the attack shape becomes more important, not less. Better synthesis capability means more confident-looking confabulations.
How FutureAGI Handles RAGET Complex-Question Attacks
FutureAGI’s approach is to bake RAGET-style complex-question test cases into the standard regression eval pipeline and score them with multi-hop-aware evaluators. Teams synthesise complex-question test cases from their corpus using synthetic-data-generation workflows (or hand-curate a golden set), load them into a Dataset, and attach Faithfulness, MultiHopReasoning, HallucinationScore, ChunkAttribution, and ContextRelevance via Dataset.add_evaluation(). The complex-question cohort is run on every release; regression deltas gate the deploy.
Concretely: a financial-services RAG team running on traceAI-langchain builds a 200-question RAGET complex-question dataset covering compliance-relevant compositional questions. Pre-deployment, every model swap, prompt change, or chunking-strategy update runs against the dataset. MultiHopReasoning returns 0.78 on the baseline and 0.61 on a candidate prompt; Faithfulness stays steady but HallucinationScore doubles. The regression blocks the release. Production traffic on the complex-question route runs through a post-guardrail that fires Faithfulness on every response and routes low-faithfulness answers to a fallback message rather than the user. FutureAGI surfaces both the offline regression and the runtime guard signal.
How to Measure or Detect It
Complex-question hallucination needs more than single-shot Faithfulness:
MultiHopReasoning: returns 0–1 plus reason for whether the response correctly synthesises across multiple retrieved chunks.Faithfulness: 0–1 score for response groundedness in retrieved context — fires on confabulated synthesis.HallucinationScore: composite hallucination-detection signal; useful when reference answers exist.ChunkAttribution: surfaces which retrieved chunks the response claimed to use vs. what was retrieved.ContextRelevanceandContextRecall: catch the upstream cause (retrieval pulled partial chunks).- Per-question-shape eval-fail-rate: slice eval-fail-rate-by-cohort by question complexity to see where complex-question regressions live.
from fi.evals import Faithfulness, MultiHopReasoning, HallucinationScore
faith = Faithfulness()
multi = MultiHopReasoning()
hallu = HallucinationScore()
result = multi.evaluate(
input=complex_question,
output=generated_answer,
context=retrieved_chunks,
)
print(result.score, result.reason)
Common Mistakes
- Building the eval dataset only from simple lookup questions. Production traffic skews complex; eval has to mirror it.
- Using a single Faithfulness score. It catches contradiction but misses missing-link hallucination — pair with
MultiHopReasoning. - Skipping the regression gate on prompt changes. Prompt rewrites are the most common source of complex-question regressions.
- Treating the long-context window as a fix. Bigger context can make retrieval lazier and synthesis less precise.
- No runtime fallback. Even a perfect offline pass-rate leaves a tail; route low-Faithfulness responses to a refusal in production.
Frequently Asked Questions
What is a RAGET complex-question hallucination attack?
It is a RAG-evaluation test case from the RAGET taxonomy that probes whether a RAG system hallucinates when answering multi-hop, compositional, or reasoning-heavy questions that require synthesis across multiple retrieved documents.
How is it different from a simple-question attack?
Simple-question attacks test single-fact lookup. Complex-question attacks chain multiple sub-questions into one prompt, forcing multi-document synthesis where retrieval gaps and reasoning errors compound into hallucination.
How does FutureAGI defend against RAGET complex-question attacks?
FutureAGI runs Faithfulness, MultiHopReasoning, and HallucinationScore on every RAG response, and includes RAGET-style test cases in the regression eval that gates each release.