Evaluation

What Is Noise Sensitivity (RAG Eval)?

A RAG evaluation metric that tests whether irrelevant retrieved context changes, corrupts, or distracts the model's answer.

What Is Noise Sensitivity (RAG Eval)?

Noise sensitivity is a RAG evaluation metric that measures whether a model can answer correctly when irrelevant or misleading context is mixed into retrieved evidence. It is an LLM-evaluation metric for the eval pipeline and production trace, focused on the generation step under retrieval noise. FutureAGI surfaces it through eval:NoiseSensitivity: the evaluator catches distractor-driven hallucination, over-attention to stale chunks, and brittle prompts that look fine when every retrieved passage is clean. In our 2026 evals, GPT-5.x and Claude Opus 4.7 both lose 5-12 points of accuracy when two plausible distractor chunks are added to clean policy context.

Why Noise Sensitivity Matters in Production LLM and Agent Systems

Noise sensitivity matters because many RAG failures are not caused by empty retrieval. They come from mixed-quality retrieval: one correct chunk, three weakly related chunks, and one stale policy page that looks authoritative. The visible failure is a confident answer that cites context but chooses the wrong context. A support assistant applies an old refund rule. A legal-search agent summarizes a distractor clause. A research copilot answers the easy adjacent question instead of the user’s specific question.

Developers feel this first when offline evals pass on clean fixtures but production traces fail on long-tail queries. SREs see repeat calls, higher token spend, and p99 latency growth as teams increase top-k to compensate. Product teams see thumbs-down clusters around broad queries. Compliance teams care because noisy context can pull a model toward outdated policy, private notes, or jurisdiction-specific exceptions.

The log pattern is specific: high retrieval recall, acceptable ContextRelevance for at least one chunk, but low answer quality when irrelevant chunks are present. In 2026 multi-step pipelines, this compounds. An agent may retrieve, plan, call a tool, retrieve again, and synthesize. Noise at each step becomes path dependence: the model follows one bad chunk into a bad tool call, then treats the tool result as confirmation. Noise sensitivity gives that failure mode a measurable boundary.

How FutureAGI Handles Noise Sensitivity

FutureAGI’s approach is to treat noise sensitivity as a regression eval for RAG behavior under distractor pressure. The anchor eval:NoiseSensitivity maps to fi.evals.NoiseSensitivity on /platform/evaluate, listed in the inventory as a local metric for RAG system resilience to irrelevant context. The evaluator belongs next to ContextRelevance, ContextPrecision, ContextRecall, and Groundedness: relevance and ranking explain the retrieval side; NoiseSensitivity explains whether the model ignored the junk it was given.

A real workflow: a documentation assistant runs on traceAI-langchain. Each answer trace stores the query, retrieval.documents in final ranked order, llm.output, retriever.version, and prompt.version. The team builds an eval dataset with paired cases: the same user question with clean context, then the same question with two plausible distractor chunks added. NoiseSensitivity runs over those pairs before a retriever or prompt release. If the answer changes from “audit logs export from Admin Settings” to “billing reports cannot be exported,” the eval flags the run because the distractor won.

The engineer does not stop at the aggregate score. They open the failing trace, check whether the distractor was ranked too high, then choose the fix: tighten retrieval filters, add a reranker, lower top-k, rewrite the answer prompt to cite only directly relevant chunks, or add a regression gate. Unlike a one-off Ragas noise-sensitivity benchmark, FutureAGI keeps the score attached to trace cohorts, versions, and alert thresholds, so a release can fail only on the retrieval slice it actually broke.

How to Measure or Detect Noise Sensitivity

Measure noise sensitivity by comparing answer quality with and without irrelevant context while keeping the user question and expected answer stable. Useful signals include:

  • fi.evals.NoiseSensitivity - returns a RAG eval signal for whether distractor context changed or corrupted the answer.
  • Clean-versus-noisy pair delta - score difference between the clean fixture and the same fixture with distractor chunks.
  • retrieval.documents and llm.output - trace fields needed to inspect what the model saw and what it produced.
  • Eval-fail-rate by cohort - dashboard signal split by retriever version, prompt version, tenant, and query type.
  • User-feedback proxy - thumbs-down rate or escalation rate on broad queries with many near-matching documents.

Minimal Python:

from fi.evals import NoiseSensitivity

evaluator = NoiseSensitivity()
result = evaluator.evaluate(
    input="Can admins export audit logs?",
    output="Admins can export audit logs from Admin Settings.",
    context=["Admin audit logs are exportable.", "Free plans cannot export billing reports."]
)
print(result.score, result.reason)
RAG metricWhat it scoresCatchesFAGI evaluator
ContextRelevanceRetrieval usefulnessOff-topic chunksContextRelevance
ContextPrecisionTop-K orderingDistractors above relevantContextPrecision
ContextRecallCoverageMissing evidenceContextRecall
Noise sensitivityModel resistance to junkDistractor-driven hallucinationNoiseSensitivity
FaithfulnessClaim supportUnsupported sentencesFaithfulness
GroundednessEnd-to-end anchoringDrift from contextGroundedness

For external calibration, RAGTruth (18K labeled response chunks, hallucinated-span granularity across QA, summarization, and data-to-text) shows frontier models hallucinate on 5-8% of answers with mixed retrieval. CRAG (Meta’s Comprehensive RAG benchmark, 4,409 QA pairs across eight web-search categories) explicitly mixes noisy and corrupted retrieval into its test set. frontier accuracy drops 15-25 points between clean and noisy retrieval splits, a useful upper bound for what a paired-cohort NoiseSensitivity test should reveal.

Building distractor cohorts that actually test noise resistance

Random noise is not interesting. Inserting a paragraph about basketball into a refund-policy retrieval set tells you nothing useful about a model that any frontier 2026 model will obviously ignore. The interesting distractors are near-misses: they share entities, dates, product names, or wording with the correct answer but say something different or apply to a different case.

A working distractor cohort has three categories. Adjacent-policy distractors: chunks from the same document family but the wrong section (refund policy versus cancellation policy). Outdated-version distractors: a previous version of the same policy that was once correct. Same-entity distractors: chunks about a related product, tenant, or jurisdiction that share keywords but apply different rules. We’ve found that GPT-5.x and Claude Opus 4.7 both fail most often on the outdated-version cohort. the model treats the older policy as authoritative because the wording is fluent and confident.

The release gate we recommend uses paired evaluation: same question, clean context vs. distractor-contaminated context, with NoiseSensitivity scoring the delta. A drop greater than 5 points blocks the release. Compared to a Ragas noise-sensitivity benchmark run once during model selection, the paired cohort runs on every release and catches prompts that work in clean conditions but break when retrieval gets messy.

Common Mistakes

Noise sensitivity is easy to misread because the retriever, prompt, and model all contribute to the result. Watch for these implementation mistakes:

  • Testing only clean context. Clean fixtures prove the model can answer; they do not prove it can ignore plausible distractors.
  • Treating recall as enough. High recall can coexist with high noise sensitivity when irrelevant chunks outrank or distract from relevant chunks.
  • Mixing unrelated distractors. Random noise is too easy; use near-miss documents that share entities, dates, or product names.
  • Blaming the retriever every time. If the relevant chunk is present and ranked high, the prompt or model may be over-attending to weak evidence.
  • Averaging every query together. Split policy, troubleshooting, navigational, and multi-hop queries; broad averages hide the slice with noisy retrieval.

Frequently Asked Questions

What is noise sensitivity in RAG evaluation?

Noise sensitivity measures whether a RAG answer stays correct when irrelevant or misleading retrieved context is present. It tests generation behavior under retrieval noise, not just retriever ranking.

How is noise sensitivity different from context relevance?

Context relevance scores whether retrieved chunks are useful for the query. Noise sensitivity checks whether the model can ignore chunks that are not useful.

How do you measure noise sensitivity?

FutureAGI measures it with `fi.evals.NoiseSensitivity` on clean and distractor-containing RAG examples. Teams usually track the failure rate by retriever version, prompt version, and query cohort.