What is noise sensitivity in RAG evaluation?

Noise sensitivity measures whether a RAG answer stays correct when irrelevant or misleading retrieved context is present. It tests generation behavior under retrieval noise, not just retriever ranking.

How is noise sensitivity different from context relevance?

Context relevance scores whether retrieved chunks are useful for the query. Noise sensitivity checks whether the model can ignore chunks that are not useful.

How do you measure noise sensitivity?

FutureAGI measures it with `fi.evals.NoiseSensitivity` on clean and distractor-containing RAG examples. Teams usually track the failure rate by retriever version, prompt version, and query cohort.

What Is Noise Sensitivity? FutureAGI Guide (2026)

What Is Noise Sensitivity (RAG Eval)?

Noise sensitivity is a RAG evaluation metric that measures whether a model can answer correctly when irrelevant or misleading context is mixed into retrieved evidence. It is an LLM-evaluation metric for the eval pipeline and production trace, focused on the generation step under retrieval noise. FutureAGI surfaces it through eval:NoiseSensitivity: the evaluator catches distractor-driven hallucination, over-attention to stale chunks, and brittle prompts that look fine when every retrieved passage is clean.

Why Noise Sensitivity Matters in Production LLM and Agent Systems

Noise sensitivity matters because many RAG failures are not caused by empty retrieval. They come from mixed-quality retrieval: one correct chunk, three weakly related chunks, and one stale policy page that looks authoritative. The visible failure is a confident answer that cites context but chooses the wrong context. A support assistant applies an old refund rule. A legal-search agent summarizes a distractor clause. A research copilot answers the easy adjacent question instead of the user’s specific question.

Developers feel this first when offline evals pass on clean fixtures but production traces fail on long-tail queries. SREs see repeat calls, higher token spend, and p99 latency growth as teams increase top-k to compensate. Product teams see thumbs-down clusters around broad queries. Compliance teams care because noisy context can pull a model toward outdated policy, private notes, or jurisdiction-specific exceptions.

The log pattern is specific: high retrieval recall, acceptable ContextRelevance for at least one chunk, but low answer quality when irrelevant chunks are present. In 2026 multi-step pipelines, this compounds. An agent may retrieve, plan, call a tool, retrieve again, and synthesize. Noise at each step becomes path dependence: the model follows one bad chunk into a bad tool call, then treats the tool result as confirmation. Noise sensitivity gives that failure mode a measurable boundary.

How FutureAGI Handles Noise Sensitivity

FutureAGI’s approach is to treat noise sensitivity as a regression eval for RAG behavior under distractor pressure. The anchor eval:NoiseSensitivity maps to fi.evals.NoiseSensitivity, listed in the inventory as a local metric for RAG system resilience to irrelevant context. The evaluator belongs next to ContextRelevance, ContextPrecision, ContextRecall, and Groundedness: relevance and ranking explain the retrieval side; NoiseSensitivity explains whether the model ignored the junk it was given.

A real workflow: a documentation assistant runs on traceAI-langchain. Each answer trace stores the query, retrieval.documents in final ranked order, llm.output, retriever.version, and prompt.version. The team builds an eval dataset with paired cases: the same user question with clean context, then the same question with two plausible distractor chunks added. NoiseSensitivity runs over those pairs before a retriever or prompt release. If the answer changes from “audit logs export from Admin Settings” to “billing reports cannot be exported,” the eval flags the run because the distractor won.

The engineer does not stop at the aggregate score. They open the failing trace, check whether the distractor was ranked too high, then choose the fix: tighten retrieval filters, add a reranker, lower top-k, rewrite the answer prompt to cite only directly relevant chunks, or add a regression gate. Unlike a one-off Ragas noise-sensitivity benchmark, FutureAGI keeps the score attached to trace cohorts, versions, and alert thresholds, so a release can fail only on the retrieval slice it actually broke.

How to Measure or Detect Noise Sensitivity

Measure noise sensitivity by comparing answer quality with and without irrelevant context while keeping the user question and expected answer stable. Useful signals include:

fi.evals.NoiseSensitivity - returns a RAG eval signal for whether distractor context changed or corrupted the answer.
Clean-versus-noisy pair delta - score difference between the clean fixture and the same fixture with distractor chunks.
retrieval.documents and llm.output - trace fields needed to inspect what the model saw and what it produced.
Eval-fail-rate by cohort - dashboard signal split by retriever version, prompt version, tenant, and query type.
User-feedback proxy - thumbs-down rate or escalation rate on broad queries with many near-matching documents.

Minimal Python:

from fi.evals import NoiseSensitivity

evaluator = NoiseSensitivity()
result = evaluator.evaluate(
    input="Can admins export audit logs?",
    output="Admins can export audit logs from Admin Settings.",
    context=["Admin audit logs are exportable.", "Free plans cannot export billing reports."]
)
print(result.score, result.reason)

Common Mistakes

Noise sensitivity is easy to misread because the retriever, prompt, and model all contribute to the result. Watch for these implementation mistakes:

Testing only clean context. Clean fixtures prove the model can answer; they do not prove it can ignore plausible distractors.
Treating recall as enough. High recall can coexist with high noise sensitivity when irrelevant chunks outrank or distract from relevant chunks.
Mixing unrelated distractors. Random noise is too easy; use near-miss documents that share entities, dates, or product names.
Blaming the retriever every time. If the relevant chunk is present and ranked high, the prompt or model may be over-attending to weak evidence.
Averaging every query together. Split policy, troubleshooting, navigational, and multi-hop queries; broad averages hide the slice with noisy retrieval.