Models

What Is Sensitivity?

The proportion of actual positives that a classifier correctly identifies — true positives divided by total actual positives. Identical to recall.

What Is Sensitivity?

Sensitivity in machine learning is the proportion of actual positives a model correctly identifies — true positives divided by the sum of true positives and false negatives. It is mathematically identical to recall and to true-positive rate; the term comes from biostatistics, where missing a positive case (a sick patient classified as healthy) is the costly error. Sensitivity pairs with specificity to give a complete picture of binary classifier behavior and is the y-axis of every ROC curve. In LLM evaluation, sensitivity reappears as recall on retrieval, tool selection, and harmful-content detection.

Why It Matters in Production LLM and Agent Systems

Sensitivity matters most in asymmetric-cost domains, and a lot of LLM applications are asymmetric-cost. A content-safety classifier that misses a harmful prompt has a public-incident cost that dwarfs the cost of an extra false positive. A fraud detector that misses a fraudulent transaction loses real money. A medical-coding model that misses a diagnosis can cause harm directly. In each case, sensitivity — not accuracy — is the metric that tracks the failure that hurts.

The pain shows up when teams optimise for accuracy and ship a high-accuracy, low-sensitivity model. On a 95%-negative dataset, a classifier that always predicts “negative” has 95% accuracy and 0% sensitivity. The dashboard looks great. The harmful prompts pass through. The compliance team is asked, mid-incident, why nothing fired and has no answer.

In 2026-era LLM stacks, sensitivity reappears in less obvious places. RAG retrievers have sensitivity-on-retrieval (recall@k): how many of the genuinely relevant documents were surfaced. Tool-calling agents have sensitivity-on-tool-detection: did the agent recognise that this prompt required a tool call at all. Guardrail evaluators have sensitivity-on-attack-detection: how many prompt-injection attempts were caught. Treating these as separate problems, rather than as the same sensitivity question in different form, is how regressions hide.

How FutureAGI Handles Sensitivity in Machine Learning

FutureAGI’s approach is to expose sensitivity (recall) as a first-class evaluator and surface it everywhere a positive-vs-negative decision matters. For binary classifiers, fi.evals.RecallScore returns the true-positive rate against a labeled dataset, with per-class breakdowns when the labels are categorical. For retrieval, fi.evals.RecallAtK returns the fraction of relevant documents retrieved in the top-K, paired with fi.evals.PrecisionAtK to make the trade-off explicit.

For LLM-specific sensitivity questions, the dedicated evaluators take over. fi.evals.PromptInjection and fi.evals.ProtectFlash measure attack-detection recall on red-team datasets. fi.evals.PII measures PII-leakage recall on synthetic-leakage tests. Each of these is a sensitivity number wearing a domain-specific name.

Concretely: a healthcare LLM team running on traceAI-openai builds a labeled dataset of “should escalate to human” vs “safe to auto-respond” examples, attaches RecallScore via Dataset.add_evaluation(), and tracks sensitivity-by-cohort. When a model swap drops sensitivity in the geriatric cohort from 0.91 to 0.74 while overall accuracy is unchanged, the regression eval surfaces the cohort gap directly. The team rolls back, retrains a calibration layer, and re-runs against the dataset before redeploying — the cohort gap is visible because sensitivity, not accuracy, is the headline metric.

How to Measure or Detect It

Sensitivity is mechanically simple to measure; the trick is choosing the right denominator and slicing:

  • RecallScore: returns sensitivity for binary or per-class classification against ground truth.
  • RecallAtK: retrieval sensitivity at rank K — the canonical metric for the retrieval layer of RAG.
  • PrecisionAtK: paired metric; lets you draw the precision-recall curve.
  • Confusion matrix: the raw counts behind sensitivity — verify TP, FN are what you think they are.
  • ROC curve and AUC: sensitivity vs (1 - specificity) at every threshold; the standard threshold-tuning surface.
  • Per-cohort sensitivity (dashboard signal): sensitivity sliced by user segment, route, or model version; surfaces fairness gaps.

Minimal Python:

from fi.evals import RecallScore

recall = RecallScore()

result = recall.evaluate(
    input=batch_inputs,
    output=batch_predictions,
    expected_response=batch_labels,
)
print(result.score, result.reason)

Common Mistakes

  • Reporting accuracy on imbalanced data. A 95%-negative dataset gives 95% accuracy to a useless classifier; sensitivity catches it.
  • Optimising threshold against F1 by default. F1 weights precision and recall equally — wrong choice when the cost asymmetry is 10:1.
  • Comparing sensitivity across thresholds. Always pin the operating threshold when comparing models.
  • Ignoring per-cohort sensitivity. Aggregate sensitivity can hide a single cohort where the model fails.
  • Conflating retrieval sensitivity with end-to-end answer recall. A high recall@k retriever still produces wrong answers if generation drops the relevant chunks.

Frequently Asked Questions

What is sensitivity in machine learning?

Sensitivity is the proportion of actual positives a classifier correctly identifies — true positives over the sum of true positives and false negatives. It is identical to recall.

How is sensitivity different from specificity?

Sensitivity is the true-positive rate (correctly catching positives). Specificity is the true-negative rate (correctly rejecting negatives). They trade off — high sensitivity often comes with lower specificity.

How do you compute sensitivity in FutureAGI?

Use the RecallScore evaluator for binary classification or RecallAtK for ranked retrieval; both attach to a Dataset for per-cohort tracking and regression evals.