How do sensitivity and specificity trade off?

Tightening the threshold to catch more positives usually misclassifies more negatives, so sensitivity goes up and specificity goes down. The ROC curve traces this trade-off across thresholds.

When should you report sensitivity and specificity instead of accuracy?

Whenever the data is imbalanced or the cost of a missed positive differs from the cost of a false alarm — most medical, fraud, and content-safety classifiers.

What Is Sensitivity and Specificity in ML? FutureAGI Guide (2026)

Q: What is sensitivity and specificity in ML?

Sensitivity and specificity are paired binary-classifier metrics: sensitivity is the true-positive rate (correctly catching positives) and specificity is the true-negative rate (correctly rejecting negatives).

What Is Sensitivity and Specificity in ML?

Sensitivity and specificity in ML are the paired metrics that describe a binary classifier on positive and negative cases separately. Sensitivity (true-positive rate, identical to recall) is the fraction of actual positives the model catches. Specificity (true-negative rate) is the fraction of actual negatives it correctly rejects. They trade off: tightening the decision threshold to catch more positives usually misclassifies more negatives. Together they’re the y-axis and (1 - x-axis) of the ROC curve and the canonical way to report classifier quality on imbalanced data where overall accuracy hides the failure mode.

Why It Matters in Production LLM and Agent Systems

A single accuracy number is a lie on imbalanced data. The classic example: a fraud-detection model that flags 0.5% of transactions can hit 99.5% accuracy by predicting “not fraud” on every input — and 0% sensitivity on the only thing the model exists to catch. Engineering teams that report only accuracy miss this completely. The dashboard says green; the fraud goes through.

The reverse problem is just as common. A content-safety classifier tuned for 99% sensitivity on harmful content will fire on enormous volumes of legitimate content (low specificity), which floods the moderation queue and degrades user experience. Both numbers have to be reported, both have to have thresholds, and both have to be sliced by cohort.

The pain falls across roles. ML engineers tune thresholds with no visibility into the production cost asymmetry. Compliance leaders are asked, mid-audit, what fraction of regulated content the system rejects correctly — and have only an aggregate accuracy number to point to. Product managers ship a “more accurate” model that quietly traded sensitivity for specificity in the wrong direction.

In 2026-era LLM stacks the same pair shows up in less obvious places. PII detectors have a sensitivity-vs-specificity trade-off — catch more PII at the cost of more false redactions. Prompt-injection guardrails do too — flag more attacks at the cost of more false blocks on legitimate prompts. Treating these as binary “does it work?” questions instead of as a sensitivity/specificity surface is how regressions hide.

How FutureAGI Handles Sensitivity and Specificity in ML

FutureAGI’s approach is to expose both numbers and their trade-off curve at every binary-decision surface in the stack. For classical classifiers, fi.evals.RecallScore (sensitivity) and a paired specificity computation on the confusion matrix produce both numbers per evaluation run, with per-class slicing for multi-class settings. For retrieval, fi.evals.RecallAtK and fi.evals.PrecisionAtK give the analogous pair at rank K — high recall@k buys you sensitivity at the cost of more irrelevant chunks reaching the model.

For LLM guardrails, fi.evals.PromptInjection, fi.evals.ProtectFlash, fi.evals.PII, and fi.evals.ContentSafety each have an inherent sensitivity/specificity trade-off, and FutureAGI’s dataset workflow lets you compute both numbers against red-team and clean-traffic datasets to confirm where the threshold should sit.

Concretely: a fintech compliance team running a transaction-classification LLM on traceAI-openai builds two datasets — confirmed-fraud cases and confirmed-clean cases — and runs Dataset.add_evaluation() with RecallScore (sensitivity on the fraud set) and a custom specificity evaluator on the clean set. The team picks an operating threshold of 0.62 by reading the ROC curve at the intersection that holds sensitivity above 0.93 and specificity above 0.97. After a model swap they re-run the same eval before deploying; if either number drops, deploy is blocked. Without the pair, the team would pick a threshold by accuracy and ship a sensitivity regression.

How to Measure or Detect It

Always report both numbers, with the threshold pinned:

RecallScore: sensitivity / true-positive rate against a labeled dataset.
Specificity (custom evaluator): true-negative rate computed from confusion-matrix counts; wrap as a CustomEvaluation.
ROC curve and AUC: sensitivity vs (1 - specificity) across all thresholds; pick the operating point from the curve.
Confusion matrix: the raw counts behind both metrics — verify TP, FN, TN, FP are what you expect.
PrecisionAtK + RecallAtK: the retrieval analog of the precision/sensitivity pair.
Per-cohort breakdown (dashboard signal): sensitivity and specificity sliced by user segment; surfaces fairness gaps.

Minimal Python:

from fi.evals import RecallScore

recall = RecallScore()

# Sensitivity
sens = recall.evaluate(
    input=positive_inputs,
    output=positive_predictions,
    expected_response=positive_labels,
).score

# Specificity (TN / (TN + FP)) — count from the confusion matrix
specificity = true_negatives / (true_negatives + false_positives)

Common Mistakes

Reporting one number, not both. Sensitivity alone hides high false-alarm rate; specificity alone hides missed positives.
Comparing models at different thresholds. Pin a threshold or compare full ROC curves; otherwise the comparison is meaningless.
Optimising F1 when costs are asymmetric. F1 assumes equal cost; weight the loss to your domain.
Ignoring per-cohort breakdowns. Aggregate sensitivity can be 0.9 while a minority cohort is at 0.5.
Treating retrieval recall@k as separate from sensitivity. Same concept; same trade-off; same threshold-tuning discipline applies.