Are sensitivity and specificity the same as recall and precision?

Sensitivity equals recall. Specificity is not precision — specificity is the true-negative rate, while precision is true positives over all predicted positives. The four metrics together describe a classifier fully.

When should ML engineers care about sensitivity and specificity?

Whenever the data is imbalanced (one class is rare) or the cost of a missed positive is unequal to a false alarm — most healthcare, fraud, prompt-injection, and content-safety classifiers.

Sensitivity & Specificity of Machine Learning

Q: What is sensitivity and specificity of machine learning?

Sensitivity and specificity are paired binary-classifier metrics: sensitivity is the true-positive rate and specificity is the true-negative rate. Together they describe how a classifier handles positives and negatives separately.

What Is Sensitivity and Specificity of Machine Learning?

Sensitivity and specificity in machine learning are two paired metrics — borrowed from biostatistics — that describe a binary classifier’s behavior on positive and negative cases separately. Sensitivity (true-positive rate, equal to recall) is the share of actual positives the model catches; specificity (true-negative rate) is the share of actual negatives correctly rejected. They quantify the trade-off behind every threshold choice and underpin the ROC curve. They are the right pair to report when the dataset is imbalanced or the cost of a missed positive differs from a false alarm — which covers most LLM safety, fraud, and medical applications.

Why It Matters in Production LLM and Agent Systems

Reporting only accuracy on imbalanced data hides the failure mode the model exists to address. A guardrail evaluator on prompt-injection that flags 1% of traffic can hit 99% accuracy by saying “not an attack” to everything — and 0% sensitivity on the only thing that matters. The dashboard looks fine. The attacks land. Compliance has nothing to point to in the postmortem.

Specificity matters in the opposite direction. A jailbreak detector with 99.9% sensitivity but 80% specificity will drown the moderation queue in false positives. Engineering teams turn down the threshold to bring specificity back up — and quietly accept a sensitivity drop that they never measured.

The pain falls everywhere binary decisions live in an LLM stack: PII detectors, content-safety classifiers, refusal evaluators, prompt-injection guardrails, RAG retrievers (sensitivity = recall@k), tool-call decisions (sensitivity = “did the agent decide to call a tool when it should have”). Treating each of those as “does it work?” rather than as a sensitivity/specificity surface obscures the cost asymmetry of every miss.

In 2026-era stacks the trade-off shows up in routing too. The Agent Command Center’s pre-guardrail layer is a sensitivity/specificity decision: more aggressive blocking raises sensitivity on attacks at the cost of specificity on legitimate prompts. Tuning it without measuring both is how teams ship guardrails that either let attacks through or block real users.

How FutureAGI Handles Sensitivity and Specificity of Machine Learning

FutureAGI’s approach is to compute both numbers at every binary-decision surface and pin the operating threshold so comparisons across models are meaningful. For classifiers, fi.evals.RecallScore returns sensitivity, and a paired specificity computation on the confusion matrix completes the picture. For retrieval, fi.evals.RecallAtK and fi.evals.PrecisionAtK give the rank-K analogue.

For LLM-specific binary decisions — PII presence, prompt injection, content safety — the dedicated evaluators (fi.evals.PII, fi.evals.PromptInjection, fi.evals.ProtectFlash, fi.evals.ContentSafety) each have an underlying sensitivity/specificity trade-off, and FutureAGI’s dataset workflow lets you compute both against curated red-team and clean-traffic datasets. The Agent Command Center’s pre-guardrail and post-guardrail policies expose threshold knobs so the operating point can be moved without code changes.

Concretely: a healthcare AI team running a triage classifier on traceAI-anthropic builds a confirmed-positive dataset (escalation cases) and a confirmed-negative dataset (routine cases). Dataset.add_evaluation() runs RecallScore against the positive set (sensitivity) and a custom specificity evaluator against the negative set. The ROC curve picks an operating point at sensitivity 0.94 / specificity 0.91 — the team chose the slope where missing one more positive costs more than ten extra false alarms, and the threshold encodes that choice. After every model release the same dataset re-runs and gates the deploy on either number regressing more than 1 point.

How to Measure or Detect It

Compute both, pin the threshold, slice by cohort:

RecallScore: sensitivity / true-positive rate against a labeled dataset.
Specificity (custom evaluator): true-negative rate from the confusion matrix; wrap as a CustomEvaluation.
RecallAtK + PrecisionAtK: the retrieval analog of the trade-off pair.
ROC curve / AUC: full sensitivity vs (1 - specificity) sweep; the right tool for picking thresholds.
Per-cohort sensitivity and specificity (dashboard signal): both metrics sliced by demographic, route, or model version; surfaces fairness gaps.
Cost-weighted score: business-weighted combination of FN and FP costs; the metric that should actually drive threshold choice.

Minimal Python:

from fi.evals import RecallScore

recall = RecallScore()
sensitivity = recall.evaluate(
    input=positive_inputs,
    output=positive_preds,
    expected_response=positive_labels,
).score

# Specificity from confusion-matrix counts
specificity = tn / (tn + fp)

Common Mistakes

Reporting accuracy on imbalanced data. A trivially-negative classifier hits 99% accuracy with zero sensitivity.
Treating specificity and precision as the same. They are not — specificity uses the true-negative count; precision uses predicted positives.
Comparing models at different thresholds. Either fix the threshold or compare full ROC curves.
Optimising for F1 when costs are asymmetric. F1 assumes equal cost of FP and FN; tune to your actual cost ratio.
Ignoring per-cohort fairness. A model with 0.9 aggregate sensitivity can be 0.5 on a minority cohort.