Specificity is the true-negative rate of a binary classifier: TN / (TN + FP). It measures the fraction of actual negatives the model correctly identifies as negative, and complements sensitivity (recall) to fully describe a classifier's class-conditional behavior.

How is specificity different from precision?

Precision is TP / (TP + FP) — fraction of predicted positives that are correct. Specificity is TN / (TN + FP) — fraction of actual negatives identified correctly. They share the same FP term but condition on different totals.

How does FutureAGI use specificity?

FutureAGI reports specificity for safety classifiers like PromptInjection and ContentSafety alongside sensitivity, so teams can pick thresholds that balance false-alarm rate against missed-detection rate for their risk profile.

What Is Specificity? Classification Metric Definition (2026)

What Is Specificity?

Specificity is a binary-classification metric defined as TN / (TN + FP) — the proportion of actual negatives that the classifier correctly labels as negative. It is also called the true negative rate (TNR), and it complements sensitivity (recall, TPR) to fully characterize class-conditional performance. Together they describe the classifier independently of class prevalence, which makes them more robust than raw accuracy on imbalanced data. Specificity matters anywhere false alarms are costly: medical screening, jailbreak detection, PII filters, hallucination detectors. The right specificity threshold depends on the operational cost of a false positive versus a missed negative.

Why It Matters in Production LLM and Agent Systems

LLM safety stacks are full of binary detectors. Is this prompt an injection? Yes/no. Does this output contain PII? Yes/no. Is this a jailbreak attempt? Yes/no. Each detector has a sensitivity-specificity tradeoff. A jailbreak detector with sensitivity 0.99 and specificity 0.80 catches almost every attack — but also blocks 20% of legitimate traffic. That is unacceptable for most products. Specificity is the metric that defines false-alarm rate, and false-alarm rate is the metric that defines whether a guardrail can run inline.

The pain of skipping specificity tracking shows up across roles. A safety engineer ships a prompt-injection detector tuned only on attack samples; specificity drops to 0.71 in production and one-third of customer support queries get blocked as “suspicious.” A product team rolls out a hallucination filter that achieves 95% sensitivity and a 12% over-block rate on legitimate factual answers. A platform engineer chases an alert storm caused by a content-safety classifier with specificity 0.60 — most of the alerts are noise.

In 2026 agent stacks where multiple guardrails run in series — pre-prompt injection check, post-output PII check, post-output content-safety check — specificity compounds. Three detectors at 0.97 specificity each give 91.3% combined specificity per request, meaning ~9% of innocent traffic gets flagged somewhere. Multi-guardrail design requires tracking specificity at each stage, not just the final block rate.

How FutureAGI Handles Specificity

FutureAGI’s safety evaluators report specificity alongside sensitivity, with full confusion matrices on regression suites. At the evaluator level, classifiers like PromptInjection, ContentSafety, Toxicity, BiasDetection, and PII run against a labeled Dataset and emit per-class true-positive, false-positive, true-negative, false-negative counts; specificity falls out of the math. At the threshold-tuning level, you sweep the classifier threshold across an operating curve and pick the (sensitivity, specificity) point that matches your operational cost profile — for example, “specificity ≥ 0.99 even if sensitivity drops to 0.90” for a customer-facing product where false blocks hurt more than missed attacks. At the production level, traceAI captures classifier scores as span attributes; sampling production traffic into a periodically-relabeled Dataset lets you re-estimate specificity over real distributions, not just lab samples.

Concretely: a team operating a customer-support agent runs PromptInjection as a pre-guardrail. They build a labeled Dataset of 1,000 examples (200 actual attacks, 800 legitimate support queries), sweep the classifier’s confidence threshold, and pick the threshold where specificity = 0.995 (5 false alarms per 1,000 legitimate queries) at sensitivity 0.91. They re-sample 500 production queries weekly into the eval Dataset, relabel via an AnnotationQueue, and re-validate specificity. When a model upgrade drops specificity to 0.97, the gateway switches to a fallback model and the team re-tunes. FutureAGI made the false-alarm cost a number, not an anecdote.

How to Measure or Detect It

Specificity-related signals to wire into evaluation and production:

Per-evaluator specificity — TN / (TN + FP) on a labeled regression set; primary number to track per classifier.
ROC curve — sensitivity vs. (1 − specificity) across threshold; pick operating point by cost.
PromptInjection, ContentSafety, Toxicity, BiasDetection — all expose enough data on a labeled set to compute specificity directly.
False-alarm rate — 1 − specificity, the operational form most product teams care about.
Per-cohort specificity — slice by user demographic, language, intent; uneven specificity is a fairness flag.

Minimal Python — measure specificity by aggregating evaluator outputs against labels:

from fi.evals import PromptInjection
from sklearn.metrics import confusion_matrix

evaluator = PromptInjection()
predictions = [evaluator.evaluate(input=x).score > 0.5 for x in inputs]
tn, fp, fn, tp = confusion_matrix(labels, predictions).ravel()
specificity = tn / (tn + fp)

Common Mistakes

Reporting accuracy on imbalanced data. Accuracy looks great when 95% of inputs are negatives. Specificity and sensitivity show what’s actually happening per class.
Tuning sensitivity without checking specificity. Sensitivity 0.99 sounds good until you discover specificity is 0.62 and false alarms drown the signal.
One-time threshold tuning. Production distributions shift; a threshold tuned at launch will drift. Re-tune on rolling samples.
Ignoring per-cohort specificity. A model that has 0.99 specificity overall but 0.85 on a specific dialect or accent is a fairness incident waiting to happen.
Confusing specificity with precision. Both involve false positives, but they condition on different totals. Specificity divides FP by actual negatives; precision divides FP by predicted positives.