Models

What Is a True Negative?

A classifier prediction that correctly labels a negative example as negative; the lower-right cell of a binary confusion matrix.

What Is a True Negative?

A true negative (TN) is a binary-classifier outcome where the predicted label is negative and the ground-truth label is also negative. In an LLM-safety context, that is a benign prompt the guardrail correctly allowed through, or a non-toxic response the toxicity classifier correctly cleared. True negatives sit in the bottom-right of the confusion matrix and feed three derived metrics: accuracy, specificity (true-negative rate), and the complement of false-positive rate. In FutureAGI’s evaluation surface, they show up implicitly inside any cohort-level confusion matrix produced by classifier-style evaluators.

Why It Matters in Production LLM and Agent Systems

True negatives are the invisible half of guardrail performance. Engineering teams obsess over false negatives (the harmful prompt that slipped through) and false positives (the benign user blocked at the door), but the true-negative count is what determines whether a guardrail is shippable. If specificity drops below ~95%, the false-positive rate is high enough to drown the support queue in complaints from legitimate users.

The pain shows up cleanly in production logs. A new prompt-injection classifier goes live with 99% recall on attacks; great. But its true-negative rate is 92%, meaning 8% of benign queries get blocked. On a chatbot doing 50K queries a day, that is 4,000 false rejections daily — enough to break product trust by week two. The SRE sees a spike in 4xx responses; the product manager sees churn; nobody connects it back to the guardrail until eval-fail-rate-by-cohort surfaces the imbalance.

For 2026-era agent systems, the stakes compound. A multi-step agent calls a guardrail per step. With a 92% specificity, a 5-step trajectory has a 34% chance of at least one false rejection — and the agent now has to recover from a guardrail block on a perfectly legitimate tool call. Trajectory completion rates collapse. Confusion-matrix discipline — tracking true negatives explicitly, not just hiding them inside accuracy — is the only honest way to ship classifier-driven safety.

How FutureAGI Handles True Negatives

FutureAGI’s approach is to expose every cell of the confusion matrix, not just headline accuracy. When you run a classifier evaluator (PromptInjection, ContentSafety, Toxicity, IsHarmfulAdvice, BiasDetection) against a labelled Dataset, the platform stores per-row predictions alongside ground-truth labels. The evaluation report aggregates these into TP, FP, TN, and FN counts and computes specificity, recall, precision, and F1 from them.

Concretely: a team running a guardrail in front of an LLM uses fi.evals.PromptInjection against a 5,000-row dataset where 2,500 are labelled benign. The evaluator scores each row; the report shows TN = 2,420 out of 2,500 — a 96.8% true-negative rate. That number is the input to the deployment decision: the team can compare it against the prior version’s TN count and only ship if specificity is non-regressive. In production, the same evaluator runs on live traces as a pre-guardrail; the Agent Command Center records every allowed request, and benign-traffic sampling lets the team estimate the real-world TN rate continuously.

The discipline matters: a 1% drop in true negatives on a high-traffic surface is more noticeable to users than a 5% drop in recall on rare attacks.

How to Measure or Detect It

True negatives are computed from a labelled evaluation set:

  • Confusion-matrix evaluator output: classifier evals in FutureAGI return TP, FP, TN, FN counts plus derived metrics.
  • Specificity (TN rate): TN / (TN + FP) — track this alongside recall.
  • False-positive rate: 1 - specificity; the product-trust signal.
  • Per-cohort breakdown: split by user segment, language, or route to find sub-populations where TN rate collapses.

Minimal Python:

from fi.evals import PromptInjection

evaluator = PromptInjection()
result = evaluator.evaluate(
    input="What's the weather in Berlin tomorrow?",
)
# Benign input; expect a "safe" prediction → contributes to TN
print(result.score, result.reason)

Aggregate over a benign-only cohort to estimate true-negative rate.

Common Mistakes

  • Reporting accuracy without specificity. With class imbalance (mostly benign traffic), a high accuracy can hide a poor TN rate.
  • Forgetting to label benign examples. Datasets that only contain attacks let you measure recall but not TN — always include negative ground truth.
  • Tuning thresholds on TPR alone. Sliding the threshold to catch more attacks tanks the TN rate; tune both together with a precision-recall curve.
  • Ignoring TN drift over time. As input distribution changes, specificity decays silently — re-measure monthly on sampled production data.
  • Conflating “true negative” with “no action.” Allowing a request through is a prediction; track it as one.

Frequently Asked Questions

What is a true negative?

A true negative is a classifier prediction that correctly labels an actual negative example as negative — for example, a safety classifier correctly marking a benign prompt as safe.

How is a true negative different from a true positive?

A true positive is a correct prediction on an actual positive example; a true negative is a correct prediction on an actual negative example. Both contribute to accuracy, but they affect specificity and recall differently.

How do you measure true-negative rate?

Run a classifier evaluator like FutureAGI's `PromptInjection` over a labelled dataset and compute true-negative rate as `TN / (TN + FP)`. The platform exposes this in confusion-matrix dashboards.