Models

What Is a False Positive?

A classifier prediction that says "yes" when the correct answer is "no" — a wrongly flagged negative — sitting in the top-right cell of the confusion matrix.

What Is a False Positive?

A false positive is a classifier prediction that says “yes” when the correct answer is “no” — a wrongly flagged negative. In a binary confusion matrix it sits in the top-right cell; it is the complement of specificity and a key driver of precision (precision = TP / (TP + FP)). False positives create user-visible friction: a benign prompt blocked by a safety filter, a clean reply flagged as toxic, a normal user locked out by a fraud detector. Their rate is the false-positive rate (FPR). In LLM safety stacks, FPR is the cost you pay for recall — tuning the threshold trades them, and the right ratio depends on the relative cost.

Why It Matters in Production LLM and Agent Systems

A false positive is loud. Users notice immediately when a legitimate prompt is blocked, when a harmless output is rewritten, or when a normal action triggers an “unsafe content detected” wall. They file tickets. They tweet. They churn. A high-FPR safety stack survives a week before product pressure forces a threshold relaxation — at which point false-negative rate rises and silent failures begin.

The pain falls on different roles. Customer-success teams escalate user complaints about over-blocking. Product leads see drop-off concentrated on prompts that look like protected categories. ML engineers reluctantly raise thresholds, watch FPR fall, watch FNR climb, and try to explain why the safety stack now misses real attacks. Compliance teams hold the original threshold in their risk register and discover six weeks later that nobody is honouring it.

In 2026-era stacks the FPR/FNR trade is bigger because attack surfaces are bigger. A 5% FPR on a prompt-injection detector means 1 in 20 legitimate user prompts gets blocked — at scale that is unacceptable. A 1% FPR sounds tolerable until you serve 1M prompts a day and 10K users hit a wall. FutureAGI’s standard pattern is layered evaluators with cascading thresholds: ProtectFlash runs first as a low-FPR fast filter, PromptInjection runs second on borderline cases, and a judge-model rubric runs third on prompts that the first two disagree on. The cascade keeps end-user FPR low while preserving recall.

How FutureAGI Handles False Positives

FutureAGI’s approach is to track precision and FPR per evaluator on every release, not as an afterthought when complaints arrive. Each evaluator runs against a labelled Dataset containing both positives and benign negatives, and RegressionEval produces per-evaluator FPR and precision deltas across versions. The team sets thresholds in a risk register and the eval gate blocks deploys that breach them.

A concrete workflow: a moderation team running Toxicity and ContentSafety notices precision dropping after a model upgrade. The eval dashboard shows FPR rose from 1.2% to 4.8% — too many clean outputs flagged as toxic. They open the failure cohort, sort by reason, and find the new model is over-flagging code blocks that contain words like “kill” in technical contexts. They tune the threshold, add a pre-guardrail rule that exempts code blocks, and run RegressionEval to confirm FPR returns to baseline without harming recall. The Agent Command Center can also run a post-guardrail cascade — ProtectFlash first, then Toxicity only on borderline cases — keeping the median FPR low while preserving the high-recall fallback for adversarial inputs. Unlike OpenAI’s moderation API, which exposes one threshold, FutureAGI surfaces FPR and FNR independently per evaluator and per cohort so engineers tune each axis on its own.

How to Measure or Detect It

False positives are measured through a labelled benign dataset:

  • False-positive rate (FPR): FP / (FP + TN); the percentage of benign cases wrongly flagged.
  • Precision: TP / (TP + FP); high FPR pulls precision down.
  • Toxicity and ContentSafety over benign golden sets: flagged-but-clean outputs are FPs.
  • PromptInjection and ProtectFlash over benign prompt sets: blocked benign prompts are FPs.
  • Per-cohort precision: a global FPR can hide cohort-level over-blocking.
  • User-feedback proxy: thumbs-down on filter-rewritten outputs correlates with FP rate.
  • RegressionEval deltas: precision drops between releases are FPR regressions.
from fi.evals import Toxicity

detector = Toxicity()
benign_outputs = [...]
fp = sum(1 for o in benign_outputs if detector.evaluate(output=o).score > 0.5)
print("False positives:", fp)

Common Mistakes

  • Reporting accuracy instead of precision. Accuracy hides class imbalance; precision exposes FPR-driven harm.
  • Single threshold across cohorts. What is balanced globally can over-flag specific cohorts.
  • Skipping the labelled benign set. Without negative-class ground truth, FPR cannot be computed.
  • Optimising precision at the expense of recall. A zero-FPR detector usually misses every real attack.
  • No user-feedback loop. User reports on over-blocking are the canary; track them per evaluator.

Frequently Asked Questions

What is a false positive?

A false positive is a classifier prediction of "yes" when the truth is "no". It is a wrongly flagged negative and reduces precision. False positives create user friction — benign prompts blocked, clean outputs flagged.

How is a false positive different from a false negative?

A false positive flags a real negative as positive (says yes when it should say no). A false negative misses a real positive. Lowering one usually raises the other; the threshold sets the trade.

How do you measure false positives in an LLM safety stack?

Run the detector on a labelled benign set. Each flagged benign output is a false positive. FutureAGI's `RegressionEval` over a `Dataset` reports FPR per evaluator across releases.