Models

What Is a Type 2 Error?

A statistical or classification error where a false null hypothesis is not rejected — equivalently, a false negative in a binary classifier.

What Is a Type 2 Error?

A type 2 error, also called a false negative or beta error, is the failure to reject a false null hypothesis. In a binary classifier the null hypothesis is “this example is negative”, so a type 2 error is labelling a positive example as negative. For an LLM safety guardrail, every real prompt-injection attack that gets through is a type 2 error. Type 2 error rate is FN / (FN + TP), equivalent to one minus recall (also called sensitivity or true-positive rate). On safety-critical surfaces — PII leakage, harmful-content filtering, prompt-injection blocking — the type 2 error rate is the headline metric for security and compliance reviewers.

Why It Matters in Production LLM and Agent Systems

Type 2 errors are the source of every “model leaked something it shouldn’t have” incident. A safety classifier with 90% recall has a 10% type 2 error rate — meaning one in ten real attacks gets through. On a chatbot with 1% adversarial traffic across 100K daily requests, that is 100 successful attacks per day. Most go unnoticed until one becomes a public incident.

The pain shows up unevenly across roles. The user rarely complains about type 2 errors — they get the answer they wanted. The security engineer finds out from a red-team test or a customer escalation. The compliance lead is asked, “what is your false-negative rate on PII leakage?” and either has the number from a recent eval or has nothing useful to say. The cost of a single high-profile type 2 error (a leaked secret, a harmful-advice response that gets screenshotted) is usually orders of magnitude higher than the cost of an equivalent type 1 error — which is why most safety classifiers err on the side of recall.

For 2026-era agent stacks, type 2 errors are particularly dangerous because the agent acts on the leaked content. An indirect-prompt-injection that a pre-guardrail misses can take over a tool-call chain — the type 2 error becomes a type 2 cascade. Per-attack-class type 2 measurement is essential; aggregate recall hides the catastrophic gap on indirect injection or rare attack vectors.

How FutureAGI Handles Type 2 Errors

FutureAGI’s approach is to make type 2 error rate a first-class metric on every classifier evaluator, decomposed by attack class. The platform’s standard evaluation report computes TP, FP, TN, FN per cohort so the type 1 / type 2 trade-off is explicit.

Concretely: a team runs fi.evals.PromptInjection against a 6,000-row dataset of 3,000 attacks split across direct injection, indirect injection, and multi-turn jailbreaks. The report shows aggregate type 2 error rate of 6%, but the per-class breakdown reveals 1% on direct injection, 19% on indirect injection, and 26% on multi-turn jailbreaks. The team prioritises retraining on the two weak cohorts and uses regression-eval to confirm the per-class type 2 error dropped before shipping.

In production, the same evaluator runs as a pre-guardrail in the Agent Command Center. Continuous red-team replays over weekly traffic estimate the live type 2 rate and feed eval-fail-rate-by-cohort dashboards. Compared to point-in-time benchmarking, this surface keeps the type 2 measurement alive against adapting adversaries — which is the entire point of a guardrail.

How to Measure or Detect It

Type 2 error is computed against positive-labelled ground truth:

  • FN / (FN + TP): the canonical type 2 error rate.
  • Recall (TPR): 1 - type 2 error rate; the same number flipped.
  • Per-attack-class breakdown: aggregate type 2 hides class-specific failures.
  • Red-team replay: weekly adversarial traffic estimates live type 2 rate.
  • Incident-triggered audit: every escalated security incident is a known type 2 — close the loop.

Minimal Python:

from fi.evals import PromptInjection

evaluator = PromptInjection()
result = evaluator.evaluate(
    input="(System prompt embedded inside retrieved doc, asks model to exfil keys.)",
)
# Real attack; if the evaluator fails to flag it → type 2 error
print(result.score, result.reason)

Aggregate over an attack-only cohort and divide FN count by total positives.

Common Mistakes

  • Tracking only aggregate recall. A 0.94 global recall with 0.74 on indirect injection is unsafe.
  • Optimising for type 2 minimisation alone. Driving type 2 to zero usually inflates type 1 error past usability — tune both.
  • Treating type 2 errors found in red-teaming as edge cases. Red-team finds are real attacks; treat them as mainline test data.
  • Stopping type 2 measurement after launch. Adversaries adapt; the type 2 rate decays — re-evaluate weekly on fresh adversarial samples.
  • Confusing low type 2 error with safety. A safe-looking aggregate can hide a 0.30 type 2 rate on the most damaging attack class.

Frequently Asked Questions

What is a type 2 error?

A type 2 error is the failure to reject a false null hypothesis — equivalent to a false negative. In LLM safety, it is a real attack the guardrail misses.

How is a type 2 error different from a type 1 error?

Type 1 rejects a true null (false positive). Type 2 fails to reject a false null (false negative). Type 1 hurts user trust; type 2 hurts safety. Both must be tracked together.

How do I measure type 2 error rate?

FutureAGI's classifier evaluators report per-row predictions; type 2 error rate is `FN / (FN + TP)` — equivalent to 1 minus recall — and is reported per cohort in the evaluation report.