What Is PR AUC?
The area under the precision-recall curve, a threshold-free summary metric for binary classifiers that emphasises the positive class and is robust to class imbalance.
What Is PR AUC?
PR AUC is the area under the precision-recall curve. You sweep the classifier’s decision threshold from 0 to 1, plot precision against recall at each threshold, and integrate. The result is a single number between 0 and 1 — higher is better — that summarises performance across thresholds. PR AUC is most useful when the positive class is rare, because it ignores the (typically huge) true-negative pool that ROC AUC implicitly rewards. Hallucination detectors, prompt-injection classifiers, fraud flags, and PII detectors all live in this regime.
Why It Matters in Production LLM and Agent Systems
LLM safety classifiers are class-imbalanced almost by definition: hallucinations, injections, PII leaks, and policy breaches are minority outcomes. ROC AUC on these tasks can read 0.95 while precision at the operating threshold is 0.40 — meaning more than half the alerts are noise.
The pain shows up across roles. A platform engineer ships a hallucination detector with ROC AUC 0.92, then watches the review queue fill with false positives because precision at threshold is 0.30. A trust-and-safety lead picks a jailbreak threshold using F1, then is surprised when production attack-pass-rate stays high — F1 was tuned on a balanced eval set, not the production-prevalence distribution. A compliance lead is asked, “what is the false-positive rate on the PII detector?” — the answer needs PR AUC plus an explicit operating-point precision, not a ROC curve.
For 2026 LLM stacks, every safety evaluator is implicitly a binary classifier — pass/fail, safe/unsafe, hallucinated/grounded. Treating evaluators as classifiers and tracking PR AUC against a labelled audit set is the discipline that separates a real safety stack from a vibe-based one.
How FutureAGI Uses PR AUC for Evaluators
FutureAGI does not return a PR AUC evaluator class because PR AUC is a property of an evaluator-plus-dataset pair, not a single response. The platform exposes the inputs you need.
Concretely: a team builds a hallucination guardrail using HallucinationScore. They label a 2,000-row audit dataset (5% positives — true hallucinations). For every row, the evaluator returns a continuous score and a binary verdict at the default threshold. Sweeping the threshold from 0 to 1 produces the precision-recall curve; PR AUC is 0.78. They compare against Faithfulness (PR AUC 0.71) and IsFactuallyConsistent (PR AUC 0.66). HallucinationScore is the best classifier on this distribution, so it becomes the production guardrail.
The team then picks the operating threshold where precision = 0.85 — the value the review-queue budget can absorb. That threshold becomes a metric-threshold config; the post-guardrail fires above it. PR AUC is recorded against the Dataset snapshot in the audit log and re-computed on every audit refresh. Drops in PR AUC paged before drops in precision-at-threshold, so the team gets early warning when distribution shifts.
How to Measure or Detect It
PR AUC depends on a labelled set and a continuous-score evaluator:
- Labelled audit dataset: the precision-recall curve only exists if you have ground-truth labels per row.
- Evaluator that returns a continuous score:
HallucinationScore,Faithfulness,Groundedness, andEmbeddingSimilarityall return floats sweepable into a curve. - Threshold sweep: compute precision and recall at 100+ thresholds; integrate with trapezoidal rule.
- PR AUC over time: track it as a daily/weekly time series against the audit dataset; drops indicate evaluator drift, not just data drift.
- Operating-point precision: PR AUC is a summary; pair it with precision at the chosen threshold for the actually-deployed signal.
from sklearn.metrics import average_precision_score
from fi.evals import HallucinationScore
scorer = HallucinationScore()
scores = [scorer.evaluate(input=row.q, output=row.a, context=row.ctx).score
for row in audit_set]
labels = [row.label for row in audit_set]
print(average_precision_score(labels, scores))
Common Mistakes
- Using ROC AUC on imbalanced tasks. It hides false-positive noise; switch to PR AUC for rare positives.
- Reporting PR AUC without operating-point precision. PR AUC summarises the curve; you ship a single threshold — quote both.
- Tuning the threshold on the eval set. Threshold tuning belongs on a held-out set or PR AUC-based search, otherwise you over-fit.
- Ignoring the prevalence floor. Random PR AUC equals the positive-class prevalence; an evaluator scoring 0.10 on a 10%-positive set is no better than chance.
- Comparing PR AUC across datasets with different prevalence. The metric is not directly comparable; report alongside positive-class fraction.
Frequently Asked Questions
What is PR AUC?
PR AUC is the area under the precision-recall curve, a threshold-free summary of a binary classifier's performance. It is preferred over ROC AUC when positives are rare, because it emphasises the positive class.
How is PR AUC different from ROC AUC?
ROC AUC plots true-positive rate vs false-positive rate; PR AUC plots precision vs recall. On imbalanced datasets, ROC AUC can look high while precision stays poor — PR AUC will not be fooled.
When should I use PR AUC for LLM evaluators?
Use PR AUC whenever the positive class is rare — hallucination flags, jailbreak detection, prompt-injection classifiers, fraud or PII detection. Track it alongside the FutureAGI evaluator's threshold-versus-precision curve.