What Is Precision-Recall Area Under the Curve (PR AUC)?
A threshold-free summary metric for binary classifiers obtained by integrating precision over recall across all decision thresholds.
What Is Precision-Recall Area Under the Curve (PR AUC)?
PR AUC is the area under a binary classifier’s precision-recall curve. Sweep the decision threshold, plot precision (TP / (TP + FP)) against recall (TP / (TP + FN)), and integrate. The resulting number sits between the positive-class prevalence (random baseline) and 1 (perfect). PR AUC ignores true negatives entirely, which is why it is preferred over ROC AUC for class-imbalanced problems — fraud, hallucination, jailbreak, PII detection. When integrated via the rectangular Riemann sum, PR AUC equals average precision; the values are functionally interchangeable for ranking comparisons.
Why It Matters in Production LLM and Agent Systems
LLM safety classifiers are imbalanced almost by construction: hallucinations, injections, PII leaks, and policy breaches are minority outcomes. Reporting ROC AUC on these tasks creates the illusion of a strong classifier — the metric reads 0.93 while precision at the operating threshold is 0.35. PR AUC tells you the truth: if the curve sits low and to the right, the classifier is not actually good at the positive class.
The pain shows up across roles. A platform engineer ships a hallucination guardrail with ROC AUC 0.94 and watches the post-guardrail block legitimate responses because precision is 0.40. A trust-and-safety lead picks the “best” injection detector by ROC AUC and discovers the chosen model is worse on PR AUC than the runner-up. A compliance lead is asked, “how does your PII detector compare to a competitor’s?” — without PR AUC, the comparison is unfair on imbalanced production data.
For 2026 LLM stacks, PR AUC is the safety-evaluator quality metric. It captures classifier quality across thresholds; you ship one threshold but you measure the whole curve. PR AUC is also more sensitive to evaluator drift than ROC AUC, which means it pages earlier when something changes upstream.
How FutureAGI Tracks PR AUC for Evaluators
FutureAGI’s approach is to make PR AUC a property you can compute on every evaluator-plus-Dataset pair. The platform persists every evaluator’s continuous score against the dataset row; the labelled audit set provides ground truth; the user computes PR AUC as a release-gate metric.
Concretely: a fintech LLM team builds a 1,800-row audit set for prompt-injection detection (3% positives — closer to production prevalence than typical balanced eval sets). They compare three evaluators: PromptInjection (PR AUC 0.81), ProtectFlash (PR AUC 0.74, but 30× lower latency), IsCompliant against an injection clause (PR AUC 0.62). They deploy ProtectFlash as the pre-guardrail because the latency budget matters; PromptInjection runs as a heavier-weight check on borderline cases above a calibrated threshold. PR AUC of both is a release-gate metric — drops below 0.70 block deploys.
Because PR AUC is more sensitive than ROC AUC, the FutureAGI dashboard pages on PR AUC drift before precision-at-threshold drift. That gives the team a head start localising whether the regression is data drift, prompt drift, or model drift.
How to Measure or Detect It
PR AUC needs labelled data plus continuous scores:
- Labelled audit dataset: 500–2,000 rows minimum; positive prevalence should match production.
- Evaluator returning continuous scores:
HallucinationScore,Faithfulness,Groundedness,PromptInjection,PII. - Integration method: rectangular sum (
average_precision_scorein scikit-learn) is the standard; trapezoidal rule (aucoverprecision_recall_curve) is the alternative. - Prevalence-floor reporting: PR AUC of
prevalenceis random; quote PR AUC alongside positive-class fraction. - PR AUC time series: track per evaluator across releases; drops indicate evaluator or input drift.
from sklearn.metrics import average_precision_score
from fi.evals import PromptInjection
detector = PromptInjection()
scores = [detector.evaluate(input=r.q, context=r.ctx).score for r in audit_set]
labels = [r.label for r in audit_set]
print(average_precision_score(labels, scores))
Common Mistakes
- Comparing PR AUC across datasets with different prevalence. The metric is not directly comparable; report alongside positive-class fraction.
- Using ROC AUC on rare-positive tasks. It hides false-positive noise; PR AUC reveals it.
- Reporting PR AUC without an operating point. PR AUC summarises the curve; the deployed system uses one threshold — quote precision at that point too.
- Tuning thresholds on the same set used for PR AUC. Use a held-out test set to avoid optimistic estimates.
- Ignoring trapezoidal vs rectangular sum mismatch. Different libraries implement different integration; pick one and stick with it.
- Treating PR AUC drops as model regressions only. Most PR AUC drops trace to upstream input drift, not the evaluator; correlate with PSI on input distributions before retraining.
- Skipping cohort-level PR AUC. Aggregate PR AUC of 0.85 routinely hides a cohort at 0.50; segment by language, intent, and retrieval source for actionable signal.
Frequently Asked Questions
What is precision-recall area under the curve?
PR AUC is the area under a binary classifier's precision-recall curve, summarising performance across all decision thresholds in a single number that ranges from the positive-class prevalence to 1.
Is PR AUC the same as average precision?
When computed via the rectangular Riemann-sum approximation across recall steps, PR AUC equals average precision. Trapezoidal-rule integration of the same curve gives a slightly different number; both are commonly called PR AUC.
When should I use PR AUC instead of ROC AUC?
Use PR AUC whenever the positive class is rare. ROC AUC implicitly rewards true negatives, which dominate imbalanced sets and inflate the metric; PR AUC is sensitive to false positives in a way that matches the production cost.