What Is a Precision-Recall (PR) Curve?
A diagnostic plot of a binary classifier's precision against recall across all decision thresholds, used to pick an operating point and summarise via PR AUC.
What Is a Precision-Recall (PR) Curve?
A precision-recall curve is the diagnostic plot of a binary classifier. Precision (TP / (TP + FP)) is on the y-axis, recall (TP / (TP + FN)) on the x-axis, and each point on the curve corresponds to a different decision threshold. Sweep the threshold from low to high: recall starts at 1 and drops, precision starts near the positive-class prevalence and rises. The shape tells you the classifier’s behaviour; the area under it (PR AUC) summarises it. For class-imbalanced LLM safety tasks, the PR curve is the right diagnostic — ROC curves over-flatter the same classifiers.
Why It Matters in Production LLM and Agent Systems
The PR curve makes threshold choice explicit. Without it, teams pick the model’s default threshold or the F1-maximising threshold and discover, in production, that neither matched the real cost of false alarms vs missed detections.
The pain shows up across roles. A platform engineer ships a hallucination filter at the F1-maximising threshold; production review-queue budget cannot absorb the resulting fire-rate, so the filter is silently turned down to 0.95 — miss-rate explodes. A trust-and-safety lead reports “the model is 90% accurate” without showing the PR curve; the precision-recall pair at the deployed threshold is 0.45 / 0.40. A compliance lead chooses a PII detector by aggregate accuracy; on the PR curve, a different evaluator dominates at every operating point worth shipping.
For 2026 agent stacks, PR curves are the way to compare safety evaluators objectively. Two competing detectors may have similar PR AUC but different curve shapes — one strong at high recall, the other strong at high precision. Visualising the curves before picking is what separates an informed choice from a benchmark-leaderboard choice.
How FutureAGI Builds PR Curves Per Evaluator
FutureAGI’s approach is to make every evaluator’s continuous score available so the PR curve is reconstructable on any labelled Dataset. Evaluators like HallucinationScore, PromptInjection, Faithfulness, PII, ContentSafety return a 0–1 score per response; the audit dataset stores ground-truth labels; sweeping the threshold and plotting yields the curve.
Concretely: a healthcare assistant team labels a 2,500-row audit set for grounded vs hallucinated answers. They run two evaluators — HallucinationScore and Faithfulness — and plot both PR curves. HallucinationScore dominates at the high-precision end (precision 0.92 at recall 0.50) — the operating zone the team needs given their review-queue budget. Faithfulness dominates at higher recall but lower precision. The team picks HallucinationScore, sets the threshold at the 0.92-precision point, and persists the choice as a metric-threshold config tied to the audit dataset hash.
Curves are recomputed weekly against the same dataset. Curve-shape changes (not just PR AUC drops) signal evaluator drift earlier than aggregate metrics. The Agent Command Center routing policy uses the chosen threshold to fire the post-guardrail.
How to Measure or Detect It
A PR curve is a sweep, not a single number:
- Labelled audit set: ground-truth labels per row; the curve is meaningless without them.
- Continuous-score evaluator:
HallucinationScore,PromptInjection,PIIall return floats; sweep the threshold from min to max. - Curve-comparison plotting: overlay the curves for competing evaluators; pick the one that dominates in the operating zone you care about, not by PR AUC alone.
- Cohort curves: plot a separate PR curve per language, per intent, per model version; cohort failures hide in the aggregate.
- Curve-shape regression: track curve change over time; shifts at fixed recall are the canonical evaluator-drift alarm.
from sklearn.metrics import precision_recall_curve
from fi.evals import HallucinationScore
scorer = HallucinationScore()
scores = [scorer.evaluate(input=r.q, output=r.a, context=r.ctx).score
for r in audit_set]
labels = [r.label for r in audit_set]
precision, recall, thresholds = precision_recall_curve(labels, scores)
Common Mistakes
- Reporting PR AUC without showing the curve. The shape matters: two evaluators can share PR AUC and have wildly different operating-point performance.
- Picking the F1-max threshold by default. F1-max is rarely the right business operating point; tie the threshold to your real cost.
- Computing on the train set. Train-set curves are memorisation; always use a held-out audit set.
- No cohort curves. Aggregate curves hide cohort failures; segment by language, intent, model.
- Treating the curve as static. Production prevalence and distribution shift over time; refresh the audit set quarterly.
- Picking a curve by area alone. Two evaluators with the same PR AUC can be unequal at the operating point you ship; visualise both curves and choose by overlap in the relevant zone.
- Ignoring curve smoothness. A jagged curve usually means the audit set is too small or the score has too few unique values; collect more rows or check for ties.
Frequently Asked Questions
What is a precision-recall curve?
A PR curve plots precision (y-axis) against recall (x-axis) for a binary classifier across all decision thresholds. Each point on the curve is one threshold.
How is a PR curve different from an ROC curve?
An ROC curve plots true-positive rate against false-positive rate; a PR curve plots precision against recall. PR curves are more informative on class-imbalanced tasks because they ignore the (usually huge) true-negative pool.
How do I pick the operating point on a PR curve?
Pick the threshold whose precision-recall pair matches your business cost — review-queue budget, blocked-request tolerance, missed-detection tolerance — not the F1-maximising point by default.