Evaluation

What Is Precision-Recall?

The joint analysis of precision and recall, two classification metrics that together describe a binary classifier's behaviour on the positive class.

What Is Precision-Recall?

Precision-recall is the joint analysis of two classification metrics that together describe a binary classifier’s positive-class behaviour. Precision is the share of predicted positives that are correct (TP / (TP + FP)); recall is the share of actual positives the classifier caught (TP / (TP + FN)). They trade off as the decision threshold moves: lower threshold raises recall and lowers precision; higher threshold does the reverse. Reporting one without the other hides half the story. For class-imbalanced LLM tasks — hallucination, prompt injection, PII detection — precision-recall is the canonical frame.

Why It Matters in Production LLM and Agent Systems

Most LLM safety and quality classifiers operate on highly imbalanced data: the positive class (a hallucination, an injection, a PII leak) is rare. Accuracy on these tasks is misleading — predicting “negative” for everything yields 99% accuracy and zero useful behaviour. Precision-recall tells you what is actually happening on the positive class.

The pain shows up across roles. A trust-and-safety lead deploys a hallucination filter and reports “95% accuracy” to leadership; the precision is 0.30 and the recall is 0.45 — a tenth of real hallucinations slip through and most alerts are noise. A platform engineer tunes a prompt-injection detector by F1; production-prevalence is far lower than the eval set, so precision craters. A compliance lead tracks PII-redaction “performance” with one number and cannot answer whether legitimate names are being wrongly redacted.

For 2026 agent stacks, every evaluator-as-classifier needs both metrics tracked: trajectory-step ActionSafety, response-level Groundedness, output-level PII. The precision-recall curve, plus a deliberately chosen operating point, is the difference between a safety stack that works and one that performs.

How FutureAGI Operationalises Precision-Recall

FutureAGI’s approach is to make every evaluator’s continuous score visible so precision-recall analysis is a property of evaluator-plus-dataset, not a one-off notebook. Evaluators like HallucinationScore, PromptInjection, PII, ContentSafety, and Faithfulness return a 0–1 score per response; an audit Dataset carries the ground-truth label per row; sweeping the threshold produces the precision-recall curve.

Concretely: a healthcare assistant team labels a 2,800-row audit set for grounded vs hallucinated answers (8% positives). They run HallucinationScore and Faithfulness on every row, sweep the threshold, plot precision-recall curves for each. At the team’s review-queue budget (1% fire-rate), HallucinationScore operates at precision 0.86 / recall 0.62, while Faithfulness hits 0.79 / 0.55. They pick HallucinationScore, set the post-guardrail threshold, and persist the choice as a metric-threshold config tied to the audit dataset hash. The deploy gate is: precision-recall on every release must not regress below the prior release on the same audit set.

Cohort segmentation is the key unlock — a global precision of 0.85 can hide a non-English cohort at 0.50.

How to Measure or Detect It

Precision-recall analysis needs three things:

  • Labelled audit set: ground-truth labels per row; build it once, version it, refresh quarterly.
  • Evaluator with a continuous score: HallucinationScore, Faithfulness, PromptInjection, PII all return floats sweepable into a curve.
  • Threshold sweep: 100+ thresholds; integrate to PR AUC for a single-number summary.
  • Operating-point quotation: pick a threshold tied to the review-queue or fire-rate budget, and quote both precision and recall at that point.
  • Cohort segmentation: precision-recall by language, intent, model version, retrieval source.
from sklearn.metrics import precision_recall_curve
from fi.evals import HallucinationScore

scorer = HallucinationScore()
scores = [scorer.evaluate(input=r.q, output=r.a, context=r.ctx).score
          for r in audit_set]
labels = [r.label for r in audit_set]
p, r, t = precision_recall_curve(labels, scores)

Common Mistakes

  • Picking a threshold by F1 only. F1 is one trade-off; pick the threshold by your real business cost (review-queue budget, blocked-request tolerance).
  • Reporting one metric in isolation. Precision without recall is half a story; quote both at the operating point.
  • Using the eval-set prevalence as production prevalence. Production positives are usually rarer; precision drops, and you find out late.
  • Computing on the train set. Train-set precision-recall is memorisation; always use held-out audit data.
  • No cohort breakdown. Aggregate scores hide where the classifier fails; segment by every meaningful axis.
  • Comparing precision-recall across datasets with different prevalence. Both metrics shift with positive-class fraction; never compare a 5%-positive eval set against a 50%-positive one.
  • Treating the operating point as set-and-forget. Threshold drift over time is real; recompute the curve and revisit the operating point after every model swap or major prompt change.

Frequently Asked Questions

What is precision-recall?

Precision-recall is the joint analysis of two classification metrics — precision (positive-prediction trustworthiness) and recall (positive-class completeness) — that together describe a binary classifier's positive-class behaviour.

How does precision-recall trade off?

Lowering the decision threshold catches more positives (higher recall) but lets in more false alarms (lower precision). The full trade-off is captured by the precision-recall curve; one operating point is what you actually deploy.

When should I use precision-recall over accuracy?

Use precision-recall whenever the positive class is rare — fraud, hallucination, jailbreak, PII detection. Accuracy on imbalanced sets is dominated by true negatives and hides positive-class failures.