What Is Precision in Machine Learning?
A classification metric that measures the share of predicted positive cases that are true positives, used to quantify false-positive control.
What Is Precision in Machine Learning?
Precision is a classification metric: the share of predicted positives that are actually correct. Formally, TP / (TP + FP). A precision of 0.90 means 90% of the items the model flagged as positive really were positive; the remaining 10% are false alarms. It is the right metric when false positives are expensive — flagged emails that are not spam, blocked queries that were safe, hallucination alerts that were grounded. Precision alone does not tell you what the classifier missed; pair it with recall and F1 to see the whole picture.
Why It Matters in Production LLM and Agent Systems
In LLM systems, every guardrail, evaluator, and content filter is implicitly a binary classifier — and precision is what determines whether the alert stream is signal or noise. Low precision floods review queues, erodes trust, and trains operators to ignore alerts. High precision with low recall hides failures.
The pain shows up across roles. A trust-and-safety lead deploys a hallucination flagger that fires on 8% of traffic. Precision is 0.30 — 70% are false positives. Reviewers stop checking the queue within a week. A platform engineer ships a prompt-injection detector with 99% recall and 40% precision; the post-guardrail blocks legitimate queries five times an hour. A compliance lead is asked, “how often does the PII filter wrongly redact customer names?” — the answer is “we have not measured precision”.
For 2026 LLM stacks, precision matters most for safety-critical evaluators where rare positives meet expensive false alarms. Hallucination scoring on a long-tail RAG system, prompt-injection detection on a customer-support gateway, PII scanning on outbound emails — all of these need precision tracked against a labelled audit set, with a threshold tuned to the review-queue budget.
How FutureAGI Tracks Precision Per Evaluator
FutureAGI’s approach is to expose every evaluator’s continuous score so you can compute precision at any operating threshold against a labelled audit Dataset. Evaluators like HallucinationScore, PromptInjection, PII, ContentSafety, and Faithfulness return a 0–1 score; the audit dataset stores the ground-truth label per row; the platform persists evaluator results against the dataset so precision is reproducible.
Concretely: a customer-support team runs HallucinationScore on a 1,500-row audit set with 12% positives. At threshold 0.5, precision is 0.62, recall is 0.88. Review-queue budget allows 0.5% of traffic to be flagged. The team picks a stricter threshold (0.72) where precision is 0.85 but recall drops to 0.61. They accept the trade-off: missed hallucinations are caught downstream by a sampled human review, but false-positive review cost would have broken the team’s bandwidth.
Precision is tracked as a daily time series alongside recall, F1, PR AUC, and operating-point fire-rate. A drop in precision pages before a drop in F1, because precision degradation usually indicates an upstream distribution shift the team needs to localise.
How to Measure or Detect It
Precision is straightforward to compute but disciplined to interpret:
- Labelled audit dataset: precision requires ground truth; budget for labelling 500–2,000 rows per safety evaluator.
- Operating-point precision: pick the threshold the post-guardrail will use; quote precision at that point, not at the default.
- PR AUC: the threshold-free summary; tracks whether the evaluator itself improved or regressed.
- Cohort precision: segment by language, intent, model version; aggregate hides cohort failures.
- Daily precision time series: precision drift indicates upstream data shift even when the evaluator is unchanged.
from sklearn.metrics import precision_score
from fi.evals import HallucinationScore
scorer = HallucinationScore()
preds = [int(scorer.evaluate(input=r.q, output=r.a, context=r.ctx).score > 0.7)
for r in audit_set]
labels = [r.label for r in audit_set]
print(precision_score(labels, preds))
Common Mistakes
- Reporting precision without recall. A precision of 0.99 with recall 0.05 catches almost nothing; report both.
- Using the default threshold. The default is rarely tuned to your review budget; set the threshold deliberately.
- Aggregate-only reporting. A 0.85 average precision can hide a 0.40 cohort. Segment by language, intent, and model.
- Computing precision on the train set. Use a held-out audit set; otherwise you are measuring memorisation.
- Ignoring base rate. Precision is sensitive to positive-class prevalence; report alongside the prevalence number.
- Mistaking high precision for safety coverage. A precision of 0.99 with recall of 0.10 still misses 90% of positives; pair with recall before signing off.
- Skipping stratified sampling. Random samples under-represent rare cohorts; stratify the audit set so per-cohort precision is statistically meaningful.
Frequently Asked Questions
What is precision in machine learning?
Precision is the share of predicted positives that are actually correct — true positives divided by true positives plus false positives. It quantifies how trustworthy a positive prediction is.
How is precision different from accuracy?
Accuracy is the share of all predictions (positive and negative) that are correct. Precision focuses only on the positive class — it ignores true negatives, which makes it more informative on imbalanced tasks.
How do you measure precision for an LLM evaluator?
Run the evaluator over a labelled audit dataset, count true positives and false positives at the operating threshold, and compute TP / (TP + FP). FutureAGI persists scores against the dataset so precision is reproducible across releases.