What Is Precision (ML Metric)?
A classification metric that measures the share of predicted positive cases that are true positives, emphasizing false-positive control.
What Is Precision (ML Metric)?
Precision is a machine-learning evaluation metric that measures how many predicted positives are actually correct: true positives divided by true positives plus false positives. In LLM and agent eval pipelines, it shows whether a classifier, retriever, guardrail, or tool-selection check is over-flagging good behavior as bad. High precision means fewer false alarms; low precision creates noisy review queues, blocked responses, and mistrusted automation. FutureAGI teams pair precision with recall and F1 because precision alone can hide missed failures.
Why Precision Matters in Production LLM and Agent Systems
False positives are not harmless. A low-precision safety classifier marks safe requests as unsafe, a low-precision PII detector floods reviewers with benign text, and a low-precision tool-selection check says the agent chose the wrong tool when it actually completed the task. The system still looks “careful” from a distance, but the people operating it stop trusting the signal.
The pain lands on different teams. Developers waste time debugging non-bugs. SREs get alert fatigue from eval-fail-rate spikes that do not match user impact. Compliance teams receive noisy review queues and struggle to separate real risk from threshold artifacts. Product teams see lower conversion when good answers are blocked by a post-response filter.
Precision is especially important in multi-step LLM and agent pipelines because one noisy positive can trigger a cascade. A false prompt-injection flag may block the user request. A false hallucination flag may force a slower model fallback. A false tool-error label may send the trace into a human queue. In logs, low precision appears as high block-rate with low confirmed-incident rate, many reviewer overrides, or a widening gap between evaluator failures and user complaints. In 2026 production stacks, those gaps matter because evals now drive runtime controls, not only dashboards.
How FutureAGI Handles Precision
FutureAGI’s approach is to treat plain precision as a cohort-level metric derived from per-row evaluator outputs. There is no standalone Precision evaluator class for binary classification; precision is meaningful after you count true positives and false positives across a dataset, trace cohort, route, or release. The practical surface is Dataset.add_evaluation(): run a classifier-style evaluator such as GroundTruthMatch, Equals, FuzzyMatch, PromptInjection, or ToolSelectionAccuracy on each row, then aggregate predictions into precision by class and cohort.
Real example: a support automation team labels 5,000 traces where the expected class is needs_human_review or safe_to_send. They run GroundTruthMatch against the agent’s predicted class and compute precision for the needs_human_review label. A precision drop from 0.91 to 0.68 means the agent is sending many safe conversations to review. The next action is not a model rollback by default; the engineer opens the false-positive rows, finds that refund-policy questions were mislabeled after a prompt change, and adds a regression slice for that intent.
For retrieval, FutureAGI uses exact metric surfaces: PrecisionAtK returns the fraction of top-K results that are relevant, while ContextPrecision scores whether relevant chunks are ranked above irrelevant chunks. Compared with sklearn’s precision_score(), which is useful offline but detached from traces, FutureAGI keeps the per-row evidence next to datasets and traceAI spans so a precision regression points to the examples that caused it.
How to Measure or Detect Precision
Use precision only after defining the positive class. For a moderation classifier, “positive” may mean unsafe; for escalation routing, it may mean needs review.
- Confusion-matrix precision:
TP / (TP + FP). Track by class, route, model version, and customer cohort. GroundTruthMatchaggregation: returns per-row agreement against gold labels; aggregate predicted positives into TP and FP.PrecisionAtK: returns the fraction of top-K retrieved items that are relevant for ranked retrieval.ContextPrecision: scores ranking order when retrieved context quality, not binary classification, is the target.- Dashboard proxy: compare eval-fail-rate-by-cohort with reviewer-confirmed incident rate; a widening gap usually means precision is falling.
Minimal Python:
from fi.evals import GroundTruthMatch
metric = GroundTruthMatch()
tp = fp = 0
for row in dataset:
result = metric.evaluate(response=row.predicted_label, expected_response=row.gold_label)
predicted_positive = row.predicted_label == "unsafe"
tp += predicted_positive and result.score == 1.0
fp += predicted_positive and result.score != 1.0
print(tp / (tp + fp) if tp + fp else 0.0)
Common Mistakes
Precision is easy to misuse because the formula is simple but the operating choice is not.
- Reporting precision without recall. A classifier that predicts one positive and gets it right has perfect precision while missing most failures.
- Changing the positive class mid-analysis. Precision for “unsafe” and precision for “safe” answer different operational questions.
- Using accuracy as a precision substitute. Accuracy includes true negatives, so it hides false-positive noise when the positive class is rare.
- Setting one threshold for every route. A billing-support guardrail and a medical-advice guardrail need different false-positive budgets.
- Mixing ranking and classification precision. Use
PrecisionAtKorContextPrecisionfor ranked lists; use confusion-matrix precision for labels.
Frequently Asked Questions
What is precision in machine learning?
Precision is the share of predicted positive cases that are actually correct: true positives divided by true positives plus false positives. In LLM evals, it measures false-positive noise in classifiers, retrievers, guardrails, and tool checks.
How is precision different from recall?
Precision asks whether the positives you predicted were correct; recall asks whether you found all real positives. High precision reduces false alarms, while high recall reduces missed failures.
How do you measure precision?
Compute TP / (TP + FP) from a confusion matrix, often using per-row FutureAGI evaluator results such as GroundTruthMatch. For ranked retrieval, use PrecisionAtK or ContextPrecision.