What Is Recall in Machine Learning?
An ML metric measuring the fraction of actual positive cases a classifier correctly identifies, computed as true positives divided by true positives plus false negatives.
What Is Recall in Machine Learning?
Recall in machine learning is the share of actual positive cases a classifier correctly identifies — true positives divided by true positives plus false negatives. It answers the operational question, “of everything that should have been flagged, routed, retrieved, or extracted, how much did we catch?” Recall is the metric to watch whenever missing a positive case is more expensive than raising a false alarm. FutureAGI computes recall over evaluator outputs, labelled datasets, and trace-linked rows using RecallScore, RecallAtK, and aggregated GroundTruthMatch.
Why Recall Matters in Production LLM and Agent Systems
Low recall produces silent failures. A classifier with 99% accuracy can still miss most positives when the positive class is rare — and rare classes are often the ones that matter: a fraud signal, a churn-risk prompt, a safety violation, an entity buried in a long answer. Aggregate accuracy looks healthy; the system is missing the cases that justify the model. Users notice only later, in escalations, audit findings, or incident reviews.
The pain hits multiple roles. ML engineers see balanced datasets in training and unbalanced reality in production, with recall masked by macro averages. SREs see fewer alerts than expected and find later that the safety classifier never tripped. Compliance teams cannot prove that a control catches the population it is meant to catch — they need recall by class, not aggregate accuracy. Product teams see escalations whose transcripts look normal until someone asks what the model failed to notice.
In 2026 multi-step agent pipelines, recall is often the upper bound for everything downstream. A retriever that misses the decisive chunk caps answer quality. A tool router that misses the right tool forces the trajectory off the rails. A prompt-injection detector with poor recall leaks attacks to the model. Tracking recall by class and cohort is the only way to keep these stacks honest.
How FutureAGI Handles Recall in Machine Learning
FutureAGI’s approach is to treat recall as a cohort-level reliability signal stored against rows, not a single number on a slide. For ranked retrieval, fi.evals.RecallAtK reports the fraction of all relevant items appearing in the top K results. For item retrieval, fi.evals.RecallScore calculates recall against ground-truth item lists. For classifiers, teams derive recall from per-row GroundTruthMatch, Equals, or custom evaluator labels, then aggregate by class, cohort, and model version.
A real workflow: a fraud-detection team instruments LangChain with traceAI-langchain, samples 10,000 production traces into a FutureAGI Dataset, and labels whether each transaction was actually fraudulent. GroundTruthMatch scores the predicted label against the reference. The team computes recall by transaction type, with a release gate requiring fraud-class recall above 0.95. When recall drops to 0.88 after a feature pipeline change, every false-negative row links back to the original trace, including features, model output, and rule-engine decision. The engineer adds targeted training examples, reruns the regression eval, and blocks the deploy until recall recovers.
Unlike a one-shot Scikit-learn recall_score over a holdout set, FutureAGI keeps recall row-linked, version-tagged, and cohort-sliced — so the metric points at a fix, not just at a number.
How to Measure or Detect It
Recall is computed from the true-positive and false-negative cells of a confusion matrix, but the production view is per-class recall sliced by version, route, and cohort:
RecallScore— item-level recall for retrieved items against ground-truth lists.RecallAtK— recall over the top K ranked results.GroundTruthMatchaggregation — derive classifier recall from per-row labelled outputs.- Per-class and macro recall — show minority-class failures that micro averages bury.
- Trace-linked false negatives — store row IDs with
trace_idandspan_idso misses point back to the exact production run.
from fi.evals import RecallScore
recall = RecallScore()
result = recall.evaluate(
retrieved_items=predicted_labels,
ground_truth_items=expected_labels,
)
print(result.score)
Common Mistakes
- Reporting accuracy instead of recall. With a rare positive class, 99% accuracy is compatible with missing most positives.
- Optimizing recall alone. Perfect recall can mean flagging everything; gate on F1 or pair with precision.
- Hiding minority classes inside one aggregate. Macro and per-class recall expose failures that micro averages bury.
- Mixing retrieval recall with classifier recall.
RecallAtKmeasures ranked retrieval coverage; classifier recall measures missed labels — they require different fixes. - Measuring on stale labels. A 2026 policy classifier needs current references; old labels turn legitimate predictions into false negatives.
Frequently Asked Questions
What is recall in machine learning?
Recall in machine learning is the share of actual positive cases a classifier correctly identifies, computed as true positives divided by true positives plus false negatives. It is the metric to watch when missing a positive is more expensive than raising a false alarm.
How is recall different from precision in ML?
Recall measures missed positives; precision measures false alarms among predicted positives. A high-recall, low-precision classifier flags many cases including real ones; a high-precision, low-recall classifier flags few cases but most are real.
How do you measure recall on production traffic?
FutureAGI computes recall by sampling production traces into a labelled Dataset, scoring each row with GroundTruthMatch or RecallScore, and aggregating to per-class recall sliced by cohort, model version, and route.