Evaluation

What Is Recall (ML Metric)?

An ML metric measuring the fraction of actual positive cases a model correctly identifies.

What Is Recall (ML Metric)?

Recall is an ML evaluation metric that measures the share of actual positive cases a model successfully finds: true positives divided by true positives plus false negatives. In an eval pipeline, it answers, “Of the cases that should have been flagged, routed, retrieved, or extracted, how many did we catch?” FutureAGI teams use recall on labelled datasets, production trace samples, and regression evals when missing a positive case is costlier than sending an extra false alarm.

Why Recall Matters in Production LLM and Agent Systems

Low recall creates quiet failures. The system says nothing, skips a route, misses a document, or fails to escalate a risky prompt. Users see the absence only later: an unflagged PII leak, a support ticket routed to the wrong queue, a RAG answer missing the policy exception that would have changed the answer, or an agent that never calls the refund tool because the intent detector missed the refund intent.

The operational pain lands on different teams at once. Developers see false negatives in labelled eval sets but may not see them in aggregate accuracy. SREs see fewer alerts than expected, then find incidents that never triggered a guardrail. Compliance teams cannot prove that a control catches the full population it is supposed to catch. Product teams see customer escalations with transcripts that look normal until someone asks what the system failed to notice.

In 2026 multi-step pipelines, recall is often the upper bound for everything downstream. A retriever that misses the decisive chunk caps answer quality. A tool router that misses the right tool forces the agent down the wrong trajectory. A safety classifier with low recall lets harmful prompts pass to the model. Accuracy can look high when positives are rare; recall names the failure mode directly.

How FutureAGI Handles Recall

FutureAGI’s approach is to treat recall as a cohort-level reliability signal, not a single-row score. The anchor for this entry is conceptual rather than a dedicated product page: recall is the confusion-matrix view engineers compute over evaluator outputs, dataset rows, and trace-linked examples. For item retrieval, fi.evals.RecallScore calculates recall for retrieved items against ground truth items. For ranked results, fi.evals.RecallAtK reports the fraction of all relevant items that appear in the top K. For classifiers, teams usually derive recall from per-row GroundTruthMatch, Equals, or custom evaluator labels.

A real workflow: an agent team instruments LangChain with traceAI-langchain, samples 5,000 production traces into a FutureAGI Dataset, and labels whether each trace should have triggered an escalation. GroundTruthMatch scores the predicted escalation label against the reference. The team then computes recall by class, with a release gate requiring escalation recall above 0.92 for billing, safety, and account-deletion intents. When recall drops to 0.84 after a prompt update, the failing rows link back to the original trace, including the user message, route decision, and evaluator reason. The engineer adds targeted examples, reruns the regression eval, and blocks deploy until false negatives fall.

Unlike Scikit-learn’s recall_score, which returns a batch scalar after the fact, FutureAGI keeps row-level examples, trace links, and cohort tags attached to the aggregate.

How to Measure or Detect Recall

Measure recall from the true-positive and false-negative cells of a confusion matrix. The minimum useful view is per-class recall; the production view is per-class recall by cohort, model version, route, and prompt version.

  • RecallScore — calculates item-level recall for retrieved items against ground truth items.
  • RecallAtK — reports the fraction of all relevant items that appear in the top K ranked results.
  • Classifier recall from GroundTruthMatch — aggregate per-row correct/incorrect labels into true positives and false negatives.
  • Dashboard signal — alert on recall-by-class, false-negative-rate-by-cohort, and eval-fail-rate-by-cohort.
  • Trace-linked review — store row IDs with trace_id and span_id so false negatives point back to the exact production run.

Minimal Python shape:

from fi.evals import RecallScore

recall = RecallScore()
result = recall.evaluate(
    retrieved_items=predicted_ids,
    ground_truth_items=expected_ids,
)
print(result.score)

Common Mistakes

Recall is simple to define and easy to misuse once the dataset becomes imbalanced or multi-class.

  • Reporting accuracy instead of recall. A 99% accurate classifier can still miss most positives when the positive class is rare.
  • Optimizing recall alone. Perfect recall can mean flagging everything; pair it with precision or F1 before setting a release gate.
  • Hiding minority classes inside one aggregate. Macro recall or per-class recall shows failures that micro averages bury.
  • Mixing retrieval recall with classifier recall. RecallAtK measures ranked retrieval coverage; classifier recall measures missed positive labels.
  • Measuring on stale labels. A 2026 policy classifier needs current references; old labels make valid positives look like false positives or false negatives.

Frequently Asked Questions

What is recall in machine learning?

Recall is the fraction of actual positive cases that a model correctly finds: true positives divided by true positives plus false negatives. It is the metric to watch when missed positives are expensive.

How is recall different from precision?

Recall measures missed positives: of everything that should have been found, how much did the model catch? Precision measures false alarms: of everything the model predicted positive, how much was actually positive?

How do you measure recall?

FutureAGI can measure item-level recall with RecallScore or RecallAtK, and classifier recall by aggregating GroundTruthMatch or Equals outputs into true positives and false negatives. Track it by class and cohort, not only as one aggregate.