What Is Recall Parity?
A fairness metric measuring whether a model achieves equal recall across protected groups, often called equal opportunity.
What Is Recall Parity?
Recall parity is a fairness metric that asks whether a model produces equal recall across protected groups — the same true-positive rate for every cohort defined by race, gender, language, accent, region, or other sensitive attribute. Equivalently, “equal opportunity”: among cases that should have been flagged, the model catches them at the same rate regardless of group membership. It runs in fairness evaluation alongside precision parity and demographic parity. FutureAGI computes per-cohort recall via RecallScore and exposes the disparity as a first-class cohort signal.
Why Recall Parity Matters in Production LLM and Agent Systems
A classifier with strong aggregate recall can still fail one cohort badly. A safety classifier that catches 95% of harmful prompts overall but only 78% on Spanish-language prompts produces real-world harm in that cohort. A fraud detector that catches 90% of fraud overall but only 71% in a specific region underprotects that region. Aggregate recall hides this; recall parity surfaces it. The same risk applies to retrieval, intent classification, PII detection, and any binary or multi-class task with sensitive cohorts.
The pain hits multiple roles. Compliance and risk teams in regulated industries face explicit fairness obligations under the EU AI Act, NYC AEDT, and equivalent regimes — they need quantitative cohort metrics, not narrative claims. Engineers face the practical problem that minority cohorts are often the ones with sparse training data, where recall is naturally weaker. Product teams face customer trust impacts if a feature works less well for one segment.
In 2026 LLM and agent systems, cohort definitions multiply: language, accent, region, channel, persona type, dialect, code-switching pattern. A useful recall-parity practice picks the cohorts that matter for the task and tracks recall per cohort with the same rigor as the aggregate. FutureAGI exposes cohort-tagged evaluator outputs so this is one query rather than a one-off study.
How FutureAGI Handles Recall Parity
FutureAGI’s approach is to treat recall parity as a query over per-row evaluator outputs, with cohort tags carried through every layer. Engineers tag rows with cohort attributes when ingesting a Dataset (locale, accent, persona, region) or capture them as span attributes through traceAI. Per-row recall is derived from RecallScore for retrieval tasks or GroundTruthMatch for classifiers. Cohort-level recall is then a simple aggregation over those rows.
A real workflow: a content-safety team running a multi-language toxicity classifier tags each labelled row with its language. Nightly evaluation runs GroundTruthMatch on 8,000 traces sampled across English, Spanish, Portuguese, Hindi, and Arabic. The dashboard shows recall per language. When Arabic recall lags English by 9 points, the team treats that gap as a release blocker, augments training data, and reruns the eval before deploy. The release gate is parameterized: “no language pair with recall gap above 5 points”.
Unlike a one-shot fairness audit, FutureAGI keeps cohort-tagged evaluator outputs available across releases, so recall parity becomes a continuous gate rather than a quarterly study.
How to Measure or Detect It
Recall parity is a derived metric over per-cohort recall:
RecallScore— item-level recall computed per cohort.GroundTruthMatchaggregation — derive classifier recall per cohort from per-row labels.- Recall gap — absolute or relative difference between the highest and lowest cohort recall.
- Recall ratio — minimum cohort recall divided by maximum cohort recall (the “80% rule” tradition uses 0.8 as a flag).
- Trace-linked false negatives by cohort — store row IDs with
trace_idso misses point back to the production run.
from fi.evals import RecallScore
recall = RecallScore()
cohorts = {"en": rows_en, "es": rows_es, "ar": rows_ar}
per_cohort = {
name: recall.evaluate(retrieved_items=r["pred"], ground_truth_items=r["gt"]).score
for name, r in cohorts.items()
}
gap = max(per_cohort.values()) - min(per_cohort.values())
Common Mistakes
- Reporting aggregate recall only. Aggregate hides cohort failures. Always slice by sensitive attribute.
- Optimizing recall parity by lowering the strong cohort. Equalize by lifting the weak cohort, not by harming the strong one.
- Using too few cohorts. Coarse buckets (“English” vs “non-English”) miss real disparities. Slice by language and dialect.
- Forgetting precision parity. A model with equal recall but wildly different false-positive rates is still unfair.
- Treating one parity study as compliance evidence. Fairness drifts; track parity continuously, not annually.
Frequently Asked Questions
What is recall parity?
Recall parity is a fairness metric that measures whether a model achieves equal recall across protected groups. It is also called equal opportunity. Wide recall gaps between cohorts indicate a fairness failure even when aggregate accuracy looks fine.
How is recall parity different from demographic parity?
Demographic parity measures whether positive prediction rates are equal across groups, regardless of ground truth. Recall parity measures whether true-positive rates are equal across groups, conditioned on the actual positive label.
How do you measure recall parity?
FutureAGI computes per-cohort recall by sampling traces into a labelled Dataset, scoring rows with RecallScore or GroundTruthMatch, then taking the absolute or relative difference in recall between protected groups.