Articles

Best HR AI Evaluation Platforms in 2026

Q: What makes an HR AI evaluation platform different from a generic one?

Three controls a generic eval platform doesn't ship. First, demographic bias detection on the protected classes an EEOC investigator and an NYC AEDT auditor read on — race, gender, age — as named evaluator primitives, not a Faithfulness score with a Title VII line. Second, per-decision audit linkage of input span, retrieved candidate-data field, output, evaluator score, reason, model version, and reviewer override, so an EEOC charge response or a Mobley-pattern discovery request can be answered from one record. Third, impact-ratio reporting against the EEOC 4/5ths rule on continuous production traffic, not just on an annual audit-vendor snapshot. Miss any of the three and you ship a regulator gap dressed up as a feature gap.

Q: Which bias-detection rubrics should an HR team gate releases on?

Four at the floor. BiasDetection catches the general disparate-impact signal across protected-class cohorts. NoRacialBias, NoGenderBias, and NoAgeBias are named evaluator primitives for the three classes an EEOC charge response, an NYC AEDT auditor, and a Colorado AI Act adverse-action review read first. Future AGI's ai-evaluation SDK ships all four as EvalTemplate classes (eval_id 69, 77, 78, 79 in templates.py), plus the Sexist primitive (eval_id 17) for gender-coded language in job-description generation and interview prompts. Pair with the Toxicity, IsPolite, and DataPrivacyCompliance templates so a candidate-facing screening message or recruiter-copilot draft is checked before it leaves the LLM. A statistician or industrial-organizational psychologist still owns the impact-ratio significance test and the AEDT audit sign-off; the eval platform owns the evidence trail.

Q: How do I meet NYC Local Law 144 AEDT bias-audit requirements for an LLM-driven hiring tool?

AEDT compliance requires three artifacts the employer ships, not the vendor: an independent third-party bias audit performed within the prior year, a public summary of audit results, and a candidate notice plus accommodation pathway. The eval platform's job is to produce the underlying evidence surface — per-decision bias scores against protected-class cohorts, drift telemetry between audits, and impact-ratio reporting on the 4/5ths rule — that the independent auditor consumes when they score your tool. A platform that ships BiasDetection, NoRacialBias, NoGenderBias, NoAgeBias, and span-linked audit logs hands your auditor a defensible record; the auditor still has to do the audit.

Q: Can I evaluate an HR AI for compliance without exposing candidate data to a third-party model?

For heuristic checks that don't require an LLM judge — regex, JSON schema, BLEU/ROUGE, semantic similarity, deterministic PII detection — data stays local. Future AGI ships 20+ local heuristic metrics so structural validation never leaves the boundary. The Protect data_privacy_compliance Gemma 3n LoRA adapter runs inline at 65 ms median time-to-label per arXiv 2510.13351, with deterministic fallback covering 18 entity types including SSN, government ID, and direct-identifier fields a resume might carry. LLM-judge calls stay opt-in and scoped to non-PII fields under EEOC technical assistance and GDPR Article 22 minimisation. Air-gapped self-host is available via BYOC for EU-domiciled employers under hard residency mandates.

Q: How does the Colorado AI Act affect HR AI evaluation in 2026?

Colorado SB 24-205, effective February 2026, classifies employment-decision AI as a high-risk system and triggers two obligations the eval platform has to support. First, an adverse-action notice with the principal reasons for the decision and the candidate-data fields the model relied on — the eval platform produces the per-decision reasoning a human can act on, not a 0-to-1 score. Second, an annual impact assessment covering known and reasonably foreseeable risks of algorithmic discrimination — the eval platform produces the cohort-grouped bias telemetry the assessment reads from. California AB 2930 (Jan 2026) layers a parallel adverse-action notice obligation; treat the two together as a hard floor on per-decision audit cadence.

Q: How often should HR teams re-evaluate production AI tools?

Three cadences. Continuous: drift detection on every production trace with the four-dimension trace score (factual_grounding, privacy_and_safety, instruction_adherence, optimal_plan_execution) watched for regressions on protected-class cohorts. Quarterly: a fixed evaluation suite against a held-out, cohort-tagged dataset that catches model-version and prompt regressions on the impact-ratio rubric. Annually: a full NYC AEDT independent bias audit, mandatory under Local Law 144. EU AI Act Annex III(3) lifts the cadence to ongoing for high-risk employment-classification systems from August 2026; the Colorado AI Act and California AB 2930 adverse-action notice obligations push monitoring closer to real-time. State laws are not optional layers on top of EEOC — they are the layer that decides whether your eval cadence is defensible at charge time.

HR AI eval in 2026: five platforms scored on demographic bias detection, per-decision audit, impact-ratio reporting. FAGI, Galileo, Braintrust, Holistic.

May 12, 2026

Updated May 20, 2026

17 min read

hr talent-acquisition evaluation ai-evaluation llm-evaluation regulated-industries

Table of Contents

A resume-screening model at a 5,000-employee employer quietly drifted onto a name-and-school proxy for race during campus recruiting season. Every output passed the gateway guardrail. Every output also produced a higher reject rate for one cohort. By the time the TA team traced which retrieval chunk and prompt segment produced the disparity, the employer had an EEOC charge on the calendar, an NYC AEDT audit due, and no audit-grade evidence to walk into either with. Mobley v. Workday — certified as a nationwide collective action in May 2025 — said the quiet part out loud: an AI vendor’s screening tool can be sued under the same disparate-impact theory as a human screener, and the employer’s evidence surface decides the case.

That is why an HR AI evaluation platform is not interchangeable with a generic LLM eval tool. HR AI eval is fairness eval. NYC Local Law 144 (AEDT) requires an independent bias audit. EEOC Title VII technical assistance requires impact-ratio analysis on protected classes under the 4/5ths rule. Colorado SB 24-205 and California AB 2930 require per-decision adverse-action explanation. EU AI Act Annex III(3) names employment classification high-risk and triggers Article 14 human oversight from August 2026. GDPR Article 22 layers a separate right to the logic of an automated decision for EU candidates. The platform that ships demographic bias detection, per-decision audit, and impact-ratio reporting is the platform that gets through your General Counsel. The platform that ships one of the three is a vendor pitch.

This guide compares the five platforms HR ML and compliance engineers should consider in 2026, scored on those three controls. The ranking weights what shows up in an EEOC charge response, an AEDT audit, a Colorado adverse-action review, and a Mobley-pattern discovery request, not what shows up in vendor decks.

TL;DR: the five-platform shortlist

#	Platform	Bias detection	Per-decision audit	Impact-ratio reporting	Best for
1	Future AGI	`BiasDetection` + `NoRacialBias` + `NoGenderBias` + `NoAgeBias` + `Sexist` as named `EvalTemplate` primitives	OTel spans, `span_id`-linked; tamper-evident	Cohort scoring with span linkage	Mid-market employers, ATS vendors, recruiter-copilot teams
2	Galileo Luna-2	Enterprise-tier; cohort scoring author-it-yourself	Closed cloud audit store; OTel partial	Custom dashboards	Tier-1 employers with MSA-first procurement
3	Braintrust	BYO bias rubrics; SDK-first	Sandboxed eval store; OTel via integration	Eval-as-code, library is yours	Engineering-led HR-tech and ATS
4	Holistic AI / fairness specialists	AEDT-named bias-audit product; Pymetrics-style cohort scoring	Annual-audit format export	Audit-grade at snapshot cadence	Multi-jurisdiction employers anchored to the annual audit
5	Custom DIY	You own it; taxonomy = you	What IAM + retention builds	What ML platform builds	Residency-mandated or cohort-exotic employers

Future AGI wins on the only axis combining all three controls: named bias primitives across race, gender, age, sexism; span-linked audit trail; cohort 4/5ths-rule scoring. The others are credible second picks when one constraint dominates.

Why generic LLM eval falls short for HR AI

HR teams ship AI faster than they evaluate it, and the failure mode is class-action-shaped, not user-experience-shaped. A screener drifting onto a name proxy is a Title VII disparate-impact story. An interview-AI scoring accented speech down is an EEOC finding and a Colorado adverse-action violation. A recruiter copilot drafting JDs with gender-coded language is an EEOC-cognisable disparate-treatment signal before any candidate applies.

Generic LLM eval breaks on three HR-specific axes. The audience is regulators and counsel, not users — the score needs a reason a Title VII reviewer can use and a 4/5ths-rule-defensible evidence surface. Failures are silent at the candidate level — disparate-impact drift, adverse-action gaps, and gender-coded JDs only show up at the span level. The audit aperture is fragmented: AEDT, EEOC, Colorado, California, Mobley discovery, EU AI Act Annex III(3), and GDPR Article 22 each want different artifacts from the same model.

Most listicles either pitch HR an AI gateway (catches inputs, misses output drift) or a bias-audit service (annual snapshot, not continuous). Evaluation platforms decide whether a disparate-impact drift pattern is caught at runtime or after a class action, and whether the adverse-action rationale on file is the one a human reviewer can actually read.

The three-control HR eval scorecard

Most listicles compare on features. HR needs a sharper rubric. The three controls come from an EEOC charge response, an NYC AEDT audit, and a Colorado AI Act adverse-action review.

Control	Pass criteria	Why it matters
Demographic bias detection	Named evaluator primitives for race, gender, age, plus general disparate-impact and a sexism evaluator for JD-generation and interview prompts; cohort-grouped scoring on every production trace, not just the annual snapshot	Title VII, AEDT, Colorado SB 24-205, California AB 2930, and EU AI Act Annex III(3) all expect protected-class testing on outcomes
Per-decision audit trail	Per-decision linkage of input span, retrieved candidate-data field, output, evaluator score, reason, model version, and reviewer override; tamper-evident; per-tenant retention	EEOC charge response, AEDT auditor evidence, Colorado adverse-action notice, GDPR Article 22 logic export, and Mobley-pattern discovery all expect this artifact
Impact-ratio (4/5ths-rule) reporting	Cohort-grouped pass-rate ratios with drift alarms when the ratio crosses 0.80, on rolling-mean and per-model-version windows	EEOC treats the 4/5ths rule as the disparate-impact floor; the AEDT auditor reads the same number; the Colorado annual impact assessment cites it

Pass three: production. Two: candidate. One: pitch.

The 2026 HR regulatory pressure stack

Rule	What it covers	What your eval platform has to produce
NYC Local Law 144 (AEDT)	Independent bias audit on every automated employment-decision tool in NYC	Cohort-grouped pass-rate ratios; per-decision audit trail; selection-rate summary in audit-vendor format
EEOC Title VII technical assistance (May 2023; Sep 2024 update)	Federal disparate-impact analysis under the 4/5ths rule	Impact-ratio reporting against protected-class cohorts; mitigation plan when ratio < 0.80
Colorado SB 24-205 + California AB 2930	Adverse-action notice + annual impact assessment on high-risk employment AI; Feb / Jan 2026	Per-decision adverse-action reasoning; cohort bias telemetry feeding annual assessment
Illinois AI Video Interview Act + BIPA	Notice, consent, cohort demographic reporting, biometric privacy on interview AI	Cohort-tagged retention; deletion-on-request; biometric-data path inside the boundary
*Mobley v. Workday* (N.D. Cal., collective certified May 2025)	Disparate-impact and ADEA claims against an HR-AI vendor; employer co-defendant exposure	Discovery-readable per-decision trail across cohorts and model versions
EU AI Act Annex III(3) + GDPR Article 22	Employment AI high-risk; Article 14 human oversight from Aug 2026; right to decision logic	Per-decision human-readable reasoning; interrupt mechanism; decision-logic export per data subject

The eval layer ships cohort-grouped bias scoring, produces a per-decision record linking score, reason, retrieved candidate-data field, and model version to the trace, and surfaces impact-ratio reporting an AEDT auditor and an EEOC investigator read from the same numbers.

#1 Future AGI: named bias primitives, span-linked audit trail, cohort-grouped impact-ratio

Future AGI is the production-grade pick when you want all three controls in one platform. SOC 2 Type II, HIPAA, GDPR, and CCPA are certified; ISO/IEC 27001 is in active audit. The ai-evaluation SDK ships BiasDetection, NoRacialBias, NoGenderBias, NoAgeBias, and Sexist as named EvalTemplate classes (eval_id 69, 77, 78, 79, 17 in templates.py), with PII, DataPrivacyCompliance, Toxicity, and IsPolite for boundary and message hygiene. The OTel-native trace layer links every score to its originating span, so an EEOC investigator or AEDT auditor walks from “rejected candidate” to the prompt segment, retrieved resume field, and evaluator reason inside your boundary.

Best for: mid-market employers, ATS and HRIS vendors, recruiter-copilot teams, interview-AI vendors, and engineering-led TA teams on OpenTelemetry.

Key strengths:

Bias detection ships as named primitives. BiasDetection, NoRacialBias, NoGenderBias, NoAgeBias, and Sexist are EvalTemplate classes in ai-evaluation (Apache 2.0). Cohort-grouped scoring surfaces disparities on every production decision; a statistician or IO-psychologist owns the 4/5ths significance test. The Sexist primitive catches gender-coded language in JD generation before a candidate sees it — the failure mode generic eval misses.
Per-decision audit trail that survives an EEOC charge. traceAI (Apache 2.0) auto-instruments 50+ AI surfaces at import time. Spans carry prompt, retrieved candidate-data field, and output as attributes; eval scores link via span_id. Per-tenant retention, RBAC, and tamper-evident logs ship in Agent Command Center. When opposing counsel in a Mobley-pattern matter asks which candidate-data field drove a rejection, the answer assembles in one query. Field-level error localization names the retrieved-resume field behind any cohort regression.
Impact-ratio reporting on continuous traffic. Cohort-grouped scoring produces selection-rate and pass-rate ratios on every trace. Drift alarms fire when the rolling-mean ratio crosses 0.80 — the canonical Mobley-pattern failure mode, caught between annual audits.
Candidate-data boundary integrity at two layers. Protect data_privacy_compliance Gemma 3n LoRA adapter runs inline at 65 ms median time-to-label per arXiv 2510.13351; deterministic fallback covers 18 entity types. Hybrid execution keeps GDPR Article 22 and Illinois BIPA scope tight: 20+ heuristic metrics run local at zero API cost.
Error Feed and closed-loop optimisation. HDBSCAN soft-clustering over ClickHouse span embeddings groups failures into named issues; a Sonnet 4.5 Judge writes root cause and immediate_fix. agent-opt ships six optimisers (PROTEGI, GEPA, MetaPrompt, PromptWizard, BayesianSearch, RandomSearch) that tune a bias-labelled rubric against live trace data, not a synthetic set.

Limitations:

Real-time voice-agent eval is out of scope; AI video interviews need post-recording evaluation. Pair with the end-to-end voice AI evaluation reference.
A statistician or AEDT auditor still owns the 4/5ths significance test and audit sign-off. The trade is the evidence trail is already regulator-readable.
Newer than Galileo on enterprise-procurement reflex; smaller named-HR-customer footprint than the fairness specialists at #4.

Use-case fit: resume screening, AI video/audio interviews (post-recording), recruiter-copilot drafting, internal-mobility, performance-review summarisation, JD hygiene.

Pricing & deployment: cloud + OSS self-host (Apache 2.0). Start free; usage-based. SOC 2 Type II, HIPAA BAA, SAML SSO, SCIM on Scale tier. AWS Marketplace. Air-gapped self-host via BYOC for EU residency. See pricing.

Verdict: the only platform in this shortlist that passes the three-control scorecard out of the box. Choose Future AGI when you need named bias primitives across race, gender, age, and sexism; one audit trail an EEOC investigator, AEDT auditor, Colorado reviewer, and GDPR Article 22 request can each read; and cohort-grouped 4/5ths reporting that surfaces drift between annual audits.

#2 Galileo Luna-2: enterprise procurement and Tier-1 HR Legal & Compliance reflex

Galileo is the strongest pick if your employer is large enough that procurement, SSO, and a Tier-1 MSA matter more than open-source flexibility or named bias primitives. The enterprise tier ships bias-detection evaluators; Luna-2 hallucination scoring closes faster with HR Legal & Compliance than newer entrants.

Best for: Tier-1 employers, multinational HR functions, Fortune 500 TA teams with mature Legal & Compliance procurement and an MSA-first vendor approach.

Key strengths:

Bias-detection evaluators on the enterprise tier; cohort drift detection in custom dashboards.
Luna-2 hallucination scoring with published benchmark numbers; mature on factuality for recruiter copilots citing internal policy.
Enterprise security clears Fortune 500 InfoSec and HR Legal & Compliance quickly; SOC 2 Type 2 by default.
Named enterprise customers in regulated industries; shorter MSA hop than newer entrants.

Limitations:

Cohort bias detection isn’t named primitives. No NoRacialBias, NoGenderBias, NoAgeBias, or Sexist evaluator an AEDT auditor can map to a Title VII class without your own rubric layer.
Closed-source. Extending evaluators with HR-specific rubrics (ADA proxies, GINA, ADEA age-coded language) is a vendor request.
Optimises for managed cloud; self-hosted retention (Illinois BIPA, EU residency) is on you.
Future AGI’s classifier-backed evaluators run at lower per-eval cost than Galileo Luna-2 at scale.

Use-case fit: Fortune 500 TA with mature procurement; workloads where MSA is the binding constraint.

Pricing & deployment: enterprise contract, fully-managed cloud. SOC 2 Type 2 by default.

Verdict: the procurement-safe pick. Choose Galileo when procurement is the binding constraint; choose Future AGI when the AEDT auditor or EEOC charge response needs named-primitive mapping in the audit trail.

#3 Braintrust: SDK-first eval workflow for engineering-led HR-tech

Braintrust is the engineering-led pick for HR-tech and ATS teams that want a code-first, sandboxed eval workflow with a polished developer surface. SOC 2 Type II by default; enterprise tier carries the broader compliance conversation for candidate-data-touching workloads.

Best for: engineering-led HR-tech vendors, ATS and HRIS engineering teams, ML platform teams at TA-tech companies that want eval datasets and prompts versioned alongside code.

Key strengths:

Strong SDK ergonomics. Eval datasets, prompts, and scoring functions live in the same repo as application code. CI gates on every PR — the bias rubric runs before the prompt merges.
Sandboxed agent eval; useful for tool-using recruiter copilots and interview-AI on synthetic candidates without real data.
SOC 2 Type II by default; clean trace store with eval scores per row.

Limitations:

Bias rubrics are author-it-yourself. NoRacialBias, NoGenderBias, NoAgeBias, Sexist, and BiasDetection don’t ship as named primitives — your team designs the cohort taxonomy, picks the judge, calibrates against an HR-labelled gold set, and re-calibrates quarterly.
The audit-trace surface is engineering-shaped, not regulator-shaped. A per-decision artifact an AEDT auditor can read in 30 seconds takes wiring.
Candidate-data path is an enterprise-tier conversation.
Newer to HR-AI relative to Holistic AI on the vertical-anchored audit story.

Use-case fit: HR-tech startups building hiring AI; ATS/HRIS engineering teams running CI-gated eval on synthetic data.

Pricing & deployment: SaaS with free and paid tiers; enterprise tier carries broader compliance terms.

Verdict: an engineering-pleasant workflow that crosses SOC 2 by default. Choose Braintrust when the ML platform team is the buyer; choose Future AGI when HR Legal needs named-primitive mapping in the audit trail.

#4 Holistic AI and the fairness specialists

A category, not a single vendor. Holistic AI, Pymetrics-style assessment specialists, Arthur, Credo AI, and the boutique AEDT independent-auditor firms split this slot. Strongest pick when your binding constraint is the annual AEDT audit cycle itself, not continuous monitoring between audits.

Best for: multi-jurisdiction employers whose primary 2026 obligation is annual AEDT bias audits plus state impact assessments at scale, paired with a continuous-monitoring layer underneath.

Key strengths:

Holistic AI ships a named AEDT bias-audit product branded for NYC Local Law 144 — the only category in the eval space with vertical-anchored HR-AI positioning. UCL spinout academic backbone with methodology in the public record.
Pymetrics-style vendors built cohort-grouped scoring on gamified assessments before AEDT existed; data-science vocabulary maps cleanly to EEOC 4/5ths reporting.
Audit-format export purpose-built for an independent auditor.
Multi-jurisdiction impact-assessment workflows (NYC AEDT + California AB 2930 + Colorado SB 24-205).

Limitations:

Annual-snapshot positioning by design, not continuous monitoring. Real-time drift alerts require layering a continuous platform on top.
Less mature on OTel-native tracing than eval-platform incumbents; the per-decision trail is engineered for the audit report, not for an EEOC discovery response or Mobley-pattern matter.
Less established Fortune 500 procurement footprint for the non-audit eval workload; Tier-1 employers usually add this category for the AEDT cycle on top of a general-purpose eval vendor.
Named-primitive coverage on Sexism, ADA proxies, and GINA varies by vendor — check the audit-template menu before signing.

Use-case fit: the AEDT compliance workflow; Colorado SB 24-205 impact-assessment artifact.

Pricing & deployment: managed cloud; tiered audit-only and audit + platform options.

Verdict: the vertical-anchored pick. If AEDT is your binding constraint, this is the cleanest answer for the audit cycle itself. Pair with Future AGI underneath for continuous monitoring — the gap where Mobley-pattern failure modes show up.

#5 Custom DIY stack: full ownership for the residency-mandated and the cohort-exotic

Some employers won’t ship candidate data to any third party. EU employers under GDPR residency, federal contractors with OFCCP audit cadences, and large unionised employers with works-council sovereignty agreements add residency on top. Some HR research teams have unusual cohort taxonomies (intersectional Title VII categories, ADA proxies, state-extended classes) no off-the-shelf menu covers. The custom path is honest about the trade: full ownership of the eval stack, trace store, audit pipeline, and rubric library.

Best for: federal contractors under OFCCP; EU employers under hard GDPR residency; large unionised employers with works-council obligations; HR research teams with exotic cohort taxonomies.

Key strengths:

No data leaves your boundary. Title VII, ADA, ADEA, GINA, GDPR, and Illinois BIPA scope collapses to your own org.
Full control over bias rubric definitions, evaluator versions, drift thresholds, audit retention, and state-by-state filing-cadence integration.
Apache 2.0 primitives self-host in your VPC or air-gapped: ai-evaluation, traceAI, Agent Command Center. Custom operationalisation, not custom primitives — wire NoRacialBias into your retention store, don’t reinvent it.
Cohort taxonomy can carry intersectional categories the SaaS menu doesn’t ship.

Limitations:

You own the upgrade path, rubric curation, judge drift, storage scaling, and dashboard work.
Bias-rubric authoring is a research workload. Cohort design needs a compliance lead, an HR-labelled gold set, and quarterly judge-calibration. IO-psychologist sign-off is not optional.
TCO rarely beats a SOC 2-certified vendor unless platform engineering exists as a team and the residency mandate is genuine.

Use-case fit: OFCCP-audited federal contractors; EU works-council on-prem settlements; HR research labs.

Pricing & deployment: infrastructure plus engineering headcount; budget accordingly.

Verdict: right answer when residency is a hard mandate and the platform org is already there. Wrong answer when the cost narrative is “we’ll save vendor fees.” Use Future AGI’s Apache 2.0 primitives so you don’t reinvent the EvalTemplate library.

Decision matrix: which platform fits which HR buyer

If you are a…	Pick	Why
Mid-market employer running screening, interview, or recruiter-copilot AI on OpenTelemetry	Future AGI	All three controls pass; named bias primitives; span-linked audit trail
Tier-1 employer or Fortune 500 TA function with full procurement, MSA, SSO	Galileo Luna-2	Enterprise procurement reflex matches the cycle; bias detection on enterprise tier
Engineering-led HR-tech or ATS vendor with SDK-first eval workflow	Braintrust	SOC 2; eval-as-code ergonomics; bias rubric library is yours to author
Multi-jurisdiction employer whose binding constraint is the annual AEDT + state impact-assessment cycle	Holistic AI / fairness specialists	Vertical-anchored audit product; pair with continuous-monitoring layer underneath
Federal contractor (OFCCP), EU employer (GDPR residency), or unionised employer (works-council)	Custom DIY	Full ownership; use OSS primitives so you don’t reinvent the `EvalTemplate` library
NYC AEDT compliance window	Future AGI + AEDT independent auditor	Named primitives produce the audit-vendor evidence trail; auditor signs off
Colorado SB 24-205 impact assessment + California AB 2930 adverse-action	Future AGI + statistician/IO psychologist	Cohort 4/5ths reporting + per-decision reasoning; reviewer owns the significance test
Illinois employer running AI video interviews	Future AGI + BIPA-compliant retention	Cohort-tagged retention; deletion-on-request; aggregate demographic reporting

Closing: the three-control ship gate

HR AI in 2026 has two production failure modes. The first is obvious: a bad input gets through, the gateway catches it. The second is silent: a screening decision biased on a name proxy, an interview AI scoring accented speech down, a recruiter-copilot draft with gender-coded language — and nobody scores it before the next AEDT audit, EEOC charge, or Mobley-pattern discovery surfaces it. Observability dashboards log the second failure. Annual bias-audit services snapshot it once a year. Evaluation platforms catch it continuously, with the trail an investigator, auditor, or magistrate can read from one record.

Run any shortlist through the three controls before procurement signs.

Demographic bias detection. Named evaluator primitives for race, gender, age, plus a sexism evaluator and general disparate-impact. Not a Faithfulness score with a Title VII line bolted on.
Per-decision audit trail. Linkage of input span, retrieved candidate-data field, output, evaluator score, reason, model version, and reviewer override. Tamper-evident. Readable by EEOC, AEDT, Colorado, California, GDPR Article 22, and Mobley-pattern discovery from one record.
Impact-ratio (4/5ths-rule) reporting. Cohort-grouped pass-rate ratios on continuous traffic, with drift alarms when the ratio crosses 0.80.

Of the five platforms above, Future AGI is the only one that passes all three today. Galileo Luna-2 wins for Tier-1 MSA. Braintrust is the engineering-led pick. Holistic AI and the fairness specialists anchor the AEDT cycle. Custom DIY is honest for federal contractors and EU residency-mandated employers.

Ready to evaluate your first HR AI agent? Wire BiasDetection, NoRacialBias, NoGenderBias, NoAgeBias, Sexist, and DataPrivacyCompliance into a pytest fixture against the ai-evaluation SDK, then add traceAI span attribution through Agent Command Center. Get started with Future AGI.

Frequently asked questions

What makes an HR AI evaluation platform different from a generic one?

Three controls a generic eval platform doesn't ship. First, demographic bias detection on the protected classes an EEOC investigator and an NYC AEDT auditor read on — race, gender, age — as named evaluator primitives, not a Faithfulness score with a Title VII line. Second, per-decision audit linkage of input span, retrieved candidate-data field, output, evaluator score, reason, model version, and reviewer override, so an EEOC charge response or a Mobley-pattern discovery request can be answered from one record. Third, impact-ratio reporting against the EEOC 4/5ths rule on continuous production traffic, not just on an annual audit-vendor snapshot. Miss any of the three and you ship a regulator gap dressed up as a feature gap.

Which bias-detection rubrics should an HR team gate releases on?

Four at the floor. BiasDetection catches the general disparate-impact signal across protected-class cohorts. NoRacialBias, NoGenderBias, and NoAgeBias are named evaluator primitives for the three classes an EEOC charge response, an NYC AEDT auditor, and a Colorado AI Act adverse-action review read first. Future AGI's ai-evaluation SDK ships all four as EvalTemplate classes (eval_id 69, 77, 78, 79 in templates.py), plus the Sexist primitive (eval_id 17) for gender-coded language in job-description generation and interview prompts. Pair with the Toxicity, IsPolite, and DataPrivacyCompliance templates so a candidate-facing screening message or recruiter-copilot draft is checked before it leaves the LLM. A statistician or industrial-organizational psychologist still owns the impact-ratio significance test and the AEDT audit sign-off; the eval platform owns the evidence trail.

How do I meet NYC Local Law 144 AEDT bias-audit requirements for an LLM-driven hiring tool?

AEDT compliance requires three artifacts the employer ships, not the vendor: an independent third-party bias audit performed within the prior year, a public summary of audit results, and a candidate notice plus accommodation pathway. The eval platform's job is to produce the underlying evidence surface — per-decision bias scores against protected-class cohorts, drift telemetry between audits, and impact-ratio reporting on the 4/5ths rule — that the independent auditor consumes when they score your tool. A platform that ships BiasDetection, NoRacialBias, NoGenderBias, NoAgeBias, and span-linked audit logs hands your auditor a defensible record; the auditor still has to do the audit.

Can I evaluate an HR AI for compliance without exposing candidate data to a third-party model?

For heuristic checks that don't require an LLM judge — regex, JSON schema, BLEU/ROUGE, semantic similarity, deterministic PII detection — data stays local. Future AGI ships 20+ local heuristic metrics so structural validation never leaves the boundary. The Protect data_privacy_compliance Gemma 3n LoRA adapter runs inline at 65 ms median time-to-label per arXiv 2510.13351, with deterministic fallback covering 18 entity types including SSN, government ID, and direct-identifier fields a resume might carry. LLM-judge calls stay opt-in and scoped to non-PII fields under EEOC technical assistance and GDPR Article 22 minimisation. Air-gapped self-host is available via BYOC for EU-domiciled employers under hard residency mandates.

How does the Colorado AI Act affect HR AI evaluation in 2026?

Colorado SB 24-205, effective February 2026, classifies employment-decision AI as a high-risk system and triggers two obligations the eval platform has to support. First, an adverse-action notice with the principal reasons for the decision and the candidate-data fields the model relied on — the eval platform produces the per-decision reasoning a human can act on, not a 0-to-1 score. Second, an annual impact assessment covering known and reasonably foreseeable risks of algorithmic discrimination — the eval platform produces the cohort-grouped bias telemetry the assessment reads from. California AB 2930 (Jan 2026) layers a parallel adverse-action notice obligation; treat the two together as a hard floor on per-decision audit cadence.

How often should HR teams re-evaluate production AI tools?

Three cadences. Continuous: drift detection on every production trace with the four-dimension trace score (factual_grounding, privacy_and_safety, instruction_adherence, optimal_plan_execution) watched for regressions on protected-class cohorts. Quarterly: a fixed evaluation suite against a held-out, cohort-tagged dataset that catches model-version and prompt regressions on the impact-ratio rubric. Annually: a full NYC AEDT independent bias audit, mandatory under Local Law 144. EU AI Act Annex III(3) lifts the cadence to ongoing for high-risk employment-classification systems from August 2026; the Colorado AI Act and California AB 2930 adverse-action notice obligations push monitoring closer to real-time. State laws are not optional layers on top of EEOC — they are the layer that decides whether your eval cadence is defensible at charge time.

View all

Guide

Best Education AI Evaluation Platforms in 2026

Education AI eval in 2026: five platforms scored on COPPA + FERPA + pedagogical-correctness rubrics. FAGI, Galileo Luna-2, Braintrust, Khanmigo, on-prem.

Rishav Hada · May 12, 2026

17 min

Guide

Best 5 AI Evaluation Tools for Manufacturing AI Applications in 2026

Five AI eval platforms for manufacturing, predictive maintenance, defect, MES copilots, safety docs. ISO 9001, OSHA 5(a)(1), EU 2023/1230, CMMC, NIST AI.

Rishav Hada · May 12, 2026

14 min

Guide

Best Fintech AI Evaluation Platforms in 2026

Fintech AI eval in 2026: five platforms scored on SOC 2 + PCI-DSS, financial-regulation rubrics, SR 11-7. FAGI, Galileo Luna-2, Braintrust, Datadog.

Rishav Hada · May 7, 2026

17 min

TL;DR: the five-platform shortlist

Why generic LLM eval falls short for HR AI

The three-control HR eval scorecard

The 2026 HR regulatory pressure stack

#1 Future AGI: named bias primitives, span-linked audit trail, cohort-grouped impact-ratio

#2 Galileo Luna-2: enterprise procurement and Tier-1 HR Legal & Compliance reflex

#3 Braintrust: SDK-first eval workflow for engineering-led HR-tech

#4 Holistic AI and the fairness specialists

#5 Custom DIY stack: full ownership for the residency-mandated and the cohort-exotic

Decision matrix: which platform fits which HR buyer

Closing: the three-control ship gate

Related reading

Frequently asked questions