Best Education AI Evaluation Platforms in 2026
Education AI eval in 2026: five platforms scored on COPPA + FERPA + pedagogical-correctness rubrics. Future AGI, Galileo Luna-2, Braintrust, Khanmigo/Duolingo internal, custom on-prem.
Table of Contents
A K-12 district rolled out an AI Socratic tutor in September 2026. By December, parents had flagged a hallucinated treaty date the tutor taught to four schools of 8th-graders before any teacher caught it. The first signal anyone got was a state AG inquiry citing the FTC’s Edmodo COPPA settlement as precedent, plus an IDEA due-process complaint where the same tutor had quietly skipped a 504 accommodation a parent had filed two months earlier. The trace store had the inputs. The observability dashboard had the latency. Nobody had a per-decision audit record linking the hallucination to the retrieved curriculum chunk and the prompt segment that produced it.
That story is the reason an education AI evaluation platform is not interchangeable with a generic LLM eval tool. Education AI eval requires three controls generic platforms don’t ship: a COPPA-compliant data path for under-13 student traffic, a FERPA-aligned audit trail under 34 CFR Part 99, and pedagogical-correctness rubrics (age-appropriate language, on-curriculum factuality, hint-not-answer scaffolding, accessibility against ADA / WCAG 2.2 AA). Pick by all three or you’ll ship a procurement-killer.
This guide compares five platforms education teams should consider in 2026, scored on those three controls. The ranking weights what shows up in a state AG inquiry, an IDEA due-process hearing, and an FTC COPPA enforcement letter.
TL;DR: the five-platform shortlist
| # | Platform | COPPA data path | FERPA audit trail | Pedagogical rubrics | Best for |
|---|---|---|---|---|---|
| 1 | Future AGI | Hybrid local mode keeps free-text student work off third-party judges; Protect data_privacy_compliance Gemma 3n adapter at 65 ms inline | OTel spans with span_id-linked eval scores; tamper-evident audit log; per-tenant retention | Age-appropriate, on-curriculum, hint-not-answer, accessibility, configurable as EvalTemplate runs on ai-evaluation | K-12 district CTOs, tutoring SaaS, LMS-vendor AI teams, accessibility-led EdTech |
| 2 | Galileo Luna-2 | Cloud-only by default; under-13 data path is an enterprise negotiation | Closed cloud audit store; trace export to OTel partial | Luna-2 hallucination scoring; pedagogical rubrics author-it-yourself | Higher-ed IT, standardized-testing procurement |
| 3 | Braintrust | SDK-first; under-13 data path is whatever the team wires inside their VPC | Sandboxed eval store; OTel export via integration | Strong eval primitives; pedagogical rubrics not built-in | Engineering-led tutoring SaaS, LMS-vendor copilots |
| 4 | Khanmigo / Duolingo internal eval | Closed; in-house only | In-house only | Vertical-anchored, in-house only | Reference benchmark, not a buyable platform |
| 5 | Custom on-prem | You own it; consent scope = you | What your storage + IAM team builds | What your ML platform team builds | Districts and EdTech firms with a real ML platform org and a hard data-residency mandate |
Future AGI wins on the only axis that combines all three controls today: COPPA-aware local execution, FERPA-grade score-to-span audit linkage, and configurable pedagogical-correctness rubrics in a single Apache 2.0 SDK plus managed platform. Galileo and Braintrust are credible second picks when one of the three controls dominates. Khanmigo and Duolingo’s stacks are reference designs, not products; custom on-prem is honest about cost.
Why generic LLM eval falls short for education AI
Education teams ship AI faster than they evaluate it, and the failure mode is dual-shaped. Title VI / IDEA / ADA carry a private cause of action, and the FTC has shown it’ll use COPPA enforcement teeth on edtech vendors. The Edmodo COPPA settlement (May 2023) was the first FTC action against an edtech vendor for AI-era under-13 data violations, with a permanent ban on monetizing under-13 data. The DOJ ADA enforcement pattern against inaccessible online learning platforms is the second precedent layer.
Generic LLM eval breaks on three education-specific axes. First, education outputs are read by regulators, parents, and counsel, not just users. The score has to come with a reason an IDEA hearing officer or a state AG investigator can read; a single 0-to-1 number is the wrong artifact. Second, FERPA at 34 CFR Part 99 governs disclosure and retention of student education records; eval pipelines have to keep student-identifier signal out of third-party LLM judges unless the evaluator operates as a school-official-equivalent under the district’s contract. Third, the failure modes are silent at the student level: a tutoring hallucination propagates across schools before a teacher catches it, an IEP copilot skips a 504 accommodation, and an auto-grader drifts on Title VI protected-class cohorts after a model upgrade. These surface as span-level signals, not UX signals, and read more like a bias detection workload than a usability issue.
Gateways control inputs. AI-content detectors catch student AI use point-in-time. Evaluation platforms determine whether a hallucinated treaty date reaches four schools of 8th-graders before a teacher catches it.
The three-control scorecard
Most listicles compare on features and call it a day. Education needs a sharper rubric. The three controls below come from a state AG inquiry, an IDEA hearing transcript, and an FTC COPPA file.
| Control | Pass criteria | Why it matters |
|---|---|---|
| COPPA-compliant data path | Documented local heuristic mode for free-text student work; inline PII redaction before any third-party LLM hop; explicit handling for under-13 surfaces consistent with 16 CFR Part 312 | Edmodo set the precedent; the FTC is looking for the second one |
| FERPA-aligned audit trail | Per-decision audit linking input, retrieved curriculum chunk, evaluator score, reason, and educator override; tamper-evident; per-tenant retention controls under 34 CFR Part 99 | District counsel needs an evidence surface that maps to the obligation, not a JSON log dump |
| Pedagogical-correctness rubrics | Pre-built or first-class configurable rubrics for age-appropriate language, on-curriculum factuality, hint-not-answer scaffolding, and accessibility against ADA Title II / WCAG 2.2 AA | Education failures are pedagogy and accessibility failures more than they are factuality failures alone |
A platform that passes all three is a production pick. Two of three is a candidate. One of three is a vendor pitch.
The 2026 education regulatory pressure stack
| Rule | What your eval platform has to produce |
|---|---|
| FERPA, 34 CFR Part 99 | Tamper-evident per-decision records linking student-identifier signal, output, and evaluator score, scoped to the school-official-equivalent contract |
| COPPA, 16 CFR Part 312 | A data path that keeps under-13 free-text work out of third-party LLM judges by default |
| IDEA, 20 USC §1400 et seq. | Per-IEP score-and-reason records for any AI-assisted accommodation generation; due-process-ready evidence |
| ADA Title II / III + WCAG 2.2 AA | Accessibility rubric runs that catch missing alt-text, reading-level mismatch, and screen-reader-unsafe markup before deploy |
| ED OET AI Report | Documented evaluator rubrics plus human-oversight evidence for tutoring and assessment AI |
| State laws (CO HB 24-1130, FL HB 7027, TX SB 1893) | Release-cadence eval evidence; per-cohort drift detection on protected classes (Title VI) |
| EU AI Act Annex III(3) | Conformity-assessment-grade eval records; human-readable reasoning per decision (Article 6 high-risk, enforcement August 2026) |
Two practical implications: the eval layer has to integrate with the district’s existing FERPA-retention infrastructure, and at least some evaluators have to run inside that boundary so under-13 student work never reaches a third-party model by default.
#1 Future AGI: COPPA-aware local execution, FERPA-grade span linkage, pedagogical rubrics as code
Future AGI is the production-grade pick for education teams that want all three controls in one platform. SOC 2 Type II, HIPAA, GDPR, and CCPA are certified per the trust page; ISO/IEC 27001 is in active audit. The ai-evaluation SDK (Apache 2.0) ships 50+ pre-built evaluators plus 20+ local heuristic metrics; pedagogical-correctness rubrics layer on top via the in-product custom-evaluator agent. The OTel-native trace layer links every evaluator score back to the span that produced it, so a state AG investigator can walk from “wrong answer” to “the prompt segment plus retrieved curriculum chunk that produced it” inside the district’s retention boundary.
Best for: K-12 district CTO offices running continuous tutoring and IEP-AI in production; tutoring SaaS engineering; LMS-platform AI teams (Canvas, Blackboard, Moodle, Schoology); accessibility-led EdTech vendors.
Key strengths:
- Pedagogical rubrics configurable as code.
ai-evaluationshipsFactual Accuracy,Groundedness,Toxicity,Tone,PII Detection,Completeness,Context Adherence, andChunk Attributionas EvalTemplate classes. ConfigureAgeAppropriateLanguage,OnCurriculumFactuality,HintNotAnswer, andAccessibilityComplianceas custom evaluators via the in-product agent. Classifier-backed evaluators run at lower per-eval cost than Galileo Luna-2. - COPPA-compliant data path at two layers. Hybrid local mode routes 20+ heuristic metrics (regex, JSON schema, BLEU/ROUGE, semantic similarity, reading-level scoring) to local execution at zero API cost, keeping free-text student work off third-party LLM judges by default. The Protect
data_privacy_complianceGemma 3n LoRA adapter runs inline at 65 ms median time-to-label for text and 107 ms for image per the Protect paper (arXiv 2510.13351). The same adapter doubles as the offlineDataPrivacyCompliancerubric so CI gate and inline guardrail share a model. - FERPA-grade audit retention.
traceAI(Apache 2.0) auto-instruments OpenAI, LangChain, Groq, Portkey, Gemini, and 50+ AI surfaces at import time. Span-layer PII redaction strips student-identifier signal before export. Eval scores link to spans viaspan_id; per-tenant retention, RBAC, and tamper-evident logs ship in Agent Command Center. - Error Localization for educator-flagged outputs. Field-level error localization attributes a failed eval to a specific input key (prompt segment, retrieved curriculum chunk, or student-context field). That’s the score-and-reason record an IDEA hearing officer needs.
- Closed loop with optimization.
agent-optships PROTEGI, GEPA, and MetaPrompt optimizers that improve a teacher-labeled rubric against live trace data. The hint-not-answer rubric stays calibrated across model upgrades.
Limitations:
- Newer platform than Galileo; smaller higher-ed Tier-1 reference base than enterprise incumbents.
- No documented air-gapped on-prem release. The Apache 2.0 SDK plus traceAI plus Agent Command Center self-host inside the district’s VPC; a single-binary on-prem distribution is not the marketed path.
- Knowledge Base API surface is still v0 for deep curriculum-RAG eval workloads.
- Real-time voice-tutor evaluation is post-recording, not mid-conversation.
Use-case fit: AI tutoring (K-12 and adult learning); auto-grading with Title VI cohort drift detection; IEP / 504-plan generation with per-IEP audit records; curriculum content generation with on-curriculum factuality gates; LMS-platform AI features; standardized-testing scoring AI.
Pricing & deployment. Cloud plus OSS self-host (Apache 2.0). Start free; usage-based billing scales with volume. SOC 2 Type II, HIPAA BAA available, SAML SSO plus SCIM, dedicated support layer on as you scale. See pricing.
Verdict: the only platform that passes the three-control scorecard out of the box today. Pick Future AGI when tutoring hallucination drift between model releases is the binding constraint, when the FERPA audit trail has to survive a state AG inquiry or an IDEA hearing, and when the same model has to gate CI on accessibility compliance and hint-not-answer behavior.
Pair this with the LLM-as-a-judge best practices guide, the LLM evaluation architecture deep dive, and the best AI gateways for education comparison.
#2 Galileo Luna-2: higher-ed and standardized-testing procurement
Galileo is the strongest pick if your education organization is large enough that procurement, SSO, and an MSA-first vendor approach matter more than open-source flexibility. Luna-2 is Galileo’s named hallucination model. The higher-ed reflex is “if Legal already cleared them for fintech or healthcare, the education extension is straightforward.”
Best for: university IT, state Department of Education procurement, standardized-testing organizations (ETS, College Board, ACT), and large LMS-platform vendors with an MSA-first buying cycle.
Key strengths:
- Luna-2 hallucination scoring with public benchmark numbers; mature on the factuality axis the tutoring failure mode bites first.
- Runtime guardrails that block outputs at inference time, increasingly relevant on K-12 tutoring surfaces.
- Enterprise security posture clears university IT InfoSec quickly: SSO, SAML, audit log, RBAC.
- Drift detection on student-outcome cohorts via custom dashboards; pairs with Title VI auto-grading audit requirements when configured.
Limitations:
- COPPA data path for under-13 student traffic is an enterprise negotiation, not a default. Cloud-only means free-text K-5 / K-8 work flows through Galileo’s own infrastructure.
- Pedagogical-correctness rubrics aren’t named primitives. AgeAppropriateLanguage, HintNotAnswer, OnCurriculumFactuality, and AccessibilityCompliance are rubrics you author.
- Closed-source. Extending evaluators with custom rubrics is a vendor request.
- Pricing skews toward Tier-1 budgets.
Use-case fit: standardized-testing scoring AI; university-grade tutoring at research-institution scale; LMS-vendor AI features where the buyer is a Tier-1 procurement function.
Pricing & deployment: enterprise contract, managed cloud. Custom pricing; under-13 data-path terms confirmed at sales.
Verdict: the procurement-safe pick for Tier-1 higher-ed and standardized-testing MSA processes. Less flexible than Future AGI on the COPPA data path and pedagogical-rubric extensibility. Future AGI’s classifier-backed evaluators run at lower per-eval cost than Galileo Luna-2 when cost is the deciding factor.
#3 Braintrust: SDK-first eval for engineering-led tutoring SaaS
Braintrust is the engineering-led pick for tutoring SaaS and LMS-vendor copilot teams that want a code-first, sandboxed eval workflow. Eval datasets, prompts, and scoring functions live alongside application code in the same repo. Compliance posture is acceptable on enterprise terms; pedagogical rubrics are yours to author.
Best for: engineering-led tutoring SaaS; ML platform teams inside larger EdTech vendors; LMS-vendor copilot teams that want eval datasets versioned alongside code.
Key strengths:
- Strong SDK ergonomics. Engineers stay in their existing tooling; eval datasets and scoring functions sit next to application code.
- Sandboxed agent eval execution, useful for testing tool-using tutoring agents on synthetic student scenarios.
- Clean trace store with eval scores per row; works for engineering postmortems and CI gates.
Limitations:
- The under-13 COPPA data path is whatever the engineering team wires inside their VPC. No local-mode equivalent to Future AGI’s hybrid execution.
- Pedagogical-correctness rubrics are author-it-yourself. AgeAppropriateLanguage, HintNotAnswer, OnCurriculumFactuality, and AccessibilityCompliance don’t ship as named primitives.
- Audit-trace surface is engineering-shaped, not regulator-shaped. Producing a per-decision artifact an IDEA hearing officer can read in 30 seconds takes additional wiring.
- Newer to education relative to Galileo; procurement at a Tier-1 district is a longer conversation.
Use-case fit: tutoring SaaS with strong engineering teams; LMS-vendor copilots; ambient classroom-assistant vendors with eval-as-code workflows.
Pricing & deployment: SaaS with free and paid tiers; enterprise tier carries the additional compliance terms most education contracts need.
Verdict: an engineering-pleasant eval workflow. Pedagogical-rubric library is yours to build; the FERPA audit-trail surface needs additional wiring before counsel signs off. Pick Braintrust when the ML platform team is the buyer; pick Future AGI when the compliance lead has a seat at the table.
#4 Khanmigo and Duolingo internal eval: reference designs, not products
Khan Academy’s Khanmigo and Duolingo’s internal eval pipeline are the two best-known vertical-anchored education AI eval stacks in 2026. Khanmigo’s hint-not-answer scaffolding is the closest public reference design for Socratic tutoring evaluation; Duolingo’s per-skill regression suite is the closest reference for adaptive-learning eval. Neither is a buyable platform.
Best for: benchmarking against. If you’re building a tutoring or adaptive-learning copilot, the internal eval pipelines inside Khanmigo and Duolingo are the bar to clear.
Key strengths (as reference designs):
- Vertical-anchored rubrics. Khanmigo’s hint-not-answer prompt-shape rubric and Duolingo’s per-skill mastery regression are canonical examples of pedagogical-correctness rubrics in production at scale.
- Curriculum-grounded retrieval. Both anchor LLM outputs to curriculum chunks the publisher maintains internally; that grounding is what most generic eval stacks miss.
- Public scholarly work. Khanmigo team posts and Duolingo’s research releases give engineering teams enough surface area to reverse-engineer the rubric shape.
Limitations:
- Not for sale. No SDK, no API, no SaaS tier.
- No third-party visibility into rubric weights, gold-set composition, or calibration cadence.
- Not portable. A K-12 district can’t buy the Khanmigo eval pipeline.
Use-case fit: reference-design study. Lift the rubric shape, implement the primitives on Future AGI’s ai-evaluation SDK or a custom on-prem stack.
Verdict: the right benchmark for vertical anchoring on tutoring and adaptive learning; the wrong answer to “which platform do we buy.”
#5 Custom on-prem stack: full ownership for orgs with a real ML platform team
Some districts and EdTech firms won’t ship student work to any third party. Some state DoE deployments have residency mandates a school-official-equivalent contract can’t satisfy. Some federal-contractor research institutions live under GDPR Article 22 for international students. The custom path is honest about the trade: full ownership of the eval stack, trace store, audit pipeline, and rubric library.
Best for: state DoE procurement with hard residency mandates; federal-contractor research universities; EU-domiciled higher-ed institutions under GDPR Article 22.
Key strengths:
- No student work leaves the district’s boundary. COPPA under-13 fan-out risk is gone by construction.
- Full control over rubric definitions, evaluator versions, drift thresholds, and audit retention.
- Open-source primitives are real.
ai-evaluation,traceAI, and Agent Command Center (all Apache 2.0) self-host inside the district’s VPC. The custom path is custom operationalization, not custom primitives.
Limitations:
- You own the upgrade path, rubric curation, judge drift, storage scaling, and dashboard customization.
- Pedagogical-rubric authoring is a research workload, not a sprint. AgeAppropriateLanguage and HintNotAnswer need a teacher lead, a labeled gold set, and a quarterly calibration review.
- Total cost of ownership rarely beats a vendor with named pedagogical rubrics unless platform engineering exists as a team.
- The audit-trace artifact is whatever you build it to be.
Use-case fit: state DoE-tier deployments under hard residency mandates; EU-domiciled universities under GDPR Article 22; academic AI labs with dedicated ML platform engineering.
Pricing & deployment: infrastructure plus engineering headcount. Pair with ai-evaluation and traceAI so the primitives match what FERPA-certified vendors run.
Verdict: the right answer when data residency is a hard mandate and the platform org is already there. The wrong answer when the cost narrative is “we’ll save vendor fees.” The headcount math rarely works at district scale.
Decision matrix: which platform fits which education buyer
| If you are a… | Pick | Why |
|---|---|---|
| K-12 district CTO with continuous tutoring or IEP-AI in production | Future AGI | All three controls pass out of the box; local heuristic path keeps under-13 work off third-party judges |
| Higher-ed IT or state DoE procurement with mature MSA cycle | Galileo Luna-2 | Enterprise procurement reflex matches your buying cycle; Luna-2 hallucination scoring |
| Tutoring SaaS engineering team, eval-as-code workflow | Braintrust | SDK-first ergonomics; pedagogical-rubric library is yours to build |
| LMS-platform vendor shipping AI features | Future AGI (engineering-led) or Galileo (procurement-led) | Choose by buyer profile: engineering owns the eval pipeline vs. compliance owns the MSA |
| Standardized-testing organization (ETS, College Board, ACT) | Galileo Luna-2 or Future AGI | Galileo for the Tier-1 MSA reflex; Future AGI when continuous Title VI cohort drift is binding |
| Adaptive-learning team benchmarking against the field | Khanmigo / Duolingo reference + Future AGI | Lift the rubric shape from public reference designs; implement on a buyable platform |
| State DoE deployment with hard data-residency mandate and a real ML platform org | Custom on-prem | Full ownership; use OSS primitives so you’re not reinventing rubrics or trace formats |
| Accessibility-led EdTech vendor with ADA / WCAG 2.2 AA gates on every release | Future AGI | AccessibilityCompliance configurable as a custom evaluator; CI and inline guardrail share a model |
Closing: the three-control ship gate
Education AI in 2026 has two production failure modes. The first is obvious: a bad input gets through. Gateways handle that. The second is silent: a confident-sounding tutoring answer is wrong, ungrounded in the curriculum, age-inappropriate, or skipped an IDEA accommodation, and nobody scored it before it landed in front of four schools of 8th-graders. Observability dashboards log the second failure. Evaluation platforms catch it continuously.
Run any shortlist through the three-control scorecard before procurement signs:
- COPPA-compliant data path. Documented local heuristic mode, inline PII redaction, explicit handling for under-13 surfaces. Not a privacy policy on a website.
- FERPA-aligned audit trail. Per-decision linkage between input, retrieved curriculum chunk, evaluator score, reason, and educator override. Tamper-evident. Per-tenant retention. Not a JSON log file.
- Pedagogical-correctness rubrics. Age-appropriate language, on-curriculum factuality, hint-not-answer scaffolding, accessibility against ADA Title II / WCAG 2.2 AA as named primitives. Not a generic Faithfulness score with an education slide.
Of the five options above, Future AGI is the only one that passes all three out of the box. Galileo Luna-2 wins for Tier-1 higher-ed and standardized-testing MSA processes. Braintrust is the engineering-led pick for tutoring SaaS that owns the eval pipeline in code. Khanmigo and Duolingo are reference designs, not products. Custom on-prem is the honest pick for state DoE deployments with hard residency mandates.
Ready to evaluate your first education AI agent? Wire Factual Accuracy, Tone, PII Detection, and a custom HintNotAnswer rubric into a pytest fixture against the ai-evaluation SDK, then add traceAI span_id attribution when production traces start asking questions the CI gate missed. Explore the Future AGI evaluation platform and follow the LLM evaluation playbook.
Related reading
Frequently asked questions
What makes an education AI evaluation platform different from a generic one?
Does SOC 2 Type II cover FERPA or COPPA for an edtech vendor?
How do I keep student PII inside the FERPA-retention boundary while evaluating tutoring outputs?
Which pedagogical-correctness rubrics should I gate every release on?
Does an AI evaluation platform replace educator review of AI tutoring outputs or an IDEA accommodation review?
Why not just self-host Phoenix or Langfuse and skip the vendor cost?
How often should education teams re-evaluate production AI tools?
HR AI eval in 2026: five platforms scored on demographic bias detection, per-decision audit, and impact-ratio reporting. Future AGI, Galileo Luna-2, Braintrust, Holistic AI fairness specialists, custom DIY.
Five AI evaluation platforms compared for manufacturing — predictive maintenance, defect detection, MES copilots, safety-procedure docs. ISO 9001, OSHA Section 5(a)(1), EU Machinery Regulation 2023/1230, CMMC 2.0, NIST AI RMF. May 2026.
Fintech AI eval in 2026: five platforms scored on SOC 2 + PCI-DSS, financial-regulation rubrics, and SR 11-7 audit trails. Future AGI, Galileo Luna-2, Braintrust, Datadog AI, custom on-prem.