Best Legal AI Evaluation Platforms in 2026: The Four-Test Scorecard
Five legal AI eval platforms scored on four tests: clause-citation validity, privilege, jurisdiction-correct authority, refusal calibration. May 2026.
Table of Contents
An associate at an AmLaw 200 firm filed an opposition brief citing four cases. Three were real. One was a confident-sounding artefact the research copilot fabricated through a retrieval gap, and no eval pass caught it before the brief left the building. Rule 11 sanctions issued against the partner of record. The firm’s AI stack had a gateway controlling inputs, a benchmark score against LegalBench, and a paid LLM observability dashboard. What it did not have was a per-output evaluator score, with reasoning the partner could inspect, that ran before the brief got filed.
The eval platform that would have caught the fabricated citation has to clear four tests, not one. Clause-level citation validity: the claim is in the cited clause, the quoted span matches the contract text verbatim. Privilege awareness: attorney-client material doesn’t leave the firm boundary. Jurisdiction-correct authority: a Delaware case applied to a California question scores down. Refusal calibration: better silent than wrong. The platform that ships all four is the one your firm’s risk partner will sign off on. The platform that ships three is a research toy.
This guide compares five evaluation platforms against that scorecard. Future AGI, Galileo Luna-2, Braintrust, the legal-AI-specialist vendors (Harvey, Legora, and the verticals), and a custom DIY stack. We weight what bar associations and courts care about, not what vendors want to sell. Each platform gets an honest verdict, including where it isn’t the right pick.
TL;DR: the four-test scorecard
| # | Platform | Best for | Where it falls short |
|---|---|---|---|
| 1 | Future AGI | clause-citation + privilege + jurisdiction + refusal in one Apache 2.0 stack, source-available SDK + Platform + Agent Command Center gateway, SOC 2 / HIPAA / GDPR / CCPA | newer than Langfuse on community size; in-product authoring agent is on the Platform tier, not OSS |
| 2 | Galileo Luna-2 | AmLaw 100/200 procurement, runtime hallucination gate, Luna-2 model in third-party benchmarks | custom evaluators are a vendor request; closed source; per-eval cost higher than Future AGI at scale |
| 3 | Braintrust | engineering-led teams that want the eval primitives and write the legal-specific rubrics themselves | thinner built-in template catalogue for legal workloads; you wire most of the privilege story |
| 4 | Harvey / Legora / vertical specialists | firms that want a turnkey legal AI product with its own internal eval bench | the eval bench is the vendor’s; you don’t get the rubrics, the scores, or the trace store as auditable artefacts |
| 5 | Custom DIY (pytest + LiteLLM + OTel) | research teams who own the platform and want zero vendor coupling | every rubric, drift monitor, and trace store is your team’s responsibility; labour cost climbs faster than the SaaS bill |
The non-negotiables across the five: per-tenant isolation, lawyer-reviewed eval dataset, refusal path on ambiguous clauses, and deterministic citation validation as a near-100 floor.
Why legal AI eval is not generic LLM eval
Three things shift the moment client-confidential data is in the pipeline.
The unit of error is jurisdictional and citational. A hallucinated case is a Rule 11 issue and a Mata v. Avianca-style sanction risk. The 2nd Circuit followed with Park v. Kim in 2024. A real case mis-cited is malpractice. A real case applied to the wrong jurisdiction is a competence breach under ABA Model Rule 1.1. Generic groundedness scores miss that: they score whether the text is supported by the retrieved context, not whether the retrieved context was the right authority for the question.
The data path is constrained. Privileged client communications and confidential matter data should not leave the firm boundary, so subjective LLM-as-judge calls have to either run inside that boundary or be scoped away from confidential fields. ABA Formal Opinion 512 made the same point at the national level. Cross-tenant retrieval is a malpractice-grade configuration class, not a model class.
The supervision record has to survive bar inquiry. ABA Model Rule 5.3 expects the supervising attorney to make reasonable efforts to ensure AI output is reviewed; “reasonable” in 2026 means a documented eval pass with a per-output score the partner can inspect. FRCP Rule 11 and Rule 26(g) bake the same reasonable-inquiry expectation into civil procedure.
Two practical implications: the platform has to produce reviewer-friendly traces a partner can audit, and at least some of the evaluators have to run inside the firm boundary.
The four-test scorecard
Most listicles compare platforms on feature counts and pricing tiers. Legal needs a sharper rubric.
| Test | What it measures | Why it matters in legal practice |
|---|---|---|
| Clause-level citation validity | Each cited clause ID exists in the retrieval context; each quoted span matches the clause text verbatim (Levenshtein tolerance for OCR drift) | Mata v. Avianca; Park v. Kim; Rule 11 reasonable inquiry; Rule 3.3 candor toward the tribunal |
| Privilege awareness | Local-only execution path for confidentiality-bearing rubrics; per-tenant namespaces; PII redaction on trace logs; self-host gateway and trace store | Model Rule 1.6 client confidentiality; attorney work product protection; the platform’s local-only paths run inside the firm’s existing privilege-protection workflow |
| Jurisdiction-correct authority | Cross-jurisdiction misapplication scores down; not just substring matching on case names; jurisdiction tag survives round-trip through trace and eval store | Model Rule 1.1 competence; per-state bar opinions in CA, NY, FL, DC, TX; the rule that exists in the wrong jurisdiction is a competence breach, not a citation hallucination |
| Refusal calibration | AnswerRefusal scores high on ambiguous or out-of-playbook clauses; per-clause-type, per-jurisdiction thresholds; a refused-when-easy and answered-when-hard rate that holds in production | Better silent than wrong; the bot that pattern-matches to the closest plausible answer when the playbook is silent is a malpractice surface |
Citation validity is the deterministic floor. Privilege and jurisdiction shape the data path. Refusal calibration is the test most platforms quietly fail in production. The moment latency targets get tight, refusal heads over-confidently answer questions they should hand back.
A platform that scores high on three of four is a strong candidate. The platform that ships all four is the production pick.
How we ranked these five
Three filters. The platform had to support legal-relevant rubrics out of the box (Groundedness, Factual Accuracy, citation grounding, AnswerRefusal, plus a structural validator). It had to expose a trace format that survived round-tripping through an OpenTelemetry-compatible audit store. It had to support a local execution path for confidentiality-relevant checks so confidential matter data did not have to leave the firm boundary. Pricing was not weighted.
#1 Future AGI: the four-test pick
Future AGI clears all four tests in one Apache 2.0 stack. The ai-evaluation SDK ships 50+ EvalTemplate classes: Groundedness, ContextAdherence, ContextRelevance, FactualAccuracy, Completeness, ChunkAttribution, ChunkUtilization, AnswerRefusal. That’s the exact rubric set a lawyer-graded probe set exercises. CustomLLMJudge with Jinja2 grading criteria authors the legal-specific bits (clause-extraction precision, redline quality, jurisdiction-correct authority). Field-level error localization tells the supervising partner which prompt segment, retrieved authority, or matter-context chunk produced a contested citation.
Best for: mid-market legal-tech vendors, in-house legal AI engineering teams, and AmLaw firms that want one platform covering all four tests. Especially strong when the team runs OpenTelemetry and wants eval and trace joined.
The four tests:
- Citation validity.
Groundednessplus a deterministic clause-ID-and-quoted-span check ride as floor rubrics; the contract review RAG guide shows the wiring. Field-level error localization pinpoints retrieval vs generation failure. - Privilege awareness. Hybrid mode routes 20+ local heuristic metrics (regex, JSON schema, BLEU, ROUGE, semantic similarity) to in-process execution so structural validations never leave the firm boundary. Agent Command Center self-hosts as a single Go binary in your VPC. Protect’s
data_privacy_complianceadapter redacts PII at 65 ms median time-to-label, with a deterministic 18-entity fallback. SOC 2 Type II, HIPAA, GDPR, and CCPA certified; ISO/IEC 27001 in active audit. - Jurisdiction-correct authority.
CustomLLMJudgewith grading criteria authored against a jurisdiction-tagged probe set scores cross-jurisdiction misapplication down; the rubric ages with the in-product authoring agent on the Platform tier. - Refusal calibration.
AnswerRefusalas a first-class template; per-clause-type and per-jurisdiction thresholds; production-side refusal-rate alarms via Error Feed.
Key strengths:
- 50+ pre-built templates plus 20+ local heuristic metrics in
ai-evaluation(Apache 2.0). The full Evaluator API runs locally or against the Turing models. - traceAI auto-instruments 50+ AI surfaces across Python, TypeScript, Java, and C#, including a first-class
RETRIEVERspan kind for per-retrieval rubric attachment. Span-layer PII redaction strips confidential signal before export. - Error Feed sits inside the eval stack. HDBSCAN clustering over ClickHouse-stored span embeddings groups failing traces into named issues; a Sonnet 4.5 Judge writes RCA, evidence quotes, an
immediate_fix, and a four-dimensional score (factual_grounding, privacy_and_safety, instruction_adherence, optimal_plan_execution). Lawyer-reviewed promotions feed the dataset. - Lower per-eval cost than Galileo Luna-2 at scale; the legal RAG evaluation deep dive covers the cost math.
Limitations:
- Smaller community than Langfuse and Phoenix today; the contributor base reading the evaluator code is younger.
- The in-product authoring agent and self-improving evaluators sit on the Platform tier. The OSS SDK gives you the templates and the Evaluator API.
- The platform does not confer attorney-client privilege. Privilege is a deployment, workflow, and jurisdictional property; the verifiable claim is that local-only paths run inside the firm’s existing privilege-protection workflow.
Use-case fit: contract review, legal research, e-discovery document review (text), brief drafting, deposition prep, compliance monitoring.
Pricing & deployment: cloud + OSS self-host (ai-evaluation, traceAI, agent-opt, Agent Command Center, all Apache 2.0). Free to start; pay-as-you-go as usage grows. Compliance and enterprise add-ons (HIPAA BAA, SAML SSO + SCIM, dedicated CSM) layer on as needed. Multi-region hosted, AWS Marketplace, 100+ provider integrations via Agent Command Center.
Verdict: the production-grade pick when data-path constraints make the local heuristic path mandatory and you want citation grounding, tracing, redaction, and refusal calibration in one stack. On all four tests together, Future AGI wins.
#2 Galileo Luna-2: runtime guardrail and AmLaw procurement
Galileo’s Luna-2 is the production hallucination model behind the platform’s managed eval and runtime guardrail surface. Third-party benchmarks rank Luna-2 well on hallucination detection, and runtime guardrails can block outputs at inference time. That’s useful on client-facing chatbot or self-service legal surfaces where a wrong answer is costlier than a slow one. Procurement tends to clear AmLaw InfoSec quickly.
Best for: AmLaw 100/200 firms, Fortune 500 in-house legal departments, and large legal-tech vendors where SSO, MSA, and named enterprise references matter more than open-source flexibility.
The four tests:
- Citation validity. Luna-2 hallucination scoring is strong; deterministic citation validity (clause ID + verbatim quoted span) sits on top as your own check.
- Privilege awareness. Fully-managed cloud is the default; privileged matter data inside a self-hosted boundary needs procurement negotiation.
- Jurisdiction-correct authority. Custom jurisdiction-tagged rubrics are a vendor request, not a code change. The closed-source posture means a feature request and a delivery date.
- Refusal calibration. Runtime guardrails can gate inference-time outputs; the AnswerRefusal calibration loop is less first-class than Future AGI’s, but the inference-time block fills a different niche.
Key strengths: Luna-2 hallucination scoring with third-party benchmark support; runtime guardrails at inference time; mature enterprise security posture; named regulated-industry customers.
Limitations: optimized for managed cloud; built-in catalogue narrower than Future AGI’s, especially for clause-extraction and citation-grounding structural checks; custom rubrics are vendor requests; per-eval cost materially higher at scale.
Use-case fit: large-firm contract review at scale, in-house compliance monitoring, client-facing legal AI products.
Pricing & deployment: enterprise contract, fully-managed cloud.
Verdict: the safest procurement story for AmLaw InfoSec; less flexible than Future AGI or Braintrust on the data-path and custom-rubric questions. The Future AGI vs Galileo comparison covers the head-to-head; Galileo alternatives covers the broader landscape.
#3 Braintrust: engineering-led eval primitives
Braintrust is the eval-and-experimentation platform engineering teams reach for when they want the primitives and intend to write the rubrics themselves. Datasets, experiments, prompt versioning, and logging sit in one workflow; the legal-specific bits are your team’s to author.
Best for: engineering-led legal-tech vendors and in-house teams with senior engineers who want to read the evaluator code, write the rubrics, and own the eval surface end-to-end.
The four tests:
- Citation validity. You wire the deterministic clause-ID-and-quoted-span check yourself. The platform gives you the experiment harness; the rubric is yours.
- Privilege awareness. Cloud-first by default; local execution for confidentiality-bearing rubrics is on you to set up.
- Jurisdiction-correct authority. You author the jurisdiction-tagged rubric in code. No template catalogue equivalent to Future AGI’s; the trade is the rubric is exactly what you wrote.
- Refusal calibration. You write the AnswerRefusal scorer. The platform’s experiment workflow makes the calibration loop tractable; rubric content is your responsibility.
Key strengths: clean experiment workflow; comparison-friendly UI; strong prompt versioning; pleasant dataset tooling for engineering teams.
Limitations: built-in template catalogue thinner for legal workloads than Future AGI’s 50+ templates; privilege story is the engineering team’s to wire; no equivalent to Error Feed clustering, in-product rubric authoring, or self-improving evaluators tuned by senior-lawyer thumbs.
Use-case fit: engineering-led legal-tech startups, in-house teams that want to own the eval primitives, research workloads where vendor coupling is the larger risk than build effort.
Pricing & deployment: cloud SaaS with usage-based pricing.
Verdict: the right pick when the engineering team has headcount and appetite to write the rubrics and operate the data path. The Braintrust alternatives overview and Braintrust vs Datadog comparison cover where the workflow fits outside legal; for legal, plan to author most of the four tests in your own code.
#4 Harvey / Legora / legal-AI specialists: the turnkey path
Harvey, Legora, and the vertical legal-AI specialists ship turnkey legal AI products with their own internal eval bench. The product covers contract review, legal research, and brief drafting out of the box; the firm signs a SaaS contract and the vendor’s eval suite runs behind the curtain. The trade is that the eval bench is the vendor’s, not yours.
Best for: firms that want a turnkey legal AI product and are comfortable with the vendor running the eval surface; firms without engineering headcount to build the eval stack themselves.
The four tests:
- Citation validity. Vendor runs citation grounding internally. You see the verdict, not the per-rubric score or the trace store as an auditable artefact you control.
- Privilege awareness. Varies by vendor and contract; typically managed SaaS with per-tenant isolation. The verifiable claim is whatever certifications and contractual terms say.
- Jurisdiction-correct authority. The vendor’s internal eval may or may not score this. You don’t see the rubric.
- Refusal calibration. Vendor’s calibration loop, vendor’s thresholds, vendor’s silent-vs-wrong tradeoffs.
Key strengths: turnkey for the workloads the vendor targets; legal-domain training data, prompts, and review workflows generic platforms don’t replicate; procurement and InfoSec built for AmLaw firms.
Limitations: the eval bench is the vendor’s. No rubrics, scores, or trace store as auditable artefacts you control; vendor coupling on evaluator logic, prompt library, and model choices; the per-output score with reasoning is often summarised rather than exposed at the rubric level.
Use-case fit: AmLaw firms and corporate legal departments that want a productized workflow more than they want the eval stack in their boundary.
Pricing & deployment: enterprise SaaS, vendor-managed.
Verdict: the right pick when the firm wants a productized workflow and trusts the vendor’s eval bench; the wrong pick when the supervision record needs to be a first-class artefact the firm owns. If opposing counsel asks “show us the per-output score and the trace store you used to validate this brief,” you need the rubric and the trace store in your boundary.
#5 Custom DIY: pytest + LiteLLM + OpenTelemetry
A handful of in-house legal AI teams ship the DIY path: pytest for the CI gate, LiteLLM for the model abstraction, OpenTelemetry for the trace store, and a hand-rolled rubric library. The win is zero vendor coupling; the cost is every rubric, drift monitor, and trace store is your team’s responsibility.
Best for: research-grade in-house teams that own the platform top-to-bottom and have explicit constraints against any SaaS vendor in the eval path. Government counsel offices and sovereign-cloud deployments fall in this bucket.
The four tests:
- Citation validity. Your team writes the deterministic check. Cheap; runs locally; no vendor required.
- Privilege awareness. Strongest by default. Nothing leaves the firm boundary by design. The price is the team owns the gateway, trace store, redaction layer, and every operational drift the platforms hide.
- Jurisdiction-correct authority. Your team writes the jurisdiction-tagged rubric, curates the probe set, owns the drift loop.
- Refusal calibration. Your team owns calibration, threshold tuning, and production-side rate alarms.
Key strengths: zero vendor coupling; data path satisfies privilege and sovereignty constraints by construction; no per-eval billing surprises.
Limitations: labour cost climbs faster than the SaaS bill once you cross a small engineering team. The 50+ EvalTemplate classes Future AGI ships as a library become 50+ rubrics your team writes, maintains, and tunes. No Error Feed equivalent, no in-product authoring agent, no senior-lawyer-thumbs feedback loop unless your team builds it. Six months in, the maintenance load is the eval team’s full-time job.
Use-case fit: sovereign deployments, government counsel offices, research-grade in-house teams.
Pricing & deployment: the cost is engineering time. Plan for 1.5 to 3 FTEs steady-state at non-trivial production volume.
Verdict: the right pick when zero vendor coupling is a hard constraint. For most firms the labour cost crosses the SaaS bill within a year.
Decision matrix: which platform fits which legal buyer
| If you are a… | Pick | Why |
|---|---|---|
| Mid-market legal-tech vendor running production agents on OTel, confidential data must stay inside boundary | Future AGI | Eval + trace + confidentiality-safe local mode + error localization in one; covers the four tests out of the box |
| AmLaw 100 firm or Fortune 500 in-house legal with full procurement, MSA, SSO requirements | Galileo Luna-2 | Enterprise security posture clears AmLaw InfoSec fastest; Luna-2 runtime gate on client-facing surfaces |
| Engineering-led legal-tech startup that wants to write the rubrics themselves | Braintrust | Eval primitives plus experiment workflow; legal-specific rubric content is your team’s |
| AmLaw firm that wants a productized workflow more than an audit-grade eval stack | Harvey / Legora / vertical specialist | Turnkey legal AI; the eval bench is the vendor’s, with all that implies |
| Government counsel office or sovereign deployment with zero-vendor-in-eval-path constraint | Custom DIY (pytest + LiteLLM + OTel) | Strongest privilege story by construction; price is 1.5–3 FTEs steady-state |
| Defense-in-depth play targeting hallucinated-citation prevention (post-Mata failure mode) | Future AGI + Galileo Luna-2 | Future AGI as primary eval and trace; Galileo Luna-2 as runtime hallucination gate on inference |
| E-discovery vendor running document-review copilots at scale | Future AGI | 50+ built-in evaluators including Groundedness + custom rubric support; span-level retention for matter-level audit |
Closing: the gap gateways and benchmarks don’t fill
Legal AI in 2026 has two production failure modes. The obvious one: a bad input gets through. Gateways are good at that, and the legal AI gateway shortlist covers the surface. The silent one: a confident-sounding output cites authorities that don’t exist, applies the right rule to the wrong jurisdiction, or leaks privileged context, and nobody scored it before it landed in a brief. Benchmarks like LegalBench tell you which model reasons better on academic tasks. They don’t tell you whether the citation in the brief on the partner’s desk this morning is real.
Evaluation platforms catch the second failure mode. Of the five above, Future AGI is the production-grade pick when you want citation grounding, tracing, span-layer redaction, and refusal calibration in one product. Galileo Luna-2 wins for AmLaw firms with established procurement. Braintrust fits engineering-led teams that want the primitives. Harvey / Legora / verticals ship turnkey if you accept the vendor’s eval bench. Custom DIY is the right answer when zero vendor coupling is a hard constraint.
Ready to evaluate your first legal AI agent against the four tests? Wire Groundedness, ContextAdherence, AnswerRefusal, and a deterministic citation-validity check into a pytest fixture against the ai-evaluation SDK, then add the traceAI instrumentor when production traces start asking questions the CI gate missed. The contract review RAG guide is the working playbook.
Related reading
- How to Build (and Evaluate) a Contract Review RAG Agent in 2026
- Best Legal RAG Evaluation (2026)
- Best Legal AI Guardrails (2026)
- Best Legal AI Observability (2026)
- Best AI Gateways for Legal (2026)
- Evaluating LLM Citation Attribution (2026)
- Deterministic LLM Evaluation Metrics (2026)
- The 2026 LLM Evaluation Playbook
Frequently asked questions
What are the four tests every legal AI evaluation platform has to pass?
Which platform is best for catching hallucinated citations in briefs?
How do I keep client-confidential data out of a third-party LLM judge?
Does any AI evaluation platform satisfy ABA Model Rule 5.3 supervision obligations?
Can the same eval platform score a contract review agent and a legal research copilot?
How often should law firms re-evaluate production AI tools?
How does Future AGI compare to Galileo Luna-2 on legal-specific evaluators?
Five RAG evaluation tools compared for HR — benefits Q&A, policy lookup, manager-toolkit, leave-eligibility. NYC AEDT, EEOC, Mobley v. Workday, EU AI Act Annex III, CA AB 2930, CO SB 24-205.
Five RAG evaluation tools compared for fintech: advisor copilots, KYC RAG, credit-decisioning RAG, regulatory research. NYDFS, FINRA, SEC 17a-4, CFPB audit requirements covered.
Five RAG evaluation tools compared for healthcare: clinical decision support, ambient scribes, prior auth, medical coding. HIPAA, FDA SaMD, 21st Century Cures, EU AI Act requirements.