Evaluation

What Is LegalBench (Domain-Specific Benchmark)?

A collaboratively built benchmark of 160+ tasks for measuring large language model reasoning across legal rule application, interpretation, and issue spotting.

What Is LegalBench (Domain-Specific Benchmark)?

LegalBench is a collaboratively built benchmark for measuring legal reasoning in large language models. It comprises over 160 tasks across five categories. rule application, rule conclusion, interpretation, rhetorical understanding, and issue spotting. designed by legal scholars working with ML researchers. Tasks mirror activities lawyers actually perform: clause classification, contract review, statutory interpretation, rule extraction. It is one of the canonical domain-specific LLM benchmarks alongside MedQA (medicine), SWE-Bench Verified (code), and FinanceBench (finance), and a useful directional signal for whether an LLM can serve in legal workflows. As of May 2026, frontier models including GPT-5.x, Claude Opus 4.7, and Gemini 3 Ultra cluster within 4-6 points on the original LegalBench split, which means the test no longer discriminates between top-tier systems on its own and must be paired with a domain golden dataset.

Why LegalBench matters in production LLM and agent systems

Legal AI applications carry asymmetric risk. A wrong contract interpretation can cost a client millions; a hallucinated case citation can sanction a lawyer. Generic LLM evaluation tells you a model is fluent and broadly capable; it does not tell you whether it can read a force-majeure clause without missing a subordinate exception. LegalBench was built precisely for that gap.

The pain shows up across roles. ML engineers picking a model for a contract-review product see strong LLM leaderboard ranks and ship, only to find the model misclassifies obvious indemnity clauses on the firm’s actual contracts. Product leads at legal-tech companies promise “trained on legal data” without any benchmark evidence and lose enterprise deals to competitors who publish LegalBench numbers. Compliance leads at a law firm need responsible AI evidence that output quality is at least at the level of a junior associate before letting it touch client work.

In 2026, agentic legal workflows raise the bar further. A multi-step contract analyser may extract clauses, classify them, retrieve relevant precedent through a RAG pipeline, and write a summary. Each step needs domain-specific evaluator coverage, and end-to-end LegalBench tasks miss step-level failure modes. Trajectory-level evaluation tied to LegalBench-style task slices is the right shape.

FutureAGI runs LegalBench (and similar domain benchmarks) as one input to a broader LLM evaluation framework. FutureAGI’s approach is to treat LegalBench as a calibration set, not a release gate by itself. The connection runs through versioned Dataset artefacts: a team imports the LegalBench task subset relevant to their product, attaches Groundedness, AnswerRelevancy, and TaskCompletion evaluators, and runs Dataset.add_evaluation(...) against each candidate model. Results show LegalBench performance per model, but the dashboard pairs them with the team’s own internal evaluation cohort and traceAI spans.

A concrete workflow: a contract-review startup runs three candidate LLMs (claude-opus-4.7, gpt-5.1, llama-4-70b fine-tuned on legal data via LoRA) against a 600-task LegalBench subset plus a 1,200-clause internal Dataset v5. The fine-tuned Llama 4 wins on the internal dataset by 11 points but trails Claude Opus 4.7 on LegalBench. The team picks Claude as the primary model with the fine-tune as fallback for the internal-corpus-heavy route, configured via Agent Command Center routing policy. Every production trace logs the model version, retrieved precedents through the LlamaIndex integration, and post-hoc evaluator scores.

For ongoing regression detection, the team runs regression eval on both LegalBench and the internal cohort weekly. When an upstream provider model update silently shifts behaviour, the eval surfaces it within a day rather than after a client complaint. Unlike a leaderboard such as the one Stanford CRFM publishes, FutureAGI keeps every row connected to evaluator reasons, the trace span, and the production cohort that originally surfaced the case.

LegalBench at a glance. 2026 view

LegalBench dimensionWhat it tests2026 frontier bandFutureAGI evaluator pairing
Rule applicationApply a stated rule to facts88-93%Groundedness, AnswerRelevancy
Rule conclusionOutput the correct outcome84-90%TaskCompletion, CustomEvaluation
InterpretationRead clauses with ambiguity75-83%Faithfulness, Groundedness
Rhetorical understandingParse legal argument style70-79%CustomEvaluation rubric
Issue spottingIdentify legal issues from facts65-74%TaskCompletion, CustomEvaluation

Domain-benchmark quality combines benchmark slices with task-specific evaluators:

  • LegalBench task accuracy. per-task and aggregated scores across the 5 categories.
  • Groundedness. for citation-heavy outputs, alignment with provided source statute and case text.
  • AnswerRelevancy. does the response address the legal question.
  • TaskCompletion. for multi-step legal agents, did the trajectory reach the goal.
  • Faithfulness. for RAG-over-precedent, support against retrieved context.
  • Per-category eval-fail-rate. surfaces uneven performance across LegalBench subdomains.
  • Reviewer-disagreement rate. proxy for ambiguity in legal-practice tasks.
from fi.evals import Groundedness, TaskCompletion

g = Groundedness()
tc = TaskCompletion()

input_q = "Identify the indemnification clauses in the attached contract."
output = "Section 14.2 contains the indemnification clause..."
context = "Section 14.2 (indemnification by seller)..."
print(g.evaluate(response=output, context=context))
print(tc.evaluate(input=input_q, output=output))

Common mistakes

  • Treating LegalBench score as production readiness. Public benchmarks rarely match a firm’s specific document distribution; pair with internal evaluation.
  • Aggregating across all 160 tasks into one number. The five categories test different abilities; report per-category scores.
  • Skipping a citation-faithfulness check. Legal hallucinations often appear as fabricated case citations; run Groundedness and chunk attribution on every output.
  • Leaderboard-only model selection. A model that wins LegalBench may be too expensive or slow for your product; weigh quality, latency, and inference cost together.
  • No regression eval on provider updates. Provider model updates can silently shift legal-task performance; schedule weekly LegalBench runs and pair with contamination checks.

Frequently Asked Questions

What is LegalBench?

LegalBench is a benchmark of more than 160 tasks for measuring legal reasoning in LLMs, spanning rule application, rule conclusion, interpretation, rhetorical understanding, and issue spotting. It was built collaboratively by legal scholars and ML researchers.

How is LegalBench different from MMLU?

MMLU tests broad multi-domain knowledge with multiple-choice questions including some law content. LegalBench focuses on legal-practice tasks lawyers perform. clause classification, contract review, statutory interpretation. with formats and rubrics designed by legal experts.

How do you use LegalBench in production legal AI evaluation?

LegalBench scores are directional, not sufficient. FutureAGI runs LegalBench-style task slices alongside a versioned production dataset, scoring with FactualAccuracy, AnswerRelevancy, and TaskCompletion to catch regressions specific to your legal corpus.