What Is MMLU?
MMLU is a multiple-choice benchmark that tests language models across 57 academic subjects using exam-style questions.
What Is MMLU?
MMLU, or Massive Multitask Language Understanding, is an LLM-evaluation benchmark that tests a model’s multiple-choice accuracy across 57 academic subjects. It shipped in 2020 and was the headline number on every model card through 2023. In May 2026, MMLU is saturated: GPT-5.x, Claude Opus 4.7, Gemini 3.x, and Llama 4 all sit above 92%, the dataset carries documented label errors that cap headroom near 95%, and the score delta between a frontier model and a six-month-old model is statistical noise. MMLU still appears in 2026 model cards as an appendix table for continuity, but it does not appear in the headline.
If you are reading this in May 2026, the practical questions are: what should I run instead, when does MMLU still earn its place in an eval doc, and how do I read a model-card MMLU number without being misled? This page is an opinionated walk through that 2026 reality and how FutureAGI ties MMLU-style scores to the task-specific evals and production traces that actually drive release decisions.
Why MMLU matters in production LLM and agent systems
A model can improve on MMLU while getting worse for your users. The common failure is benchmark substitution: a team treats a public score as proof that a model can answer its own support, legal, healthcare, or coding questions. The second failure is benchmark overfitting or contamination. the candidate has seen exam-style patterns and grade-school subject material in training, so the headline accuracy reflects memorization rather than reasoning.
The pain lands across the stack. Developers inherit regressions that were invisible in the benchmark table. SREs see p99 latency, retry rate, or token cost rise after a larger model replaces a smaller one. Product teams see task-completion rate fall for a niche workflow that MMLU never covered. Compliance reviewers see a model with strong academic accuracy cite policy text incorrectly or answer outside an approved scope.
Symptoms often appear as disagreement between public and private signals: MMLU accuracy rises, but eval-fail-rate-by-cohort rises too; schema failures increase after a model swap; tool calls become less precise; escalation rate rises for complex tickets. Unlike Chatbot Arena, which measures pairwise preference and has its own verbosity bias, MMLU is fixed-answer multiple choice. That makes it reproducible, but narrow. and in 2026 the narrowness matters more than ever.
Agentic systems widen the gap. A 2026 pipeline may plan, retrieve, call tools via MCP, revise, and hand off between agents in a multi-agent system. MMLU says nothing about agent.trajectory.step, retrieval groundedness, or whether the final action was safe.
MMLU saturation: the headline number nobody reads anymore
Saturation is not a future risk for MMLU; it has already happened. By Q1 2026, every frontier system reports MMLU above 92%. Frontier labs have moved on to MMLU-Pro (10 answer choices, chain-of-thought pressure), GPQA Diamond (Google-proof PhD-level), HLE (Humanity’s Last Exam. 3,000 expert-authored questions), and FrontierMath (research-level math). Open the most recent OpenAI, Anthropic, or Google DeepMind model card and you will find HLE, FrontierMath, GPQA Diamond, AIME 2025, ARC-AGI 2, SWE-Bench Verified, Aider Polyglot, τ-bench, MMMU-Pro, and RULER in the headline table. MMLU sits below in the appendix. If your team’s eval doc still treats MMLU as the headline number, the doc is three years out of date.
What replaced MMLU in 2026
This is the swap table to internalize. MMLU is left column. Right column is what frontier labs reach for in May 2026 and why.
| 2022-era benchmark | 2026 status | Replacement | Why the swap |
|---|---|---|---|
| MMLU (57 subjects, MCQ) | Saturated (92–95% frontier) | MMLU-Pro, GPQA Diamond, HLE | Label-noise cap near 95%; PhD-level + private holdout still discriminate |
| HellaSwag | Saturated (97%+) | HLE, MUSR | Commonsense completion no longer separates models |
| GSM8K | Saturated (98%+), contaminated | FrontierMath, AIME 2025, MATH-500 | Grade-school math memorized; competition + research math still moves |
| HumanEval | Saturated, contaminated | SWE-Bench Verified, Aider Polyglot, LiveCodeBench | 164 toy problems memorized; real GitHub patches discriminate |
| MT-Bench | Saturated + judge bias | Arena-Hard-Auto, WildBench | Verbosity bias + judge-model leakage broke MT-Bench |
| Chatbot Arena (vanilla) | Live but verbosity-biased | Style-Controlled Arena, Arena-Hard-Auto v2 | Style-controlled variant strips length and formatting effects |
| Needle-in-a-Haystack | Saturated to 1M tokens | RULER, LongBench v2, BABILong | Single-needle retrieval no longer measures real long-context reasoning |
| Single-turn QA generally | Mostly obsolete as frontier signal | τ-bench, SWE-Bench Verified, OSWorld, GAIA L3 | Production work needs trajectory, tool state, multi-turn user simulation |
The interesting 2026 benchmarks are the agentic ones: τ-bench (Sierra/Anthropic’s multi-turn customer-support benchmark with database state and simulated users), SWE-Bench Verified (OpenAI’s 500-issue subset of real GitHub bugs), GAIA Level 3, OSWorld (real OS-level desktop tasks; frontier still under 40% in May 2026), WebArena, BFCL v3, MLE-Bench, and Aider Polyglot. These benchmarks share three properties MMLU lacks: state across turns, tool effects, and a pass criterion that requires the model to actually accomplish a goal.
Contamination is the second reason to discount MMLU
Public benchmark splits leak into web crawls, GitHub mirrors, Discord transcripts, and synthetic training pipelines. A model that memorized MMLU can report 94% accuracy without generalizing the underlying knowledge. The community response is contamination-resistant suites: LiveBench refreshes monthly with fresh problems, LiveCodeBench filters by submission date after model cutoff, HLE keeps a private holdout, ARC-AGI 2 keeps a private eval set, FrontierMath problems are expert-authored and never published. Treat every pre-2024 public score, including MMLU, as contaminated by default.
How FutureAGI uses MMLU
FutureAGI’s approach is to keep MMLU in the model-selection layer, then test the candidate model against the product’s own failure budget. MMLU’s FutureAGI anchor is none: there is no dedicated MMLU evaluator class in fi.evals. Teams should not invent one or pretend the benchmark measures behaviors it does not cover.
A practical 2026 workflow starts with a model comparison dataset. The engineer imports the external benchmark result as a custom field such as mmlu_accuracy, records model name, prompt format, benchmark date, answer-extraction rule, and contamination notes, then replays recent production traces through the same candidate. FutureAGI attaches CustomEvaluation scores for the imported benchmark scalar, GroundTruthMatch for closed-form internal questions, and TaskCompletion for agent workflows where the right outcome matters more than a single letter answer.
The trace layer supplies the counterweight. With traceAI-openai or traceAI-langchain instrumentation, the engineer watches llm.token_count.prompt, llm.token_count.completion, route, prompt version, latency, and eval reason. If MMLU improves but TaskCompletion or GroundTruthMatch falls on private traces, the next step is not a global rollout. The engineer can keep the incumbent model, open a regression eval, route only low-risk traffic, or configure model fallback in the Agent Command Center.
Why a higher MMLU does not guarantee a better product
A model that gains 1.5 points on MMLU may lose ground on retrieval groundedness (after a context-window expansion changed how it weighs retrieved chunks), on tool selection (after a refusal-tuning rollout made it hesitate on legitimate write tools), on latency (a larger model with the same accuracy), or on cost (more output tokens for the same answer because the model now reasons longer). Unlike LM Evaluation Harness reports that usually end at a benchmark table, FutureAGI ties the benchmark result to trace cohorts and evaluator outcomes. The decision becomes: did the higher-MMLU model improve this product workflow?
When MMLU still earns a row in your eval doc
MMLU is still useful for two narrow purposes. First, as a continuity check. if a new release’s MMLU is unexpectedly lower than its predecessor, something regressed and a deeper investigation is warranted. Second, as a tier filter for new small-model releases (7B–30B Llama 4 derivatives, fine-tuned 8B models for edge deployment) where the field has not saturated. For frontier-tier model selection in 2026, MMLU adds no signal beyond what HLE, GPQA Diamond, and SWE-Bench Verified already provide.
Per-subject view: where MMLU still discriminates
Even at the saturated headline, MMLU’s 57 subject splits sometimes carry signal. Professional Law, College Medicine, Virology, and Formal Logic remain a few points below the global average for most frontier models. If your product lives in one of those domains, the relevant MMLU slice is more informative than the headline. but it should still be paired with a domain golden dataset and a task-specific evaluator before any release decision.
How to measure MMLU
Measure MMLU as an offline benchmark, then detect whether it predicts production quality:
- Official MMLU accuracy. correct multiple-choice answers divided by total questions, with prompt format and answer parser fixed across runs.
- Category accuracy. subject-level score for weak areas such as law, medicine, math, or professional knowledge.
CustomEvaluation. stores the imported MMLU scalar beside model, prompt, and dataset metadata for comparison.GroundTruthMatch. checks closed-form internal tasks against expected answers when your private dataset has a known target.TaskCompletion. for agent workflows, scores whether the candidate model actually finishes the task end-to-end.- Dashboard signals. compare MMLU delta with eval-fail-rate-by-cohort, p99 latency, token-cost-per-successful-trace, and escalation rate.
- Release correlation. require the candidate’s private eval pass rate to move with MMLU before promoting it.
- Contamination probe. before trusting a public MMLU result, run a canary check or compare perplexity between published and held-out splits.
Minimal pairing snippet:
from fi.evals import GroundTruthMatch, TaskCompletion, CustomEvaluation
ground = GroundTruthMatch()
task = TaskCompletion()
custom = CustomEvaluation(
name="mmlu_correlation",
instruction="Score whether the model's MMLU delta predicts production lift",
)
for row in candidate_dataset:
g = ground.evaluate(response=row.answer, expected_response=row.expected)
t = task.evaluate(input=row.user_goal, trajectory=row.trace)
c = custom.evaluate(input=row.context, output=row.answer)
row.attach(ground=g, task=t, custom=c)
The useful signal in 2026 is not “MMLU went up.” It is whether the model that went up on MMLU also passes the private tasks, latency budget, safety checks, and cost ceiling that define release readiness.
A second snippet showing a cohort-filtered regression eval over a captured Dataset, which is what actually gates a model swap in production:
from fi.evals import GroundTruthMatch, TaskCompletion, Groundedness, ToolSelectionAccuracy
from fi.datasets import Dataset
ds = Dataset.from_traces(project="prod-support-agent", cohort="billing", days=14)
evaluators = [GroundTruthMatch(), TaskCompletion(), Groundedness(), ToolSelectionAccuracy()]
report = ds.evaluate(
evaluators=evaluators,
baseline_model="gpt-5",
candidate_model="claude-opus-4-7",
sample_strategy="failure_biased",
)
report.compare(group_by=["cohort.name", "gen_ai.request.model"]).to_csv("mmlu_swap_audit.csv")
MMLU is one row in a bigger eval doc
The 2026 eval doc for a production LLM application looks like a matrix, not a row. Columns: incumbent model, candidate model A, candidate model B. Rows: MMLU (continuity), HLE (frontier reasoning), GPQA Diamond (PhD-level), SWE-Bench Verified (coding), τ-bench (agent), domain golden dataset score (TaskCompletion, GroundTruthMatch, Groundedness, Faithfulness, AnswerRelevancy, ToolSelectionAccuracy), safety (PromptInjection, PII, BiasDetection, Toxicity), latency p99, and cost per successful trace. The release gate fires on the domain rows; MMLU is a sanity-check cell, not a decision cell.
A note on MMLU-Pro and the path forward
MMLU-Pro is MMLU’s direct successor: 10 answer choices instead of 4, chain-of-thought encouraged, 12,000 reasoning-focused questions across 14 subjects, designed to resist the saturation that consumed the original. As of May 2026, frontier scores on MMLU-Pro sit in the high 70s to mid-80s. discriminating but tightening. The honest read is that MMLU-Pro will follow MMLU into saturation within a year or two, and the long-term replacements are HLE (3,000 expert questions across 100+ domains with a private holdout) and the agent benchmarks. Treat MMLU-Pro as a useful 2025–2026 bridge, not the final answer. The teams we see making the best model-selection decisions in our 2026 evals have already moved their headline number to HLE for reasoning, SWE-Bench Verified for coding, and τ-bench for agent work, and they read MMLU and MMLU-Pro as appendix data.
Reading a 2026 model card without being misled
Vendor model cards in 2026 are crowded. The reliable reading order: skip MMLU, MMLU-Pro, and HellaSwag first; read HLE and GPQA Diamond for reasoning tier; read SWE-Bench Verified and Aider Polyglot for coding tier; read τ-bench and BFCL v3 for agent tier; read RULER and LongBench v2 for long-context; read MMMU-Pro and OSWorld for multimodal/desktop. Then check whether the vendor reported with chain-of-thought, with extended thinking, with tool use, or without. these toggles can shift scores by 10+ points. Finally, run the model against your own golden dataset before believing any of the above. We’ve found that vendor selection decisions made on model-card numbers alone reverse roughly 30% of the time once a private eval runs; the practical rule is “shortlist public, decide private.”
How MMLU interacts with fine-tuning and small models
A 7B–30B open-weight model that fine-tunes hard on MMLU-style data can post a headline number that looks competitive with frontier closed-source. and fall over on real tasks. This is one of the most common patterns we audit for in 2026. The defense is straightforward: pair every fine-tuned model’s MMLU with a held-out golden dataset run, a regression eval on production-shaped tasks, and a contamination probe on the training corpus. The LM Evaluation Harness makes MMLU runs reproducible, but reproducibility is not validity. a high MMLU on a contaminated fine-tune is just measurement of memorization.
Common mistakes (May 2026 edition)
- Using MMLU as a release gate. It samples academic knowledge, not proprietary workflows, tool policies, or conversational recovery. and it is saturated above 90% on every frontier model. Decision goes to your domain golden dataset.
- Comparing scores without run details. Few-shot format, chain-of-thought allowance, contamination filters, and answer extraction can move MMLU accuracy by 2–4 points without changing the model.
- Averaging away category weakness. A high global MMLU can hide failures in law, medicine, math, or business-critical domains; always look at subject splits.
- Treating tiny deltas as meaningful. A 0.2-point MMLU gain is noise. Frontier-model deltas in 2026 are within run variance.
- Ignoring cost and latency. A larger model may score higher on MMLU while breaking p95 latency or doubling token-cost-per-successful-trace.
- Skipping contamination checks. Every pre-2024 public benchmark is contaminated by default in 2026. Hold out a fresh slice and compare.
- Self-judging with the same model family. When MMLU pairs with chain-of-thought, the answer extractor sometimes uses an LLM judge. Pin the judge to a different family or use deterministic extraction.
- Reporting a single number when your traffic has six different intents. Global MMLU is meaningless when your traffic is 60% billing, 20% legal, 10% medical, 10% other.
- Citing MMLU in 2026 as if it were 2022. If a vendor pitch leads with MMLU as a top-line, treat it the way you would treat a 2021 paper leading with BLEU score. useful for continuity, not for choosing a model.
Frequently Asked Questions
What is MMLU?
MMLU is a multiple-choice LLM benchmark that measures accuracy across 57 academic subjects. It is saturated above 90% for every frontier model in 2026, so it is useful only as continuity context, not as a release gate.
How is MMLU different from an LLM leaderboard?
MMLU is a benchmark: a fixed question set with scored answers. An LLM leaderboard is a ranked table that may combine MMLU with other benchmarks, preference votes, or hidden scoring policies.
How do you measure MMLU?
Measure MMLU as multiple-choice accuracy on the official question set, then record it beside FutureAGI CustomEvaluation or GroundTruthMatch results, llm.token_count.prompt, latency, and private eval pass rate. Treat it as model-selection evidence, not a release gate.