Evaluation

What Is GSM8K?

A grade-school math benchmark used to test multi-step arithmetic word-problem solving in language models.

What Is GSM8K?

GSM8K is a grade-school math benchmark released by OpenAI in 2021 that tests whether language models can solve multi-step arithmetic word problems. It belongs to LLM evaluation and historically showed up in benchmark suites, model-selection reports, regression evals, and release gates. A GSM8K row contains a natural-language question, a worked solution, and a final numeric answer extracted with a #### delimiter. As of May 2026, GSM8K is saturated above 98% on every frontier model and is widely contaminated in training data. the interesting question for a senior engineer is not “what does my model score on GSM8K?” but “which 2026 successor benchmark should I use instead, and how do I keep GSM8K-style arithmetic regression coverage in my own golden dataset?”

Why GSM8K matters in production LLM and agent systems

Wrong GSM8K interpretation creates a false sense of reasoning capability. A model can score 98% because its prompt resembles training examples or because it memorized answer patterns, then fail when a product asks for invoice prorations, tax estimates, unit conversions, subscription credits, or tool-backed calculations. The failure mode is not only “bad math.” It is silent arithmetic drift inside an answer that looks fluent enough for a human to trust. and in 2026, it is the exact class of failure that public benchmarks no longer surface.

Developers feel this first when eval rows pass on casual review but fail exact numeric checks. Product teams see refunds, quotes, or support answers that are off by one step. SREs see longer completions, higher retry rates, calculator-tool loops, and p99 latency spikes after teams add chain-of-thought prompting to fix math. Compliance teams inherit audit risk when financial or healthcare workflows produce unsupported calculations without a clear final-answer field.

For 2026-era agentic pipelines, GSM8K-style problems are useful because arithmetic reasoning often sits in the middle of a longer workflow. An agent may retrieve a policy, call a calculator, convert units, and write a customer-facing answer. If the math step is weak, later steps can make the wrong number look official. Symptoms include low exact-match rate on math cohorts, high self-correction loops, final answers buried inside rationales, and failures concentrated around prompts with multiple quantities or distractor facts.

GSM8K is saturated. what to use instead in May 2026

This is the table to internalize. GSM8K saturated in 2024 and is now reported as a footnote, not a headline, on every frontier model card. The 2026 math benchmarks that actually discriminate between frontier systems look different.

BenchmarkDifficultyStatus (May 2026)Notes
GSM8KGrade schoolSaturated (>98% frontier), contaminatedFootnote only; useful as a regression smoke test, not a release signal
MATH (Hendrycks)High school competitionMostly saturated on levels 1-4 (>90%)Levels 5 still moves; subset usable
MATH-500Hand-curated subset of MATHAbove 90% frontier; still citedCompact replacement for full MATH
AIME 2025High school olympiadDiscriminating: 60-90% frontier rangeReported by OpenAI, Anthropic, Google in 2026 model cards
Putnam-AXIOMCollege olympiadDiscriminating: 20-50% frontierStretch goal for reasoning models
FrontierMathResearch-grade, expert-authoredBelow 20% frontier; private holdoutThe new ceiling; contamination-resistant
Omni-MATHOlympiad, broad coverage50-70% frontierUseful for multi-domain math
GPQA Diamond (math subset)PhD-level science with math70-85% frontierReasoning-heavy, not pure math

In our 2026 evals, we treat GSM8K like we treat HumanEval and HellaSwag. useful for continuity dashboards, useless as a release gate. The math regression coverage moved into private golden datasets that mirror your product’s actual arithmetic surface (proration math, tax math, dosing math, financial conversion math), scored with NumericSimilarity and ReasoningQuality.

How FutureAGI uses GSM8K-style benchmarks

GSM8K has no dedicated FutureAGI anchor, so the clean FutureAGI surface is the evaluation dataset workflow: a Dataset, reference columns, evaluator attachments, and traceAI instrumentation from the model or agent runner. A team imports GSM8K rows, or (more usefully in 2026) private rows written in the same style that mirror real product arithmetic, with columns such as question, expected_response, model_answer, rationale, prompt_version, model_route, cohort, and tool_used.

FutureAGI’s approach is to treat GSM8K as a reasoning smoke test and arithmetic-regression smoke test, then connect each row to product release criteria. NumericSimilarity checks whether the extracted number matches the gold answer closely enough for the task. GroundTruthMatch handles canonical final-answer checks when an exact reference exists. ReasoningQuality flags brittle rationales, missing intermediate steps, and trajectories that arrive at the right number with unusable reasoning. A CustomEvaluation named unit_check ensures the number carries the right unit (dollars vs days vs percentages). a class of error that pure numeric similarity misses.

A real workflow: a billing-support agent uses an LLM to calculate prorated refunds before drafting a reply. Engineers run a GSM8K-style regression eval suite before changing the prompt or model. The traceAI openai integration records prompt version, model name, tool calls, latency, and llm.token_count.prompt. If exact answer accuracy drops below 97% on refund rows, the release blocks. If ReasoningQuality drops while accuracy stays flat, the engineer samples traces, tightens the prompt, or routes math-heavy cases to a model with a better eval history via Agent Command Center conditional routing.

Unlike EleutherAI LM Evaluation Harness runs that often stop at aggregate accuracy, this workflow keeps each GSM8K-style failure tied to evaluator reasons, trace fields, prompt versions, and release thresholds. Unlike Inspect AI which focuses on safety evaluations, FutureAGI ties the arithmetic eval directly to production trace gates and Agent Command Center routing. We’ve found that the highest-value addition is the per-row tool_used column: it separates “the model did the math directly” from “the model called a calculator and trusted the result”. two completely different failure surfaces that look identical in aggregate accuracy.

When to keep GSM8K and when to retire it

Keep a 200-row GSM8K subset in your continuity dashboard as a smoke test. if a fine-tuned model suddenly drops to 80% on GSM8K, something is broken in the base behavior. Retire it as a model-selection signal: in 2026 every frontier model is statistically tied here. For real arithmetic capability, use AIME 2025 or FrontierMath subsets. For product reliability, use your own golden dataset with NumericSimilarity plus unit and tolerance checks.

Catching contamination, the GSM8K way

GSM8K is the textbook example of benchmark contamination. The dataset has been in public crawls since 2021, mirrored in countless GitHub repos and Hugging Face datasets, and quoted in thousands of blog posts. Frontier models in 2026 have functionally memorized it. The 2026 contamination-test pattern that works on GSM8K specifically:

  1. Entity-rename probe. Replace names and objects in 200 GSM8K rows (“Janet” → “Priya”, “duck eggs” → “watch batteries”). If accuracy drops more than 3 points, the model is pattern-matching on memorized surface features.
  2. Quantity-renumber probe. Replace all numbers in a row with values that preserve the structure but produce a different answer. Frontier models that memorized solutions often re-output the original answer despite different inputs. a clear contamination signal.
  3. Held-out timestamp probe. Use private rows authored after the model’s training cutoff and never published online. A model that scores 98% on public GSM8K but 78% on your private post-cutoff math is memorizing, not reasoning.

This pattern generalizes to every saturated public benchmark. Treat any pre-2024 public eval as contaminated by default in 2026 and run the rename + renumber probes before trusting headline scores. We’ve found that the contamination-corrected gap between frontier models on GSM8K-style math is often double the gap reported on the public number.

Mapping GSM8K-style coverage to production arithmetic

Most products do not need GSM8K math; they need their own arithmetic surface scored well. The mapping pattern:

  • Refund math → multi-step proration with units, tax, and currency. Score with NumericSimilarity plus unit_check CustomEvaluation.
  • Dosing math (healthcare) → unit conversion, weight-based calculation, frequency. Add a hard ContentSafety post-guardrail because errors here are catastrophic.
  • Tax/quote math (B2B SaaS) → tiered pricing, discounts, region-based rates. Score with tolerance bands; exact match is too strict.
  • Inventory math (retail) → SKU aggregation, available-stock calculation. Pair NumericSimilarity with ToolSelectionAccuracy because most of this should be a calculator call, not LLM math.
  • Financial-statement math (fintech) → ledger arithmetic with strict equality. Use GroundTruthMatch and reject tolerance entirely.

Build the per-product math golden dataset once, score it on every release, and treat GSM8K as the smoke test before you ever look at the gold set’s results.

Why models stopped getting better on GSM8K but better at math

A puzzle for a senior engineer reading 2026 model cards: GSM8K is flat at 98%+ across frontier models, but the same models score wildly differently on AIME 2025 (range: 35-92%) and FrontierMath (range: 2-25%). What changed?

Three things happened between 2022 and 2026:

  1. Pretraining math coverage exploded. Frontier base models in 2026 saw 100-1000x more math tokens than 2022 baselines. math papers, competition archives, formalized proofs, code-tagged math, synthetic chain-of-thought traces. GSM8K-level arithmetic became base-model competence, not a frontier skill.
  2. Chain-of-thought became default. 2022 models had to be prompted for CoT; 2026 models default to it, often as an extended-thinking phase invisible to the user. That alone closed most of the GSM8K gap.
  3. Verifier-trained reasoning models emerged. GPT-5.x deep reasoning, Claude’s extended thinking, Gemini 3 Pro’s reasoning mode all use verifier-trained inference to check intermediate steps. GSM8K-style problems are well within the verifier’s range; FrontierMath’s expert-authored problems are not.

The takeaway: a 2026 model that does poorly on GSM8K is broken at the base level. A model that does well on GSM8K is showing you the floor of its math capability, not the ceiling.

Tool-augmented math: the production reality

The most common 2026 pattern for production math is not “ask the model to do it” but “have the model call a calculator or code interpreter and trust the result.” That changes what GSM8K-style evaluation measures:

  • Direct-mode runs test whether the model can do the math in chain-of-thought. Useful as a capability measurement.
  • Tool-mode runs test whether the model knows when to call the tool and what arguments to pass. The math itself is no longer the question; ToolSelectionAccuracy and FunctionCallAccuracy matter more than NumericSimilarity.
  • Verified-mode runs add a post-tool verifier. usually a second model or a deterministic check. that confirms the tool’s output before the agent commits. This is the pattern in regulated stacks.

A 2026 evaluation suite for arithmetic in agents should run all three modes and report separately. We’ve found that direct-mode accuracy correlates only weakly with tool-mode reliability. strong math models can be bad at calling a calculator, and vice versa.

How to measure or detect GSM8K performance

Measure GSM8K (and its 2026 successors) as a final-answer benchmark with supporting reasoning and trajectory signals:

  • Exact answer accuracy. parse the final numeric answer and compare it with the gold answer; report pass rate by prompt, model, and cohort.
  • fi.evals.NumericSimilarity. calculates similarity between numbers extracted from response and expected_response, useful when formatting differs.
  • fi.evals.GroundTruthMatch. checks the model answer against the canonical reference for rows with one accepted answer.
  • fi.evals.ReasoningQuality. scores whether the rationale or agent trajectory shows coherent intermediate reasoning.
  • fi.evals.FactualConsistency. guards against the failure mode where the rationale and the final answer disagree.
  • fi.evals.ToolSelectionAccuracy. for agent stacks, checks whether the model called the calculator or did the math directly; route the call by your reliability preference.
  • fi.evals.CustomEvaluation with a unit_check rubric. catches unit mismatches (USD vs days vs percentages) that numeric similarity misses.
  • Trace fields. segment failures by llm.token_count.prompt, model route, calculator-tool usage, latency p99, and retry count.
  • Contamination probe. for any GSM8K row, run a held-out variant with renamed entities and renumbered quantities; large accuracy gaps indicate memorization.
  • User-feedback proxy. monitor correction rate, support escalation rate, and thumbs-down rate on production math workflows.

Minimal pairing snippet:

from fi.evals import NumericSimilarity, GroundTruthMatch, ReasoningQuality

num = NumericSimilarity().evaluate(response=model_answer, expected_response=gold_answer)
gt = GroundTruthMatch().evaluate(response=model_answer, expected_response=gold_answer)
reasoning = ReasoningQuality().evaluate(trajectory=run.trajectory)

The benchmark is healthy when reruns are reproducible, final-answer failures are explainable, and score movement lines up with trace and user-feedback signals.

For a cohort-filtered regression run on a private math Dataset with a unit-check CustomEvaluation and contamination probe, wire the gates explicitly:

from fi.datasets import Dataset
from fi.evals import NumericSimilarity, CustomEvaluation, AggregatedMetric

math_gold = Dataset.load("refund-math-v3").filter(cohort="proration")

unit_check = CustomEvaluation(
    name="unit_check",
    rubric="Return Pass if the model's answer carries the correct unit (USD, days, percent) for the question; Fail otherwise.",
    judge_model="claude-opus-4-7",
)

agg = AggregatedMetric(
    metrics=[NumericSimilarity(tolerance=0.01), unit_check],
    weights=[0.7, 0.3],
)

baseline = agg.run_dataset(math_gold, model="gpt-5.1", prompt_version="v18")
candidate = agg.run_dataset(math_gold, model="gpt-5.1", prompt_version="v19")
assert candidate.score >= baseline.score - 0.01, "Refund-math regression blocks release"

The cross-benchmark math leaderboard, May 2026

A useful frame for choosing benchmarks: where do the frontier models actually separate, and where do they cluster? This is the picture as of May 2026:

ModelGSM8KMATH-500AIME 2025FrontierMathNotes
GPT-5.x (deep reasoning)~99%~95%~88%~22%Top of the AIME / FrontierMath stack
Claude Opus 4.7 (extended)~99%~94%~84%~18%Closest to GPT-5 on math reasoning
Gemini 3 Pro (thinking)~99%~93%~80%~15%Strong on long-form derivations
Llama 4 405B (open-weight)~98%~85%~62%~7%Best open-weight, still a gap on hard math
Qwen 3 (open-weight)~98%~87%~68%~9%Best open-weight on math specifically
Mistral Large 3~97%~80%~55%~5%Trails frontier on competition math

The numbers tell two stories. First, every model tested above is at the ceiling on GSM8K, so it can no longer separate them. Second, the spread on AIME 2025 (62-88%) and FrontierMath (5-22%) is wide and meaningful. those are the benchmarks that should drive 2026 math-model decisions. Pair with your private golden dataset for product-specific arithmetic surface.

Reasoning-model trade-offs for math

A 2026 specifics: when you turn on extended thinking, deep reasoning, or thinking mode, math accuracy improves, but cost and latency rise. For GSM8K-style smoke tests, default-mode is fine. the problems are easy enough. For AIME-style problems, reasoning mode is required. For production math in agents, the decision depends on the route: high-stakes math (refunds, healthcare dosing) gets reasoning mode plus a tool-calculator fallback; low-stakes math (rough quote estimates) uses default mode with a verifier.

Score every route separately. A blended average across reasoning and default modes is meaningless.

Common mistakes (May 2026 edition)

  • Treating GSM8K as a 2026 model-selection signal. It is saturated above 98% across frontier models. Use AIME 2025, MATH-500, or FrontierMath for discrimination. GSM8K is a smoke test, not a gate.
  • Scoring the full rationale with exact match. Exact match belongs on the final numeric answer; score reasoning with ReasoningQuality, not string match.
  • Accepting numerically close answers without unit checks. A value can be arithmetically close and still wrong for dollars, days, percentages, or inventory counts. Add a unit_check rubric.
  • Reporting one GSM8K accuracy without version fields. Store prompt, model, sampling settings, and parser version, or regression deltas become untraceable.
  • Treating GSM8K as full agent readiness. It lacks real API delays, permissions, memory, retrieval failures, and tool schemas. For agent math reliability, use τ-bench-style trajectories with arithmetic embedded.
  • Ignoring benchmark contamination. Public benchmark exposure inflates scores; keep a private math golden dataset with rewritten quantities for real release gates.
  • No tool-vs-direct split. “Model did the math in its head” vs “model called a calculator” are different reliability regimes; track them separately.
  • Trusting chain-of-thought self-correction without verification. A model that loops three times to land on the right number costs 3x tokens; track retry count alongside accuracy.

Frequently Asked Questions

What is GSM8K?

GSM8K is a grade-school math benchmark for testing whether language models can solve multi-step arithmetic word problems. FutureAGI teams can adapt GSM8K-style rows into golden datasets, then score final-answer correctness, reasoning quality, and trace failures.

How is GSM8K different from MATH?

GSM8K focuses on grade-school arithmetic word problems with short natural-language rationales and final numeric answers. MATH uses harder competition-style problems that often require algebra, geometry, proofs, or advanced notation.

How do you measure GSM8K performance?

Use FutureAGI evaluators such as NumericSimilarity or GroundTruthMatch for the final numeric answer, then add ReasoningQuality for the rationale. Segment results by prompt version, model route, and trace fields such as llm.token_count.prompt.