Evaluation

What Is the GSM8K Math Benchmark?

An OpenAI 8.5K grade-school math word-problem benchmark used to measure multi-step arithmetic reasoning in language models.

What Is the GSM8K Math Benchmark?

GSM8K — Grade School Math 8K — is an OpenAI benchmark of 8,500 high-quality math word problems published by Cobbe et al. in 2021. Problems require two to eight steps of basic arithmetic (addition, subtraction, multiplication, division) reasoned in natural language, and each has a single numeric ground-truth answer. The benchmark was designed to stress multi-step reasoning rather than computational complexity, making it a clean test of chain-of-thought capability. The standard split is 7.5K train and 1K test. By 2026 frontier models clear 95% on GSM8K, so it is increasingly used as a regression guard rather than a discriminator.

Why It Matters in Production LLM and Agent Systems

GSM8K’s value for production teams is not the headline number; it is the diagnostic. Models that fail on GSM8K tend to fail on any task that requires committing to a multi-step plan and following it without losing state. That includes spreadsheet automation, financial reasoning agents, tool-call argument computation, and any retrieval-augmented workflow where the model must combine numeric facts from different chunks.

The pain shows up in three places. Fine-tuning teams who distill a small model from a large one routinely find that the student passes single-step QA but loses 8–15 points on GSM8K because the chain-of-thought capability did not transfer. Quantization teams find that aggressive 4-bit quantization sometimes preserves perplexity but degrades GSM8K by 3–10 points — a silent reasoning regression that perplexity does not surface. RLHF teams sometimes optimize for helpfulness in a way that shortens chains-of-thought, also costing GSM8K points.

For 2026 agent stacks the relevance is sharper. An agent that reasons over tool outputs, computes an order quantity, or summarizes financial reports needs the GSM8K-style multi-step capability intact. A capable chat model that lost its GSM8K score after fine-tune-for-tone is going to fail in unpredictable ways the first time a planner asks it to add three numbers it pulled from a CRM. The benchmark is a cheap, fast canary for that capability.

How FutureAGI Handles GSM8K-Style Regression Evaluation

FutureAGI does not redistribute the GSM8K dataset; teams load it themselves. Once loaded into a Dataset, the standard evaluation pattern is Dataset.add_evaluation(ExactMatch()) on the final numeric answer and Dataset.add_evaluation(ReasoningQuality()) on the chain-of-thought trace, so reasoning quality is graded separately from final-answer correctness. That separation is what makes GSM8K useful as a regression test: a model that drops from 92% to 85% with a clean reasoning trace is a different bug than one that drops to 85% with broken intermediate steps.

A real workflow: a quantization team has a 7B model at FP16 with a 91.4% GSM8K score. They release an INT4 candidate and run the same Dataset through fi.evals.ExactMatch and fi.evals.ReasoningQuality. ExactMatch drops to 86.2%; ReasoningQuality drops from 0.88 to 0.71. The dashboard slice by problem-step-count shows the regression concentrated on 5+ step problems. The team rejects the deploy, tries weight-aware quantization, and the score recovers to 90.1%. Without GSM8K wired into a regression eval, the same drop would have shipped and surfaced weeks later as customer reports of “the agent gets math wrong now.”

FutureAGI’s approach is to treat GSM8K as one row in a broader reasoning eval surface alongside MATH, MMLU-STEM, and ARC. Unlike running the benchmark in a notebook, every run is versioned against a Dataset.commit() with model, prompt, and decoding parameters logged so regressions are traceable to the change that caused them.

How to Measure or Detect It

Score GSM8K against both the final answer and the reasoning trace:

  • fi.evals.ExactMatch — returns boolean correctness of the final numeric answer against ground truth.
  • fi.evals.ReasoningQuality — returns 0–1 reasoning quality across the chain-of-thought trace.
  • fi.evals.TaskCompletion — returns whether the model committed to and finished the multi-step plan.
  • Step-count cohort slicing (dashboard signal) — accuracy by problem step count (1–8); regressions concentrate on long chains.
  • Decoding-parameter slice — temperature, top-p, max-tokens; chain truncation at low max-tokens silently kills GSM8K.
from fi.evals import ExactMatch, ReasoningQuality

em = ExactMatch()
rq = ReasoningQuality()

problem = "Maria has 3 boxes with 8 apples each. She gives away 7. How many remain?"
answer = "17"
trace = "3 * 8 = 24. 24 - 7 = 17."

print(em.evaluate(input=problem, output=answer, expected="17"))
print(rq.evaluate(input=problem, output=trace))

Common Mistakes

  • Reporting GSM8K accuracy without a reasoning-quality score. A model can guess final answers via memorization; the trace tells you whether reasoning is real.
  • Running with greedy decoding only. GSM8K self-consistency results vary 3–6 points by sampling strategy; pick one and pin it.
  • Truncating max-tokens. 256-token caps cut chains-of-thought mid-step and produce false negatives. Allow 512+ for reasoning runs.
  • Treating 95% as ceiling. The remaining 5% is concentrated on the hardest, most multi-step problems — the ones that matter for production agents.
  • Skipping translated and code-switching variants. GSM8K is English-only; multilingual reasoning needs MGSM or local translations.

Frequently Asked Questions

What is GSM8K?

GSM8K is a benchmark of 8,500 grade-school math word problems built by OpenAI in 2021 to evaluate multi-step arithmetic reasoning in language models. The 7.5K train and 1K test split is widely used for fine-tuning and evaluation.

How is GSM8K different from MATH?

GSM8K covers grade-school arithmetic with two-to-eight-step problems. The MATH benchmark covers high-school competition mathematics including algebra, geometry, calculus, and number theory and is much harder.

How does FutureAGI evaluate GSM8K-style problems?

FutureAGI runs GSM8K problems through Dataset.add_evaluation with ExactMatch on the final numeric answer and ReasoningQuality on the chain-of-thought trace, so reasoning regressions surface separately from final-answer regressions.