What Is the SVAMP Math Benchmark?
A 1,000-problem math word-problem benchmark with structural variations designed to expose shallow pattern matching in language models.
What Is the SVAMP Math Benchmark?
SVAMP. Simple Variations on Arithmetic Math word Problems. is a 1,000-problem benchmark introduced by Patel et al. (2021) to expose pattern-matching in language models on arithmetic word problems. Each item starts from an existing elementary-school problem and applies a small variation. swapping question type, reordering objects, adding an irrelevant sentence. that should not change the answer but often does. Run as a robustness probe next to GSM8K, a high standard score with a much lower SVAMP score signals shallow reasoning rather than arithmetic skill.
By May 2026, GSM8K is saturated (frontier models routinely score above 98%) and SVAMP itself is mostly saturated at the top of the leaderboard (95%+). The benchmark is still useful as a regression probe for fine-tuned variants, distilled models, quantized inference paths, and open-weights models in the 7B-70B range, where pattern-matching shortcuts still leak through.
Why SVAMP matters in production LLM and agent systems
The production motivation is fragile reasoning. A model that scores 98% on GSM8K and 75% on SVAMP is fluent on the canonical problem shape but breaks the moment a real user phrases the same question differently. That is exactly what production traffic looks like: a financial-summary agent given a sentence with one extra clause, a tutoring bot given a question with the order of operations restated, a workflow agent reading a poorly worded support ticket. The user does not apologize for variation; the model has to reason through it.
The pain shows up across roles. Fine-tuning teams find that a model heavily trained on GSM8K-style data overfits to that surface form and degrades on SVAMP. a lesson learned only after the eval was added. Agent teams shipping math-adjacent workflows find the agent passing internal tests and failing the first time a user paraphrases a request. Compliance teams in regulated industries care because a fragile reasoning signal under variation translates to inconsistency in answers to the same underlying question.
For 2026-era agent stacks, SVAMP-style robustness is a planner concern. A planner reads tool outputs that vary in phrasing, then must reason about quantities. If the underlying model fails the SVAMP probe, the planner will fail wherever a tool returns a slightly different format than the training set used. Robustness is not a luxury here; it is a precondition for reliable orchestration.
How FutureAGI handles SVAMP-style robustness evaluation
FutureAGI does not host the SVAMP dataset; teams load it themselves into a Dataset. Once loaded, the standard pattern is Dataset.add_evaluation with a numeric-answer check (CustomEvaluation rubric) on the final answer plus a reasoning rubric on the chain-of-thought trace, both sliced by SVAMP variation type. The variation slice is what gives SVAMP its diagnostic power.
| Variation type | Example transformation | Common failure |
|---|---|---|
| Question-type swap | ”How many left” → “How many total” | Model gives the old answer |
| Object reorder | Swap object positions in the prompt | Model averages over canonical order |
| Irrelevant info added | Extra sentence about a distractor | Model includes distractor in arithmetic |
| Tense change | ”had” → “has” | Model misreads state |
| Numerical reorder | Swap two numbers | Model uses fixed slot positions |
| Sign change | Add → subtract | Model picks the wrong operator |
A model can hold steady on “question-type swap” but collapse on “irrelevant information added,” and that asymmetry tells you which pattern the model is actually relying on. FutureAGI’s dataset versioning keeps the variation metadata attached to each row, so the slice is a single dashboard filter rather than a one-off script.
Concretely: a team fine-tunes a Llama 4 8B reasoning model and runs both GSM8K and SVAMP through fi.evals after every checkpoint. GSM8K is steady at 96%; SVAMP drops from 84% to 78% between two checkpoints. The slice-by-variation view shows the regression concentrated on “irrelevant information added” cases. the model has learned to incorporate every sentence in the prompt. The team adjusts the fine-tune mix to include adversarial distractors and recovers SVAMP to 86%. None of that signal would have surfaced if only GSM8K had been wired into the regression eval.
Unlike a one-time leaderboard run, this workflow keeps SVAMP attached to model versions, checkpoints, and prompt templates so any regression is attributable.
How to measure or detect it
Pick signals that distinguish surface success from robust reasoning:
- Numeric-answer check (
CustomEvaluation). returns whether the final numeric answer matches the gold; the canonical SVAMP score. - Reasoning rubric (
CustomEvaluation). scores whether the chain-of-thought trace is logically valid; surfaces correct-answer-with-broken-reasoning cases. TaskCompletion. useful when SVAMP is folded into an agent trajectory rather than a single-turn QA.- Variation-type slice (dashboard signal). SVAMP score broken out by variation category to see where the model is brittle.
- GSM8K-vs-SVAMP delta. the gap between the standard set and the variation set, tracked per checkpoint as a robustness regression metric.
Minimal Python:
from fi.evals import CustomEvaluation
acc = CustomEvaluation(
name="svamp_numeric_answer",
rubric="Score 1 if final numeric answer matches the gold answer, else 0.",
)
reasoning = CustomEvaluation(
name="svamp_reasoning",
rubric=(
"Score 1-5 on whether each step in the chain-of-thought is "
"arithmetically valid and follows from the previous step."
),
)
for row in svamp_dataset:
a = acc.evaluate(input=row["question"], output=response, expected_response=row["answer"])
r = reasoning.evaluate(input=row["question"], output=response)
print(row["variation"], a.score, r.score)
Common mistakes
- Reporting only the GSM8K score. A model can clear GSM8K at 98% and SVAMP at 78%. shipping the GSM8K number alone hides the brittleness.
- Not slicing by variation type. SVAMP’s diagnostic value is in the per-variation breakdown; a single mean buries the failure mode.
- Assuming chain-of-thought prompting fixes it. It helps, but a model that pattern-matches will pattern-match its own reasoning chain too. Compare reasoning quality before vs after variation.
- Treating SVAMP as a frontier leaderboard discriminator. It is mostly saturated at the top; use it as a regression guard on retrains and smaller open-weights models.
- Using SVAMP without controlling for prompt template. Different prompts move SVAMP scores 5-10 points. Pin the template before comparing checkpoints.
- Skipping FrontierMath, AIME 2025, and Putnam-AXIOM for frontier models. SVAMP cannot discriminate between frontier reasoning models in 2026. escalate the benchmark.
Frequently Asked Questions
What is SVAMP?
SVAMP. Simple Variations on Arithmetic Math word Problems. is a 1,000-problem benchmark introduced by Patel et al. in 2021 to test whether language models actually reason about math word problems or merely match surface patterns.
How is SVAMP different from GSM8K?
GSM8K measures multi-step arithmetic over a fixed problem set. SVAMP applies adversarial variations to existing problems. swapping question types, reordering, adding distractors. to expose models that pattern-match instead of reason.
How does FutureAGI evaluate SVAMP-style problems?
FutureAGI runs SVAMP problems through Dataset.add_evaluation with an exact-match check on the final numeric answer plus a CustomEvaluation reasoning rubric on the chain-of-thought, sliced by variation type.