Evaluation

What Is the SVAMP Math Benchmark?

A 1,000-problem math word-problem benchmark with structural variations designed to expose shallow pattern matching in language models.

What Is the SVAMP Math Benchmark?

SVAMP — Simple Variations on Arithmetic Math word Problems — is a 1,000-problem benchmark introduced by Patel et al. (2021) to expose pattern-matching in language models on arithmetic word problems. Each item starts from an existing elementary-school problem and applies a small variation — swapping question type, reordering objects, adding an irrelevant sentence — that should not change the answer but often does. Run as a robustness probe next to GSM8K, a high standard score with a much lower SVAMP score signals shallow reasoning rather than arithmetic skill.

Why It Matters in Production LLM and Agent Systems

The production motivation is fragile reasoning. A model that scores 92% on GSM8K and 71% on SVAMP is fluent on the canonical problem shape but breaks the moment a real user phrases the same question differently. That is exactly what production traffic looks like: a financial-summary agent given a sentence with one extra clause, a tutoring bot given a question with the order of operations restated, a workflow agent reading a poorly worded support ticket. The user does not apologize for variation; the model has to reason through it.

The pain shows up across roles. Fine-tuning teams find that a model heavily trained on GSM8K-style data overfits to that surface form and degrades on SVAMP — a lesson learned only after the eval was added. Agent teams shipping math-adjacent workflows find the agent passing internal tests and failing the first time a user paraphrases a request. Compliance teams in regulated industries care because a fragile reasoning signal under variation translates to inconsistency in answers to the same underlying question.

For 2026-era agent stacks, SVAMP-style robustness is a planner concern. A planner reads tool outputs that vary in phrasing, then must reason about quantities. If the underlying model fails the SVAMP probe, the planner will fail wherever a tool returns a slightly different format than the training set used. Robustness is not a luxury here; it is a precondition for reliable orchestration.

How FutureAGI Handles SVAMP-Style Robustness Evaluation

FutureAGI does not host the SVAMP dataset; teams load it themselves into a Dataset. Once loaded, the standard pattern is Dataset.add_evaluation(FactualAccuracy()) on the final numeric answer plus Dataset.add_evaluation(ReasoningQuality()) on the chain-of-thought trace, both sliced by SVAMP variation type. The variation slice is what gives SVAMP its diagnostic power: a model can hold steady on “Type A: change of question” but collapse on “Type B: change of object order,” and that asymmetry tells you which pattern the model is actually relying on. FutureAGI’s dataset versioning keeps the variation metadata attached to each row, so the slice is a single dashboard filter rather than a one-off script.

Concretely: a team fine-tunes a 7B reasoning model and runs both GSM8K and SVAMP through fi.evals after every checkpoint. GSM8K is steady at 89%; SVAMP drops from 74% to 68% between two checkpoints. The slice-by-variation view shows the regression concentrated on “irrelevant information added” cases — the model has learned to incorporate every sentence in the prompt. The team adjusts the fine-tune mix to include adversarial distractors and recovers SVAMP to 76%. None of that signal would have surfaced if only GSM8K had been wired into the regression eval.

How to Measure or Detect It

Pick signals that distinguish surface success from robust reasoning:

  • FactualAccuracy: returns whether the final numeric answer matches the gold; the canonical SVAMP score.
  • ReasoningQuality: scores whether the chain-of-thought trace is logically valid; surfaces correct-answer-with-broken-reasoning cases.
  • TaskCompletion: useful when SVAMP is folded into an agent trajectory rather than a single-turn QA.
  • Variation-type slice (dashboard signal): SVAMP score broken out by variation category — question-type swap, object reorder, irrelevant info — to see where the model is brittle.
  • GSM8K-vs-SVAMP delta: the gap between the standard set and the variation set, tracked per checkpoint as a robustness regression metric.

Minimal Python:

from fi.evals import FactualAccuracy, ReasoningQuality

acc = FactualAccuracy()
reasoning = ReasoningQuality()

for row in svamp_dataset:
    result = acc.evaluate(
        input=row["question"],
        output=model_response,
        expected_response=row["answer"],
    )
    print(result.score)

Common Mistakes

  • Reporting only the GSM8K score. A model can clear GSM8K at 92% and SVAMP at 70% — shipping the GSM8K number alone hides the brittleness.
  • Not slicing by variation type. SVAMP’s diagnostic value is in the per-variation breakdown; a single mean buries the failure mode.
  • Assuming chain-of-thought prompting fixes it. It helps, but a model that pattern-matches will pattern-match its own reasoning chain too. Compare reasoning quality before vs. after variation.
  • Treating SVAMP as a leaderboard discriminator. It is more useful as a regression guard on retrains; don’t pick a model on SVAMP score alone.
  • Using SVAMP without controlling for prompt template. Different prompts move SVAMP scores 5–10 points. Pin the template before comparing checkpoints.

Frequently Asked Questions

What is SVAMP?

SVAMP — Simple Variations on Arithmetic Math word Problems — is a 1,000-problem benchmark introduced by Patel et al. in 2021 to test whether language models actually reason about math word problems or merely match surface patterns.

How is SVAMP different from GSM8K?

GSM8K measures multi-step arithmetic over a fixed problem set. SVAMP applies adversarial variations to existing problems — swapping question types, reordering, adding distractors — to expose models that pattern-match instead of reason.

How does FutureAGI evaluate SVAMP-style problems?

FutureAGI runs SVAMP problems through Dataset.add_evaluation with ExactMatch on the final numeric answer plus ReasoningQuality on the chain-of-thought, sliced by variation type to expose pattern-matching regressions.