Evaluation

What Is the StrategyQA Reasoning Benchmark?

An open-domain question-answering benchmark requiring implicit multi-step reasoning, with yes/no answers, decomposition steps, and per-step Wikipedia evidence.

What Is the StrategyQA Reasoning Benchmark?

StrategyQA, introduced by Geva et al. (2021), is an open-domain question-answering benchmark of 2,780 yes/no questions whose answers require multi-step reasoning over implicit strategies. Example: “Did Aristotle use a laptop?” requires reasoning that laptops were invented well after Aristotle lived. Each question carries a human-authored decomposition into atomic sub-questions plus Wikipedia evidence for each sub-question, allowing fine-grained scoring beyond final-answer accuracy.

By May 2026, StrategyQA’s headline final-answer accuracy has saturated at the top end (frontier models. GPT-5.x, Claude Opus 4.7, Gemini 3. sit above 90%), so it is no longer used as a primary discriminator between frontier systems. It survives as a regression probe and as a per-strategy diagnostic when comparing fine-tuned variants, smaller open-weights models, or agent planners. For 2026 frontier-model comparisons, escalate to HLE, FrontierMath, GPQA Diamond, or τ-bench, which still have headroom.

Why StrategyQA matters in production LLM and agent systems

A model that aces MMLU can still fail StrategyQA-shaped questions because the implicit-strategy step is not in the training distribution as a single fact. In production, this matters every time a user asks a question whose answer requires “reasoning across two facts the model knows separately”. comparison shopping, eligibility checks, retrieval-augmented synthesis, agent tasks with implicit subgoals.

The pain shows up across roles. A RAG team finds their pipeline answers single-fact lookups well but degrades on “given X, would Y also satisfy Z?”. StrategyQA-shaped questions. An ML engineer running a chain-of-thought variant sees accuracy improve on MMLU but actually drop on StrategyQA, indicating the CoT prompt is over-fitting to surface patterns. A product lead onboarding a new model finds the demo questions all pass but real user questions. which carry implicit comparisons. silently fail.

In 2026 agent stacks, StrategyQA-shaped reasoning is the default. A planner that decides to call tool A, then tool B, then synthesise. without being told to. is doing exactly what StrategyQA measures: implicit decomposition. Agent benchmarks like τ-bench, GAIA, and SWE-Bench Verified inherit StrategyQA’s emphasis on intermediate reasoning, and trajectory evaluators are built to score the same property at the trace level.

How FutureAGI handles StrategyQA evaluation

FutureAGI’s approach is to treat StrategyQA as one entry in a regression-eval portfolio rather than a one-shot leaderboard run. The 2,780 questions load into a Dataset via Hugging Face import. Dataset.add_evaluation attaches a binary answer check (CustomEvaluation with a yes/no rubric) for accuracy and a chain-of-thought rubric (also CustomEvaluation) for per-step inference soundness. The platform stores per-question scores so a release diff highlights which strategy types degraded. temporal reasoning, comparison, counting, set operations.

Strategy typeExampleCommon failure mode
Temporal”Could Caesar have used WhatsApp?”Model ignores dates
Comparison”Is X heavier than Y?”Model picks the more famous entity
Counting”Does Z have more than three legs?”Model agrees with the surface form
Set operation”Is A a subset of B?”Model lacks membership grounding
Existence”Has anyone done X?”Model overgeneralizes a single example
Causality”Does A cause B?”Model confuses correlation with cause

Concretely: a research-agent team running on traceAI-openai-agents evaluates a candidate model swap. They load StrategyQA into a versioned Dataset, run the answer check and reasoning rubric, and produce a comparison report against the prior production model. Final accuracy is unchanged at 91%, but the temporal-reasoning slice drops 4 points. surfacing that the new model gets the right answer through worse reasoning. The team gates the swap on this regression signal, not just final-answer accuracy.

Unlike the original HELM-style harness that scored StrategyQA as a single aggregate against a static gold split, the FutureAGI workflow treats the dataset as a living set of cohorts: each strategy type is its own slice, and the dataset can be extended with team-authored questions in the same shape (yes/no, decomposition, evidence). That extension is the practical way to keep StrategyQA useful after the public split saturates.

For agent settings, StrategyQA-style multi-hop questions are wrapped as test cases; the agent’s full trajectory. every state transition between tool calls. is captured and scored with a trajectory rubric.

How to measure or detect it

  • Binary answer check (CustomEvaluation). returns yes/no accuracy against the StrategyQA gold answer.
  • Reasoning rubric (CustomEvaluation). scores whether the chain of intermediate inferences is supported and complete.
  • Faithfulness. for chain-of-thought traces grounded against retrieved Wikipedia evidence.
  • TaskCompletion. when StrategyQA is folded into an agent trajectory.
  • Per-strategy slice accuracy (dashboard signal). accuracy bucketed by strategy type surfaces which sub-skill regressed.
  • Self-consistency vote rate. percentage of N samples that agree on the final yes/no; low vote rate hints at reasoning instability.
from fi.evals import CustomEvaluation, Faithfulness

answer_check = CustomEvaluation(
    name="strategyqa_binary",
    rubric="Score 1 if final answer matches yes/no gold, else 0.",
)
reasoning_check = CustomEvaluation(
    name="strategyqa_reasoning",
    rubric=(
        "Score 1-5 on whether each step in the chain-of-thought is supported "
        "by world knowledge and follows from the previous step."
    ),
)
faith = Faithfulness()

a = answer_check.evaluate(output="No", expected_response="No")
b = reasoning_check.evaluate(
    input="Did Aristotle use a laptop?",
    output="Aristotle lived around 384-322 BC. Laptops were invented in the 1980s. Therefore no.",
)
f = faith.evaluate(output=b, context=wikipedia_evidence)
print(a.score, b.score, f.score)

Common mistakes

  • Reporting only final-answer accuracy. Two models can hit 91% with very different reasoning quality. one will degrade on the next domain.
  • Mixing StrategyQA with HLE or MMLU into a single average. They measure different skills; aggregate drowns the signal.
  • Skipping per-strategy slicing. Temporal-reasoning regressions hide behind unchanged comparison accuracy.
  • Letting the judge model and the candidate share a base. Self-evaluation of chain-of-thought inflates reasoning scores; pin the judge to a different family.
  • Using the dataset as a fine-tuning corpus and an eval set. Contamination invalidates the leaderboard score; hold out the test split.
  • Treating StrategyQA as a frontier discriminator in 2026. It saturated; use it for regression, not for picking between frontier models.
  • Skipping the evidence-grounded variant. StrategyQA ships with Wikipedia evidence per sub-question; evaluating the model’s reasoning against that evidence is a stronger signal than just the binary answer.
  • Reporting StrategyQA in isolation. It works best as one of three reasoning probes (StrategyQA + GPQA Diamond + τ-bench); a single number is rarely actionable on its own.

Frequently Asked Questions

What is StrategyQA?

StrategyQA is a question-answering benchmark with yes/no questions that require implicit multi-step reasoning. Each question is paired with a decomposition into intermediate steps and Wikipedia evidence for each step.

How is StrategyQA different from MMLU or HellaSwag?

MMLU tests knowledge recall across many domains. HellaSwag tests commonsense completion. StrategyQA tests whether the model can plan and decompose an implicit reasoning chain that the question itself never spells out.

How do you run StrategyQA evaluations on your model?

Load the StrategyQA dataset into FutureAGI's Dataset object, run Dataset.add_evaluation with a binary answer check on the yes/no answer, and pair a CustomEvaluation rubric for reasoning quality on the chain-of-thought trace.