What Is the StrategyQA Reasoning Benchmark?
An open-domain question-answering benchmark requiring implicit multi-step reasoning, with yes/no answers, decomposition steps, and per-step Wikipedia evidence.
What Is the StrategyQA Reasoning Benchmark?
StrategyQA, introduced by Geva et al. (2021), is an open-domain question-answering benchmark of 2,780 yes/no questions whose answers require multi-step reasoning over implicit strategies. Example: “Did Aristotle use a laptop?” requires reasoning that laptops were invented well after Aristotle lived. Each question carries a human-authored decomposition into atomic sub-questions plus Wikipedia evidence for each sub-question, allowing fine-grained scoring beyond final-answer accuracy. It is a canonical eval for chain-of-thought, self-consistency, and multi-hop RAG, and it surfaces a different failure mode than knowledge-recall benchmarks like MMLU.
Why It Matters in Production LLM and Agent Systems
A model that aces MMLU can still fail StrategyQA because the implicit-strategy step is not in the training distribution as a single fact. In production, this matters every time a user asks a question whose answer requires “reasoning across two facts the model knows separately” — comparison shopping, eligibility checks, retrieval-augmented synthesis, agent tasks with implicit subgoals.
The pain shows up across roles. A RAG team finds their pipeline answers single-fact lookups well but degrades on “given X, would Y also satisfy Z?” — StrategyQA-shaped questions. An ML engineer running a chain-of-thought variant sees accuracy improve on MMLU but actually drop on StrategyQA, indicating the CoT prompt is over-fitting to surface patterns. A product lead onboarding a new model finds the demo questions all pass but real user questions — which carry implicit comparisons — silently fail.
In 2026 agent stacks, StrategyQA-shaped reasoning is the default. A planner that decides to call tool A, then tool B, then synthesise — without being told to — is doing exactly what StrategyQA measures: implicit decomposition. Agent benchmarks like AgentBench and GAIA inherit StrategyQA’s emphasis on intermediate reasoning, and trajectory evaluators like MultiHopReasoning and ReasoningQuality are built to score the same property at the trace level.
How FutureAGI Handles StrategyQA Evaluation
FutureAGI’s approach is to treat StrategyQA as one entry in a regression-eval portfolio rather than a one-shot leaderboard run. The 2,780 questions load into a Dataset via Dataset.import_from_huggingface. Dataset.add_evaluation attaches GroundTruthMatch for yes/no accuracy, MultiHopReasoning for chain-of-thought quality, and ReasoningQuality for per-step inference soundness. The platform stores per-question scores so a release diff highlights which strategy types degraded — temporal reasoning, comparison, counting, set operations.
Concretely: a research-agent team running on traceAI-openai-agents evaluates a candidate model swap. They load StrategyQA into Dataset v8, run all three evaluators, and produce a comparison report against the prior production model. Final accuracy is unchanged at 73%, but ReasoningQuality drops 4 points on temporal-reasoning questions — surfacing that the new model gets the right answer through worse reasoning. The team gates the swap on this regression signal, not just final-answer accuracy.
For agent settings, the simulate-sdk’s Scenario wraps StrategyQA-style multi-hop questions as Persona test cases; the agent’s full trajectory — every state transition between tool calls — is captured and scored against TrajectoryScore plus ReasoningQuality.
How to Measure or Detect It
GroundTruthMatch: returns binary yes/no accuracy against the StrategyQA gold answer.MultiHopReasoning: scores whether the chain of intermediate inferences is supported and complete.ReasoningQuality: returns a 0–1 score for soundness of intermediate steps; complementary toMultiHopReasoning.- Per-strategy slice accuracy (dashboard signal): accuracy bucketed by the strategy type — temporal, comparison, counting, set ops — surfaces which sub-skill regressed.
- Self-consistency vote rate: percentage of N samples that agree on the final yes/no; low vote rate hints at reasoning instability.
from fi.evals import GroundTruthMatch, MultiHopReasoning
answer_check = GroundTruthMatch()
hop_check = MultiHopReasoning()
result_a = answer_check.evaluate(output="No", expected_response="No")
result_b = hop_check.evaluate(
input="Did Aristotle use a laptop?",
output="Aristotle lived around 384-322 BC. Laptops were invented in the 1980s. Therefore no.",
)
print(result_a.score, result_b.score)
Common Mistakes
- Reporting only final-answer accuracy. Two models can hit 75% with very different reasoning quality — one will degrade on the next domain.
- Mixing StrategyQA with MMLU into a single average. They measure different skills; aggregate drowns the signal.
- Skipping per-strategy slicing. Temporal-reasoning regressions hide behind unchanged comparison accuracy.
- Letting the judge model and the candidate share a base. Self-evaluation of chain-of-thought inflates
ReasoningQuality; pin the judge to a different family. - Using the dataset as a fine-tuning corpus and an eval set. Contamination invalidates the leaderboard score; hold out the test split.
Frequently Asked Questions
What is StrategyQA?
StrategyQA is a question-answering benchmark with yes/no questions that require implicit multi-step reasoning. Each question is paired with a decomposition into intermediate steps and Wikipedia evidence for each step.
How is StrategyQA different from MMLU or HellaSwag?
MMLU tests knowledge recall across many domains. HellaSwag tests commonsense completion. StrategyQA tests whether the model can plan and decompose an implicit reasoning chain that the question itself never spells out.
How do you run StrategyQA evaluations on your model?
Load the StrategyQA dataset into FutureAGI's Dataset object, run Dataset.add_evaluation with GroundTruthMatch on the yes/no answer, and pair MultiHopReasoning with ReasoningQuality on the chain-of-thought trace.