Evaluation

What Is the HellaSwag Reasoning Benchmark?

A commonsense-reasoning benchmark of 70K adversarially-filtered multiple-choice questions where the model must pick the most plausible continuation.

What Is the HellaSwag Reasoning Benchmark?

HellaSwag — Harder Endings, Longer contexts, and Low-shot Activities for Situations With Adversarial Generations — is a commonsense-reasoning benchmark introduced by Zellers et al. in 2019. Each item presents a short scenario (a sentence or two from ActivityNet captions or WikiHow) followed by four candidate continuations. The model must pick the most plausible one. Distractors are produced by an Adversarial Filtering pipeline that selects continuations grammatical and topically relevant but commonsense-incoherent. The benchmark contains roughly 70,000 items split across train, validation, and test. By 2026 frontier LLMs exceed 95% accuracy, so HellaSwag mainly serves as a regression probe.

Why It Matters in Production LLM and Agent Systems

HellaSwag’s value is not the headline accuracy; it is the diagnostic. Models that fail HellaSwag tend to fail any task that requires picking the more plausible of multiple plausible continuations — which describes a large fraction of conversational chat, ranking, judge-model rubrics, and multiple-choice tool selection. A model that drops 5 points on HellaSwag after a fine-tune is announcing that its commonsense prior moved.

The pain shows up in three places. Distillation teams find that a small student model passes single-step QA but loses 8 points on HellaSwag — the hard reasoning circuits are precisely the ones that distill poorly. Quantization teams find that aggressive 4-bit quantization sometimes preserves perplexity but degrades HellaSwag by 3–7 points, a silent commonsense regression. Persona / character-card fine-tunes sometimes shift the model’s commonsense prior in ways that show up first as HellaSwag regressions before customers notice “the assistant feels weird.”

For 2026 agent stacks, HellaSwag is most useful as a canary. An agent that loses commonsense ranking ability will fail subtly — picking the wrong tool when two are plausible, choosing the wrong document when several are relevant, or selecting an unsafe action because it ranks “more interesting” higher than “more correct.” HellaSwag is cheap to run and exposes that capability before customers do.

How FutureAGI Handles HellaSwag-Style Evaluation

FutureAGI does not redistribute HellaSwag — teams load it themselves from the public dataset. The standard pattern is to load the validation split into a versioned Dataset and call Dataset.add_evaluation(ExactMatch()) on the selected option index against ground truth, plus Dataset.add_evaluation(ReasoningQuality()) on the model’s chain-of-thought when present. The two scores together separate “the model gets the answer right” from “the model gets the answer right for the right reason.”

A real workflow: a fine-tuning team ships a candidate model. They run HellaSwag as one of seven reasoning probes (alongside CommonsenseQA, ARC, GSM8K, BBH, TruthfulQA, MMLU). The Dataset is versioned at v8. ExactMatch on the candidate is 0.943; the prior release was 0.951. Slicing by category (sports, cooking, work) shows the regression is concentrated in the cooking cohort — a 4-point drop. The team traces it to a system-prompt change that biased the model toward shorter completions and reverts it. Without HellaSwag wired into a regression eval, the same drop would have shipped silently and surfaced weeks later as user reports.

FutureAGI’s approach is to treat HellaSwag as one lane in a broader reasoning eval surface. Unlike single-benchmark scoreboards, the dashboard shows accuracy alongside ReasoningQuality, sliced by category, and gated against the prior release’s score. The engineer can require pareto improvement — no category regression by more than 1 point — as a release condition.

How to Measure or Detect It

Score HellaSwag against accuracy and reasoning quality:

  • fi.evals.ExactMatch — boolean correctness of the chosen option index against ground truth.
  • fi.evals.ReasoningQuality — 0–1 reasoning quality across the chain-of-thought trace, when CoT is enabled.
  • fi.evals.TaskCompletion — whether the model committed to and finished the multi-step decision (relevant when running CoT-augmented HellaSwag).
  • Category cohort slicing — accuracy by HellaSwag category (sports, cooking, work, home maintenance); regressions concentrate in specific categories.
  • Decoding-parameter slice — temperature and top-p; greedy decoding and high-temperature sampling produce different scores.
from fi.evals import ExactMatch, ReasoningQuality

problem = "A woman is outside with a bucket and a dog. The dog runs around in circles trying to avoid a bath. She..."
choices = ["A) puts the bucket on the dog's head", "B) gets soaked while bathing the dog",
           "C) sells the bucket", "D) flies away"]
answer = "B"

print(ExactMatch().evaluate(input=problem, output=answer, expected="B"))
print(ReasoningQuality().evaluate(input=problem, output="The dog won't sit still..."))

Common Mistakes

  • Reporting only accuracy. A late-2026 model at 95% accuracy might still pick the right answer for the wrong reason; require ReasoningQuality alongside.
  • Skipping category slicing. Average accuracy hides a 4-point regression in one cohort.
  • Truncating max-tokens. CoT-augmented HellaSwag needs 256+ tokens; aggressive caps clip the reasoning.
  • Treating HellaSwag as a discriminator. It saturates near the ceiling for frontier models; pair with HarmBench, GPQA, or domain-specific evals for live discrimination.
  • Ignoring training-set contamination. HellaSwag has been in pretraining corpora for years; the test set may already be memorized in some models, inflating scores.

Frequently Asked Questions

What is the HellaSwag benchmark?

HellaSwag is a benchmark of roughly 70,000 four-way multiple-choice commonsense-reasoning questions, where each item asks the model to pick the most plausible continuation of a short scenario from adversarially-generated distractors.

How is HellaSwag different from MMLU?

MMLU tests knowledge across 57 academic subjects with multiple-choice questions. HellaSwag tests commonsense reasoning about everyday physical and social situations and uses Adversarial Filtering to make distractors hard.

How do you evaluate HellaSwag in production workflows?

FutureAGI runs HellaSwag through Dataset.add_evaluation with ExactMatch on the chosen option index and ReasoningQuality on the model's chain-of-thought, so reasoning regressions surface even when accuracy looks flat.