What Is ARC-AGI-2?
The second-generation Abstraction and Reasoning Corpus benchmark, measuring an AI system's ability to infer novel abstract rules from a handful of grid examples.
What Is ARC-AGI-2?
ARC-AGI-2 is the second generation of the Abstraction and Reasoning Corpus, a benchmark created to measure fluid reasoning in AI systems. Each task gives the model a handful of input-output grid examples and asks it to predict the output grid for a new input — the model has to infer the underlying rule, then apply it. Unlike pretraining-saturated benchmarks, ARC-AGI-2 tasks are designed to be novel to every solver, so memorization does not help. It is the public reasoning proxy on frontier models in 2026 and the benchmark behind the ARC Prize.
Why It Matters in Production LLM and Agent Systems
Most benchmarks reward models that have seen similar problems during pretraining. ARC-AGI-2 deliberately doesn’t — that’s the point. For a production team, the question is not “does our model score 85% on MMLU” but “does it actually reason when the user’s request doesn’t pattern-match anything it saw in training?” ARC-AGI-2 is the cheapest public proxy for that question.
The pain shows up where reasoning-load is high and pretraining-coverage is thin. A research-assistant agent breaks on a novel multi-step instruction that combines two operations the model has seen separately but never together. A workflow agent fails when the user describes a custom transform in their own vocabulary. A planner subagent confidently generates a 7-step plan whose step 3 contradicts step 1 — a logic error a strong reasoner would have caught.
In 2026-era agent stacks, ARC-AGI-2 scores correlate with how well a model handles real planning under novelty. Frontier models like GPT-5, Claude 4.x, and Gemini 3 publish ARC-AGI-2 scores in their release notes, and the gap between models on this benchmark predicts production-agent quality more reliably than HumanEval or MMLU does. Teams shipping agents that touch business-specific workflows should track ARC-AGI-2 movement when picking a base model — it surfaces reasoning regressions that knowledge benchmarks miss.
How FutureAGI Handles ARC-AGI-2
FutureAGI does not run the public ARC-AGI-2 leaderboard, but the same machinery applies inside production. At dataset level, an engineering team loads ARC-style grid tasks into a Dataset, with each row carrying the few-shot examples, the test input, and the gold output grid. At evaluation level, an exact-match evaluator scores whether the model’s output grid matches the gold grid cell-for-cell, and a ReasoningQuality evaluator scores whether the chain-of-thought leading to the answer is logically valid. At regression level, Dataset.add_evaluation() versions the result so when a team swaps from gpt-4o to gpt-5-mini they can see whether ARC-tier accuracy moved before shipping.
Concretely: a research team building an analytical agent on traceAI-langgraph constructs an internal “novel-reasoning” eval cohort modeled after ARC-AGI-2 — small grid puzzles, custom-rule transforms — and runs it nightly via FutureAGI’s evaluation pipeline. When their planner step regresses 9 points after a model swap, the trace view shows that the model started picking obvious-but-wrong rules without exploring alternatives, and the fix routes hard problems through a chain-of-thought-prompted variant via the Agent Command Center. ARC-AGI-2 isn’t the only signal, but it is the canary that fires before user complaints do.
How to Measure or Detect It
Useful signals when running ARC-AGI-2-style evaluation:
- Exact-match accuracy: fraction of test grids reproduced exactly; the canonical ARC score.
fi.evals.ReasoningQuality: returns 0-1 plus a reason for whether the chain-of-thought is logically valid given the few-shot examples.fi.evals.TaskCompletion: scores end-to-end whether the model arrived at the right grid given the rule it was asked to infer.- Strict-format compliance: fraction of outputs that even parse as a grid before correctness is checked; collapsing strict-format rate is the first warning.
- Per-puzzle latency p99: hard ARC tasks blow up token usage; track latency and cost per puzzle as part of the eval.
Minimal Python:
from fi.evals import ReasoningQuality, TaskCompletion
reasoning = ReasoningQuality()
task = TaskCompletion()
result = task.evaluate(
input=arc_task_prompt,
output=model_grid,
context={"gold_grid": gold},
)
print(result.score, result.reason)
Common Mistakes
- Treating ARC-AGI-2 as a single number. Different task families test different reasoning skills; aggregating hides which class of rule the model fails on.
- Confusing ARC-AGI-2 with the older AI2 ARC benchmark. They are unrelated — AI2 ARC is multiple-choice science questions; ARC-AGI is grid puzzles.
- Letting the model see the test grid in the few-shot block. Easy to do by accident when constructing prompts; ARC scores inflate to nonsense.
- Optimizing on the public set. The public ARC-AGI-2 set is small; if you train or prompt-tune against it, you have a memorized leaderboard score, not a reasoning measurement.
- Reading ARC-AGI-2 in isolation. It is one signal in a portfolio. Pair it with a domain-specific reasoning eval that reflects your users’ actual tasks.
Frequently Asked Questions
What is ARC-AGI-2?
ARC-AGI-2 is the second-generation Abstraction and Reasoning Corpus, a benchmark of grid-puzzle tasks designed to test fluid reasoning over novel rules rather than recall of pretraining data.
How is ARC-AGI-2 different from MMLU?
MMLU tests stored academic knowledge across 57 subjects; ARC-AGI-2 tests rule-induction on tasks the model has never seen, so memorization-heavy strategies fail. The two measure different abilities.
How do you measure ARC-AGI-2 results?
Pass rate on the held-out evaluation set, where a task is passed only if every test grid is reproduced exactly. FutureAGI runs ARC-style tasks as a Dataset with an exact-match evaluator and tracks per-cohort accuracy.