What Is ARC-AGI?
ARC-AGI is a benchmark that tests abstract reasoning by asking AI systems to solve novel grid-transformation tasks from few examples.
What Is ARC-AGI?
ARC-AGI is a benchmark for measuring abstract reasoning and skill-acquisition efficiency in AI systems. It belongs to the evaluation family and appears in benchmark suites, model-selection studies, agent research, and regression pipelines where teams test whether a system can infer grid-transformation rules from a few examples. Unlike factual QA benchmarks, ARC-AGI stresses out-of-distribution generalization: the model must form a new rule, apply it, and produce the correct output grid. FutureAGI treats it as conceptual benchmark evidence, not a dedicated product evaluator.
The 2026 reality: ARC-AGI 1 (original, public) is saturated. ARC-AGI 2 (private holdout, harder transformations, smaller training sample, paid prize pool) is the live frontier benchmark. As of May 2026 the top public ARC-AGI 2 scores sit around 30-40% for frontier systems (Claude Opus 4.7, GPT-5.x, Gemini 3 Ultra plus chain-of-thought scaffolding), with substantial headroom remaining. one of the few benchmarks where frontier progress is still measurable month over month.
Why ARC-AGI Matters in Production LLM and Agent Systems
ARC-AGI matters because many production failures look like reasoning until they meet a truly novel case. A support agent may answer common refund questions well, then fail when a customer combines policy, billing state, and tool constraints in a pattern missing from the eval set. A coding agent may pass common unit-test repair tasks, then loop on a new abstraction because it cannot infer the rule behind the failing behavior.
Ignoring ARC-AGI-style evaluation creates two failure modes:
- Overfitting to static knowledge tests. Teams ship models that recall facts but do not adapt.
- Rewarding unlimited test-time search. A system “solves” a task only by spending too many tokens, tool calls, or retries for production use.
The pain lands on developers, SREs, and product teams:
- Developers see brittle prompt fixes that improve one task and break another.
- SREs see p99 latency and token-cost-per-trace rise when agents keep trying alternate hypotheses.
- Product teams see erratic user outcomes: correct answers on familiar flows, then confident nonsense on edge cases.
In traces, symptoms include repeated plan revisions, high agent.trajectory.step counts, rising llm.token_count.prompt, and low pass rate on tasks with small input changes.
For 2026-era multi-step systems, ARC-AGI is useful as a stress test for abstraction, not as a direct proxy for customer value. Unlike MMLU (saturated, contaminated), GSM8K (saturated), or HumanEval (saturated, contaminated), ARC-AGI 2 asks whether a system can acquire a new skill from tiny evidence, then apply it under tight constraints.
How FutureAGI Handles ARC-AGI
The FutureAGI anchor for ARC-AGI is none: there is no ARC-AGI-specific fi.evals class or dedicated product surface. FutureAGI’s approach is to treat ARC-AGI as conceptual benchmark evidence, then translate the same reliability question into product evals: can the model infer the task rule, complete the output, and do so within trace and cost limits?
A practical workflow starts with a Dataset of ARC-style rows. Each row stores training input-output grids, the held-out input grid, the expected output grid, model route, prompt version, and a structured predicted_grid. The team attaches the appropriate evaluator stack:
| Evaluator | What it checks | Use |
|---|---|---|
GroundTruthMatch | Exact output-grid match | Pass/fail gate |
TaskCompletion | Agent finished the transformation | End-to-end signal |
CustomEvaluation | Per-rule rubric (e.g., “applies symmetry”) | Diagnostic |
| Cost-per-solved-task | Tokens spent on success | Efficiency gate |
| Retry rate | Failed-then-recovered attempts | Brittleness signal |
If the system uses a LangChain or custom agent harness, traceAI-langchain records spans with agent.trajectory.step, llm.token_count.prompt, model name, tool calls, latency, and retry count.
The engineer then acts on disagreement patterns:
- Exact accuracy drops while custom rubric stays high → the final grid serializer or schema is broken.
TaskCompletionimproves only when token cost triples → the system relies on brute-force search rather than efficient rule induction.- Newer model solves more tasks but creates longer traces → set a release gate on
arc_task_pass_rate, cost-per-solved-task, and p99 latency before routing production traffic.
Unlike a public leaderboard that reports a single score, FutureAGI keeps benchmark attempts tied to evaluator reasons and traces. That makes ARC-AGI useful as a diagnostic lens instead of a trophy metric. In our 2026 evals, the cost-per-solved-task metric is what differentiates models on ARC-AGI 2 most cleanly. Gemini 3 Ultra solves a similar fraction of tasks as Claude Opus 4.7 but at roughly 1.5x the token cost on long-grid problems.
How to Measure or Detect ARC-AGI Performance
Measure ARC-AGI as a held-out benchmark with strict task-level scoring, then add diagnostic metrics:
- Exact task accuracy. count a task correct only when the generated output grid matches the expected grid.
GroundTruthMatch. returns whether the predicted grid matches the reference output; use it as the pass/fail gate.TaskCompletion. checks whether the agent completed the requested transformation instead of stopping early or returning analysis only.CustomEvaluation. encodes rule-specific rubrics where intermediate reasoning is logged.- Trace efficiency. monitor
agent.trajectory.step,llm.token_count.prompt, retries, p99 latency, token-cost-per-solved-task. - Cohort drift. segment by grid size, transformation type, prompt version, and model route to avoid hiding brittle behavior inside an aggregate score.
Minimal pairing snippet:
from fi.evals import GroundTruthMatch, TaskCompletion
match = GroundTruthMatch()
task = TaskCompletion()
match_result = match.evaluate(
response=predicted_grid,
expected_response=expected_grid,
)
task_result = task.evaluate(
input=arc_prompt,
trajectory=trace_spans,
)
print(match_result.score, task_result.score)
Treat a high ARC-AGI score as one signal. It should agree with trace efficiency, regression stability, and product-specific evals before it influences a release decision.
Common Mistakes
- Treating ARC-AGI as an AGI certificate. A high benchmark score is evidence about abstraction, not proof of general production reliability.
- Comparing ARC-AGI 1 and ARC-AGI 2 scores. They are different benchmarks; ARC-AGI 1 is saturated, ARC-AGI 2 has a private holdout. Never mix the numbers.
- Rewarding unlimited search. Accuracy without cost, latency, and attempt limits can hide systems that are unusable in production.
- Converting grids to verbose text too early. Text serialization can add parser failures that are separate from the reasoning task.
- Ignoring failed trajectories. Final wrong grids are less useful than the span where the agent chose a bad rule and never recovered.
- Skipping efficiency gates. Two models with the same pass rate but 3x cost difference are not equivalent production options.
- Stopping at the benchmark. Pair ARC-AGI with your own golden dataset.
Frequently Asked Questions
What is ARC-AGI?
ARC-AGI is an evaluation benchmark for abstract reasoning and skill-acquisition efficiency. It asks AI systems to infer hidden grid-transformation rules from a few examples, then apply those rules to novel test inputs. ARC-AGI 2 is the 2026 frontier version.
How is ARC-AGI different from MMLU?
MMLU primarily measures knowledge and exam-style reasoning across known domains, and is saturated in 2026. ARC-AGI is designed around novel visual tasks, so memorized facts and web-scale pretraining help less than rule induction and efficient adaptation.
How do you measure ARC-AGI?
Measure exact task accuracy on held-out grid outputs, then pair the score with FutureAGI evaluators such as GroundTruthMatch and TaskCompletion. Trace fields like agent.trajectory.step and llm.token_count.prompt show how costly each solved task was.