Evaluation

What Is ARC-AGI?

ARC-AGI is a benchmark that tests abstract reasoning by asking AI systems to solve novel grid-transformation tasks from few examples.

What Is ARC-AGI?

ARC-AGI is a benchmark for measuring abstract reasoning and skill-acquisition efficiency in AI systems. It belongs to the evaluation family and appears in benchmark suites, model-selection studies, agent research, and regression pipelines where teams test whether a system can infer grid-transformation rules from a few examples. Unlike factual QA benchmarks, ARC-AGI stresses out-of-distribution generalization: the model must form a new rule, apply it, and produce the correct output grid. FutureAGI treats it as conceptual benchmark evidence, not a dedicated product evaluator.

The 2026 reality: ARC-AGI 1 (original, public) is saturated. ARC-AGI 2 (private holdout, harder transformations, smaller training sample, paid prize pool) is the live frontier benchmark. As of May 2026 the top public ARC-AGI 2 scores sit around 30-40% for frontier systems (Claude Opus 4.7, GPT-5.x, Gemini 3 Ultra plus chain-of-thought scaffolding), with substantial headroom remaining. one of the few benchmarks where frontier progress is still measurable month over month.

Why ARC-AGI Matters in Production LLM and Agent Systems

ARC-AGI matters because many production failures look like reasoning until they meet a truly novel case. A support agent may answer common refund questions well, then fail when a customer combines policy, billing state, and tool constraints in a pattern missing from the eval set. A coding agent may pass common unit-test repair tasks, then loop on a new abstraction because it cannot infer the rule behind the failing behavior.

Ignoring ARC-AGI-style evaluation creates two failure modes:

  1. Overfitting to static knowledge tests. Teams ship models that recall facts but do not adapt.
  2. Rewarding unlimited test-time search. A system “solves” a task only by spending too many tokens, tool calls, or retries for production use.

The pain lands on developers, SREs, and product teams:

  • Developers see brittle prompt fixes that improve one task and break another.
  • SREs see p99 latency and token-cost-per-trace rise when agents keep trying alternate hypotheses.
  • Product teams see erratic user outcomes: correct answers on familiar flows, then confident nonsense on edge cases.

In traces, symptoms include repeated plan revisions, high agent.trajectory.step counts, rising llm.token_count.prompt, and low pass rate on tasks with small input changes.

For 2026-era multi-step systems, ARC-AGI is useful as a stress test for abstraction, not as a direct proxy for customer value. Unlike MMLU (saturated, contaminated), GSM8K (saturated), or HumanEval (saturated, contaminated), ARC-AGI 2 asks whether a system can acquire a new skill from tiny evidence, then apply it under tight constraints.

How FutureAGI Handles ARC-AGI

The FutureAGI anchor for ARC-AGI is none: there is no ARC-AGI-specific fi.evals class or dedicated product surface. FutureAGI’s approach is to treat ARC-AGI as conceptual benchmark evidence, then translate the same reliability question into product evals: can the model infer the task rule, complete the output, and do so within trace and cost limits?

A practical workflow starts with a Dataset of ARC-style rows. Each row stores training input-output grids, the held-out input grid, the expected output grid, model route, prompt version, and a structured predicted_grid. The team attaches the appropriate evaluator stack:

EvaluatorWhat it checksUse
GroundTruthMatchExact output-grid matchPass/fail gate
TaskCompletionAgent finished the transformationEnd-to-end signal
CustomEvaluationPer-rule rubric (e.g., “applies symmetry”)Diagnostic
Cost-per-solved-taskTokens spent on successEfficiency gate
Retry rateFailed-then-recovered attemptsBrittleness signal

If the system uses a LangChain or custom agent harness, traceAI-langchain records spans with agent.trajectory.step, llm.token_count.prompt, model name, tool calls, latency, and retry count.

The engineer then acts on disagreement patterns:

  • Exact accuracy drops while custom rubric stays high → the final grid serializer or schema is broken.
  • TaskCompletion improves only when token cost triples → the system relies on brute-force search rather than efficient rule induction.
  • Newer model solves more tasks but creates longer traces → set a release gate on arc_task_pass_rate, cost-per-solved-task, and p99 latency before routing production traffic.

Unlike a public leaderboard that reports a single score, FutureAGI keeps benchmark attempts tied to evaluator reasons and traces. That makes ARC-AGI useful as a diagnostic lens instead of a trophy metric. In our 2026 evals, the cost-per-solved-task metric is what differentiates models on ARC-AGI 2 most cleanly. Gemini 3 Ultra solves a similar fraction of tasks as Claude Opus 4.7 but at roughly 1.5x the token cost on long-grid problems.

How to Measure or Detect ARC-AGI Performance

Measure ARC-AGI as a held-out benchmark with strict task-level scoring, then add diagnostic metrics:

  • Exact task accuracy. count a task correct only when the generated output grid matches the expected grid.
  • GroundTruthMatch. returns whether the predicted grid matches the reference output; use it as the pass/fail gate.
  • TaskCompletion. checks whether the agent completed the requested transformation instead of stopping early or returning analysis only.
  • CustomEvaluation. encodes rule-specific rubrics where intermediate reasoning is logged.
  • Trace efficiency. monitor agent.trajectory.step, llm.token_count.prompt, retries, p99 latency, token-cost-per-solved-task.
  • Cohort drift. segment by grid size, transformation type, prompt version, and model route to avoid hiding brittle behavior inside an aggregate score.

Minimal pairing snippet:

from fi.evals import GroundTruthMatch, TaskCompletion

match = GroundTruthMatch()
task = TaskCompletion()

match_result = match.evaluate(
    response=predicted_grid,
    expected_response=expected_grid,
)
task_result = task.evaluate(
    input=arc_prompt,
    trajectory=trace_spans,
)
print(match_result.score, task_result.score)

Treat a high ARC-AGI score as one signal. It should agree with trace efficiency, regression stability, and product-specific evals before it influences a release decision.

Common Mistakes

  • Treating ARC-AGI as an AGI certificate. A high benchmark score is evidence about abstraction, not proof of general production reliability.
  • Comparing ARC-AGI 1 and ARC-AGI 2 scores. They are different benchmarks; ARC-AGI 1 is saturated, ARC-AGI 2 has a private holdout. Never mix the numbers.
  • Rewarding unlimited search. Accuracy without cost, latency, and attempt limits can hide systems that are unusable in production.
  • Converting grids to verbose text too early. Text serialization can add parser failures that are separate from the reasoning task.
  • Ignoring failed trajectories. Final wrong grids are less useful than the span where the agent chose a bad rule and never recovered.
  • Skipping efficiency gates. Two models with the same pass rate but 3x cost difference are not equivalent production options.
  • Stopping at the benchmark. Pair ARC-AGI with your own golden dataset.

Frequently Asked Questions

What is ARC-AGI?

ARC-AGI is an evaluation benchmark for abstract reasoning and skill-acquisition efficiency. It asks AI systems to infer hidden grid-transformation rules from a few examples, then apply those rules to novel test inputs. ARC-AGI 2 is the 2026 frontier version.

How is ARC-AGI different from MMLU?

MMLU primarily measures knowledge and exam-style reasoning across known domains, and is saturated in 2026. ARC-AGI is designed around novel visual tasks, so memorized facts and web-scale pretraining help less than rule induction and efficient adaptation.

How do you measure ARC-AGI?

Measure exact task accuracy on held-out grid outputs, then pair the score with FutureAGI evaluators such as GroundTruthMatch and TaskCompletion. Trace fields like agent.trajectory.step and llm.token_count.prompt show how costly each solved task was.