AgentBench is an open-source benchmark from THUDM that evaluates LLMs acting as autonomous agents across eight environments — operating system, database, knowledge graph, card game, lateral puzzles, household tasks, web shopping, and web browsing.

How is AgentBench different from MMLU?

MMLU evaluates single-shot question answering. AgentBench evaluates multi-turn decision-making in interactive environments — the model must call tools, observe results, and adapt across many steps. AgentBench measures agent capability; MMLU measures knowledge.

How does FutureAGI relate to AgentBench?

FutureAGI does not host AgentBench, but its evaluator stack — TaskCompletion, TrajectoryScore, ToolSelectionAccuracy — is designed for the same multi-step trajectory data AgentBench produces. You can run AgentBench-style cohorts against your own agents using the simulate SDK.

AgentBench Agent Benchmark Definition

What Is AgentBench Agent Benchmark?

The AgentBench agent benchmark is an open-source agent-evaluation benchmark that tests whether large language models can operate as autonomous agents across eight interactive environments. Instead of scoring a single answer, it grades multi-turn planning, tool use, observation handling, and final task success across operating-system, database, knowledge-graph, game, household, shopping, and browsing tasks. It appears in offline eval pipelines, benchmark comparisons, and production regression suites for agentic systems; in FutureAGI, teams pair that view with TrajectoryScore on real traces.

Why AgentBench Agent Benchmark matters in production LLM and agent systems

Most pre-2024 LLM benchmarks measured static knowledge or one-shot reasoning. They could not tell you whether a model would actually run a coherent agent loop in production. AgentBench was an early answer to that gap — it threw the model into an environment, let it call tools, observe outcomes, and graded the trajectory. Production teams saw immediately that high-MMLU models could rank below average-MMLU models on AgentBench because the failure modes were different: planning, tool selection, observation parsing, and stop conditions matter for agents in ways they do not for closed-book QA.

The pain it surfaces in production lines up with what teams see daily. A backend engineer reads “Model X scores 91% on MMLU” and assumes it will agent well — then sees AgentBench scores in the 30s and reverses course. A product reviewer comparing candidate base models for a planner-LLM finds the AgentBench leaderboard is more predictive of agent quality than any single-shot benchmark. A research lead uses AgentBench-style harnesses to validate that fine-tuning actually improved agent behavior, not just isolated answers.

In 2026, AgentBench has been joined by GAIA, ARC-AGI, WebArena, OSWorld, and SWE-Bench, each carving out a sharper slice of agent capability. AgentBench remains the broad-spectrum benchmark — useful precisely because it spans eight domains rather than one. That breadth makes it a sanity check before any production agent rollout.

How FutureAGI handles AgentBench Agent Benchmark

FutureAGI does not host AgentBench — that is THUDM’s project, and the canonical leaderboard lives there. What FutureAGI provides is the evaluation infrastructure that maps cleanly onto AgentBench-style trajectory data and lets you run equivalent cohorts on your own production agent. The evaluator stack — TaskCompletion, TrajectoryScore, ToolSelectionAccuracy, StepEfficiency, ReasoningQuality — operates on multi-step trajectories captured via traceAI integrations like openai-agents, crewai, google-adk, and langchain.

The simulate SDK is the closer parallel to the AgentBench harness. Scenario and Persona objects let you define multi-turn test cases; CloudEngine orchestrates the runs against your agent callback; TestReport aggregates trajectory-level metrics. FutureAGI’s approach is to treat AgentBench as public coverage evidence, then convert the same failure types into domain-specific regression rows. That is functionally an AgentBench-style harness scoped to your domain — you bring the environments, FutureAGI brings the trajectory evaluators.

Concrete example: a team picking a planner-LLM for a coding agent runs AgentBench publicly to shortlist models, then runs a private domain-specific scenario cohort through FutureAGI’s simulate SDK to verify the shortlist holds on their tools and policies. AgentBench surfaces that Model A and Model B are both above 50%; FutureAGI’s TaskCompletion plus ToolSelectionAccuracy on the team’s 250 internal scenarios show Model B wins on tool-selection in their tool registry. The two together produce a defensible decision; either alone would have been a guess.

For research-leaning teams, AgentBench’s published trajectories can be ingested as a Dataset and re-scored with FutureAGI evaluators alongside the original metrics — useful for comparing evaluator sensitivities across the same trajectory set.

How to measure or detect AgentBench Agent Benchmark performance

AgentBench produces multi-step trajectory data; the FutureAGI evaluators that map onto it are:

TaskCompletion: 0–1 score for end-to-end goal achievement on each AgentBench environment task.
TrajectoryScore: aggregates per-step scoring across the trajectory; useful for environments with partial-credit grading.
ToolSelectionAccuracy: scores whether the agent picked the right tool/action at each step in the environment.
StepEfficiency: counts wasted steps; AgentBench’s web-browsing and household environments penalize inefficient trajectories.
ReasoningQuality: scores chain-of-thought given observations — relevant for the lateral-thinking-puzzle subset.
agent.trajectory.step (OTel attribute): the canonical span attribute used to align evaluator outputs to trajectory steps.

from fi.evals import TaskCompletion, TrajectoryScore, ToolSelectionAccuracy

t = TaskCompletion().evaluate(input=task, trajectory=agent_spans)
tr = TrajectoryScore().evaluate(input=task, trajectory=agent_spans)
ts = ToolSelectionAccuracy().evaluate(input=task, trajectory=agent_spans)
print(t.score, tr.score, ts.score)

Common mistakes

Treating AgentBench score as a model-ranking shortcut. AgentBench measures generic agent capability; your production agent depends on your tool registry and prompt. Re-score on a domain cohort.
Comparing AgentBench numbers across paper versions. The benchmark has been updated; pin the version when comparing.
Using AgentBench alone. Pair with GAIA, WebArena, or domain-specific cohorts; one benchmark is one slice.
Conflating AgentBench with single-shot benchmarks. A model can win MMLU and lose AgentBench; do not average them into a single “capability” number.
Skipping per-environment breakdown. Aggregate AgentBench scores hide that a model is strong on database tasks and weak on web browsing — your agent may only need one of the two.