What Is AgentBench Agent Benchmark?
An open-source benchmark that evaluates LLMs as autonomous agents across eight environments including OS, database, web browsing, and household tasks.
What Is AgentBench Agent Benchmark?
The AgentBench agent benchmark is an open-source agent-evaluation benchmark that tests whether large language models can operate as autonomous agents across eight interactive environments. Instead of scoring a single answer, it grades multi-turn planning, tool use, observation handling, and final task success across operating-system, database, knowledge-graph, game, household, shopping, and browsing tasks. It appears in offline eval pipelines, benchmark comparisons, and production regression suites for agentic systems; in FutureAGI, teams pair that view with TrajectoryScore on real traces.
Why AgentBench Agent Benchmark matters in production LLM and agent systems
Most pre-2024 LLM benchmarks measured static knowledge or one-shot reasoning. They could not tell you whether a model would actually run a coherent agent loop in production. AgentBench was an early answer to that gap — it threw the model into an environment, let it call tools, observe outcomes, and graded the trajectory. Production teams saw immediately that high-MMLU models could rank below average-MMLU models on AgentBench because the failure modes were different: planning, tool selection, observation parsing, and stop conditions matter for agents in ways they do not for closed-book QA.
The pain it surfaces in production lines up with what teams see daily. A backend engineer reads “Model X scores 91% on MMLU” and assumes it will agent well — then sees AgentBench scores in the 30s and reverses course. A product reviewer comparing candidate base models for a planner-LLM finds the AgentBench leaderboard is more predictive of agent quality than any single-shot benchmark. A research lead uses AgentBench-style harnesses to validate that fine-tuning actually improved agent behavior, not just isolated answers.
In 2026, AgentBench has been joined by GAIA, ARC-AGI, WebArena, OSWorld, and SWE-Bench, each carving out a sharper slice of agent capability. AgentBench remains the broad-spectrum benchmark — useful precisely because it spans eight domains rather than one. That breadth makes it a sanity check before any production agent rollout.
How FutureAGI handles AgentBench Agent Benchmark
FutureAGI does not host AgentBench — that is THUDM’s project, and the canonical leaderboard lives there. What FutureAGI provides is the evaluation infrastructure that maps cleanly onto AgentBench-style trajectory data and lets you run equivalent cohorts on your own production agent. The evaluator stack — TaskCompletion, TrajectoryScore, ToolSelectionAccuracy, StepEfficiency, ReasoningQuality — operates on multi-step trajectories captured via traceAI integrations like openai-agents, crewai, google-adk, and langchain.
The simulate SDK is the closer parallel to the AgentBench harness. Scenario and Persona objects let you define multi-turn test cases; CloudEngine orchestrates the runs against your agent callback; TestReport aggregates trajectory-level metrics. FutureAGI’s approach is to treat AgentBench as public coverage evidence, then convert the same failure types into domain-specific regression rows. That is functionally an AgentBench-style harness scoped to your domain — you bring the environments, FutureAGI brings the trajectory evaluators.
Concrete example: a team picking a planner-LLM for a coding agent runs AgentBench publicly to shortlist models, then runs a private domain-specific scenario cohort through FutureAGI’s simulate SDK to verify the shortlist holds on their tools and policies. AgentBench surfaces that Model A and Model B are both above 50%; FutureAGI’s TaskCompletion plus ToolSelectionAccuracy on the team’s 250 internal scenarios show Model B wins on tool-selection in their tool registry. The two together produce a defensible decision; either alone would have been a guess.
For research-leaning teams, AgentBench’s published trajectories can be ingested as a Dataset and re-scored with FutureAGI evaluators alongside the original metrics — useful for comparing evaluator sensitivities across the same trajectory set.
How to measure or detect AgentBench Agent Benchmark performance
AgentBench produces multi-step trajectory data; the FutureAGI evaluators that map onto it are:
TaskCompletion: 0–1 score for end-to-end goal achievement on each AgentBench environment task.TrajectoryScore: aggregates per-step scoring across the trajectory; useful for environments with partial-credit grading.ToolSelectionAccuracy: scores whether the agent picked the right tool/action at each step in the environment.StepEfficiency: counts wasted steps; AgentBench’s web-browsing and household environments penalize inefficient trajectories.ReasoningQuality: scores chain-of-thought given observations — relevant for the lateral-thinking-puzzle subset.agent.trajectory.step(OTel attribute): the canonical span attribute used to align evaluator outputs to trajectory steps.
from fi.evals import TaskCompletion, TrajectoryScore, ToolSelectionAccuracy
t = TaskCompletion().evaluate(input=task, trajectory=agent_spans)
tr = TrajectoryScore().evaluate(input=task, trajectory=agent_spans)
ts = ToolSelectionAccuracy().evaluate(input=task, trajectory=agent_spans)
print(t.score, tr.score, ts.score)
Common mistakes
- Treating AgentBench score as a model-ranking shortcut. AgentBench measures generic agent capability; your production agent depends on your tool registry and prompt. Re-score on a domain cohort.
- Comparing AgentBench numbers across paper versions. The benchmark has been updated; pin the version when comparing.
- Using AgentBench alone. Pair with GAIA, WebArena, or domain-specific cohorts; one benchmark is one slice.
- Conflating AgentBench with single-shot benchmarks. A model can win MMLU and lose AgentBench; do not average them into a single “capability” number.
- Skipping per-environment breakdown. Aggregate AgentBench scores hide that a model is strong on database tasks and weak on web browsing — your agent may only need one of the two.
Frequently Asked Questions
What is AgentBench?
AgentBench is an open-source benchmark from THUDM that evaluates LLMs acting as autonomous agents across eight environments — operating system, database, knowledge graph, card game, lateral puzzles, household tasks, web shopping, and web browsing.
How is AgentBench different from MMLU?
MMLU evaluates single-shot question answering. AgentBench evaluates multi-turn decision-making in interactive environments — the model must call tools, observe results, and adapt across many steps. AgentBench measures agent capability; MMLU measures knowledge.
How does FutureAGI relate to AgentBench?
FutureAGI does not host AgentBench, but its evaluator stack — TaskCompletion, TrajectoryScore, ToolSelectionAccuracy — is designed for the same multi-step trajectory data AgentBench produces. You can run AgentBench-style cohorts against your own agents using the simulate SDK.