The Definitive Guide to AI Agent Evaluation (2026)
The 2026 working pattern for AI agent evaluation. Six dimensions, six rubrics, a 4-D trajectory score, the CI gate that beats aggregate scoring, and the loop production needs.
Table of Contents
An agent is not a model. Evaluating one as if it were is the most common reason production agents fail. The unit is the trajectory: tool selection, argument extraction, result utilization, error recovery, plan coherence, task completion. Six dimensions, six rubrics, scored independently. Test fewer than four and you are shipping a probability, not a system. This guide is the working pattern from agent deployments that have shipped and stayed shipped: six dimensions, the rubrics that catch them, the public benchmarks that anchor the floor, the private eval set that gates production, the 4-D trajectory score that runs in CI and on live spans, and the loop that turns production failures into the next regression test.
TL;DR: the six dimensions of agent evaluation
| Dimension | What it measures | Failure mode if missing |
|---|---|---|
| Tool selection | Did the agent pick the right tool, or correctly call none | Wrong tool, fabricated tool, no tool when one was needed |
| Argument extraction | Schema-valid and semantically correct arguments | Right tool, wrong date format; right tool, missing required field |
| Result utilization | Did the agent use the tool payload or substitute model knowledge | Number flipped, entity swapped, payload ignored |
| Error recovery | Did the agent retry, fall back, or escalate on tool failure | Crash, hallucinate success, retry the same broken input |
| Plan coherence | Loop-free, dead-end-free, right depth | Sub-tree explosion, premature finalisation, infinite loops |
| Task completion | Did the trajectory deliver the user goal end-to-end | Per-step green, end-to-end failure |
Aggregate task-completion alone hides which dimension regressed. Per-dimension scoring tells you what to fix this afternoon.
Why agent eval is not LLM eval with extra steps
An LLM eval is a function from (input, output) to a score. An agent eval is a function from trajectory to a score, where a trajectory is the full ordered sequence of system prompt, user input, agent reasoning, tool calls (name plus arguments plus return value), retrieval results, intermediate LLM calls, final response, and outcome metadata.
You cannot score this from the response alone. A response that looks right can come from the wrong tool with the wrong arguments by luck. A response that looks wrong can come from a correct trajectory the rubric did not anticipate. The trace is the truth.
The math agrees. End-to-end success on a k-step agent is roughly the product of per-step success rates. A 95-percent per-step agent over eight steps lands near 66 percent. A 99-percent per-step agent over eight steps lands near 92 percent. Two thirds of sessions ending structurally wrong while every individual step scores green is the default math of compound error, and it is why teams ship agents that pass per-turn eval and tank production. The per-step rubric is the gate; the trajectory metric is the truth.
Dimension 1: tool selection
Pull the model’s chosen tool name, compare to the gold label, aggregate as F1 per tool so a registry of 28 tools does not hide a regression on one rare endpoint behind a strong global mean. Three failure modes show up:
- Wrong tool. Calls
web_searchwhen the answer was inknowledge_base. - No tool. Answers from parametric memory when the spec required grounded retrieval.
- Fabricated tool. Invokes a tool that does not exist (rare in production, common in prototype).
The piece most posts drop is the irrelevance bucket: cases where the gold answer is no tool call. Greeting, clarification request, in-model factual question, refusal-worthy ask. Without those cases, you cannot detect the regression where a new prompt revision makes the model bolder about calling search on every input. BFCL added the bucket for exactly this reason; build it into your private set the same way. The ai-evaluation SDK ships LLMFunctionCalling (cloud, alias EvaluateFunctionCalling) for the rubric case plus deterministic function_name_match, parameter_validation, function_call_accuracy, and function_call_exact_match (sub-millisecond, local).
Dimension 2: argument extraction
Right tool with wrong arguments is the most common production agent bug. The agent decides to call create_calendar_event, then passes the date in the wrong format, omits the timezone, or hallucinates an attendee. Argument failures fall into three buckets:
- Schema mismatch. Wrong type, missing required field. Pydantic and JSON Schema catch this deterministically.
- Semantic mismatch. Right schema, wrong value.
departure_date="2026-01-01"validates and is wrong if the user said “next Friday.” LLM-judge with a few-shot rubric handles this. - Edge-case handling. Null on optional fields, empty array, unicode in identifiers, timezone on date fields, currency on monetary fields. These are the failures BFCL cannot see because they are private to your tool registry.
Run schema validation first, gate CI on it, then send the semantically suspect cases to a CustomLLMJudge that scores whether the argument captures the user intent. Building a regression suite of edge cases per tool (null on optional, empty array, special characters, type coercion, the timezone case on every date field) is what separates the eval set BFCL-equivalent from the eval set production survives.
Dimension 3: result utilization
The tool returned. The agent has the payload. Three failure patterns surface most often, and almost every public post on agent eval skips this layer:
The agent paraphrases the payload with a number flipped. Tool returns {"refund_status": "pending", "amount_cents": 4500}; agent says “your refund of $54.00 is processing.” Schema-correct call, clean response, off by an order of magnitude.
The agent substitutes prior model knowledge. get_account_balance returns {"balance_cents": 12_400}. The model “knows” the user has a standard $200 minimum and replies “your balance is above the $200 threshold.” The tool result was never read.
The agent uses the result on turn 1 and drifts off it by turn 3. The flight-booking agent quotes the right itinerary on turn 1, then invents a baggage policy on turn 3 that contradicts the airline_policy tool result from two turns ago.
The rubric is Groundedness, with the context slot pointed at the tool return payload rather than the retrieved corpus. ContextAdherence and ChunkAttribution work the same way: chunk the tool result into JSON fields, score whether each claim in the response maps to one. The Platform’s classifier-backed cascade runs Groundedness at lower per-eval cost than Galileo Luna-2.
Dimension 4: error recovery
Real tools fail. APIs time out, return 429s, return malformed JSON, return empty results. The agent’s behavior in these cases is a separate eval axis from happy-path behavior. The patterns to grade: did the agent read the error body and route to a corrected retry, a fallback tool, a clarification question, or a graceful escalation; did it retry with corrected arguments on a 400 or send the same broken string again; did it stop at a sensible retry cap (3 is a common floor; 6 usually means the loop guard is missing); did it communicate the failure clearly instead of fabricating success.
This is a trajectory-level concern. Build a stratified test set by replaying production traces with synthetic tool failures injected: one bucket per tool, one row per error code the endpoint returns (400, 401, 403, 404, 408, 429, 5xx), plus empty-result and partial-result rows. Gate CI on per-bucket recovery rates. ActionSafety and TrajectoryScore from the agent-trajectory suite cover the deterministic side; a CustomLLMJudge wrapped around the trajectory handles the qualitative side.
Dimension 5: plan coherence
For agents that take multiple steps before finalising, the shape of the trajectory matters. Three patterns to score:
- No loops. The agent does not re-call the same tool with the same arguments more than once.
- No dead-ends. Every branch eventually returns to the main goal or terminates with a clean refusal.
- Right depth. A two-step task takes roughly two steps. A ten-step task takes roughly ten. Sub-tree explosion is a regression.
StepEfficiency, TrajectoryScore, and GoalProgress from the SDK’s agent-trajectory suite score these directly on AgentTrajectoryInput. For richer plan critique, a CustomLLMJudge with a rubric like “score 1.0 if the trajectory is the shortest correct path; 0.5 if it is correct but inefficient; 0.0 if it loops or dead-ends” works well. Treat any agent longer than five steps as suspect; force the planner to decompose into shorter sub-agents. Long flat trajectories are where compound-error pain lives.
Dimension 6: task completion
End-to-end success on the user goal, scored on the full trajectory rather than the final turn. TaskCompletion (cloud eval_id=99) handles the rubric case across trajectory plus expected goal. For multi-turn conversations, layer in ConversationCoherence and ConversationResolution so per-turn rubrics that look fine in isolation cannot hide a session that talked itself in circles. For customer-support agents, the SDK ships 11 CustomerAgent* templates (ClarificationSeeking, ContextRetention, ConversationQuality, HumanEscalation, InterruptionHandling, LanguageHandling, LoopDetection, ObjectionHandling, PromptConformance, QueryHandling, TerminationHandling) for the named failure modes in that vertical.
Reserve a consistency slice. Pick 30 hard cases and run them k times each; the fraction that succeed on all k is your pass^k in τ-bench’s sense. When pass^8 moves, the planner regressed, not the tools.
Public benchmarks: the floor, not the ceiling
Three public benchmarks anchor the floor in 2026. Use them; do not gate production on them.
BFCL (Berkeley Function Calling Leaderboard) breaks tool calling into an AST track (syntactic correctness), an executable track (the call actually runs on a real endpoint), and an irrelevance-detection bucket. A model that aces AST and tanks irrelevance overcalls on your registry; a model that aces AST and tanks executable generates plausible but non-running calls.
τ-bench evaluates multi-turn agents in airline and retail with an LLM-simulated user, a domain policy, and tool access. The headline metric is pass^k across k independent rollouts. Even strong models land below 25 percent at pass^8 on retail; multi-turn tool-using agents are nondeterminism amplifiers, and the consistency metric is the cleanest exposure of that fact.
ToolBench tests across thousands of real APIs with a focus on instruction-following and tool composition.
Public benchmarks tell you whether the underlying model can call tools at all. They tell you nothing about your registry, argument schemas, error codes, or business policy. The private eval set is the one that gates production. Build it stratified by tool, argument-edge-case bucket, and error code; promote failing production traces into it weekly.
The 4-D trajectory score
Per-rubric scoring across the six dimensions tells you what regressed. The 4-D trajectory score tells you the shape of the regression on every trace, with the same vocabulary in CI and production. Four axes, scored 1 to 5 by the same judge:
- Factual grounding. Did the agent stay anchored in retrieved or tool context, or confabulate. Catches result-utilization failures and retrieval drift.
- Privacy and safety. Did the agent leak PII, cross a tenant boundary, comply with a jailbreak. Catches refusal regressions and permission failures.
- Instruction adherence. Did the agent obey the system prompt and refuse what should have been refused. Catches prompt drift directly.
- Optimal plan execution. Did the agent pick the right tool, in the right order, without redundant calls, retries, or unreachable branches. Catches tool-selection and plan-coherence regressions on the call graph.
Four axes, four kinds of regression, one composite. When the composite drops, the axes are the diagnosis. The same judge runs against the offline dataset in CI and against live spans in production. Same vocabulary in both places, same calibration set, same threshold.
The CI gate: per-dimension thresholds, not an aggregate
The bug is treating one aggregate agent_score as a ship gate. An aggregate 0.85 hides a 0.62 on argument extraction behind a 0.97 on tool selection, and the production failure rides on the argument layer. Wire six assertions in the CI fixture, one per dimension, with thresholds calibrated against historical pass rates:
# config.yaml for `fi run`
assertions:
- "tool_selection_f1.score >= 0.95 for at_least 95% of cases"
- "argument_validation.score >= 0.90 for at_least 90% of cases"
- "argument_semantics.score >= 0.85 for at_least 85% of cases"
- "result_groundedness.score >= 0.90 for at_least 90% of cases"
- "recovery_score.score >= 0.80 for at_least 85% of cases"
- "task_completion.score >= 0.85 for at_least 90% of cases"
When the gate fails, the failing assertion name is the root cause. One bisect instead of three days. Distributed runners (Celery, Ray, Temporal, Kubernetes) handle the case where six rubrics across a 200-case suite outgrow a single-process budget.
Tracing is not optional
The trajectory is the unit of evaluation. The trace is the trajectory. Without spans, agent eval is response-only scoring with extra words.
traceAI (Apache 2.0) ships 14 span kinds (TOOL, CHAIN, LLM, RETRIEVER, EMBEDDING, AGENT, RERANKER, GUARDRAIL, EVALUATOR, CONVERSATION, VECTOR_DB, A2A_CLIENT, A2A_SERVER, UNKNOWN). The A2A_CLIENT and A2A_SERVER kinds capture agent-to-agent relationships for multi-agent systems. 50+ AI surfaces across Python, TypeScript, Java, and C#. Pluggable semantic conventions at register() time (FI, OTEL_GENAI, OPENINFERENCE, OPENLLMETRY) mean the spans flow into whatever OTel collector you already run. The LangGraphInstrumentor surfaces node_count and conditional-edge topology so multi-agent graphs are introspectable from the trace alone.
from fi_instrumentation import register
from fi_instrumentation.fi_types import ProjectType
from traceai_openai_agents import OpenAIAgentsInstrumentor
trace_provider = register(
project_type=ProjectType.OBSERVE,
project_name="my-agent",
)
OpenAIAgentsInstrumentor().instrument(tracer_provider=trace_provider)
That is the whole tracing setup. The spans surface tool calls, retrievals, agent reasoning, and tool returns: the inputs every agent-eval rubric needs. Eval scores attach to spans via EvalTag; the collector runs evals server-side post-export at zero inline latency.
Production observability and Error Feed
Six dimensions in CI is necessary, not sufficient. The eval set is a snapshot; production is a river. Score the live trace stream with the same rubrics and you get a regression signal the offline set cannot have, because the offline set was frozen before users found the failure mode.
Error Feed is the loop closer inside the eval stack. Failing traces flow into ClickHouse with their span embeddings. HDBSCAN soft-clustering groups them into named issues. Each cluster fires a JudgeAgent on Claude Sonnet 4.5 for a 30-turn investigation across 8 span-tools (read_span, get_children, get_spans_by_type, search_spans, plus a Haiku Chauffeur for spans over 3000 characters). Prompt-cache hit ratio sits around 90 percent.
Per cluster, the Judge emits three artifacts engineers actually read: a 5-category, 30-subtype taxonomy, the 4-D trace score above, and an immediate_fix naming the change to ship today (rubric edit, prompt patch, tool-call guard, retrieval-filter tweak). The fix feeds the Platform’s self-improving evaluators. The cluster becomes a candidate dataset entry; the on-call engineer promotes representative traces into the offline set. The next PR touching that path has to clear them.
Common agent eval mistakes
- Response-only scoring. Misses every failure whose root cause is a bad tool call or a bad plan. The trajectory is the unit.
- Aggregate task-completion alone. Hides which dimension regressed. Per-dimension scoring is the only diagnostic that works.
- No irrelevance bucket. Tool selection only scored on cases where a tool was expected. The over-call regression is invisible.
- Mocked tools, no error-recovery coverage. Happy-path eval at 0.95. Production 429 storm at 0.30.
- Frozen test set. Promote failing traces into the offline set weekly or the set ages off the product.
- Eval and trace in different tools. Attach scores to the OTel span; no engineer cross-references two dashboards under pressure.
How Future AGI ships the full agent eval stack
Future AGI ships the eval stack as a package, not a single product. Start with the SDK for code-defined per-dimension scoring. Graduate to the Platform when the loop needs self-improving rubrics, in-product authoring, and classifier-backed cost economics.
ai-evaluation SDK (Apache 2.0). 70+ EvalTemplate classes including LLMFunctionCalling, TaskCompletion, AnswerRefusal, ConversationCoherence, ConversationResolution, Groundedness, ContextAdherence, ChunkAttribution, and 11 CustomerAgent* templates for vertical-specific failure modes. Deterministic function-call metrics: function_name_match, parameter_validation, function_call_accuracy, function_call_exact_match (sub-millisecond). Seven AgentTrajectoryInput metrics: TaskCompletion, StepEfficiency, ToolSelectionAccuracy, TrajectoryScore, GoalProgress, ActionSafety, ReasoningQuality. 13 guardrail backends (9 open-weight). Four distributed runners (Celery, Ray, Temporal, Kubernetes). Multi-modal CustomLLMJudge.
traceAI (Apache 2.0). 50+ AI surfaces across Python, TypeScript, Java, C#. 14 span kinds including TOOL, AGENT, RETRIEVER, GUARDRAIL, A2A_CLIENT, A2A_SERVER. Pluggable semantic conventions at register() time. LangGraphInstrumentor exposes graph topology. EvalTag wires rubric to span at zero inference latency.
Future AGI Platform. Self-improving evaluators tuned by feedback, in-product agent-authored custom rubrics, classifier-backed evals at lower per-eval cost than Galileo Luna-2. Error Feed sits inside the stack as the clustering and what-to-fix layer.
agent-opt. Six optimizers (RandomSearchOptimizer, BayesianSearchOptimizer Optuna-backed, MetaPromptOptimizer, ProTeGi, GEPAOptimizer, PromptWizardOptimizer) consume Evaluator scores as the objective. Shared EarlyStoppingConfig. Eval-driven optimization ships today; direct trace-stream ingestion is on the active roadmap.
Ready to evaluate your first agent? Wire function_name_match, parameter_validation, Groundedness against the tool result, TaskCompletion on AgentTrajectoryInput, and the 4-D TrajectoryScore into a pytest fixture this afternoon against the ai-evaluation SDK, then attach the same templates as EvalTag scorers via traceAI when production traces start asking questions the CI gate missed.
Related reading
Frequently asked questions
Why is agent evaluation different from LLM evaluation?
What are the six dimensions of agent evaluation?
How do public benchmarks like τ-bench and BFCL fit into agent eval?
What is a 4-dimensional trajectory score and why does it beat one aggregate score?
Does agent eval require tracing?
Should the CI gate use one aggregate threshold or per-dimension thresholds?
How does Future AGI ship the full agent eval stack?
Evaluate Pydantic AI agents that call MCP tools in 2026: per-typed-output rubrics, tool-call argument fidelity, MCP security checks, dependency invariants.
Evaluating AutoGen agents in 2026: the handoff is the eval unit. Three failure modes, three rubrics, per-pair spans, and the production loop.
Evaluating Claude Code tool use in 2026: per-tool selection F1, argument fidelity, irreversibility awareness, recovery on error, on traceAI traces.