Guides

The Definitive Guide to AI Agent Evaluation (2026)

The 2026 working pattern for AI agent evaluation. Six dimensions, six rubrics, a 4-D trajectory score, the CI gate that beats aggregate scoring, and the loop production needs.

·
Updated
·
13 min read
agent-evaluation ai-agents tool-calling trajectory-evaluation traceai 2026
Editorial cover image for The Definitive Guide to AI Agent Evaluation
Table of Contents

An agent is not a model. Evaluating one as if it were is the most common reason production agents fail. The unit is the trajectory: tool selection, argument extraction, result utilization, error recovery, plan coherence, task completion. Six dimensions, six rubrics, scored independently. Test fewer than four and you are shipping a probability, not a system. This guide is the working pattern from agent deployments that have shipped and stayed shipped: six dimensions, the rubrics that catch them, the public benchmarks that anchor the floor, the private eval set that gates production, the 4-D trajectory score that runs in CI and on live spans, and the loop that turns production failures into the next regression test.

TL;DR: the six dimensions of agent evaluation

DimensionWhat it measuresFailure mode if missing
Tool selectionDid the agent pick the right tool, or correctly call noneWrong tool, fabricated tool, no tool when one was needed
Argument extractionSchema-valid and semantically correct argumentsRight tool, wrong date format; right tool, missing required field
Result utilizationDid the agent use the tool payload or substitute model knowledgeNumber flipped, entity swapped, payload ignored
Error recoveryDid the agent retry, fall back, or escalate on tool failureCrash, hallucinate success, retry the same broken input
Plan coherenceLoop-free, dead-end-free, right depthSub-tree explosion, premature finalisation, infinite loops
Task completionDid the trajectory deliver the user goal end-to-endPer-step green, end-to-end failure

Aggregate task-completion alone hides which dimension regressed. Per-dimension scoring tells you what to fix this afternoon.

Why agent eval is not LLM eval with extra steps

An LLM eval is a function from (input, output) to a score. An agent eval is a function from trajectory to a score, where a trajectory is the full ordered sequence of system prompt, user input, agent reasoning, tool calls (name plus arguments plus return value), retrieval results, intermediate LLM calls, final response, and outcome metadata.

You cannot score this from the response alone. A response that looks right can come from the wrong tool with the wrong arguments by luck. A response that looks wrong can come from a correct trajectory the rubric did not anticipate. The trace is the truth.

The math agrees. End-to-end success on a k-step agent is roughly the product of per-step success rates. A 95-percent per-step agent over eight steps lands near 66 percent. A 99-percent per-step agent over eight steps lands near 92 percent. Two thirds of sessions ending structurally wrong while every individual step scores green is the default math of compound error, and it is why teams ship agents that pass per-turn eval and tank production. The per-step rubric is the gate; the trajectory metric is the truth.

Dimension 1: tool selection

Pull the model’s chosen tool name, compare to the gold label, aggregate as F1 per tool so a registry of 28 tools does not hide a regression on one rare endpoint behind a strong global mean. Three failure modes show up:

  • Wrong tool. Calls web_search when the answer was in knowledge_base.
  • No tool. Answers from parametric memory when the spec required grounded retrieval.
  • Fabricated tool. Invokes a tool that does not exist (rare in production, common in prototype).

The piece most posts drop is the irrelevance bucket: cases where the gold answer is no tool call. Greeting, clarification request, in-model factual question, refusal-worthy ask. Without those cases, you cannot detect the regression where a new prompt revision makes the model bolder about calling search on every input. BFCL added the bucket for exactly this reason; build it into your private set the same way. The ai-evaluation SDK ships LLMFunctionCalling (cloud, alias EvaluateFunctionCalling) for the rubric case plus deterministic function_name_match, parameter_validation, function_call_accuracy, and function_call_exact_match (sub-millisecond, local).

Dimension 2: argument extraction

Right tool with wrong arguments is the most common production agent bug. The agent decides to call create_calendar_event, then passes the date in the wrong format, omits the timezone, or hallucinates an attendee. Argument failures fall into three buckets:

  • Schema mismatch. Wrong type, missing required field. Pydantic and JSON Schema catch this deterministically.
  • Semantic mismatch. Right schema, wrong value. departure_date="2026-01-01" validates and is wrong if the user said “next Friday.” LLM-judge with a few-shot rubric handles this.
  • Edge-case handling. Null on optional fields, empty array, unicode in identifiers, timezone on date fields, currency on monetary fields. These are the failures BFCL cannot see because they are private to your tool registry.

Run schema validation first, gate CI on it, then send the semantically suspect cases to a CustomLLMJudge that scores whether the argument captures the user intent. Building a regression suite of edge cases per tool (null on optional, empty array, special characters, type coercion, the timezone case on every date field) is what separates the eval set BFCL-equivalent from the eval set production survives.

Dimension 3: result utilization

The tool returned. The agent has the payload. Three failure patterns surface most often, and almost every public post on agent eval skips this layer:

The agent paraphrases the payload with a number flipped. Tool returns {"refund_status": "pending", "amount_cents": 4500}; agent says “your refund of $54.00 is processing.” Schema-correct call, clean response, off by an order of magnitude.

The agent substitutes prior model knowledge. get_account_balance returns {"balance_cents": 12_400}. The model “knows” the user has a standard $200 minimum and replies “your balance is above the $200 threshold.” The tool result was never read.

The agent uses the result on turn 1 and drifts off it by turn 3. The flight-booking agent quotes the right itinerary on turn 1, then invents a baggage policy on turn 3 that contradicts the airline_policy tool result from two turns ago.

The rubric is Groundedness, with the context slot pointed at the tool return payload rather than the retrieved corpus. ContextAdherence and ChunkAttribution work the same way: chunk the tool result into JSON fields, score whether each claim in the response maps to one. The Platform’s classifier-backed cascade runs Groundedness at lower per-eval cost than Galileo Luna-2.

Dimension 4: error recovery

Real tools fail. APIs time out, return 429s, return malformed JSON, return empty results. The agent’s behavior in these cases is a separate eval axis from happy-path behavior. The patterns to grade: did the agent read the error body and route to a corrected retry, a fallback tool, a clarification question, or a graceful escalation; did it retry with corrected arguments on a 400 or send the same broken string again; did it stop at a sensible retry cap (3 is a common floor; 6 usually means the loop guard is missing); did it communicate the failure clearly instead of fabricating success.

This is a trajectory-level concern. Build a stratified test set by replaying production traces with synthetic tool failures injected: one bucket per tool, one row per error code the endpoint returns (400, 401, 403, 404, 408, 429, 5xx), plus empty-result and partial-result rows. Gate CI on per-bucket recovery rates. ActionSafety and TrajectoryScore from the agent-trajectory suite cover the deterministic side; a CustomLLMJudge wrapped around the trajectory handles the qualitative side.

Dimension 5: plan coherence

For agents that take multiple steps before finalising, the shape of the trajectory matters. Three patterns to score:

  • No loops. The agent does not re-call the same tool with the same arguments more than once.
  • No dead-ends. Every branch eventually returns to the main goal or terminates with a clean refusal.
  • Right depth. A two-step task takes roughly two steps. A ten-step task takes roughly ten. Sub-tree explosion is a regression.

StepEfficiency, TrajectoryScore, and GoalProgress from the SDK’s agent-trajectory suite score these directly on AgentTrajectoryInput. For richer plan critique, a CustomLLMJudge with a rubric like “score 1.0 if the trajectory is the shortest correct path; 0.5 if it is correct but inefficient; 0.0 if it loops or dead-ends” works well. Treat any agent longer than five steps as suspect; force the planner to decompose into shorter sub-agents. Long flat trajectories are where compound-error pain lives.

Dimension 6: task completion

End-to-end success on the user goal, scored on the full trajectory rather than the final turn. TaskCompletion (cloud eval_id=99) handles the rubric case across trajectory plus expected goal. For multi-turn conversations, layer in ConversationCoherence and ConversationResolution so per-turn rubrics that look fine in isolation cannot hide a session that talked itself in circles. For customer-support agents, the SDK ships 11 CustomerAgent* templates (ClarificationSeeking, ContextRetention, ConversationQuality, HumanEscalation, InterruptionHandling, LanguageHandling, LoopDetection, ObjectionHandling, PromptConformance, QueryHandling, TerminationHandling) for the named failure modes in that vertical.

Reserve a consistency slice. Pick 30 hard cases and run them k times each; the fraction that succeed on all k is your pass^k in τ-bench’s sense. When pass^8 moves, the planner regressed, not the tools.

Public benchmarks: the floor, not the ceiling

Three public benchmarks anchor the floor in 2026. Use them; do not gate production on them.

BFCL (Berkeley Function Calling Leaderboard) breaks tool calling into an AST track (syntactic correctness), an executable track (the call actually runs on a real endpoint), and an irrelevance-detection bucket. A model that aces AST and tanks irrelevance overcalls on your registry; a model that aces AST and tanks executable generates plausible but non-running calls.

τ-bench evaluates multi-turn agents in airline and retail with an LLM-simulated user, a domain policy, and tool access. The headline metric is pass^k across k independent rollouts. Even strong models land below 25 percent at pass^8 on retail; multi-turn tool-using agents are nondeterminism amplifiers, and the consistency metric is the cleanest exposure of that fact.

ToolBench tests across thousands of real APIs with a focus on instruction-following and tool composition.

Public benchmarks tell you whether the underlying model can call tools at all. They tell you nothing about your registry, argument schemas, error codes, or business policy. The private eval set is the one that gates production. Build it stratified by tool, argument-edge-case bucket, and error code; promote failing production traces into it weekly.

The 4-D trajectory score

Per-rubric scoring across the six dimensions tells you what regressed. The 4-D trajectory score tells you the shape of the regression on every trace, with the same vocabulary in CI and production. Four axes, scored 1 to 5 by the same judge:

  • Factual grounding. Did the agent stay anchored in retrieved or tool context, or confabulate. Catches result-utilization failures and retrieval drift.
  • Privacy and safety. Did the agent leak PII, cross a tenant boundary, comply with a jailbreak. Catches refusal regressions and permission failures.
  • Instruction adherence. Did the agent obey the system prompt and refuse what should have been refused. Catches prompt drift directly.
  • Optimal plan execution. Did the agent pick the right tool, in the right order, without redundant calls, retries, or unreachable branches. Catches tool-selection and plan-coherence regressions on the call graph.

Four axes, four kinds of regression, one composite. When the composite drops, the axes are the diagnosis. The same judge runs against the offline dataset in CI and against live spans in production. Same vocabulary in both places, same calibration set, same threshold.

The CI gate: per-dimension thresholds, not an aggregate

The bug is treating one aggregate agent_score as a ship gate. An aggregate 0.85 hides a 0.62 on argument extraction behind a 0.97 on tool selection, and the production failure rides on the argument layer. Wire six assertions in the CI fixture, one per dimension, with thresholds calibrated against historical pass rates:

# config.yaml for `fi run`
assertions:
  - "tool_selection_f1.score >= 0.95 for at_least 95% of cases"
  - "argument_validation.score >= 0.90 for at_least 90% of cases"
  - "argument_semantics.score >= 0.85 for at_least 85% of cases"
  - "result_groundedness.score >= 0.90 for at_least 90% of cases"
  - "recovery_score.score >= 0.80 for at_least 85% of cases"
  - "task_completion.score >= 0.85 for at_least 90% of cases"

When the gate fails, the failing assertion name is the root cause. One bisect instead of three days. Distributed runners (Celery, Ray, Temporal, Kubernetes) handle the case where six rubrics across a 200-case suite outgrow a single-process budget.

Tracing is not optional

The trajectory is the unit of evaluation. The trace is the trajectory. Without spans, agent eval is response-only scoring with extra words.

traceAI (Apache 2.0) ships 14 span kinds (TOOL, CHAIN, LLM, RETRIEVER, EMBEDDING, AGENT, RERANKER, GUARDRAIL, EVALUATOR, CONVERSATION, VECTOR_DB, A2A_CLIENT, A2A_SERVER, UNKNOWN). The A2A_CLIENT and A2A_SERVER kinds capture agent-to-agent relationships for multi-agent systems. 50+ AI surfaces across Python, TypeScript, Java, and C#. Pluggable semantic conventions at register() time (FI, OTEL_GENAI, OPENINFERENCE, OPENLLMETRY) mean the spans flow into whatever OTel collector you already run. The LangGraphInstrumentor surfaces node_count and conditional-edge topology so multi-agent graphs are introspectable from the trace alone.

from fi_instrumentation import register
from fi_instrumentation.fi_types import ProjectType
from traceai_openai_agents import OpenAIAgentsInstrumentor

trace_provider = register(
    project_type=ProjectType.OBSERVE,
    project_name="my-agent",
)
OpenAIAgentsInstrumentor().instrument(tracer_provider=trace_provider)

That is the whole tracing setup. The spans surface tool calls, retrievals, agent reasoning, and tool returns: the inputs every agent-eval rubric needs. Eval scores attach to spans via EvalTag; the collector runs evals server-side post-export at zero inline latency.

Production observability and Error Feed

Six dimensions in CI is necessary, not sufficient. The eval set is a snapshot; production is a river. Score the live trace stream with the same rubrics and you get a regression signal the offline set cannot have, because the offline set was frozen before users found the failure mode.

Error Feed is the loop closer inside the eval stack. Failing traces flow into ClickHouse with their span embeddings. HDBSCAN soft-clustering groups them into named issues. Each cluster fires a JudgeAgent on Claude Sonnet 4.5 for a 30-turn investigation across 8 span-tools (read_span, get_children, get_spans_by_type, search_spans, plus a Haiku Chauffeur for spans over 3000 characters). Prompt-cache hit ratio sits around 90 percent.

Per cluster, the Judge emits three artifacts engineers actually read: a 5-category, 30-subtype taxonomy, the 4-D trace score above, and an immediate_fix naming the change to ship today (rubric edit, prompt patch, tool-call guard, retrieval-filter tweak). The fix feeds the Platform’s self-improving evaluators. The cluster becomes a candidate dataset entry; the on-call engineer promotes representative traces into the offline set. The next PR touching that path has to clear them.

Common agent eval mistakes

  • Response-only scoring. Misses every failure whose root cause is a bad tool call or a bad plan. The trajectory is the unit.
  • Aggregate task-completion alone. Hides which dimension regressed. Per-dimension scoring is the only diagnostic that works.
  • No irrelevance bucket. Tool selection only scored on cases where a tool was expected. The over-call regression is invisible.
  • Mocked tools, no error-recovery coverage. Happy-path eval at 0.95. Production 429 storm at 0.30.
  • Frozen test set. Promote failing traces into the offline set weekly or the set ages off the product.
  • Eval and trace in different tools. Attach scores to the OTel span; no engineer cross-references two dashboards under pressure.

How Future AGI ships the full agent eval stack

Future AGI ships the eval stack as a package, not a single product. Start with the SDK for code-defined per-dimension scoring. Graduate to the Platform when the loop needs self-improving rubrics, in-product authoring, and classifier-backed cost economics.

ai-evaluation SDK (Apache 2.0). 70+ EvalTemplate classes including LLMFunctionCalling, TaskCompletion, AnswerRefusal, ConversationCoherence, ConversationResolution, Groundedness, ContextAdherence, ChunkAttribution, and 11 CustomerAgent* templates for vertical-specific failure modes. Deterministic function-call metrics: function_name_match, parameter_validation, function_call_accuracy, function_call_exact_match (sub-millisecond). Seven AgentTrajectoryInput metrics: TaskCompletion, StepEfficiency, ToolSelectionAccuracy, TrajectoryScore, GoalProgress, ActionSafety, ReasoningQuality. 13 guardrail backends (9 open-weight). Four distributed runners (Celery, Ray, Temporal, Kubernetes). Multi-modal CustomLLMJudge.

traceAI (Apache 2.0). 50+ AI surfaces across Python, TypeScript, Java, C#. 14 span kinds including TOOL, AGENT, RETRIEVER, GUARDRAIL, A2A_CLIENT, A2A_SERVER. Pluggable semantic conventions at register() time. LangGraphInstrumentor exposes graph topology. EvalTag wires rubric to span at zero inference latency.

Future AGI Platform. Self-improving evaluators tuned by feedback, in-product agent-authored custom rubrics, classifier-backed evals at lower per-eval cost than Galileo Luna-2. Error Feed sits inside the stack as the clustering and what-to-fix layer.

agent-opt. Six optimizers (RandomSearchOptimizer, BayesianSearchOptimizer Optuna-backed, MetaPromptOptimizer, ProTeGi, GEPAOptimizer, PromptWizardOptimizer) consume Evaluator scores as the objective. Shared EarlyStoppingConfig. Eval-driven optimization ships today; direct trace-stream ingestion is on the active roadmap.

Ready to evaluate your first agent? Wire function_name_match, parameter_validation, Groundedness against the tool result, TaskCompletion on AgentTrajectoryInput, and the 4-D TrajectoryScore into a pytest fixture this afternoon against the ai-evaluation SDK, then attach the same templates as EvalTag scorers via traceAI when production traces start asking questions the CI gate missed.

Frequently asked questions

Why is agent evaluation different from LLM evaluation?
An LLM eval scores a single input-output pair. An agent eval scores a trajectory: the ordered sequence of plans, tool selections, tool arguments, tool returns, intermediate reasoning, error recoveries, and the final response. The unit of evaluation is the trace, not the response. A trajectory that produces a correct final answer can still be structurally broken (wrong tool, lucky arguments, no recovery), and a trajectory the rubric did not anticipate can still be correct. Score the trajectory or you are grading luck. The Future AGI ai-evaluation SDK ships AgentTrajectoryInput with seven trajectory metrics (TaskCompletion, StepEfficiency, ToolSelectionAccuracy, TrajectoryScore, GoalProgress, ActionSafety, ReasoningQuality) for exactly this reason.
What are the six dimensions of agent evaluation?
Six dimensions that map cleanly onto trace shape. Tool selection: did the agent pick the right tool from the registry, or correctly call none. Argument extraction: are the arguments schema-valid and semantically correct. Result utilization: did the agent use the tool payload or substitute model knowledge. Error recovery: did the agent retry, fall back, or escalate on tool failure. Plan coherence: is the multi-step shape efficient, loop-free, and within reasonable depth. Task completion: did the trajectory actually deliver the user goal end-to-end. Score the six separately. Aggregate task-completion hides which one regressed; per-dimension scoring tells you what to fix this afternoon.
How do public benchmarks like τ-bench and BFCL fit into agent eval?
Public benchmarks anchor the floor; your private eval set gates production. BFCL (Berkeley Function Calling Leaderboard) breaks tool-calling into AST correctness, executable correctness, and an irrelevance bucket on a public registry. τ-bench runs multi-turn agents in airline and retail with an LLM-simulated user and reports pass^k across k independent rollouts, exposing the nondeterminism cost of multi-step trajectories (GPT-4o lands below 25 percent at pass^8 on retail). ToolBench tests across thousands of real APIs. Use them as model-selection signals. They tell you nothing about your tool registry, argument schemas, error codes, or business policy, which are the layers that actually break in production.
What is a 4-dimensional trajectory score and why does it beat one aggregate score?
Four axes scored 1 to 5 by the same judge on every trace. Factual grounding (did the agent stay anchored in retrieved or tool context, or confabulate). Privacy and safety (did it leak PII, cross a tenant boundary, follow a jailbreak). Instruction adherence (did it obey the system prompt and refuse what it should have refused). Optimal plan execution (did it pick the right tool, in the right order, without redundant calls or loops). The composite is the trace score; the four axes are the diagnosis when the composite drops. The same judge runs against the offline dataset in CI and against live spans in production, so the vocabulary is identical across both surfaces.
Does agent eval require tracing?
Effectively yes. The trace is the trajectory. Without spans for each tool call, retrieval, model call, and handoff, you cannot score the intermediate steps that separate correct trajectories from lucky ones. traceAI ships 14 span kinds (AGENT, TOOL, RETRIEVER, LLM, CHAIN, RERANKER, EMBEDDING, GUARDRAIL, EVALUATOR, VECTOR_DB, CONVERSATION, A2A_CLIENT, A2A_SERVER, UNKNOWN) across 50+ AI surfaces in Python, TypeScript, Java, and C#. Pluggable semantic conventions at register() time (FI, OTEL_GENAI, OPENINFERENCE, OPENLLMETRY) mean the spans flow into whatever OTel collector you already run. The LangGraphInstrumentor surfaces node_count and conditional-edge topology; the @tracer.tool decorator auto-infers schema from type hints.
Should the CI gate use one aggregate threshold or per-dimension thresholds?
Per-dimension, always. An aggregate 0.85 hides a 0.62 on argument extraction behind a 0.97 on tool selection, and the production failure rides on the argument layer. Wire six assertions in the CI fixture, one per dimension, with thresholds calibrated against historical pass rates. When the gate fails, the failing dimension is the root cause. The Future AGI fi CLI ships per-eval assertions natively. Distributed runners (Celery, Ray, Temporal, Kubernetes) handle the case where six rubrics across a 200-case suite outgrow a single-process budget.
How does Future AGI ship the full agent eval stack?
The eval stack ships as a package, not a single product. ai-evaluation SDK (Apache 2.0) is the code-first surface: 70+ EvalTemplate classes including LLMFunctionCalling, TaskCompletion, AnswerRefusal, ConversationCoherence, ConversationResolution, and 11 CustomerAgent templates; seven AgentTrajectoryInput metrics; 13 guardrail backends; four distributed runners. traceAI (Apache 2.0) carries the same rubrics as span-attached scores on live traces. The Future AGI Platform adds self-improving evaluators tuned by feedback, in-product agent-authored custom rubrics, and classifier-backed scoring at lower per-eval cost than Galileo Luna-2. Error Feed sits inside the stack: HDBSCAN clusters failing trajectories, a Sonnet 4.5 Judge writes the 5-category 30-subtype taxonomy, the 4-D trace score, and an immediate_fix. agent-opt closes the optimization loop with six optimizers consuming Evaluator scores as the objective.
Related Articles
View all