Evaluating Tool-Calling Agents in 2026: The Four-Layer Eval Stack
Tool-calling eval is four eval problems stacked: tool selection, argument extraction, result utilization, and error recovery. Most posts grade the first one and call it done.
Table of Contents
An agent fails to book a flight. The trace shows the model called search_flights with departure_date="next Friday". The endpoint returned 400; it expected an ISO date. The agent retried four times with the same string, then apologized to the user. Tool selection was correct, the model picked the right function from a registry of 28, and tool-selection accuracy logs a 1.0. Neither that score nor an aggregate task-completion 0 tells you which of three things downstream broke: the argument was wrong, the model never read the 400 body, the retry policy looped on the same input.
If you only eval “did the agent call the right tool,” you’re testing intent, not execution. Tool selection is necessary, not sufficient. The opinion this post earns: tool-calling eval is four eval problems stacked, not one. Layer 1, tool selection. Layer 2, argument extraction. Layer 3, result utilization. Layer 4, error recovery. Per-layer scoring tells you what to fix this afternoon. This guide walks each layer, the rubric that catches it, the compound-error math on multi-step agents, where BFCL and τ-bench fit, and how the Future AGI eval stack wires it end-to-end.
TL;DR: the four-layer eval stack
| Layer | What you measure | Deterministic rubric | LLM-judge rubric |
|---|---|---|---|
| 1. Tool selection | Right tool (or correctly no tool) | function_name_match, F1 + irrelevance bucket | EvaluateFunctionCalling |
| 2. Argument extraction | Schema-valid + semantically correct | parameter_validation, function_call_exact_match | LLMFunctionCalling on argument semantics |
| 3. Result utilization | Did the agent use what the tool returned | function_call_accuracy on the call sequence | Groundedness + ContextAdherence with tool result as context |
| 4. Error recovery | Did the agent retry, fall back, or escalate | Retry-count, max-loops, error-tier guards | TaskCompletion + recovery rubric on AgentTrajectoryInput |
Non-negotiables: per-layer scoring rather than aggregate task_completion alone, an irrelevance bucket on the test set, schema validation as a deterministic gate before the LLM-judge runs, groundedness on the tool output as a first-class rubric, and a trajectory rubric so the compound-error problem stops hiding in per-turn averages.
Why tool-calling eval is four eval problems stacked
Four failure modes show up in postmortem, and they map cleanly onto four layers.
Selection. The agent picked the wrong tool, called a tool when the model knew the answer directly, or did not call one when it should have. F1 on the tool name plus an irrelevance bucket catches it; the irrelevance bucket is the piece most posts drop.
Argument. Schema right, types right, values wrong. departure_date="next Friday" schema-validates and fails the user. customer_id="me" returns someone else’s account. amount_cents=5000000 drains the refund budget. Schema validation catches the type class; the semantic class needs a rubric.
Result-utilization. The tool returned correctly; the agent ignored the payload, paraphrased with a number flipped, substituted prior model knowledge, or used the result on turn 1 and drifted off it by turn 3. Almost every public post on tool-call eval skips this layer.
Error-recovery. The tool 4xx-ed, the model did not read the error body, the retry sent the same broken arguments, the loop hit the max-step ceiling, or the agent fabricated a “successful” response to hide the failure. Per-call rubrics never see this; the trajectory metric does.
Score the four layers separately and the diagnostic vocabulary collapses from “the agent failed” to “the argument extractor regressed on date strings on the flight-booking path.” One bisect instead of three days.
Layer 1: tool selection (F1 on the tool name + the irrelevance bucket)
Pull the model’s chosen tool name, compare to the gold label, aggregate as F1 per tool so a registry of 28 tools does not hide a regression on one rare endpoint behind a strong global mean.
from fi.evals import evaluate
result = evaluate("function_name_match",
output={"function_name": predicted_tool},
expected={"function_name": ground_truth_tool})
# result.score = 1.0 exact match, 0.0 otherwise
The SDK ships four deterministic function-call metrics in fi.evals.metrics.function_calling: function_name_match (name only), parameter_validation (name plus argument shape), function_call_accuracy (the full call against the gold), and function_call_exact_match (strict equality including parallel-call ordering). All sub-millisecond per call.
The piece most posts drop is the irrelevance bucket. The test set has to include cases where the gold answer is “no tool call”: a greeting, a clarification request, an in-model factual question, a refusal-worthy ask. Without those cases, you cannot detect the regression where a new prompt revision makes the model bolder about calling search on every input. BFCL added the bucket for exactly this reason; build it into your private set the same way.
For the rubric case (no single correct tool, or two reasonable tools differ on edge), the cloud EvaluateFunctionCalling template (alias LLMFunctionCalling) handles semantic correctness via the Evaluator API or CustomLLMJudge.
Layer 2: argument extraction (schema validation + semantic rubric)
Argument failures fall into three buckets: schema mismatch (wrong type, missing required field), semantic mismatch (right schema, wrong value), and edge-case handling (null, empty array, special characters, type coercion).
Schema validation runs first and is deterministic. Pydantic on the model’s output is the cheapest possible gate.
from pydantic import BaseModel, Field, ValidationError
class SearchFlightsArgs(BaseModel):
departure_airport: str = Field(pattern=r"^[A-Z]{3}$")
arrival_airport: str = Field(pattern=r"^[A-Z]{3}$")
departure_date: str = Field(pattern=r"^\d{4}-\d{2}-\d{2}$")
cabin: str = Field(pattern=r"^(economy|premium|business|first)$")
try:
args = SearchFlightsArgs.model_validate(predicted_args)
except ValidationError as e:
schema_errors = e.errors() # emit as span attribute, gate CI
Once schema passes, the SDK’s parameter_validation metric matches argument shape and values against the gold call. Semantic correctness needs the LLM-judge: departure_date="2026-01-01" schema-validates and is wrong if the user said “next Friday.” A CustomLLMJudge scores whether the argument captures the user’s intent — 1.0 if it captures correct dates, entities, identifiers, units; 0.5 on minor interpretation; 0.0 if values are clearly wrong or unobtainable from the user input.
Build a regression suite of edge cases per tool: null on optional fields, empty array where the schema permits but the tool returns 500, unicode in identifiers, the time-zone case on every date field, the currency case on every monetary field. These are the failures BFCL cannot see because they are private to your tool registry.
Layer 3: result utilization (the layer most posts skip)
The tool returned. The agent has the payload. Three failure patterns show up.
The agent paraphrases the payload with a number flipped. Tool returns {"refund_status": "pending", "amount_cents": 4500}, agent says “your refund of $54.00 is processing.” Schema-correct call, clean response, off by an order of magnitude.
The agent substitutes prior model knowledge. get_account_balance returns {"balance_cents": 12_400}. The model “knows” the user has a standard $200 minimum and replies “your balance is above the $200 threshold.” The tool result was never read.
The agent uses the result on turn 1 and drifts off it by turn 3. The flight-booking agent quotes the right itinerary on turn 1, then invents a baggage policy on turn 3 that contradicts the airline_policy tool result from two turns ago.
The rubric is Groundedness, with the context slot pointed at the tool’s return payload rather than the retrieved corpus. ContextAdherence and ChunkAttribution work the same way: chunk the tool result into JSON fields, score whether each claim in the response maps to one.
from fi.evals import Evaluator
from fi.evals.templates import Groundedness, ContextAdherence, ChunkAttribution
from fi.testcases import TestCase
evaluator = Evaluator()
for tool_call in result.tool_calls:
# context = the actual tool payload, not the retrieved corpus
tc = TestCase(input=ex.user_message, output=result.response,
context=json.dumps(tool_call.result))
scores = evaluator.evaluate(
eval_templates=[Groundedness(), ContextAdherence(), ChunkAttribution()],
inputs=tc)
Score this layer on every multi-turn agent where the tool feeds the response. The Platform’s classifier-backed cascade runs Groundedness at lower per-eval cost than Galileo Luna-2.
Layer 4: error recovery (the trajectory rubric)
When the tool 4xx-es, times out, or returns an empty or partial result, the agent’s next move is the eval surface. The patterns to grade: did the agent read the error body and route to a corrected retry, a fallback tool, a clarification question, or a graceful escalation; did it retry with corrected arguments on a 400 or send the same broken string again; did it fall back to an alternative tool when the primary was down; did it stop at a sensible cap on retries (3 is a common floor; 6 usually means the loop guard is missing); did it communicate the failure clearly instead of fabricating success.
This is a trajectory-level concern, not per-call. The SDK exposes AgentTrajectoryInput with seven trajectory metrics in fi.evals.metrics.agents: TaskCompletion, StepEfficiency, ToolSelectionAccuracy, TrajectoryScore, GoalProgress, ActionSafety, ReasoningQuality. Each takes the full step list, the available tools, and the expected goal.
from fi.evals.metrics.agents import TrajectoryScore, AgentTrajectoryInput
from fi.evals.metrics.agents.types import AgentStep, TaskDefinition
trajectory = AgentTrajectoryInput(
trajectory=[AgentStep(action=s.action, tool_used=s.tool,
tool_args=s.args, tool_result=s.result,
error=s.error) for s in agent_steps],
task=TaskDefinition(goal=expected_goal, description=user_request),
available_tools=[t.name for t in registered_tools],
final_result=agent_response)
score = TrajectoryScore().compute_one(trajectory)
For a recovery-specific rubric, wrap a CustomLLMJudge against the trajectory: 1.0 if the agent read the error and corrected, 0.5 on partial recovery, 0.0 if it looped on the same broken input or hid the failure.
Build a stratified test set: one bucket per tool, one row per error code the endpoint actually returns (400, 401, 403, 404, 408, 429, 5xx), plus empty-result and partial-result rows. Gate CI on per-bucket recovery rates. A regression on 429 recovery for a single tool is one of the cheapest failures to ship and the hardest to debug after the fact.
The compound-error problem on multi-step agents
End-to-end success on a k-step agent is roughly the product of per-step success rates. A 95-percent per-step agent over eight steps lands near 66 percent. A 99-percent per-step agent over eight steps lands near 92 percent. Two thirds of sessions ending structurally wrong while every individual step scores green is not a hypothetical; it is the default math, and it is the most common reason teams ship agents that pass eval and tank production.
Three habits fix this.
Score the trajectory as a unit. Add TaskCompletion, TrajectoryScore, GoalProgress on the full step list. The per-step rubric is the gate; the trajectory metric is the truth.
Treat any agent longer than five steps as suspect. Force the planner to decompose into shorter sub-agents. Long flat trajectories are where compound-error pain lives.
Reserve a “consistency” eval slice. Pick 30 hard cases and run them k times each; the fraction that succeed on all k is your pass^k in τ-bench’s sense. A pass^8 < 25 percent on a 4o-class model in retail is the cost of nondeterminism stacked across eight steps. When it moves, the planner regressed, not the tools.
Public benchmarks (BFCL, τ-bench) vs your tool registry
Two public benchmarks anchor the floor in 2026. Use them; do not gate production on them.
BFCL (Berkeley Function Calling Leaderboard) evaluates function calling across an AST track (syntactic correctness), an executable track (the call actually runs on a real endpoint), and an irrelevance-detection bucket. The breakdown is the value: a model that aces AST and tanks irrelevance overcalls on your registry; a model that aces AST and tanks executable generates plausible but non-running calls. Treat BFCL as a model-selection signal, not a production gate.
τ-bench evaluates multi-turn agent behavior in airline and retail environments. The user is LLM-simulated, the agent has tools and a domain policy, and the headline metric is pass^k measuring reliability across k independent rollouts (not k retry attempts). Even GPT-4o lands below 25 percent at pass^8 on retail; multi-turn tool-using agents are nondeterminism amplifiers, and the consistency metric exposes how much.
Public benchmarks tell you whether the underlying model can call tools at all. They tell you nothing about your registry, argument schemas, error codes, or business policy. The private eval set is the one that gates production. Build it stratified by tool, argument-edge-case bucket, and error code; promote failing production traces into it weekly. The MCP (Model Context Protocol) angle is the same: the protocol surface is generic; your MCP server’s tool registry is private, and the eval has to know that registry.
# Stratified shape for a private tool-calling eval set
{
"by_tool": {"search_flights": 80, "book_flight": 40, ...}, # ≥ 30 per tool
"by_layer": {"selection_correct": 0.40, "selection_irrelevant": 0.10,
"argument_edge": 0.20, "result_utilization": 0.15,
"error_recovery": 0.15},
"by_difficulty": {"easy": 0.4, "medium": 0.4, "hard": 0.2},
}
Wiring the stack: traceAI spans + EvalTag + CI
The @tracer.tool decorator wraps a Python function and auto-infers the tool description from the docstring and the parameter schema from type annotations via _get_jsonschema_type. No manual span creation.
from fi_instrumentation import register, get_tracer
from fi_instrumentation.fi_types import ProjectType
register(project_name="travel_agent", project_type=ProjectType.OBSERVE)
tracer = get_tracer(__name__)
@tracer.tool
def search_flights(departure: str, arrival: str, date: str) -> list:
"""Search flights between two airports on a given date."""
...
Every tool call lands as a span with fi.span.kind=TOOL plus the GenAI attributes (gen_ai.tool.name, gen_ai.tool.call.id, gen_ai.tool.call.arguments, gen_ai.tool.call.result). Latency rides on the standard OpenTelemetry duration attribute, so per-tool p50, p95, p99 are a Grafana query away. Eval scores attach to spans via EvalTag; the collector runs evals server-side post-export at zero inline latency. Wire one EvalTag per layer (EVALUATE_LLM_FUNCTION_CALLING on TOOL spans for layers 1-2, GROUNDEDNESS on LLM spans for layer 3, TASK_COMPLETION on AGENT spans for layer 4) and the same rubric runs in CI and on live traces.
Per-layer CI gate as one fixture:
# config.yaml for `fi run`
assertions:
- "tool_name_f1.score >= 0.95 for at_least 95% of cases"
- "argument_validation.score >= 0.90 for at_least 90% of cases"
- "argument_semantics.score >= 0.85 for at_least 85% of cases"
- "result_groundedness.score >= 0.90 for at_least 90% of cases"
- "trajectory_recovery.score >= 0.80 for at_least 85% of cases"
Distributed batch evaluation runs across four backends (Celery, Ray, Temporal, Kubernetes) when the CI fixture grows past a single-runner budget.
Tool-call security as a first-class layer
The four-layer stack is the quality story. Tool-calling agents need a parallel security story because they can do things. Three defenses ride alongside.
Per-virtual-key allow/deny lists at the gateway. The Agent Command Center carries AllowedTools and DeniedTools on every API key (plus AllowedModels, AllowedProviders, AllowedIPs, RateLimitRPM, RateLimitTPM). A key scoped to ["search_flights", "check_status"] cannot call cancel_booking under a coerced prompt.
MCP security at the protocol boundary. The gateway’s mcpsec plugin enforces allowed_servers, blocked_tools, validate_inputs / validate_outputs (injection patterns in tool arguments and results), max_calls_per_request (default 25), and per-tool rate limits. Default patterns catch exec(), eval(), shell escapes, SQL drop, script tags.
Prompt injection on every tool input and output. Future AGI Protect’s prompt_injection adapter is one of four Gemma 3n LoRA adapters running at 65 ms text / 107 ms image median time-to-label per the Protect paper, with the agentcc-gateway Go plugin carrying deterministic regex fallback. The same adapters reuse as eval rubrics for batch scoring of historical tool-call traces, so the production policy and the regression rubric stay in sync. SOC 2 Type II, HIPAA, GDPR, and CCPA certified per futureagi.com/trust; ISO/IEC 27001 in active audit.
How Future AGI ships the tool-calling eval stack
Future AGI ships the eval stack as a package. Start with the SDK for code-defined per-layer scoring. Graduate to the Platform when the loop needs self-improving rubrics and classifier-backed cost economics.
- ai-evaluation SDK (Apache 2.0): deterministic function-call metrics (
function_name_match,parameter_validation,function_call_accuracy,function_call_exact_match) for layers 1-2;EvaluateFunctionCalling(aliasLLMFunctionCalling) for the rubric case;Groundedness,ContextAdherence,ChunkAttribution,ChunkUtilizationfor layer 3; seven agent-trajectory metrics onAgentTrajectoryInputfor layer 4;fi runCLI with CI assertions. - Future AGI Platform: self-improving evaluators tuned by thumbs feedback; in-product authoring agent for custom rubrics; classifier-backed evals at lower per-eval cost than Galileo Luna-2.
- traceAI (Apache 2.0):
@tracer.tooldecorator with auto-inferred schemas; 14 span kinds across 50+ AI surfaces in Python, TypeScript, Java, C#;EvalTagwires rubric to span at zero inference latency. - Error Feed: HDBSCAN clusters tool-call failures into named issues; a Sonnet 4.5 Judge agent writes the 5-category 30-subtype taxonomy (Decision Errors: wrong-tool, invalid-params, missing-tool-call), the 4-D trace score, and an
immediate_fix. - Agent Command Center: 20+ providers; per-virtual-key
AllowedTools/DeniedToolsand an MCP security plugin enforce scope at the gateway; SOC 2 Type II, HIPAA, GDPR, CCPA certified (ISO/IEC 27001 in active audit).
Three honest tradeoffs
- Per-layer scoring costs more than aggregate task-completion. Five rubrics per case, not one. Payoff: when CI fails, the failing layer name is the root cause. New deployments can ship the deterministic layers first and turn on the LLM-judge layers once trace volume justifies it.
- Groundedness on tool output is noisier than groundedness on retrieved corpus. Tool payloads are JSON; the rubric reasons over fields, not prose. Pin a small human-labelled calibration set and re-tune monthly.
- The
pass^kconsistency slice is expensive. 30 cases × 8 rollouts per gate is 240 agent runs. Run it on release candidates, not every PR; the planner-vs-tool signal is worth the cadence.
Ready to evaluate your first tool-calling agent? Wire function_name_match, parameter_validation, Groundedness against the tool result, and TaskCompletion on AgentTrajectoryInput into a pytest fixture this afternoon against the ai-evaluation SDK, then attach the same templates as EvalTag scorers via traceAI when production traces start asking questions the CI gate missed.
Related reading
- Your Agent Passes Evals and Fails in Production. Here’s Why. (2026)
- How to Build and Evaluate a Customer Support Chatbot in 2026
- Evaluating Multi-Turn Conversations: A Deep Dive (2026)
- Agent Evaluation Frameworks (2026)
- Agent Observability vs Evaluation vs Benchmarking (2026)
- LLM Evaluation Playbook (2026)
Frequently asked questions
What are the four layers of tool-calling agent evaluation?
Why isn't tool selection accuracy enough on its own?
How do I score result utilization (the layer most posts skip)?
What does the compound-error problem mean for multi-step agents?
Should I rely on BFCL and tau-bench for tool-calling eval?
How does traceAI instrument tool calls without boilerplate?
How does Future AGI ship the tool-calling eval stack?
Evaluating LLM tool use is a four-step contract: decide to call, pick the tool, build the arguments, integrate the result. Score each step independently or you'll never know which one broke.
Evaluate Pydantic AI agents that call MCP tools in 2026: per-typed-output rubrics, tool-call argument fidelity, MCP security checks, dependency invariants.
A 2026 workflow for evaluating MCP servers end to end: functional checks, security checks, cross-client compatibility, stress tests, and the CI gate.