Guides

Evaluating Tool-Calling Agents in 2026: The Four-Layer Eval Stack

Tool-calling eval is four eval problems stacked: tool selection, argument extraction, result utilization, and error recovery. Most posts grade the first one and call it done.

·
12 min read
tool-calling function-calling agents llm-evaluation mcp bfcl tau-bench 2026
Editorial cover image for Evaluating Tool-Calling Agents (2026)
Table of Contents

An agent fails to book a flight. The trace shows the model called search_flights with departure_date="next Friday". The endpoint returned 400; it expected an ISO date. The agent retried four times with the same string, then apologized to the user. Tool selection was correct, the model picked the right function from a registry of 28, and tool-selection accuracy logs a 1.0. Neither that score nor an aggregate task-completion 0 tells you which of three things downstream broke: the argument was wrong, the model never read the 400 body, the retry policy looped on the same input.

If you only eval “did the agent call the right tool,” you’re testing intent, not execution. Tool selection is necessary, not sufficient. The opinion this post earns: tool-calling eval is four eval problems stacked, not one. Layer 1, tool selection. Layer 2, argument extraction. Layer 3, result utilization. Layer 4, error recovery. Per-layer scoring tells you what to fix this afternoon. This guide walks each layer, the rubric that catches it, the compound-error math on multi-step agents, where BFCL and τ-bench fit, and how the Future AGI eval stack wires it end-to-end.

TL;DR: the four-layer eval stack

LayerWhat you measureDeterministic rubricLLM-judge rubric
1. Tool selectionRight tool (or correctly no tool)function_name_match, F1 + irrelevance bucketEvaluateFunctionCalling
2. Argument extractionSchema-valid + semantically correctparameter_validation, function_call_exact_matchLLMFunctionCalling on argument semantics
3. Result utilizationDid the agent use what the tool returnedfunction_call_accuracy on the call sequenceGroundedness + ContextAdherence with tool result as context
4. Error recoveryDid the agent retry, fall back, or escalateRetry-count, max-loops, error-tier guardsTaskCompletion + recovery rubric on AgentTrajectoryInput

Non-negotiables: per-layer scoring rather than aggregate task_completion alone, an irrelevance bucket on the test set, schema validation as a deterministic gate before the LLM-judge runs, groundedness on the tool output as a first-class rubric, and a trajectory rubric so the compound-error problem stops hiding in per-turn averages.

Why tool-calling eval is four eval problems stacked

Four failure modes show up in postmortem, and they map cleanly onto four layers.

Selection. The agent picked the wrong tool, called a tool when the model knew the answer directly, or did not call one when it should have. F1 on the tool name plus an irrelevance bucket catches it; the irrelevance bucket is the piece most posts drop.

Argument. Schema right, types right, values wrong. departure_date="next Friday" schema-validates and fails the user. customer_id="me" returns someone else’s account. amount_cents=5000000 drains the refund budget. Schema validation catches the type class; the semantic class needs a rubric.

Result-utilization. The tool returned correctly; the agent ignored the payload, paraphrased with a number flipped, substituted prior model knowledge, or used the result on turn 1 and drifted off it by turn 3. Almost every public post on tool-call eval skips this layer.

Error-recovery. The tool 4xx-ed, the model did not read the error body, the retry sent the same broken arguments, the loop hit the max-step ceiling, or the agent fabricated a “successful” response to hide the failure. Per-call rubrics never see this; the trajectory metric does.

Score the four layers separately and the diagnostic vocabulary collapses from “the agent failed” to “the argument extractor regressed on date strings on the flight-booking path.” One bisect instead of three days.

Layer 1: tool selection (F1 on the tool name + the irrelevance bucket)

Pull the model’s chosen tool name, compare to the gold label, aggregate as F1 per tool so a registry of 28 tools does not hide a regression on one rare endpoint behind a strong global mean.

from fi.evals import evaluate

result = evaluate("function_name_match",
    output={"function_name": predicted_tool},
    expected={"function_name": ground_truth_tool})
# result.score = 1.0 exact match, 0.0 otherwise

The SDK ships four deterministic function-call metrics in fi.evals.metrics.function_calling: function_name_match (name only), parameter_validation (name plus argument shape), function_call_accuracy (the full call against the gold), and function_call_exact_match (strict equality including parallel-call ordering). All sub-millisecond per call.

The piece most posts drop is the irrelevance bucket. The test set has to include cases where the gold answer is “no tool call”: a greeting, a clarification request, an in-model factual question, a refusal-worthy ask. Without those cases, you cannot detect the regression where a new prompt revision makes the model bolder about calling search on every input. BFCL added the bucket for exactly this reason; build it into your private set the same way.

For the rubric case (no single correct tool, or two reasonable tools differ on edge), the cloud EvaluateFunctionCalling template (alias LLMFunctionCalling) handles semantic correctness via the Evaluator API or CustomLLMJudge.

Layer 2: argument extraction (schema validation + semantic rubric)

Argument failures fall into three buckets: schema mismatch (wrong type, missing required field), semantic mismatch (right schema, wrong value), and edge-case handling (null, empty array, special characters, type coercion).

Schema validation runs first and is deterministic. Pydantic on the model’s output is the cheapest possible gate.

from pydantic import BaseModel, Field, ValidationError

class SearchFlightsArgs(BaseModel):
    departure_airport: str = Field(pattern=r"^[A-Z]{3}$")
    arrival_airport: str = Field(pattern=r"^[A-Z]{3}$")
    departure_date: str = Field(pattern=r"^\d{4}-\d{2}-\d{2}$")
    cabin: str = Field(pattern=r"^(economy|premium|business|first)$")

try:
    args = SearchFlightsArgs.model_validate(predicted_args)
except ValidationError as e:
    schema_errors = e.errors()  # emit as span attribute, gate CI

Once schema passes, the SDK’s parameter_validation metric matches argument shape and values against the gold call. Semantic correctness needs the LLM-judge: departure_date="2026-01-01" schema-validates and is wrong if the user said “next Friday.” A CustomLLMJudge scores whether the argument captures the user’s intent — 1.0 if it captures correct dates, entities, identifiers, units; 0.5 on minor interpretation; 0.0 if values are clearly wrong or unobtainable from the user input.

Build a regression suite of edge cases per tool: null on optional fields, empty array where the schema permits but the tool returns 500, unicode in identifiers, the time-zone case on every date field, the currency case on every monetary field. These are the failures BFCL cannot see because they are private to your tool registry.

Layer 3: result utilization (the layer most posts skip)

The tool returned. The agent has the payload. Three failure patterns show up.

The agent paraphrases the payload with a number flipped. Tool returns {"refund_status": "pending", "amount_cents": 4500}, agent says “your refund of $54.00 is processing.” Schema-correct call, clean response, off by an order of magnitude.

The agent substitutes prior model knowledge. get_account_balance returns {"balance_cents": 12_400}. The model “knows” the user has a standard $200 minimum and replies “your balance is above the $200 threshold.” The tool result was never read.

The agent uses the result on turn 1 and drifts off it by turn 3. The flight-booking agent quotes the right itinerary on turn 1, then invents a baggage policy on turn 3 that contradicts the airline_policy tool result from two turns ago.

The rubric is Groundedness, with the context slot pointed at the tool’s return payload rather than the retrieved corpus. ContextAdherence and ChunkAttribution work the same way: chunk the tool result into JSON fields, score whether each claim in the response maps to one.

from fi.evals import Evaluator
from fi.evals.templates import Groundedness, ContextAdherence, ChunkAttribution
from fi.testcases import TestCase

evaluator = Evaluator()
for tool_call in result.tool_calls:
    # context = the actual tool payload, not the retrieved corpus
    tc = TestCase(input=ex.user_message, output=result.response,
                  context=json.dumps(tool_call.result))
    scores = evaluator.evaluate(
        eval_templates=[Groundedness(), ContextAdherence(), ChunkAttribution()],
        inputs=tc)

Score this layer on every multi-turn agent where the tool feeds the response. The Platform’s classifier-backed cascade runs Groundedness at lower per-eval cost than Galileo Luna-2.

Layer 4: error recovery (the trajectory rubric)

When the tool 4xx-es, times out, or returns an empty or partial result, the agent’s next move is the eval surface. The patterns to grade: did the agent read the error body and route to a corrected retry, a fallback tool, a clarification question, or a graceful escalation; did it retry with corrected arguments on a 400 or send the same broken string again; did it fall back to an alternative tool when the primary was down; did it stop at a sensible cap on retries (3 is a common floor; 6 usually means the loop guard is missing); did it communicate the failure clearly instead of fabricating success.

This is a trajectory-level concern, not per-call. The SDK exposes AgentTrajectoryInput with seven trajectory metrics in fi.evals.metrics.agents: TaskCompletion, StepEfficiency, ToolSelectionAccuracy, TrajectoryScore, GoalProgress, ActionSafety, ReasoningQuality. Each takes the full step list, the available tools, and the expected goal.

from fi.evals.metrics.agents import TrajectoryScore, AgentTrajectoryInput
from fi.evals.metrics.agents.types import AgentStep, TaskDefinition

trajectory = AgentTrajectoryInput(
    trajectory=[AgentStep(action=s.action, tool_used=s.tool,
                          tool_args=s.args, tool_result=s.result,
                          error=s.error) for s in agent_steps],
    task=TaskDefinition(goal=expected_goal, description=user_request),
    available_tools=[t.name for t in registered_tools],
    final_result=agent_response)
score = TrajectoryScore().compute_one(trajectory)

For a recovery-specific rubric, wrap a CustomLLMJudge against the trajectory: 1.0 if the agent read the error and corrected, 0.5 on partial recovery, 0.0 if it looped on the same broken input or hid the failure.

Build a stratified test set: one bucket per tool, one row per error code the endpoint actually returns (400, 401, 403, 404, 408, 429, 5xx), plus empty-result and partial-result rows. Gate CI on per-bucket recovery rates. A regression on 429 recovery for a single tool is one of the cheapest failures to ship and the hardest to debug after the fact.

The compound-error problem on multi-step agents

End-to-end success on a k-step agent is roughly the product of per-step success rates. A 95-percent per-step agent over eight steps lands near 66 percent. A 99-percent per-step agent over eight steps lands near 92 percent. Two thirds of sessions ending structurally wrong while every individual step scores green is not a hypothetical; it is the default math, and it is the most common reason teams ship agents that pass eval and tank production.

Three habits fix this.

Score the trajectory as a unit. Add TaskCompletion, TrajectoryScore, GoalProgress on the full step list. The per-step rubric is the gate; the trajectory metric is the truth.

Treat any agent longer than five steps as suspect. Force the planner to decompose into shorter sub-agents. Long flat trajectories are where compound-error pain lives.

Reserve a “consistency” eval slice. Pick 30 hard cases and run them k times each; the fraction that succeed on all k is your pass^k in τ-bench’s sense. A pass^8 < 25 percent on a 4o-class model in retail is the cost of nondeterminism stacked across eight steps. When it moves, the planner regressed, not the tools.

Public benchmarks (BFCL, τ-bench) vs your tool registry

Two public benchmarks anchor the floor in 2026. Use them; do not gate production on them.

BFCL (Berkeley Function Calling Leaderboard) evaluates function calling across an AST track (syntactic correctness), an executable track (the call actually runs on a real endpoint), and an irrelevance-detection bucket. The breakdown is the value: a model that aces AST and tanks irrelevance overcalls on your registry; a model that aces AST and tanks executable generates plausible but non-running calls. Treat BFCL as a model-selection signal, not a production gate.

τ-bench evaluates multi-turn agent behavior in airline and retail environments. The user is LLM-simulated, the agent has tools and a domain policy, and the headline metric is pass^k measuring reliability across k independent rollouts (not k retry attempts). Even GPT-4o lands below 25 percent at pass^8 on retail; multi-turn tool-using agents are nondeterminism amplifiers, and the consistency metric exposes how much.

Public benchmarks tell you whether the underlying model can call tools at all. They tell you nothing about your registry, argument schemas, error codes, or business policy. The private eval set is the one that gates production. Build it stratified by tool, argument-edge-case bucket, and error code; promote failing production traces into it weekly. The MCP (Model Context Protocol) angle is the same: the protocol surface is generic; your MCP server’s tool registry is private, and the eval has to know that registry.

# Stratified shape for a private tool-calling eval set
{
  "by_tool":  {"search_flights": 80, "book_flight": 40, ...},   # ≥ 30 per tool
  "by_layer": {"selection_correct": 0.40, "selection_irrelevant": 0.10,
               "argument_edge": 0.20, "result_utilization": 0.15,
               "error_recovery": 0.15},
  "by_difficulty": {"easy": 0.4, "medium": 0.4, "hard": 0.2},
}

Wiring the stack: traceAI spans + EvalTag + CI

The @tracer.tool decorator wraps a Python function and auto-infers the tool description from the docstring and the parameter schema from type annotations via _get_jsonschema_type. No manual span creation.

from fi_instrumentation import register, get_tracer
from fi_instrumentation.fi_types import ProjectType

register(project_name="travel_agent", project_type=ProjectType.OBSERVE)
tracer = get_tracer(__name__)

@tracer.tool
def search_flights(departure: str, arrival: str, date: str) -> list:
    """Search flights between two airports on a given date."""
    ...

Every tool call lands as a span with fi.span.kind=TOOL plus the GenAI attributes (gen_ai.tool.name, gen_ai.tool.call.id, gen_ai.tool.call.arguments, gen_ai.tool.call.result). Latency rides on the standard OpenTelemetry duration attribute, so per-tool p50, p95, p99 are a Grafana query away. Eval scores attach to spans via EvalTag; the collector runs evals server-side post-export at zero inline latency. Wire one EvalTag per layer (EVALUATE_LLM_FUNCTION_CALLING on TOOL spans for layers 1-2, GROUNDEDNESS on LLM spans for layer 3, TASK_COMPLETION on AGENT spans for layer 4) and the same rubric runs in CI and on live traces.

Per-layer CI gate as one fixture:

# config.yaml for `fi run`
assertions:
  - "tool_name_f1.score >= 0.95 for at_least 95% of cases"
  - "argument_validation.score >= 0.90 for at_least 90% of cases"
  - "argument_semantics.score >= 0.85 for at_least 85% of cases"
  - "result_groundedness.score >= 0.90 for at_least 90% of cases"
  - "trajectory_recovery.score >= 0.80 for at_least 85% of cases"

Distributed batch evaluation runs across four backends (Celery, Ray, Temporal, Kubernetes) when the CI fixture grows past a single-runner budget.

Tool-call security as a first-class layer

The four-layer stack is the quality story. Tool-calling agents need a parallel security story because they can do things. Three defenses ride alongside.

Per-virtual-key allow/deny lists at the gateway. The Agent Command Center carries AllowedTools and DeniedTools on every API key (plus AllowedModels, AllowedProviders, AllowedIPs, RateLimitRPM, RateLimitTPM). A key scoped to ["search_flights", "check_status"] cannot call cancel_booking under a coerced prompt.

MCP security at the protocol boundary. The gateway’s mcpsec plugin enforces allowed_servers, blocked_tools, validate_inputs / validate_outputs (injection patterns in tool arguments and results), max_calls_per_request (default 25), and per-tool rate limits. Default patterns catch exec(), eval(), shell escapes, SQL drop, script tags.

Prompt injection on every tool input and output. Future AGI Protect’s prompt_injection adapter is one of four Gemma 3n LoRA adapters running at 65 ms text / 107 ms image median time-to-label per the Protect paper, with the agentcc-gateway Go plugin carrying deterministic regex fallback. The same adapters reuse as eval rubrics for batch scoring of historical tool-call traces, so the production policy and the regression rubric stay in sync. SOC 2 Type II, HIPAA, GDPR, and CCPA certified per futureagi.com/trust; ISO/IEC 27001 in active audit.

How Future AGI ships the tool-calling eval stack

Future AGI ships the eval stack as a package. Start with the SDK for code-defined per-layer scoring. Graduate to the Platform when the loop needs self-improving rubrics and classifier-backed cost economics.

  • ai-evaluation SDK (Apache 2.0): deterministic function-call metrics (function_name_match, parameter_validation, function_call_accuracy, function_call_exact_match) for layers 1-2; EvaluateFunctionCalling (alias LLMFunctionCalling) for the rubric case; Groundedness, ContextAdherence, ChunkAttribution, ChunkUtilization for layer 3; seven agent-trajectory metrics on AgentTrajectoryInput for layer 4; fi run CLI with CI assertions.
  • Future AGI Platform: self-improving evaluators tuned by thumbs feedback; in-product authoring agent for custom rubrics; classifier-backed evals at lower per-eval cost than Galileo Luna-2.
  • traceAI (Apache 2.0): @tracer.tool decorator with auto-inferred schemas; 14 span kinds across 50+ AI surfaces in Python, TypeScript, Java, C#; EvalTag wires rubric to span at zero inference latency.
  • Error Feed: HDBSCAN clusters tool-call failures into named issues; a Sonnet 4.5 Judge agent writes the 5-category 30-subtype taxonomy (Decision Errors: wrong-tool, invalid-params, missing-tool-call), the 4-D trace score, and an immediate_fix.
  • Agent Command Center: 20+ providers; per-virtual-key AllowedTools / DeniedTools and an MCP security plugin enforce scope at the gateway; SOC 2 Type II, HIPAA, GDPR, CCPA certified (ISO/IEC 27001 in active audit).

Three honest tradeoffs

  • Per-layer scoring costs more than aggregate task-completion. Five rubrics per case, not one. Payoff: when CI fails, the failing layer name is the root cause. New deployments can ship the deterministic layers first and turn on the LLM-judge layers once trace volume justifies it.
  • Groundedness on tool output is noisier than groundedness on retrieved corpus. Tool payloads are JSON; the rubric reasons over fields, not prose. Pin a small human-labelled calibration set and re-tune monthly.
  • The pass^k consistency slice is expensive. 30 cases × 8 rollouts per gate is 240 agent runs. Run it on release candidates, not every PR; the planner-vs-tool signal is worth the cadence.

Ready to evaluate your first tool-calling agent? Wire function_name_match, parameter_validation, Groundedness against the tool result, and TaskCompletion on AgentTrajectoryInput into a pytest fixture this afternoon against the ai-evaluation SDK, then attach the same templates as EvalTag scorers via traceAI when production traces start asking questions the CI gate missed.

Frequently asked questions

What are the four layers of tool-calling agent evaluation?
Tool-calling eval is four eval problems stacked. Layer 1, tool selection: did the agent pick the right tool from the registry (or correctly pick none). Score with F1 on the tool name plus an explicit irrelevance bucket so 'called search when it shouldn't have' is a graded failure. Layer 2, argument extraction: did the agent produce a JSON object that schema-validates and semantically maps to the user's request. Two rubrics, deterministic schema validation plus an LLM-judge on argument semantics. Layer 3, result utilization: when the tool returned, did the agent use the payload in its answer or ignore it and hallucinate. Groundedness on the tool output, the layer almost every public post skips. Layer 4, error recovery: when the tool 4xx-ed, timed out, or returned an empty result, did the agent retry with corrected arguments, fall back to another tool, or escalate. Aggregate task-completion hides which layer broke; per-layer scoring tells you what to fix this afternoon.
Why isn't tool selection accuracy enough on its own?
Because picking the right tool is testing intent, not execution. An agent that picks 'search_flights' and then passes departure_date='next Friday' is graded 1.0 by a name-match metric and 0.0 by your users. Tool selection is necessary, not sufficient. The four-layer stack exists because each downstream layer (arguments, result use, error recovery) is its own independent failure surface with its own root cause. Treating tool-call eval as 'did it pick the right tool' is the same mistake as treating RAG eval as 'did it retrieve the right chunk', and it produces the same class of trace-eval gap in production.
How do I score result utilization (the layer most posts skip)?
Run a Groundedness rubric where the context is the tool's return payload, not the retrieved corpus. The question to grade: does the agent's response derive cleanly from what the tool returned, or did it ignore the payload and substitute prior model knowledge. Three failure patterns surface most often. One, agent paraphrases the tool result with a number flipped or an entity swapped. Two, agent ignores the tool result entirely because it does not match what the model already believed. Three, agent uses the tool result correctly on the first turn and then drifts away from it on follow-up turns. The Future AGI ai-evaluation SDK ships Groundedness and ContextAdherence as the rubrics; point the context slot at the tool output JSON and the rubric works against tool calls without modification.
What does the compound-error problem mean for multi-step agents?
End-to-end success on a k-step agent is roughly the product of per-step success rates. A 95 percent per-step agent over eight steps lands near 66 percent. A 99 percent per-step agent over eight steps lands near 92 percent. The math is unforgiving and it is why teams ship agents with great per-turn rubrics that miss conversation-level metrics. The fix is to score the trajectory as a unit alongside the per-step rubric. The ai-evaluation SDK exposes AgentTrajectoryInput with seven trajectory metrics (task_completion, step_efficiency, tool_selection_accuracy, trajectory_score, goal_progress, action_safety, reasoning_quality), all of which take the full step list, the available tools, and the expected goal as input. The per-step rubric is the gate. The trajectory metric is the truth.
Should I rely on BFCL and tau-bench for tool-calling eval?
Use them as the public floor, not the private ceiling. BFCL (Berkeley Function Calling Leaderboard) tests function calling across AST checks, executable checks, and an irrelevance-detection bucket on a public tool registry. tau-bench tests multi-turn agent interactions in airline and retail environments, with pass^k measuring reliability across k independent rollouts (not k retry attempts). Both are useful as a baseline indicator that your underlying model can call tools at all. Neither tells you whether the agent handles your tool registry, your argument schemas, your error codes, or your business policy. The private eval set is the one that gates production. Build one stratified by tool, by argument-edge-case bucket, and by error code, and gate CI on its per-layer scores.
How does traceAI instrument tool calls without boilerplate?
The @tracer.tool decorator wraps a Python function and emits a span with fi.span.kind=TOOL plus the GenAI tool attributes (gen_ai.tool.name, gen_ai.tool.call.id, gen_ai.tool.call.arguments, gen_ai.tool.call.result, gen_ai.tool.description, gen_ai.tool.type). The decorator auto-infers the tool description from the Python docstring and the parameter schema from type annotations via _get_jsonschema_type. No manual span creation, no manual attribute set. traceAI ships across 50+ AI surfaces (4 languages: Python, TypeScript, Java, C#) and 14 span kinds including first-class TOOL, AGENT, RETRIEVER, GUARDRAIL, A2A_CLIENT, and A2A_SERVER. The same tool span carries latency in OpenTelemetry-standard duration attributes, so per-tool p50, p95, and p99 are a Grafana query away.
How does Future AGI ship the tool-calling eval stack?
Future AGI ships the eval stack as a package. The ai-evaluation SDK (Apache 2.0) is the code-first surface: deterministic function-call metrics (function_name_match, parameter_validation, function_call_accuracy, function_call_exact_match) run sub-millisecond locally; LLM-judge templates (EvaluateFunctionCalling alias LLMFunctionCalling, TaskCompletion) cover the semantic and trajectory layers; the agent-trajectory suite (TaskCompletion, StepEfficiency, ToolSelectionAccuracy, TrajectoryScore, GoalProgress, ActionSafety, ReasoningQuality) operates on AgentTrajectoryInput for multi-step trajectories. The Future AGI Platform adds self-improving evaluators tuned by thumbs feedback, an in-product authoring agent for custom rubrics, and classifier-backed evals at lower per-eval cost than Galileo Luna-2. traceAI carries the same rubrics as span-attached scores on live traces via EvalTag. Error Feed clusters tool-call regressions and writes the immediate_fix. The Agent Command Center (20+ providers) enforces per-virtual-key AllowedTools and DeniedTools at the gateway boundary, with an MCP security plugin and Protect's prompt_injection adapter on every tool input and output. SOC 2 Type II, HIPAA, GDPR, and CCPA certified per the trust page (ISO/IEC 27001 in active audit).
Related Articles
View all