Evaluating LLM Tool Use in 2026: The Four-Step Contract
Evaluating LLM tool use is a four-step contract: decide to call, pick the tool, build the arguments, integrate the result. Score each step independently or you'll never know which one broke.
Table of Contents
A customer support model picks refund_order, fills the schema, validates clean, and pushes $54 to the right account. The user asked about $5,400. Tool selection scored 1.0. Argument schema validated. The audit-side number is off by an order of magnitude and the only signal that something broke was a support ticket forty minutes later.
LLM tool use is a four-step contract: the model has to (1) decide whether to call any tool at all, (2) pick the right tool from the catalog, (3) construct correct arguments against the schema and the user’s intent, and (4) integrate the result into a response without paraphrasing the payload into nonsense. Get one of the four wrong and the demo passes anyway. Score it as one number and you’ll never know which step broke. This guide walks each step, the rubric that catches it, where BFCL and τ-bench fit (and where they don’t), and the production observability wiring that keeps the contract honest after release. Sibling read for the full-agent angle: Evaluating Tool-Calling Agents in 2026.
TL;DR — the four-step contract
| Step | What you measure | Deterministic gate | LLM-judge rubric |
|---|---|---|---|
| 1. Call or no-call | Did the model fire a tool at all (or correctly stay quiet) | Irrelevance precision/recall on a 10-15% no-call slice | EvaluateFunctionCalling with irrelevance examples |
| 2. Tool selection | Right tool from the catalog | function_name_match, F1 per tool | EvaluateFunctionCalling on semantic name choice |
| 3. Argument construction | Schema valid AND semantically grounded | parameter_validation, Pydantic on the input schema | CustomLLMJudge on argument plausibility |
| 4. Result integration | Did the model use the payload truthfully | None | Groundedness + ContextAdherence with tool.result as context |
Non-negotiables: every eval set carries a no-call slice, schema validation runs before any LLM-judge does, the Groundedness rubric on tool output is a first-class score (not a sidecar), and per-step CI gates publish failing-step names rather than aggregate red/green.
Why one number hides three failures
Four failure modes show up in postmortem and they collapse into a single “the agent broke” line when you score one number.
Over-call. The model fires search_orders on a greeting because the prompt was tuned harder on selection. Tool-selection metrics never see it; the gold was always a tool.
Wrong tool, right family. get_order_status vs get_customer_status. Same schema shape, different downstream system. F1 catches this if the eval set has both endpoints; lump them with the irrelevance bucket and the regression hides.
Right tool, wrong arguments. departure_date="next Friday" schema-validates if the field accepts strings. customer_id="me" returns someone else’s account. Schema gates the type class; the semantic class needs a judge that reads the conversation.
Right call, wrong narration. Tool returned correctly. Model paraphrased amount_cents=4500 as $54.00 and the user trusted the answer. Or it substituted prior knowledge — get_account_balance returned $124 and the model said “above the $200 threshold.” Almost every public eval skips this step.
Score the four separately and the vocabulary collapses from “the agent failed” to “step 3 regressed on the date-string semantic judge for the flight-booking path.” One bisect instead of three days. The LLM evaluation playbook covers per-dimension scoring more broadly.
Step 1: decide whether to call any tool
Most eval sets only contain rows where the gold is a tool call, so the over-call regression — a new prompt revision makes the model bolder about firing search on every input — ships untested.
Reserve 10-15 percent of the eval set for no-call cases: greetings, clarifications, in-model factual questions, refusal-worthy asks, off-topic chatter. Score precision and recall on the binary call/no-call decision. Gate CI on per-class accuracy.
# A no-call eval row
{
"input": "Hi, how does this work?",
"expected_tools_called": [],
"available_tools": ["refund_order", "search_orders", "create_ticket"],
"rationale": "Greeting. Model should answer in-line, no tool fire.",
}
BFCL (Berkeley Function Calling Leaderboard) added an explicit irrelevance-detection bucket for exactly this reason. A model that scores 0.95 on tool-name F1 and 0.40 on irrelevance over-calls; without the irrelevance score, you only learn in production.
The deterministic rubric is the easy half: count tool calls, compare to the gold count, aggregate. The judge case is harder — when both “no call” and “call” are reasonable. EvaluateFunctionCalling (alias LLMFunctionCalling) handles the rubric case; for sharper control, wrap a CustomLLMJudge that scores the decision against the available tools and the conversation.
Step 2: pick the right tool
Once the model has decided to call, the question is which one. Pull the chosen tool name, compare to the gold label, aggregate F1 per tool so a catalog of 28 endpoints does not hide a regression on one rare endpoint behind the global mean.
from fi.evals import evaluate
result = evaluate("function_name_match",
output={"function_name": predicted_tool},
expected={"function_name": ground_truth_tool})
# result.score = 1.0 exact match, 0.0 otherwise
The Future AGI ai-evaluation SDK ships four deterministic function-call metrics in fi.evals.metrics.function_calling: function_name_match, parameter_validation, function_call_accuracy, and function_call_exact_match. All sub-millisecond per call, all Apache 2.0.
The failure that hides at this step is the near-miss family. get_user vs get_customer. cancel_order vs cancel_booking. refund_order vs void_charge. The judge has to know that two tools with similar names cover different downstream systems and that the model’s pick has to map to the user’s stated entity, not just keyword overlap. For ambiguous cases, fall back to a rubric judge. Future AGI’s EvaluateFunctionCalling accepts the tool catalog with descriptions and scores semantic correctness directly:
from fi.evals import Evaluator
from fi.evals.templates import EvaluateFunctionCalling
from fi.testcases import TestCase
evaluator = Evaluator()
tc = TestCase(
input="Refund order 8821 to the original card",
expected_tool_call={"name": "refund_order", "arguments": {"order_id": "8821"}},
actual_tool_call={"name": "refund_order",
"arguments": {"order_id": "8821", "reason": "customer request"}},
available_tools=[
{"name": "refund_order", "description": "Refund an order to the original payment method"},
{"name": "cancel_order", "description": "Cancel an open order before it ships"},
{"name": "void_charge", "description": "Reverse a charge before settlement"},
],
)
score = evaluator.evaluate(eval_templates=[EvaluateFunctionCalling()], inputs=tc)
Step 3: construct correct arguments
Argument failures fall into three buckets: schema mismatch (wrong type, missing required field), semantic mismatch (right schema, wrong value), and edge-case handling (null, empty array, unicode, type coercion across the model-to-tool boundary). Schema gates the first; a judge with conversation context gates the second; a regression suite catches the third.
Schema validation runs first and is deterministic. Pydantic on the model’s output is the cheapest possible gate.
from pydantic import BaseModel, Field, ValidationError
class RefundOrderArgs(BaseModel):
order_id: str = Field(pattern=r"^\d{6,}$")
amount_cents: int = Field(gt=0, le=10_000_00) # $10,000 cap
reason: str = Field(min_length=3, max_length=200)
try:
args = RefundOrderArgs.model_validate(predicted_args)
except ValidationError as e:
schema_errors = e.errors() # emit as span attribute, gate CI
Run this on every call. Zero LLM cost. The deterministic gate catches type errors, missing fields, pattern mismatches, and range violations.
Semantic correctness needs the LLM-judge. departure_date="2026-01-01" schema-validates and is wrong if the user said “next Friday.” customer_id="42" is a valid string and refunds the wrong account if customer_id="841" was the one mentioned two turns ago. Wrap a CustomLLMJudge that gets the last K turns plus the proposed call and scores whether each argument is grounded in prior turns:
from fi.evals.metrics.llm_as_judges.custom_judge.metric import CustomLLMJudge
argument_plausibility = CustomLLMJudge(
name="ArgumentPlausibility",
instructions=(
"Score whether the arguments in the proposed tool call are grounded in the conversation. "
"Identifiers (customer_id, order_id) must appear in prior turns. Amounts must be within "
"an order of magnitude of any user-stated value. Dates must align with user intent."
),
grading_criteria={
"5": "All arguments grounded; no hallucinated identifiers; amounts and dates match intent.",
"3": "Arguments parse, but at least one value is weakly grounded.",
"0": "At least one argument is hallucinated or contradicts the conversation.",
},
)
Build a private edge-case suite per tool. Null on optional fields, empty array where the schema permits but the tool returns 500, unicode in identifiers, the time-zone case on every date field, the currency case on every monetary field. These are the failures BFCL cannot see — they are private to your registry. The LLM dataset management tools post covers tooling for versioning these sets alongside CI.
Step 4: integrate the result
The tool returned. The model has the payload. Three failure patterns show up.
The model paraphrases the payload with a number flipped. Tool returns {"amount_cents": 540_000}, model says “your refund of $54.00 is processing.” Off by an order of magnitude, schema-clean, the user trusts the answer.
The model substitutes prior knowledge. get_account_balance returns {"balance_cents": 12_400}. The model “knows” the user has a standard $200 minimum and replies “your balance is above the $200 threshold.” The tool result was never read.
The model drifts off the result across turns. The flight-booking model quotes the right itinerary on turn 1, then invents a baggage policy on turn 3 that contradicts the airline_policy payload from two turns ago. Per-call rubrics never see this; multi-turn Groundedness does.
The rubric is Groundedness with the tool payload as the context slot, not a retrieved corpus. ContextAdherence and ChunkAttribution work the same way: chunk the result into JSON fields and score whether each claim in the response maps to one.
from fi.evals import Evaluator
from fi.evals.templates import Groundedness, ContextAdherence, ChunkAttribution
from fi.testcases import TestCase
import json
evaluator = Evaluator()
for tool_call in result.tool_calls:
tc = TestCase(
input=ex.user_message,
output=result.response,
context=json.dumps(tool_call.result), # the tool payload, not corpus
)
scores = evaluator.evaluate(
eval_templates=[Groundedness(), ContextAdherence(), ChunkAttribution()],
inputs=tc,
)
Score this step on every model where a tool feeds the response. The Platform’s classifier-backed cascade runs Groundedness at lower per-eval cost than Galileo Luna-2, which matters once the rubric is on every TOOL span at scale.
BFCL and τ-bench: the public floor
Two public benchmarks anchor model selection in 2026. Use them. Do not gate production on them.
BFCL (Berkeley Function Calling Leaderboard) measures function calling across an AST track (syntactic correctness), an executable track (the call actually runs on a real endpoint), and the irrelevance-detection bucket that steps 1 and 2 depend on. The breakdown matters more than the headline number. A model that aces AST and tanks irrelevance over-calls; a model that aces AST and tanks executable generates plausible but non-running calls. Treat BFCL as a model-selection signal, not a release gate.
τ-bench measures multi-turn agent behavior in airline and retail environments. The user is LLM-simulated, the model has tools plus a domain policy, and the headline metric is pass^k — the fraction of independent k-rollouts that all succeed (not k retries on the same input). The metric exposes how much consistency degrades when nondeterminism stacks. Even GPT-4o lands below 25 percent at pass^8 on retail. Multi-turn tool-using LLMs are nondeterminism amplifiers and pass^k is one of the few public signals that quantifies the cost.
Public benchmarks tell you whether the underlying model can call tools at all. They tell you nothing about your registry, schemas, error codes, or business policy. The private eval set is the one that gates production. Build it stratified by tool, argument edge-case bucket, and call/no-call gold. Promote failing production traces into it weekly.
# Stratified shape for a private LLM tool-use eval set
{
"by_step": {
"step_1_no_call": 0.12, # call vs no-call regression slice
"step_2_selection": 0.38, # tool-name F1, with near-miss families
"step_3_arguments": 0.30, # schema + semantic + edge cases
"step_4_result_integration":0.20, # Groundedness on tool payload
},
"by_tool": {"refund_order": 80, "get_order_status": 60, ...}, # >= 30 per tool
"by_difficulty": {"easy": 0.4, "medium": 0.4, "hard": 0.2},
}
Production observability for the contract
The four-step contract has to hold in production, not just in CI. The wiring is one trace provider, one decorator, and four EvalTags. The @tracer.tool decorator wraps a Python function and auto-infers the tool description from the docstring and the parameter schema from type annotations via _get_jsonschema_type.
from fi_instrumentation import register, get_tracer
from fi_instrumentation.fi_types import ProjectType
register(project_name="support_agent", project_type=ProjectType.OBSERVE)
tracer = get_tracer(__name__)
@tracer.tool
def refund_order(order_id: str, amount_cents: int, reason: str) -> dict:
"""Refund an order to the original payment method."""
...
Every call lands as a span with fi.span.kind=TOOL plus the GenAI attributes — gen_ai.tool.name, gen_ai.tool.call.arguments, gen_ai.tool.call.result, gen_ai.tool.description. Latency rides on the standard OpenTelemetry duration attribute, so per-tool p50, p95, p99 are one Grafana query away. traceAI ships across 50+ AI surfaces in Python, TypeScript, Java, and C# with 14 span kinds.
EvalTag attaches the same rubric you ran in CI as a span-attached scorer. The collector runs evals server-side post-export at zero inline latency, which is what makes this viable on every TOOL span at production scale. One EvalTag per step:
EVALUATE_LLM_FUNCTION_CALLINGonTOOLspans for steps 1-3 (the no-call slice rides on the irrelevance examples in the rubric input)GROUNDEDNESSon the parentLLMspan with tool result wired in as context for step 4
A per-step CI gate as one fixture, then the same fixture shape rides on live traces via EvalTag:
# config.yaml for `fi run`
assertions:
- "call_or_no_call_precision.score >= 0.95 for at_least 95% of cases"
- "tool_name_f1.score >= 0.95 for at_least 95% of cases"
- "argument_validation.score >= 0.90 for at_least 90% of cases"
- "argument_plausibility.score >= 0.85 for at_least 85% of cases"
- "result_groundedness.score >= 0.90 for at_least 90% of cases"
The Agent Command Center gateway returns x-prism-cost, x-prism-latency-ms, and x-prism-model-used on every call, so cost and latency per step land in the trace without extra plumbing. The AI agent cost optimization guide covers per-step budgeting; for runtime detection of contract violations the AI agent failure detection tools post compares patterns.
How Future AGI ships the four-step eval stack
Future AGI ships the stack as a package. Start with the SDK for code-defined per-step scoring. Graduate to the Platform when the loop needs self-improving rubrics and classifier-backed cost economics.
- ai-evaluation SDK (Apache 2.0): deterministic metrics (
function_name_match,parameter_validation,function_call_accuracy,function_call_exact_match) for the cheap layer of steps 1-3;EvaluateFunctionCalling(aliasLLMFunctionCalling) for selection and argument rubrics;CustomLLMJudgefor argument plausibility;Groundedness,ContextAdherence,ChunkAttributionfor step 4 with tool payload as context;fi runCLI with CI assertions. - Future AGI Platform: self-improving evaluators tuned by thumbs feedback on production traces; in-product authoring agent for custom rubrics; classifier-backed evals at lower per-eval cost than Galileo Luna-2.
- traceAI (Apache 2.0):
@tracer.tooldecorator with auto-inferred schemas; 14 span kinds across 50+ AI surfaces in Python, TypeScript, Java, C#;EvalTagwires the rubric to the span at zero inference latency. - Error Feed: HDBSCAN clusters tool-use regressions over span embeddings; a Sonnet 4.5 Judge writes the 5-category 30-subtype taxonomy (Decision Errors: wrong-tool, invalid-params, missing-tool-call), the 4-D trace score, and an
immediate_fix. - Agent Command Center: 100+ providers; per-virtual-key
AllowedToolsandDeniedToolsat the gateway boundary; MCP security plugin withallowed_servers,blocked_tools, per-tool rate limits; Protect’sprompt_injectionadapter on every tool input and output. SOC 2 Type II, HIPAA, GDPR, and CCPA certified per futureagi.com/trust; ISO/IEC 27001 in active audit.
MCP is worth one line. Tool descriptions arrive at runtime from servers you don’t own. The gateway’s mcpsec plugin scans descriptions for injection before the model sees them; Protect’s prompt_injection adapter runs on tool outputs before the model integrates them — the step 4 attack surface specifically. The AI gateway and MCP tool observability writeup covers the dual-scanner architecture.
Three honest tradeoffs
Per-step scoring costs more than aggregate task-completion. Five rubrics per case, not one. The payoff: when CI fails, the failing step name is the root cause, not the start of a triage session. Ship the deterministic layers (schema validation, function_name_match, call-vs-no-call counts) first and turn on the LLM-judge layers once trace volume justifies the eval bill.
Groundedness on tool payloads is noisier than Groundedness on retrieved corpus. JSON has fewer surface cues than prose, and the rubric reasons over fields rather than passages. Pin a small human-labeled calibration set and re-tune monthly; the LLM-as-a-judge deep-dive covers judge calibration.
The no-call slice feels expensive to build. Greetings and clarifications are cheap to generate but a curator has to label “what should the model do here,” and the labels are softer than tool-call golds. Build it once, treat it as a fixture, refresh quarterly from production traces where the model fired a tool on a turn that didn’t need one.
Ready to wire the contract? Pick one tool with a write side effect, build a 60-case eval set across the four steps (10 no-call, 20 selection with at least one near-miss family, 20 argument cases including 6 semantic, 10 result integration), wire function_name_match, parameter_validation, a CustomLLMJudge on argument plausibility, and Groundedness against the tool payload into a pytest fixture against the ai-evaluation SDK, then attach the same templates as EvalTag scorers via traceAI so the same rubric rides on every production call.
Related reading
Frequently asked questions
What are the four steps of LLM tool use evaluation?
Why is step 1 (decide whether to call any tool) the step everyone misses?
How do I score argument construction (step 3) beyond schema validation?
How do you score step 4 (integrate the result)?
Should I gate production releases on BFCL or τ-bench?
What does production observability for tool use actually look like?
How does Future AGI ship the four-step eval stack?
Tool-calling eval is four eval problems stacked: tool selection, argument extraction, result utilization, and error recovery. Most posts grade the first one and call it done.
Evaluating Instructor structured outputs in 2026: per-field rubrics, cross-field consistency, numeric drift, and traceAI instrumentation.
Evaluating Mistral agents: the tool-call schema parsing gap, system-prompt adherence vs OpenAI, EU data-residency verification, and Codestral safety gates.