Guides

Evaluating LLM Tool Use in 2026: The Four-Step Contract

Evaluating LLM tool use is a four-step contract: decide to call, pick the tool, build the arguments, integrate the result. Score each step independently or you'll never know which one broke.

·
Updated
·
11 min read
llm-evaluation tool-use function-calling bfcl tau-bench mcp 2026
Editorial cover image for Evaluating LLM Tool Use in 2026
Table of Contents

A customer support model picks refund_order, fills the schema, validates clean, and pushes $54 to the right account. The user asked about $5,400. Tool selection scored 1.0. Argument schema validated. The audit-side number is off by an order of magnitude and the only signal that something broke was a support ticket forty minutes later.

LLM tool use is a four-step contract: the model has to (1) decide whether to call any tool at all, (2) pick the right tool from the catalog, (3) construct correct arguments against the schema and the user’s intent, and (4) integrate the result into a response without paraphrasing the payload into nonsense. Get one of the four wrong and the demo passes anyway. Score it as one number and you’ll never know which step broke. This guide walks each step, the rubric that catches it, where BFCL and τ-bench fit (and where they don’t), and the production observability wiring that keeps the contract honest after release. Sibling read for the full-agent angle: Evaluating Tool-Calling Agents in 2026.

TL;DR — the four-step contract

StepWhat you measureDeterministic gateLLM-judge rubric
1. Call or no-callDid the model fire a tool at all (or correctly stay quiet)Irrelevance precision/recall on a 10-15% no-call sliceEvaluateFunctionCalling with irrelevance examples
2. Tool selectionRight tool from the catalogfunction_name_match, F1 per toolEvaluateFunctionCalling on semantic name choice
3. Argument constructionSchema valid AND semantically groundedparameter_validation, Pydantic on the input schemaCustomLLMJudge on argument plausibility
4. Result integrationDid the model use the payload truthfullyNoneGroundedness + ContextAdherence with tool.result as context

Non-negotiables: every eval set carries a no-call slice, schema validation runs before any LLM-judge does, the Groundedness rubric on tool output is a first-class score (not a sidecar), and per-step CI gates publish failing-step names rather than aggregate red/green.

Why one number hides three failures

Four failure modes show up in postmortem and they collapse into a single “the agent broke” line when you score one number.

Over-call. The model fires search_orders on a greeting because the prompt was tuned harder on selection. Tool-selection metrics never see it; the gold was always a tool.

Wrong tool, right family. get_order_status vs get_customer_status. Same schema shape, different downstream system. F1 catches this if the eval set has both endpoints; lump them with the irrelevance bucket and the regression hides.

Right tool, wrong arguments. departure_date="next Friday" schema-validates if the field accepts strings. customer_id="me" returns someone else’s account. Schema gates the type class; the semantic class needs a judge that reads the conversation.

Right call, wrong narration. Tool returned correctly. Model paraphrased amount_cents=4500 as $54.00 and the user trusted the answer. Or it substituted prior knowledge — get_account_balance returned $124 and the model said “above the $200 threshold.” Almost every public eval skips this step.

Score the four separately and the vocabulary collapses from “the agent failed” to “step 3 regressed on the date-string semantic judge for the flight-booking path.” One bisect instead of three days. The LLM evaluation playbook covers per-dimension scoring more broadly.

Step 1: decide whether to call any tool

Most eval sets only contain rows where the gold is a tool call, so the over-call regression — a new prompt revision makes the model bolder about firing search on every input — ships untested.

Reserve 10-15 percent of the eval set for no-call cases: greetings, clarifications, in-model factual questions, refusal-worthy asks, off-topic chatter. Score precision and recall on the binary call/no-call decision. Gate CI on per-class accuracy.

# A no-call eval row
{
  "input": "Hi, how does this work?",
  "expected_tools_called": [],
  "available_tools": ["refund_order", "search_orders", "create_ticket"],
  "rationale": "Greeting. Model should answer in-line, no tool fire.",
}

BFCL (Berkeley Function Calling Leaderboard) added an explicit irrelevance-detection bucket for exactly this reason. A model that scores 0.95 on tool-name F1 and 0.40 on irrelevance over-calls; without the irrelevance score, you only learn in production.

The deterministic rubric is the easy half: count tool calls, compare to the gold count, aggregate. The judge case is harder — when both “no call” and “call” are reasonable. EvaluateFunctionCalling (alias LLMFunctionCalling) handles the rubric case; for sharper control, wrap a CustomLLMJudge that scores the decision against the available tools and the conversation.

Step 2: pick the right tool

Once the model has decided to call, the question is which one. Pull the chosen tool name, compare to the gold label, aggregate F1 per tool so a catalog of 28 endpoints does not hide a regression on one rare endpoint behind the global mean.

from fi.evals import evaluate

result = evaluate("function_name_match",
    output={"function_name": predicted_tool},
    expected={"function_name": ground_truth_tool})
# result.score = 1.0 exact match, 0.0 otherwise

The Future AGI ai-evaluation SDK ships four deterministic function-call metrics in fi.evals.metrics.function_calling: function_name_match, parameter_validation, function_call_accuracy, and function_call_exact_match. All sub-millisecond per call, all Apache 2.0.

The failure that hides at this step is the near-miss family. get_user vs get_customer. cancel_order vs cancel_booking. refund_order vs void_charge. The judge has to know that two tools with similar names cover different downstream systems and that the model’s pick has to map to the user’s stated entity, not just keyword overlap. For ambiguous cases, fall back to a rubric judge. Future AGI’s EvaluateFunctionCalling accepts the tool catalog with descriptions and scores semantic correctness directly:

from fi.evals import Evaluator
from fi.evals.templates import EvaluateFunctionCalling
from fi.testcases import TestCase

evaluator = Evaluator()
tc = TestCase(
    input="Refund order 8821 to the original card",
    expected_tool_call={"name": "refund_order", "arguments": {"order_id": "8821"}},
    actual_tool_call={"name": "refund_order",
                       "arguments": {"order_id": "8821", "reason": "customer request"}},
    available_tools=[
        {"name": "refund_order", "description": "Refund an order to the original payment method"},
        {"name": "cancel_order", "description": "Cancel an open order before it ships"},
        {"name": "void_charge", "description": "Reverse a charge before settlement"},
    ],
)
score = evaluator.evaluate(eval_templates=[EvaluateFunctionCalling()], inputs=tc)

Step 3: construct correct arguments

Argument failures fall into three buckets: schema mismatch (wrong type, missing required field), semantic mismatch (right schema, wrong value), and edge-case handling (null, empty array, unicode, type coercion across the model-to-tool boundary). Schema gates the first; a judge with conversation context gates the second; a regression suite catches the third.

Schema validation runs first and is deterministic. Pydantic on the model’s output is the cheapest possible gate.

from pydantic import BaseModel, Field, ValidationError

class RefundOrderArgs(BaseModel):
    order_id: str = Field(pattern=r"^\d{6,}$")
    amount_cents: int = Field(gt=0, le=10_000_00)  # $10,000 cap
    reason: str = Field(min_length=3, max_length=200)

try:
    args = RefundOrderArgs.model_validate(predicted_args)
except ValidationError as e:
    schema_errors = e.errors()  # emit as span attribute, gate CI

Run this on every call. Zero LLM cost. The deterministic gate catches type errors, missing fields, pattern mismatches, and range violations.

Semantic correctness needs the LLM-judge. departure_date="2026-01-01" schema-validates and is wrong if the user said “next Friday.” customer_id="42" is a valid string and refunds the wrong account if customer_id="841" was the one mentioned two turns ago. Wrap a CustomLLMJudge that gets the last K turns plus the proposed call and scores whether each argument is grounded in prior turns:

from fi.evals.metrics.llm_as_judges.custom_judge.metric import CustomLLMJudge

argument_plausibility = CustomLLMJudge(
    name="ArgumentPlausibility",
    instructions=(
        "Score whether the arguments in the proposed tool call are grounded in the conversation. "
        "Identifiers (customer_id, order_id) must appear in prior turns. Amounts must be within "
        "an order of magnitude of any user-stated value. Dates must align with user intent."
    ),
    grading_criteria={
        "5": "All arguments grounded; no hallucinated identifiers; amounts and dates match intent.",
        "3": "Arguments parse, but at least one value is weakly grounded.",
        "0": "At least one argument is hallucinated or contradicts the conversation.",
    },
)

Build a private edge-case suite per tool. Null on optional fields, empty array where the schema permits but the tool returns 500, unicode in identifiers, the time-zone case on every date field, the currency case on every monetary field. These are the failures BFCL cannot see — they are private to your registry. The LLM dataset management tools post covers tooling for versioning these sets alongside CI.

Step 4: integrate the result

The tool returned. The model has the payload. Three failure patterns show up.

The model paraphrases the payload with a number flipped. Tool returns {"amount_cents": 540_000}, model says “your refund of $54.00 is processing.” Off by an order of magnitude, schema-clean, the user trusts the answer.

The model substitutes prior knowledge. get_account_balance returns {"balance_cents": 12_400}. The model “knows” the user has a standard $200 minimum and replies “your balance is above the $200 threshold.” The tool result was never read.

The model drifts off the result across turns. The flight-booking model quotes the right itinerary on turn 1, then invents a baggage policy on turn 3 that contradicts the airline_policy payload from two turns ago. Per-call rubrics never see this; multi-turn Groundedness does.

The rubric is Groundedness with the tool payload as the context slot, not a retrieved corpus. ContextAdherence and ChunkAttribution work the same way: chunk the result into JSON fields and score whether each claim in the response maps to one.

from fi.evals import Evaluator
from fi.evals.templates import Groundedness, ContextAdherence, ChunkAttribution
from fi.testcases import TestCase
import json

evaluator = Evaluator()
for tool_call in result.tool_calls:
    tc = TestCase(
        input=ex.user_message,
        output=result.response,
        context=json.dumps(tool_call.result),  # the tool payload, not corpus
    )
    scores = evaluator.evaluate(
        eval_templates=[Groundedness(), ContextAdherence(), ChunkAttribution()],
        inputs=tc,
    )

Score this step on every model where a tool feeds the response. The Platform’s classifier-backed cascade runs Groundedness at lower per-eval cost than Galileo Luna-2, which matters once the rubric is on every TOOL span at scale.

BFCL and τ-bench: the public floor

Two public benchmarks anchor model selection in 2026. Use them. Do not gate production on them.

BFCL (Berkeley Function Calling Leaderboard) measures function calling across an AST track (syntactic correctness), an executable track (the call actually runs on a real endpoint), and the irrelevance-detection bucket that steps 1 and 2 depend on. The breakdown matters more than the headline number. A model that aces AST and tanks irrelevance over-calls; a model that aces AST and tanks executable generates plausible but non-running calls. Treat BFCL as a model-selection signal, not a release gate.

τ-bench measures multi-turn agent behavior in airline and retail environments. The user is LLM-simulated, the model has tools plus a domain policy, and the headline metric is pass^k — the fraction of independent k-rollouts that all succeed (not k retries on the same input). The metric exposes how much consistency degrades when nondeterminism stacks. Even GPT-4o lands below 25 percent at pass^8 on retail. Multi-turn tool-using LLMs are nondeterminism amplifiers and pass^k is one of the few public signals that quantifies the cost.

Public benchmarks tell you whether the underlying model can call tools at all. They tell you nothing about your registry, schemas, error codes, or business policy. The private eval set is the one that gates production. Build it stratified by tool, argument edge-case bucket, and call/no-call gold. Promote failing production traces into it weekly.

# Stratified shape for a private LLM tool-use eval set
{
  "by_step": {
    "step_1_no_call":           0.12,   # call vs no-call regression slice
    "step_2_selection":         0.38,   # tool-name F1, with near-miss families
    "step_3_arguments":         0.30,   # schema + semantic + edge cases
    "step_4_result_integration":0.20,   # Groundedness on tool payload
  },
  "by_tool":  {"refund_order": 80, "get_order_status": 60, ...},  # >= 30 per tool
  "by_difficulty": {"easy": 0.4, "medium": 0.4, "hard": 0.2},
}

Production observability for the contract

The four-step contract has to hold in production, not just in CI. The wiring is one trace provider, one decorator, and four EvalTags. The @tracer.tool decorator wraps a Python function and auto-infers the tool description from the docstring and the parameter schema from type annotations via _get_jsonschema_type.

from fi_instrumentation import register, get_tracer
from fi_instrumentation.fi_types import ProjectType

register(project_name="support_agent", project_type=ProjectType.OBSERVE)
tracer = get_tracer(__name__)

@tracer.tool
def refund_order(order_id: str, amount_cents: int, reason: str) -> dict:
    """Refund an order to the original payment method."""
    ...

Every call lands as a span with fi.span.kind=TOOL plus the GenAI attributes — gen_ai.tool.name, gen_ai.tool.call.arguments, gen_ai.tool.call.result, gen_ai.tool.description. Latency rides on the standard OpenTelemetry duration attribute, so per-tool p50, p95, p99 are one Grafana query away. traceAI ships across 50+ AI surfaces in Python, TypeScript, Java, and C# with 14 span kinds.

EvalTag attaches the same rubric you ran in CI as a span-attached scorer. The collector runs evals server-side post-export at zero inline latency, which is what makes this viable on every TOOL span at production scale. One EvalTag per step:

  • EVALUATE_LLM_FUNCTION_CALLING on TOOL spans for steps 1-3 (the no-call slice rides on the irrelevance examples in the rubric input)
  • GROUNDEDNESS on the parent LLM span with tool result wired in as context for step 4

A per-step CI gate as one fixture, then the same fixture shape rides on live traces via EvalTag:

# config.yaml for `fi run`
assertions:
  - "call_or_no_call_precision.score >= 0.95 for at_least 95% of cases"
  - "tool_name_f1.score              >= 0.95 for at_least 95% of cases"
  - "argument_validation.score        >= 0.90 for at_least 90% of cases"
  - "argument_plausibility.score      >= 0.85 for at_least 85% of cases"
  - "result_groundedness.score        >= 0.90 for at_least 90% of cases"

The Agent Command Center gateway returns x-prism-cost, x-prism-latency-ms, and x-prism-model-used on every call, so cost and latency per step land in the trace without extra plumbing. The AI agent cost optimization guide covers per-step budgeting; for runtime detection of contract violations the AI agent failure detection tools post compares patterns.

How Future AGI ships the four-step eval stack

Future AGI ships the stack as a package. Start with the SDK for code-defined per-step scoring. Graduate to the Platform when the loop needs self-improving rubrics and classifier-backed cost economics.

  • ai-evaluation SDK (Apache 2.0): deterministic metrics (function_name_match, parameter_validation, function_call_accuracy, function_call_exact_match) for the cheap layer of steps 1-3; EvaluateFunctionCalling (alias LLMFunctionCalling) for selection and argument rubrics; CustomLLMJudge for argument plausibility; Groundedness, ContextAdherence, ChunkAttribution for step 4 with tool payload as context; fi run CLI with CI assertions.
  • Future AGI Platform: self-improving evaluators tuned by thumbs feedback on production traces; in-product authoring agent for custom rubrics; classifier-backed evals at lower per-eval cost than Galileo Luna-2.
  • traceAI (Apache 2.0): @tracer.tool decorator with auto-inferred schemas; 14 span kinds across 50+ AI surfaces in Python, TypeScript, Java, C#; EvalTag wires the rubric to the span at zero inference latency.
  • Error Feed: HDBSCAN clusters tool-use regressions over span embeddings; a Sonnet 4.5 Judge writes the 5-category 30-subtype taxonomy (Decision Errors: wrong-tool, invalid-params, missing-tool-call), the 4-D trace score, and an immediate_fix.
  • Agent Command Center: 100+ providers; per-virtual-key AllowedTools and DeniedTools at the gateway boundary; MCP security plugin with allowed_servers, blocked_tools, per-tool rate limits; Protect’s prompt_injection adapter on every tool input and output. SOC 2 Type II, HIPAA, GDPR, and CCPA certified per futureagi.com/trust; ISO/IEC 27001 in active audit.

MCP is worth one line. Tool descriptions arrive at runtime from servers you don’t own. The gateway’s mcpsec plugin scans descriptions for injection before the model sees them; Protect’s prompt_injection adapter runs on tool outputs before the model integrates them — the step 4 attack surface specifically. The AI gateway and MCP tool observability writeup covers the dual-scanner architecture.

Three honest tradeoffs

Per-step scoring costs more than aggregate task-completion. Five rubrics per case, not one. The payoff: when CI fails, the failing step name is the root cause, not the start of a triage session. Ship the deterministic layers (schema validation, function_name_match, call-vs-no-call counts) first and turn on the LLM-judge layers once trace volume justifies the eval bill.

Groundedness on tool payloads is noisier than Groundedness on retrieved corpus. JSON has fewer surface cues than prose, and the rubric reasons over fields rather than passages. Pin a small human-labeled calibration set and re-tune monthly; the LLM-as-a-judge deep-dive covers judge calibration.

The no-call slice feels expensive to build. Greetings and clarifications are cheap to generate but a curator has to label “what should the model do here,” and the labels are softer than tool-call golds. Build it once, treat it as a fixture, refresh quarterly from production traces where the model fired a tool on a turn that didn’t need one.

Ready to wire the contract? Pick one tool with a write side effect, build a 60-case eval set across the four steps (10 no-call, 20 selection with at least one near-miss family, 20 argument cases including 6 semantic, 10 result integration), wire function_name_match, parameter_validation, a CustomLLMJudge on argument plausibility, and Groundedness against the tool payload into a pytest fixture against the ai-evaluation SDK, then attach the same templates as EvalTag scorers via traceAI so the same rubric rides on every production call.

Frequently asked questions

What are the four steps of LLM tool use evaluation?
LLM tool use is a four-step contract: (1) decide whether to call any tool at all, (2) pick the right tool from the catalog, (3) construct correct arguments against the tool's schema and the user's intent, (4) integrate the tool's result into the final response without paraphrasing or substituting prior knowledge. The model can get one of the four wrong and the demo still passes — the wrong refund goes to the right customer, or the right tool fires when the user only said hello. Score each step independently with a dedicated rubric (irrelevance bucket, F1 on tool name, schema validation plus a semantic judge on arguments, Groundedness with the tool payload as context) and the diagnostic vocabulary collapses from 'the model failed' to 'step 2 regressed on the refund_order vs. cancel_order pair.' One bisect instead of three days.
Why is step 1 (decide whether to call any tool) the step everyone misses?
Most eval sets only contain cases where a tool call is the gold answer. The model gets graded on selection accuracy across the catalog, which is layer 2 in this contract, and the over-call regression where a new prompt revision makes the model fire search() on every greeting goes untested until production. BFCL added an explicit irrelevance-detection bucket for this reason: a model that scores 0.95 on tool-name F1 and 0.40 on irrelevance is a model that over-calls. The fix is to reserve 10-15 percent of the eval set for cases where the gold answer is no tool call — greetings, clarifications, in-model factual questions, refusal-worthy asks — and gate CI on per-bucket precision.
How do I score argument construction (step 3) beyond schema validation?
Schema validation is the deterministic gate and it runs on every call. Pydantic on the tool's input schema catches type errors, missing required fields, and pattern mismatches at zero LLM cost. The class of failure schema can never see is semantic: departure_date='next Friday' parses against pattern=r'^\d{4}-\d{2}-\d{2}$' if the schema accepts strings, and customer_id='please refund the angry one' is a valid string. Pair the deterministic layer with an LLM-judge that reads the conversation history and scores whether each argument is grounded in prior turns and plausible given the user's intent. Future AGI's EvaluateFunctionCalling (alias LLMFunctionCalling) plus a CustomLLMJudge with conversation context covers both classes — schema cheap, semantics judged.
How do you score step 4 (integrate the result)?
Run a Groundedness rubric where the context slot is the tool's return payload, not a retrieved corpus. The question to grade: does the model's response derive cleanly from what the tool returned, or did the model paraphrase with a number flipped, ignore the payload entirely, or substitute prior model knowledge. Three failure patterns show up in production: tool returns amount_cents=4500 and the model says '$54.00 is processing'; tool returns balance below the threshold and the model 'knows' it's above; tool result used correctly on turn 1, contradicted on turn 3. The Future AGI ai-evaluation SDK ships Groundedness and ContextAdherence as the rubrics; point the context slot at the JSON of tool.result and the rubric works against tool calls without modification.
Should I gate production releases on BFCL or τ-bench?
Use them as the public floor, not the private ceiling. BFCL (Berkeley Function Calling Leaderboard) measures function calling across AST checks, executable checks, and an irrelevance bucket on a public tool registry. τ-bench measures multi-turn behavior in airline and retail environments with pass^k as the consistency metric. Both tell you whether the underlying model can call tools at all. Neither tells you whether the model handles your registry, your schemas, your error codes, or your business policy. Treat public benchmarks as model-selection signal — they help you choose between GPT-5, Claude Opus 4.7, and Gemini for a new build. Build a private eval set stratified by tool, by argument edge-case bucket, and by call/no-call gold and gate CI on that. Promote failing production traces into the set weekly.
What does production observability for tool use actually look like?
Every tool call lands as a span with fi.span.kind=TOOL plus the GenAI attributes — gen_ai.tool.name, gen_ai.tool.call.arguments, gen_ai.tool.call.result, gen_ai.tool.description, gen_ai.tool.type — emitted by traceAI across 50+ AI surfaces in Python, TypeScript, Java, and C#. Latency rides on the standard OpenTelemetry duration attribute, so per-tool p50, p95, p99 are one Grafana query away. EvalTag attaches the same EvaluateFunctionCalling rubric you ran in CI as a span-attached scorer that the collector runs server-side at zero inline latency. The Agent Command Center gateway returns x-prism-cost and x-prism-latency-ms per call so per-step economics surface in the trace without extra plumbing.
How does Future AGI ship the four-step eval stack?
As a package. The ai-evaluation SDK (Apache 2.0, OSS) ships deterministic function-call metrics (function_name_match, parameter_validation, function_call_accuracy, function_call_exact_match) for steps 1-3 cheap layer; EvaluateFunctionCalling (alias LLMFunctionCalling) and CustomLLMJudge for the semantic rubric layer; Groundedness, ContextAdherence, and ChunkAttribution for step 4 with the tool payload as context. The Future AGI Platform adds self-improving evaluators tuned by thumbs feedback, an in-product authoring agent for custom rubrics, and classifier-backed evals at lower per-eval cost than Galileo Luna-2. traceAI carries the same rubrics as span-attached scorers via EvalTag. Error Feed clusters tool-use regressions over span embeddings and writes the immediate_fix. The Agent Command Center enforces per-virtual-key AllowedTools and DeniedTools at the gateway boundary, with an MCP security plugin and Protect's prompt_injection adapter on every tool input and output. SOC 2 Type II, HIPAA, GDPR, and CCPA certified per the trust page; ISO/IEC 27001 in active audit.
Related Articles
View all