Evaluation

What Is the Tool Correctness Metric?

An LLM-evaluation metric that scores whether an agent picked the right tool, called it with valid arguments, and used the result correctly.

What Is the Tool Correctness Metric?

The tool correctness metric is an LLM evaluation metric that scores whether an agent invoked the right tool with the right arguments, and whether the tool’s observation was used correctly to advance the user’s goal. It grades a single tool-call span in an agent trajectory. tool name match, argument schema, argument values, and observation handling. rather than the final answer. Engineers wire it into agent traces so they can isolate planner errors from tool errors. FutureAGI exposes the metric through ToolSelectionAccuracy, FunctionCallAccuracy, and step-level dashboards. In 2026 this is the single most important agent-reliability signal: with MCP servers exposing dozens of tools dynamically and BFCL v3 showing wide gaps between frontier models on parallel/irrelevance calls, end-to-end success rate alone hides where the agent broke.

Why tool correctness matters in production LLM and agent systems

A single end-to-end success rate hides where the agent broke. An agent can finish a 12-step trajectory with the wrong tool selected at step three, the wrong arguments at step seven, and a salvaged final answer that masks both errors. By the time a user complains, the trace has 40 spans and the team is guessing. Tool correctness gives you the per-step signal that converts “the agent is broken” into “the agent picks the wrong tool 8% of the time on the refund route.”

The pain is concrete. Backend engineers chase runaway cost when an agent retries the wrong tool five times before giving up. SREs see latency spikes when an agent calls a heavy database tool instead of a cached lookup. Product managers see hallucinated outputs when the model fabricated arguments because the schema description was ambiguous. Compliance teams need to prove an agent never called a write-capable tool on a read-only request. a step-level evaluator is the only artifact that answers cleanly.

In 2026-era multi-agent stacks built on the OpenAI Agents SDK, LangGraph, CrewAI, Google ADK, or MCP-connected toolchains, the tool registry is large and the model swap rate is high. A prompt that worked on Claude Sonnet 4.5 may select the wrong tool on Sonnet 4.6 12% more often after a system-prompt change. The τ-bench retail variant shows frontier scores between 55% and 70%. a 15-point spread driven almost entirely by tool selection and argument fidelity, not by raw reasoning. Tool correctness as a continuous metric. not a one-off eval. is what catches that regression before the support queue does.

How FutureAGI Handles Tool Correctness

FutureAGI’s approach is to grade tool calls at three resolutions and tie all of them to the same span. At the selection layer, the ToolSelectionAccuracy evaluator scores whether the agent picked the correct tool from the registry given the input state. it returns a 0/1 verdict plus a reason citing the expected tool. At the call layer, FunctionCallAccuracy validates argument schema, required fields, and value plausibility against a JSON Schema or a reference call, returning per-argument diagnostics. At the trajectory layer, TaskCompletion and TrajectoryScore aggregate per-step verdicts so a single failing tool step shows up against the goal.

A real workflow: a refunds agent built on the OpenAI Agents SDK is instrumented with traceAI-openai-agents. Every tool span carries agent.trajectory.step, the tool name, and the arguments. A 5% production sample is mirrored into an eval cohort where ToolSelectionAccuracy and FunctionCallAccuracy run on each step. When tool-correctness rate dips below 92% on the refunds route, the dashboard pivots by tool name and flags a model swap that started returning the lookup_order tool when process_refund was correct. The team rolls back the model in the Agent Command Center, locks the previous variant via routing-policy: cost-optimized with a fallback, and adds the failing trace to a regression eval.

Unlike single-shot benchmarks such as BFCL v3 or τ-bench, which report one aggregate score across a public test set, the FutureAGI workflow keeps the failing trajectory in a regression eval dataset so the same step is scored on every prompt and model change. We’ve found in our 2026 agent evals that 60% of model-swap regressions show up first on argument plausibility (wrong order id, wrong date format). not on tool selection. which is why FunctionCallAccuracy is the more sensitive of the two.

Three resolutions for the same span

ResolutionEvaluatorWhat it gradesTypical failure mode
SelectionToolSelectionAccuracyDid the agent pick the right tool name?Two similar tools, ambiguous descriptions
ArgumentFunctionCallAccuracyAre arguments valid and plausible?Fabricated ids, wrong types, dropped required fields
TrajectoryTaskCompletion / TrajectoryScoreDid the chain of steps reach the goal?Wrong-order steps, infinite retries
Step OTel attragent.trajectory.stepSpan tag to group scores by step indexMissing on async sub-agents

How to Measure or Detect It

Tool correctness is multi-signal. pick the ones that match your agent surface:

  • ToolSelectionAccuracy. returns 0/1 plus reason on whether the chosen tool name matches the expected tool given the input state.
  • FunctionCallAccuracy. validates argument JSON Schema, required fields, types, and value plausibility against a reference.
  • TaskCompletion. aggregates step-level signals into a 0–1 trajectory score so a single bad tool call shows up against the goal.
  • agent.trajectory.step (OTel attribute). the canonical span tag to filter and group tool-correctness scores by step number.
  • Dashboard signal. tool-correctness-rate-by-route, sliced by tool name, model id, and prompt version; alert when it drops more than 3 points week-over-week.

Minimal Python:

from fi.evals import ToolSelectionAccuracy, FunctionCallAccuracy

selection = ToolSelectionAccuracy()
arguments = FunctionCallAccuracy()

result = selection.evaluate(
    input="Refund order 12345",
    output={"tool": "process_refund", "args": {"order_id": "12345"}},
    expected_tool="process_refund",
)
print(result.score, result.reason)

Common mistakes

  • Grading only the final answer. A salvaged correct answer hides a bad tool call earlier in the trajectory; score every step, not just the last.
  • Ignoring argument plausibility. A tool name match is not enough. fabricated order ids or wrong dates pass selection but fail the call.
  • Using a single reference per query. Many tasks have multiple correct tool sequences; allow a set of acceptable trajectories rather than one canonical path.
  • Skipping the schema test. A model that returns extra or missing fields will pass ToolSelectionAccuracy but break the runtime. pair both evaluators.
  • Letting the judge model and the agent model be the same. Self-grading inflates correctness; pin the judge to a different model family.
  • Treating MCP and native tool calls differently. A unified score across both transports is what reveals registry-level confusion.

Frequently Asked Questions

What is the tool correctness metric?

It is a step-level evaluation metric that scores whether an agent picked the right tool, called it with valid arguments, and used the observation correctly to advance the goal.

How is tool correctness different from task completion?

Tool correctness is per-step. each tool call is graded individually. Task completion is end-to-end. it asks whether the agent reached the goal across the full trajectory.

How do you measure tool correctness?

FutureAGI's ToolSelectionAccuracy evaluator returns a 0/1 verdict on tool name choice; FunctionCallAccuracy validates argument schema and values; both attach to the agent.trajectory.step span.