How is tool correctness different from task completion?

Tool correctness is per-step — each tool call is graded individually. Task completion is end-to-end — it asks whether the agent reached the goal across the full trajectory.

How do you measure tool correctness?

FutureAGI's ToolSelectionAccuracy evaluator returns a 0/1 verdict on tool name choice; FunctionCallAccuracy validates argument schema and values; both attach to the agent.trajectory.step span.

What Is Tool Correctness Metric? Definition & FutureAGI Guide (2026)

Q: What is the tool correctness metric?

It is a step-level evaluation metric that scores whether an agent picked the right tool, called it with valid arguments, and used the observation correctly to advance the goal.

What Is the Tool Correctness Metric?

The tool correctness metric is an LLM-evaluation metric that scores whether an agent invoked the right tool with the right arguments, and whether the tool’s observation was used correctly to advance the user’s goal. It grades a single tool-call span in an agent trajectory — tool name match, argument schema, argument values, and observation handling — rather than the final answer. Engineers wire it into agent traces so they can isolate planner errors from tool errors. FutureAGI exposes the metric through ToolSelectionAccuracy, FunctionCallAccuracy, and step-level dashboards.

Why It Matters in Production LLM and Agent Systems

A single end-to-end success rate hides where the agent broke. An agent can finish a 12-step trajectory with the wrong tool selected at step three, the wrong arguments at step seven, and a salvaged final answer that masks both errors. By the time a user complains, the trace has 40 spans and the team is guessing. Tool correctness gives you the per-step signal that converts “the agent is broken” into “the agent picks the wrong tool 8% of the time on the refund route.”

The pain is concrete. Backend engineers chase runaway cost when an agent retries the wrong tool five times before giving up. SREs see latency spikes when an agent calls a heavy database tool instead of a cached lookup. Product managers see hallucinated outputs when the model fabricated arguments because the schema description was ambiguous. Compliance teams need to prove an agent never called a write-capable tool on a read-only request — a step-level evaluator is the only artifact that answers cleanly.

In 2026-era multi-agent stacks built on the OpenAI Agents SDK, LangGraph, CrewAI, or Google ADK, the tool registry is large and the model swap rate is high. A prompt that worked on gpt-4o may select the wrong tool on gpt-4o-mini 12% more often. Tool correctness as a continuous metric — not a one-off eval — is what catches that regression before the support queue does.

How FutureAGI Handles Tool Correctness

FutureAGI’s approach is to grade tool calls at three resolutions and tie all of them to the same span. At the selection layer, the ToolSelectionAccuracy evaluator scores whether the agent picked the correct tool from the registry given the input state — it returns a 0/1 verdict plus a reason citing the expected tool. At the call layer, FunctionCallAccuracy validates argument schema, required fields, and value plausibility against a JSON Schema or a reference call, returning per-argument diagnostics. At the trajectory layer, TaskCompletion and TrajectoryScore aggregate per-step verdicts so a single failing tool step shows up against the goal.

A real workflow: a refunds agent built on the OpenAI Agents SDK is instrumented with traceAI-openai-agents. Every tool span carries agent.trajectory.step, the tool name, and the arguments. A 5% production sample is mirrored into an eval cohort where ToolSelectionAccuracy and FunctionCallAccuracy run on each step. When tool-correctness rate dips below 92% on the refunds route, the dashboard pivots by tool name and flags a model swap that started returning the lookup_order tool when process_refund was correct. The team rolls back the model in the Agent Command Center, locks the previous variant via routing-policy: cost-optimized with a fallback, and adds the failing trace to a regression eval.

Unlike single-shot benchmarks like AgentBench, the FutureAGI workflow keeps the failing trajectory in a regression dataset so the same step is scored on every prompt and model change.

How to Measure or Detect It

Tool correctness is multi-signal — pick the ones that match your agent surface:

ToolSelectionAccuracy — returns 0/1 plus reason on whether the chosen tool name matches the expected tool given the input state.
FunctionCallAccuracy — validates argument JSON Schema, required fields, types, and value plausibility against a reference.
TaskCompletion — aggregates step-level signals into a 0–1 trajectory score so a single bad tool call shows up against the goal.
agent.trajectory.step (OTel attribute) — the canonical span tag to filter and group tool-correctness scores by step number.
Dashboard signal — tool-correctness-rate-by-route, sliced by tool name, model id, and prompt version; alert when it drops more than 3 points week-over-week.

Minimal Python:

from fi.evals import ToolSelectionAccuracy, FunctionCallAccuracy

selection = ToolSelectionAccuracy()
arguments = FunctionCallAccuracy()

result = selection.evaluate(
    input="Refund order 12345",
    output={"tool": "process_refund", "args": {"order_id": "12345"}},
    expected_tool="process_refund",
)
print(result.score, result.reason)

Common Mistakes

Grading only the final answer. A salvaged correct answer hides a bad tool call earlier in the trajectory; score every step, not just the last.
Ignoring argument plausibility. A tool name match is not enough — fabricated order ids or wrong dates pass selection but fail the call.
Using a single reference per query. Many tasks have multiple correct tool sequences; allow a set of acceptable trajectories rather than one canonical path.
Skipping the schema test. A model that returns extra or missing fields will pass ToolSelectionAccuracy but break the runtime — pair both evaluators.
Letting the judge model and the agent model be the same. Self-grading inflates correctness; pin the judge to a different model family.