How is tool selection accuracy different from function call accuracy?

Tool selection accuracy grades which tools the agent picked. Function call accuracy grades the call itself — name plus argument values. An agent can have perfect tool selection and still fail FunctionCallAccuracy because it filled a parameter wrong.

How do you measure tool selection accuracy?

FutureAGI's fi.evals.ToolSelectionAccuracy consumes the agent trajectory, a list of available_tools, and required_tools from the task definition, then returns a 0–1 score weighted across coverage, validity, and call success.

What Is Tool Selection Accuracy? Agent Eval (2026)

Q: What is tool selection accuracy?

Tool selection accuracy is an agent-eval metric that scores whether the agent chose the right tools across its trajectory — measuring required-tool coverage, validity against the available toolset, and the success rate of the actual calls.

What Is Tool Selection Accuracy?

Tool selection accuracy is an agent-evaluation metric that measures whether an agent picked the right tools to solve a task, not whether each call’s arguments were filled correctly. It scores three signals across the trajectory: coverage of required tools listed in the task definition, validity of the chosen tools against the available toolset, and the success rate of the actual calls. The metric returns a 0–1 score with a reason. In FutureAGI it is the ToolSelectionAccuracy class in fi.evals, used as one component of TrajectoryScore in agent regression suites.

Why It Matters in Production LLM and Agent Systems

Picking the wrong tool is the most expensive class of agent failure: cheap in tokens, ruinous in side effects. An agent that calls cancel_subscription when the user asked to upgrade has high answer-relevancy and zero business value. The same is true of an agent that hallucinates a tool name that does not exist (silent failure with a polite “I’ve handled that”), or one that invokes a deprecated tool the runtime quietly stubs.

The pain shows up in three roles. SREs see retry storms when the agent keeps calling a tool that always errors — a signal ToolSelectionAccuracy would have caught offline. Product owners watch task-completion rate stay flat while latency and cost climb because the agent is using slower, less-targeted tools. Compliance hits a wall when the agent uses delete_record from a write-allowed scope when a mark_inactive was the documented path — the action is logged but not authorised by policy.

In 2026, the surface area exploded with MCP. An MCP-connected agent can see hundreds of tools at once across servers, and the ones it should use change weekly as MCP servers are added. Without a tool-selection metric scored against available_tools and required_tools, you cannot tell whether an agent regression was a model regression, a tool-catalogue change, or an MCP server returning stale schemas. Multi-agent systems compound this — each handoff introduces a new tool surface and a new opportunity to pick wrong.

How FutureAGI Handles Tool Selection Accuracy

FutureAGI’s approach treats tool selection as a deterministic measurement over the trajectory, with weighted components that auto-normalise to whatever evaluation context you have. The fi.evals.ToolSelectionAccuracy class consumes an AgentTrajectoryInput containing the trajectory and a task with required_tools, plus an available_tools list. It collects every tool name actually called, then weights three signals: 40% for required-tool coverage (used_required / required), 30% for validity (penalised when the agent calls tools outside available_tools), and 30% for the success rate of every individual ToolCall. If you only supply some of those inputs, the weights re-normalise — pass nothing but the trajectory and you still get a usable success-rate signal.

Concretely: a coding-agent team on the OpenAI Agents SDK uses traceAI-openai-agents to stream every ToolCall into an agent.trajectory.step span. Their offline regression suite attaches ToolSelectionAccuracy to a 200-task golden set with required_tools=["read_file", "edit_file", "run_tests"] per task. When a prompt update causes the model to skip run_tests on 18% of tasks, the metric drops from 0.94 to 0.81 — well before that change ships. Compared with Galileo’s “Tool Selection Quality” judge-model approach, the FutureAGI implementation runs in milliseconds because it is rule-based over the trajectory, with an optional LLM-judge mode for ambiguous calls. The team pairs the metric with FunctionCallAccuracy so they catch both wrong-tool and wrong-argument failures from the same dashboard.

How to Measure or Detect It

Bullet-list of measurement signals to wire to a ToolSelectionAccuracy eval:

fi.evals.ToolSelectionAccuracy — returns a 0–1 score, the list of tools_used, and a successful_calls / total_calls ratio. Threshold at 0.85 in regression and 0.7 in production sampling.
agent.trajectory.step OTel attribute — every ToolCall is emitted as a span with tool.name and tool.success; the eval reads from these directly.
Invalid-tool-rate dashboard signal — count of tools_used - available_tools over a 24h window; non-zero is a hallucinated tool name and should page.
Required-tool coverage cohort — grouping by user-intent label exposes which task types are getting under-served by the model’s tool choice.

Minimal Python:

from fi.evals import ToolSelectionAccuracy

metric = ToolSelectionAccuracy()
result = metric.evaluate(trajectory=run.trajectory,
                         available_tools=["read_file","edit_file",
                                          "run_tests","grep"],
                         task={"required_tools": ["read_file",
                                                  "run_tests"]})
print(result.score, result.tools_used)

Common Mistakes

Confusing tool selection with function call accuracy. ToolSelectionAccuracy grades which tool. FunctionCallAccuracy grades the call’s name plus argument values. You need both.
Empty available_tools. Without it, the validity check skips and you cannot detect hallucinated tool names. Always pass the registered tool catalogue.
No required_tools on the task. The eval falls back to call-success-rate alone, which an agent can game by calling one safe tool repeatedly.
Scoring only the final tool call. Trajectory-level evaluation is the point — a wrong middle call corrupts every downstream step even if the final call is correct.
Ignoring tool-call-success in production. A high tool-selection score with low success rate means the right tool is being called with wrong inputs; route those failures to FunctionCallAccuracy.