Evaluation

What Is Tool Selection Accuracy?

An agent-eval metric that scores whether the agent chose the right tools by measuring required-tool coverage, tool validity, and call success rate.

What Is Tool Selection Accuracy?

Tool selection accuracy is an agent-evaluation metric that measures whether an agent picked the right tools to solve a task, not whether each call’s arguments were filled correctly. It scores three signals across the trajectory: coverage of required tools listed in the task definition, validity of the chosen tools against the available toolset, and the success rate of the actual calls. The metric returns a 0–1 score with a reason. In FutureAGI it is the ToolSelectionAccuracy class in fi.evals, used as one component of TrajectoryScore in agent regression suites. By May 2026. with MCP servers exposing hundreds of dynamic tools per agent and A2A handoffs adding sub-agent invocations. it has become the load-bearing diagnostic for tool-driven agent regressions.

Why tool selection accuracy matters in production LLM and agent systems

Picking the wrong tool is the most expensive class of agent failure: cheap in tokens, ruinous in side effects. An agent that calls cancel_subscription when the user asked to upgrade has high answer-relevancy and zero business value. The same is true of an agent that hallucinates a tool name that does not exist (silent failure with a polite “I’ve handled that”), or one that invokes a deprecated tool the runtime quietly stubs. None of these surface in TaskCompletion alone. the agent reports success, the trace shows a tool span, and the regression goes undetected until production tickets arrive.

The pain shows up in three roles. SREs see retry storms when the agent keeps calling a tool that always errors. a signal ToolSelectionAccuracy would have caught offline. Product owners watch task-completion rate stay flat while latency and cost climb because the agent is using slower, less-targeted tools. Compliance hits a wall when the agent uses delete_record from a write-allowed scope when a mark_inactive was the documented path. the action is logged but not authorised by policy. The accountability and audit chain breaks the moment the wrong tool fires.

In 2026, the surface area exploded with MCP. An MCP-connected agent can see hundreds of tools at once across servers, and the ones it should use change weekly as MCP servers are added. Without a tool-selection metric scored against available_tools and required_tools, you cannot tell whether an agent regression was a model regression, a tool-catalogue change, or an MCP server returning stale schemas. Multi-agent systems compound this. each handoff introduces a new tool surface and a new opportunity to pick wrong. We’ve found that ~40% of post-model-swap regressions trace back to a tool-selection drop, not a reasoning or generation drop. which is why this metric sits next to TaskCompletion on every agent dashboard we ship.

How frontier-model behavior changed tool selection in 2025-2026

The combination of (a) parallel tool use becoming reliable on frontier models, (b) reasoning post-training (RLVR, GRPO, extended thinking) producing models that reason about which tool to call, and (c) MCP standardizing the tool catalogue interface, changed what tool-selection eval has to measure. Until 2024, selection eval mostly checked “did the agent pick a sensible tool from a fixed list of five.” In 2026 it checks “did the agent, given a catalogue of 200 tools across 8 MCP servers, refuse to call when no tool fits, pick parallel calls when appropriate, sequence calls in a useful order, and avoid hallucinating tools that look plausible but were not in the snapshot.” The metric had to scale with the surface. which is why the FutureAGI implementation weights three component signals rather than collapsing to one number.

The three tool-selection failure modes

Failure modeWhat it looks like in the traceDetected byFix
Wrong tool chosenRight intent, valid tool, wrong choice (e.g., search instead of db_lookup)ToolSelectionAccuracy coverage signalPrompt clarification, tool-description tightening, TrajectoryScore regression cohort
Hallucinated tool nameCall to a tool not in available_toolsToolSelectionAccuracy validity signal; invalid-tool rate alertMCP catalogue snapshot per request; refusal training
Right tool, persistent failureTool spans show repeated errors on the right nameToolSelectionAccuracy success-rate signalTool fix, retry policy, function-call accuracy check on args
Missing required toolRequired tool never called in the trajectoryToolSelectionAccuracy coverage signalTighter success criteria, planner-prompt fix
Sequencing errorAll required tools called but in wrong orderTrajectoryScore step-efficiency componentExplicit ordering constraints in prompt or planner

The metric is rule-based and runs in milliseconds. no LLM judge required for the headline signal. That matters: in regression suites of 500-5,000 rows, an LLM-judge-only tool-selection score blows past the CI time budget within a few releases. Versus competitors that route every tool-selection check through a judge model (Galileo’s “Tool Selection Quality” is the canonical example), the FutureAGI implementation defaults to rule-based with an optional LLM-judge fallback for ambiguous calls. fast by default, accurate when needed.

How FutureAGI handles tool selection accuracy

FutureAGI’s approach treats tool selection as a deterministic measurement over the trajectory, with weighted components that auto-normalise to whatever evaluation context you have. The fi.evals.ToolSelectionAccuracy class consumes an AgentTrajectoryInput containing the trajectory and a task with required_tools, plus an available_tools list. It collects every tool name actually called, then weights three signals: 40% for required-tool coverage (used_required / required), 30% for validity (penalised when the agent calls tools outside available_tools), and 30% for the success rate of every individual ToolCall. If you only supply some of those inputs, the weights re-normalise. pass nothing but the trajectory and you still get a usable success-rate signal.

Concretely: a coding-agent team on the OpenAI Agents SDK uses traceAI-openai-agents to stream every ToolCall into an agent.trajectory.step span. Their offline regression suite attaches ToolSelectionAccuracy to a 200-task golden set with required_tools=["read_file", "edit_file", "run_tests"] per task. When a prompt update causes the model to skip run_tests on 18% of tasks, the metric drops from 0.94 to 0.81. well before that change ships. Compared with Galileo’s “Tool Selection Quality” judge-model approach, the FutureAGI implementation runs in milliseconds because it is rule-based over the trajectory, with an optional LLM-judge mode for ambiguous calls. The team pairs the metric with FunctionCallAccuracy so they catch both wrong-tool and wrong-argument failures from the same dashboard, and TaskCompletion so they see the outcome impact.

Wiring into release gates and runtime guardrails

At release time, ToolSelectionAccuracy is one of the four headline gates in the agent release pipeline (alongside TaskCompletion, TrajectoryScore, and FunctionCallAccuracy). The CI job runs the agent over a golden dataset, scores selection, and blocks the deploy if any cohort’s score drops more than 2 points or any safety-critical cohort drops at all.

At runtime, the same evaluator runs on a sampled fraction of live trajectories via traceAI. When invalid-tool rate exceeds zero over a rolling window, Agent Command Center can shift that route to a fallback model, escalate to human review via AnnotationQueue, or activate a stricter pre-guardrail that requires explicit confirmation for destructive tools. The pattern is the same as for completion or hallucination. eval-time and runtime share the evaluator, so there is no drift between dev expectation and live behavior.

Tool selection in voice and multimodal agents

Voice agents add another dimension: the model is reading an ASR transcript, which carries error noise into the selection decision. A misheard customer name can route the agent to a search tool instead of an account-lookup tool. the regression is in the ASR layer, not the model’s selection capability. The voice-agent-evaluation pattern is to score ToolSelectionAccuracy and ASRAccuracy together; selection drops correlated with ASR drops point to the speech path, while selection drops without ASR drops point to the model.

Multimodal agents (Anthropic computer-use, GPT-5.x with vision, Gemini 3.x) carry a similar gotcha: the model selects tools based partly on what it sees in a screenshot, and visual perception errors propagate into selection errors. Run a cohort breakdown by input_modality on every selection eval.

MCP and A2A. the dynamic-catalogue case

MCP changed the tool-selection problem in two ways. First, available_tools is now per-request, not per-deploy. the catalogue can change between two consecutive trajectories. The fix is to snapshot the catalogue on each request and pass it into the evaluator, not a static list. Second, A2A makes other agents callable as tools, so tools_used may include sub-agent invocations whose own trajectories matter. We’ve found that the strongest leading indicator of an MCP-induced regression is invalid_tool_rate ticking off zero. alert on that the moment it does, because a model that starts hallucinating tool names has usually drifted on its tool-discovery prompt or hit a catalogue version mismatch.

How to measure tool selection accuracy

Bullet-list of measurement signals to wire to a ToolSelectionAccuracy eval:

  • fi.evals.ToolSelectionAccuracy. returns a 0–1 score, the list of tools_used, and a successful_calls / total_calls ratio. Threshold at 0.85 in regression and 0.7 in production sampling.
  • agent.trajectory.step OTel attribute. every ToolCall is emitted as a span with tool.name and tool.success; the eval reads from these directly. Span captures across traceAI-openai-agents, traceAI-langchain, traceAI-mcp, traceAI-crewai, traceAI-pydantic-ai, traceAI-google-adk, and traceAI-anthropic.
  • Invalid-tool-rate dashboard signal. count of tools_used - available_tools over a 24h window; non-zero is a hallucinated tool name and should page.
  • Required-tool coverage cohort. grouping by user-intent label exposes which task types are getting under-served by the model’s tool choice.
  • Per-tool error rate. paired with selection accuracy, lets you separate “right tool, bad tool” from “wrong tool entirely.”
  • Selection vs success gap. when selection holds at 0.95 but success rate drops, the agent is picking right and the tool is broken; route to infra, not to ML.
  • Cohort breakdown by model + prompt version. drift in one cohort after a model upgrade isolates the cause to the model, not the prompt.
  • BFCL v3 reference score. keep the public benchmark visible as a tier indicator; if your internal selection accuracy is 0.85 while the model’s published BFCL v3 is 0.93, the gap is your prompts or your tool descriptions, not the model.

Minimal Python:

from fi.evals import ToolSelectionAccuracy

metric = ToolSelectionAccuracy()
result = metric.evaluate(trajectory=run.trajectory,
                         available_tools=["read_file","edit_file",
                                          "run_tests","grep"],
                         task={"required_tools": ["read_file",
                                                  "run_tests"]})
print(result.score, result.tools_used)

For cohort-filtered regression eval against a persisted Dataset. the workflow most teams adopt once they have an MCP catalogue plus a release gate to defend. chain selection, args, and outcome together:

from fi.evals import Dataset, ToolSelectionAccuracy, FunctionCallAccuracy, TaskCompletion

ds = Dataset.load("agent_golden_v7")

results = ds.evaluate(
    evaluators=[
        ToolSelectionAccuracy(threshold=0.90, alert_on_invalid_tool=True),
        FunctionCallAccuracy(threshold=0.88),
        TaskCompletion(threshold=0.75),
    ],
    cohort_filter={"task_type": "destructive_action", "mcp_server": "billing"},
)
results.gate(baseline="release_v4.7_baseline", max_delta=-0.02)

Healthy selection: thresholded scores hold across cohorts, invalid-tool rate stays at zero, success gap stays small, and the dashboard surfaces the failure-mode breakdown (coverage / validity / success) not just the headline score. As a reference, BFCL v3 frontier scores cluster 88-94% on the headline in May 2026. the meaningful gap to track on your own catalogue is the irrelevance and missing-tool sub-scores, where most production regressions land.

Pairing ToolSelectionAccuracy with the rest of the eval panel

Selection is one of four metrics that together explain any agent-tool regression. The panel:

Add BiasDetection and PII for safety-critical surfaces, and a CustomEvaluation for product-specific tool policies (e.g., “must always call the policy-lookup tool before the refund tool”). The panel runs as one regression job; the dashboard renders the worst-mover first.

Common mistakes

  • Confusing tool selection with function call accuracy. ToolSelectionAccuracy grades which tool. FunctionCallAccuracy grades the call’s name plus argument values. You need both.
  • Empty available_tools. Without it, the validity check skips and you cannot detect hallucinated tool names. Always pass the registered tool catalogue. and snapshot it per request if you are on MCP.
  • No required_tools on the task. The eval falls back to call-success-rate alone, which an agent can game by calling one safe tool repeatedly.
  • Scoring only the final tool call. Trajectory-level evaluation is the point. a wrong middle call corrupts every downstream step even if the final call is correct.
  • Ignoring tool-call-success in production. A high tool-selection score with low success rate means the right tool is being called with wrong inputs; route those failures to FunctionCallAccuracy and to the tool owner.
  • No regression eval after a tool registry change. Adding, renaming, or removing a tool changes the agent’s selection distribution silently; rerun the regression set on every catalogue change, not just on every model change.
  • Same-family judge in LLM-judge mode. Pin the optional LLM-judge to a different model family from the agent’s model; same-family judging inflates the score by a measurable margin in our 2026 evals.
  • Treating selection accuracy as a global average. A 0.92 global average can hide a 0.62 cohort; the cohort breakdown is the only honest aggregate.
  • No alert on invalid-tool rate ticking off zero. It is the cleanest leading indicator of an MCP catalogue drift or a model regression; treat it as a paging condition.
  • Forgetting the planner is part of the system. Selection failures often originate in the planner’s tool-selection prompt, not in the model’s tool-calling capability. Track planner-prompt versions on each span.
  • No coverage for refusal scenarios. A correctly-behaving agent should refuse to call any tool when the user’s intent is out of scope; if your eval set has no “no tool should fire” cohort, the agent learns to always call something.
  • Skipping the audit-log integration. Every selection result should land in the trace record with model, prompt version, snapshot of available_tools, and reason. without it, post-incident analysis turns into archaeology.
  • No alerting on per-cohort score drift. A 0.92 → 0.90 global drop is invisible; the same global drop sourced from a 0.84 → 0.62 cohort drop is a release blocker. Slice every score by cohort by default; raise the global only as a summary.
  • Treating BFCL v3 as a production gate. The Berkeley leaderboard is a tier filter, not a release contract. Your own golden dataset. with your tool catalogue, your prompts, your refusal rubric. is the only gate that should block a deploy.
  • Forgetting is_optional tools. Some available_tools are optional helpers; coverage should only penalise missing required tools. Set the required vs optional split in the task definition or your eval will over-flag.

Frequently Asked Questions

What is tool selection accuracy?

Tool selection accuracy is an agent-eval metric that scores whether the agent chose the right tools across its trajectory. measuring required-tool coverage, validity against the available toolset, and the success rate of the actual calls.

How is tool selection accuracy different from function call accuracy?

Tool selection accuracy grades which tools the agent picked. Function call accuracy grades the call itself. name plus argument values. An agent can have perfect tool selection and still fail FunctionCallAccuracy because it filled a parameter wrong.

How do you measure tool selection accuracy?

FutureAGI's fi.evals.ToolSelectionAccuracy consumes the agent trajectory, a list of available_tools, and required_tools from the task definition, then returns a 0–1 score weighted across coverage, validity, and call success.