What is tool use in an LLM?

Tool use is the agent capability that lets an LLM choose and invoke external tools such as APIs, retrievers, functions, code executors, or other agents during a run.

How is tool use different from tool calling?

Tool use is the broader behavior: deciding when and why an external capability is needed. Tool calling is the structured invocation mechanism that executes that choice.

How do you measure tool use?

FutureAGI uses ToolSelectionAccuracy to score whether the agent selected the right tool at each step, and traceAI spans use agent.trajectory.step to place the call in context.

Tool Use (LLM): Definition, Examples & FutureAGI Guide

What Is Tool Use (LLM)?

Tool use in an LLM is the agent capability that lets a model choose and invoke external systems such as APIs, search, databases, code executors, or other agents. It belongs to the agent family because reliability depends on choosing the right tool, filling the right arguments, and using the result at the correct step. In production, FutureAGI records tool use as trace spans and evaluates selection with ToolSelectionAccuracy, which catches wrong-tool calls before they trigger bad actions or user-facing failures.

Why Tool Use Matters in Production LLM and Agent Systems

Tool use turns a language model from a text generator into an actor. A wrong answer is bad; a wrong tool call can refund money, delete data, expose private records, send the wrong notification, or start an expensive downstream job. The core failure modes are wrong-tool execution, argument drift, stale tool descriptions, and side-effect escalation. They often look fine in a transcript because the model explains the action confidently after the runtime already executed it.

Different teams feel the pain in different places. Developers see 4xx errors when an agent passes a customer email to a tool that expects an internal account ID. SREs see p99 latency move after the agent retries a slow search tool instead of using a cached lookup. Compliance teams see agents call write-path tools without approval. Product teams see users report that the assistant “did the wrong thing,” even when the final text was polite.

This matters more in 2026-era agent pipelines than in single-turn LLM calls because tools are chained across planning, retrieval, memory, code execution, and agent-to-agent handoffs. MCP servers can expose dozens of tools dynamically. A model upgrade that slightly changes selection behavior can break only one branch of a workflow, so aggregate task success may hide the regression. You need per-step visibility, not just final-answer scoring.

How FutureAGI Handles Tool Use

FutureAGI’s approach is to treat tool use as a scored decision inside the agent trajectory, not as a loose log message. With traceAI integrations such as traceAI-langchain, traceAI-openai-agents, and traceAI-mcp, each tool execution is captured as an OpenTelemetry span. The span is tied to agent.trajectory.step and carries the tool name, argument payload, result status, latency, and surrounding reasoning context.

The anchor evaluator for this glossary entry is ToolSelectionAccuracy, a FutureAGI local metric that evaluates whether the agent selected the correct tool. FunctionCallAccuracy is the companion check for name-and-argument correctness after the selection decision has been made. Unlike raw trace review in a tool such as LangSmith, the goal is not only to see that a tool call happened; the goal is to score whether that tool was the right action for the user’s objective and the current state.

Example: a support agent has lookup_customer, search_policy, and create_refund. After a model swap, traces show create_refund appearing before search_policy on edge cases. FutureAGI evaluates those steps with ToolSelectionAccuracy, alerts when the score drops below the release threshold, and lets the engineer inspect the failing spans. The fix might be a prompt regression test, a stricter tool description, or an Agent Command Center post-guardrail that blocks write-path tools until the policy lookup has succeeded.

How to Measure or Detect Tool Use

Measure tool use at the decision, argument, runtime, and outcome layers:

ToolSelectionAccuracy: scores whether the selected tool matches the expected tool for the task and trajectory step.
FunctionCallAccuracy: checks whether the chosen callable has the right function name, argument structure, and semantic values.
agent.trajectory.step: groups tool spans by step so failures can be traced to planning, retrieval, execution, or handoff.
Dashboard signals: track wrong-tool rate, tool-error-rate-by-name, p99 tool latency, token-cost-per-trace, and eval-fail-rate-by-cohort.
User feedback proxies: monitor thumbs-down rate, escalation rate, refund reversals, and manual override frequency after tool calls.

Minimal evaluator call:

from fi.evals import ToolSelectionAccuracy

user_goal = "Check policy before refunding order 123"
agent_trace = [{"tool": "create_refund", "step": 1}]
result = ToolSelectionAccuracy().evaluate(
    input=user_goal,
    trajectory=agent_trace,
)
print(result.score, result.reason)

Common mistakes

Common mistakes are usually measurement mistakes:

Scoring only the final answer. A fluent summary can hide that the agent called the wrong API two steps earlier.
Treating all tools as equally safe. Read-only search, account lookup, and write-path actions need different thresholds and approval gates.
Logging tool names without arguments. Tool selection can be correct while the arguments quietly point at the wrong customer, file, or time range.
Testing only the happy path. Tool choice often fails on ambiguous requests, missing permissions, stale memory, or two similar tools.
Confusing availability with suitability. Just because a tool is in the registry does not mean the agent should call it for every related request.