What Is Tool Calling?
The capability that lets an LLM-driven agent invoke external functions, APIs, retrievers, or sub-agents during a run by emitting a structured request.
What Is Tool Calling?
Tool calling is the capability that lets an LLM-driven agent invoke external functions, APIs, retrievers, code interpreters, or sub-agents during a run. The agent’s model is given a registry of tools, each with a name, description, and argument schema, and emits a structured request when it decides to act. The runtime executes the request, captures the result, and feeds it back as the next observation. Tool calling is the umbrella concept; function calling is the specific OpenAI-style implementation. In a FutureAGI trace, each tool invocation is a span with the tool name, arguments, latency, success flag, and result. By May 2026, tool calling is no longer the optional add-on it was in 2023. every frontier model (GPT-5.x, Claude Opus 4.7, Gemini 3.x, Llama 4) ships first-class tool support, BFCL v3 is the canonical capability benchmark, and MCP plus A2A have standardized the interfaces across providers.
Why tool calling matters in production LLM and agent systems
Tool calling is the second most common agent failure surface after planning. The model can pick the wrong tool. search when it should call the database. It can pick the right tool with wrong arguments. query="customer" instead of query="customer ID 12345". It can pick the right tool, get a transient failure, and then loop or give up. It can pick a tool that doesn’t exist because the registry changed and the prompt didn’t. None of these get caught by answer-relevancy checks; the model’s reply still reads as helpful.
Different roles see different failure shapes. A backend engineer sees 4xx error rates spike on a downstream API because the agent passes wrong types. An SRE sees latency double when the agent retries a slow tool that should be cached. A product reviewer sees an agent confidently report success on a tool call that returned an error string the model misread as data. Compliance worries about agents calling tools that take real-world actions. bookings, payments, deletions. without human approval. The accountability chain breaks the moment a tool span lacks an audit log entry.
In 2026, the tool-calling surface is multiplying. The OpenAI Agents SDK ships first-class tool definitions; LangGraph nodes call tools via decorators; the Model Context Protocol (MCP) standardises tool discovery across servers; A2A treats other agents as tools; Anthropic’s computer-use tools allow agents to drive arbitrary GUIs. Every one of these emits the same structural shape. a tool span with name, args, result. which is what makes universal evaluation possible. We’ve found in our 2026 evals that the dominant tool-calling regression after a model swap is not selection accuracy on familiar tools; it is hallucinated tool names when the new model’s training distribution included tools your registry doesn’t expose.
2026 tool-calling surface. what shipped and what evaluates it
The protocol layer split into three live standards plus dozens of vendor surfaces:
| Surface | What it is | Evaluator (FutureAGI) | Notable in 2026 |
|---|---|---|---|
| OpenAI function calling | JSON-schema function args via the chat-completions API | FunctionCallAccuracy | Still the most common in the wild; GPT-5.x parallel-call quality has lifted |
| OpenAI Agents SDK | First-class agent + tool framework | ToolSelectionAccuracy, TaskCompletion | The reference orchestrator for OpenAI-stack agents |
| Anthropic tool use | Native tool blocks with structured I/O | ToolSelectionAccuracy, FunctionCallAccuracy | Claude Opus 4.7 leads BFCL v3 multi-step categories |
| Model Context Protocol (MCP) | Open standard for cross-vendor tool servers | ToolSelectionAccuracy over MCP spans | Hundreds of public MCP servers by 2026; schema-drift is the main risk |
| A2A Protocol | Open standard for agent-to-agent calls | TrajectoryScore end-to-end | Sub-agent calls evaluated as tool calls |
| LangGraph tools | Decorator-based tool nodes in a graph | traceAI-langchain spans + selection eval | Graph-shape evaluation, not just trajectory |
| CrewAI / Pydantic-AI / LlamaIndex Agents | Higher-level orchestrators with tool support | traceAI-{crewai,pydantic-ai,llamaindex} | Each ships their own decorator-style tool API |
| Anthropic computer-use | Tool calls that drive GUIs (screenshot + click + type) | TaskCompletion against OSWorld-shaped golden sets | OSWorld scores rose 25 points across the 2025-2026 frontier; still <40% headroom |
| Code-execution sandboxes | Tools that run model-emitted code (Python, JS, shell) | FunctionCallAccuracy + sandbox security checks | The biggest excessive-agency risk surface |
The headline: by 2026 tool calling spans far beyond “function calling.” Any evaluator that only scores OpenAI-style function calls is measuring a fraction of the surface.
Multi-tool, parallel-tool, and agentic-tool patterns
A 2026 production agent rarely calls one tool per step. The three patterns that matter:
- Parallel calls. the model emits N tool calls in one step, the runtime fans them out, the agent reasons over all results at once. GPT-5.x and Claude Opus 4.7 are reliable in parallel; smaller models tend to fall back to sequential. Parallel correctness is its own benchmark category in BFCL v3, and the FutureAGI trace capture preserves parallel-call ordering so evaluation reflects the actual fan-out behavior.
- Tool chaining. the model calls tool A, reads the result, then calls tool B with arguments derived from A’s output. This is where Faithfulness on tool outputs matters: if the agent fabricates an argument it claims came from tool A, the trajectory looks fine but the downstream call corrupts state. Run a faithfulness eval over tool-result-derived arguments, not just over the final answer.
- Recursive sub-agent calls. under A2A, a tool call can itself be another agent. TrajectoryScore at the parent level must include the sub-agent’s trajectory; otherwise a failed sub-agent that returned “Done” hides the issue.
BFCL v3. the canonical tool-calling benchmark
The Berkeley Function Calling Leaderboard v3 is the closest the field has to a saturated capability benchmark for tool calling. It scores across single calls, parallel calls, multiple calls, missing-tool detection, and irrelevance detection. the last two matter most for production safety. In May 2026 frontier scores cluster 88-94% on the headline; the meaningful gaps are in the irrelevance and missing-tool categories (a model that calls a wrong tool when none of the available tools fits the request is the production nightmare). Vendors that publish a BFCL v3 number without the irrelevance subcategory are skipping the failure mode that matters.
How FutureAGI handles tool calling
FutureAGI’s approach is to instrument every tool call as an OpenTelemetry span and evaluate selection plus arguments separately. The traceAI integrations for openai-agents, langchain, mcp, crewai, pydantic-ai, llamaindex, google-adk, and anthropic wrap the tool-execution path so each call lands as a span tagged with agent.trajectory.step, the tool name, the argument JSON, the result, and the duration. That gives engineers a per-tool dashboard across frameworks. The same spans surface in Agent Command Center, where a pre-guardrail can intercept destructive tool calls (deletions, payments, bookings) and route them through a confirmation step or a human-in-the-loop gate.
Evaluation runs at two levels. Selection: ToolSelectionAccuracy scores whether the agent chose the right tool given the input. the most common regression after a model swap or prompt change. It scores three signals. required-tool coverage, validity against available_tools, and call success rate. and returns a 0–1 score with tools_used and a hit/miss breakdown. Argument validity: FunctionCallAccuracy and the cloud-template EvaluateFunctionCalling validate that the arguments match the schema and the semantics of what the user wanted. Together they tell you whether failure is “wrong tool” or “right tool, wrong args.” TaskCompletion closes the loop on outcome, and TrajectoryScore provides the weighted composite that catches “got there the wrong way.”
Concretely: a coding-assistant agent on the OpenAI Agents SDK exposes read_file, run_tests, and git_commit as tools. After a model upgrade, TaskCompletion drops from 84% to 71%. The FutureAGI span dashboard shows git_commit calls jumped 3x; ToolSelectionAccuracy flags those as wrong-tool selections. the new model commits before running tests. The team adds one line to the system prompt to enforce ordering, ToolSelectionAccuracy recovers, and TaskCompletion climbs back to 86%. Without per-tool spans and a selection evaluator, this would have been a multi-day investigation. Unlike Arize Phoenix’s tool-call view (a list of spans without a selection score) or LangSmith’s per-tool error rate, the FutureAGI surface ties selection, argument validity, and outcome to the same trace. a single dashboard answers “wrong tool? bad args? bad outcome?” without a manual join.
Where tool calling sits in the routing stack
Inside Agent Command Center, tool calls are first-class objects in the routing policy. A team can route by tool type (cheap LLM for retrieval-only trajectories, frontier model for tool-heavy trajectories), apply model fallback when a tool budget is exceeded, semantic-cache on tool-result-shaped sub-prompts, and traffic-mirror a candidate model against production to validate tool-calling parity before any rollout. The same gateway exposes per-tool quota, rate-limiting, and a kill-switch for destructive tools when an incident requires fast shutoff.
MCP-specific evaluation patterns
MCP changed the eval surface in a quiet but important way. An MCP-connected agent in 2026 can see hundreds of tools across servers, and the set is dynamic. servers come up, schemas drift, capabilities change. The FutureAGI pattern for MCP is to (a) snapshot the active tool catalogue per request as available_tools on the span, (b) run ToolSelectionAccuracy against that snapshot rather than a static list, and (c) alert on invalid_tool_rate. calls to tools not in the snapshot. as a leading indicator of catalogue drift or model hallucination. Pair this with a pre-guardrail for any tool tagged “destructive” inside Agent Command Center, and a post-guardrail for any tool result that triggers a PII or PromptInjection match.
How to measure tool calling
Tool calling fails in distinct ways. measure each:
ToolSelectionAccuracy. returns 0–1 for whether the right tool was picked at each step, withtools_usedand per-call success.FunctionCallAccuracy. comprehensive accuracy on function-style tool calls (name + args + types).EvaluateFunctionCalling. cloud-template eval for end-to-end function-call quality.TaskCompletion. end-to-end check; bad tool calls usually surface as TaskCompletion regressions.TrajectoryScore. composite that catches “wrong way to a right answer.”- Per-tool error rate (dashboard signal). % of calls per tool name that returned an error. Slice by tool, model, prompt version, and tenant.
- Invalid-tool rate (dashboard signal). count of calls to tools not in
available_tools; non-zero is either model hallucination or MCP catalogue drift. agent.trajectory.step(OTel attribute). paired with span kind =tool, gives you the per-tool slice and the per-step view.- Parallel-call success. when the model emits multiple tool calls in one step, do all succeed? BFCL v3 measures this explicitly; track it for any agent using GPT-5.x or Claude Opus 4.7 parallel tools.
- Latency p99 per tool. slow tools become retry sources and inflate trajectory length; alert on p99, not just average.
Minimal Python:
from fi.evals import ToolSelectionAccuracy, FunctionCallAccuracy, TaskCompletion
selection = ToolSelectionAccuracy().evaluate(
input=user_query,
trajectory=spans,
available_tools=catalogue,
)
args = FunctionCallAccuracy().evaluate(
trajectory=spans,
expected_calls=expected,
)
outcome = TaskCompletion().evaluate(
trajectory=spans,
final_result=result,
task={"success_criteria": criteria},
)
print(selection.score, args.score, outcome.score)
For MCP-connected agents, wire the same evaluator chain to live traceAI spans so tool calls are scored as they happen. and snapshot the catalogue per request to catch hallucinated tool names (the dominant regression mode after a model swap, per our 2026 customer data):
from fi.traceAI.mcp import MCPInstrumentor
from fi.evals import ToolSelectionAccuracy, FunctionCallAccuracy, TaskCompletion
MCPInstrumentor().instrument()
eval_chain = [
ToolSelectionAccuracy(threshold=0.85, alert_on_invalid_tool=True),
FunctionCallAccuracy(threshold=0.90),
TaskCompletion(threshold=0.70),
]
@MCPInstrumentor.online(evaluators=eval_chain, sample_rate=0.05)
def agent_step(query, available_tools):
return my_agent.invoke(query, tools=available_tools)
That single decorator drives BFCL v3-shaped scoring (selection, parallel, missing-tool, irrelevance) against your production traffic at 5% sample rate, posts the scores back to each span, and fires an alert when invalid_tool_rate crosses zero. Healthy tool calling: every span has the four required tags (tool.name, tool.args, tool.success, tool.duration), invalid-tool rate is zero, per-tool error rate has alerting thresholds, parallel-call success holds across model releases, and TaskCompletion is wired to the per-tool diagnostic dashboard.
Tool-calling regression cohorts to track
Set up at least these regression cohorts in your tool-call evaluation:
- Single-tool, single-call. the baseline; failures here mean the model is struggling with basics.
- Parallel-tool. N tool calls in one step; failures here mean the model lost parallelism.
- Multi-step chained. tool A → tool B → tool C; failures here mean the model lost argument provenance.
- Missing-tool. a request for which no available tool fits; the model should refuse or ask, not hallucinate. This is the most common silent failure.
- Irrelevant-tool. a tool exists that looks plausible but is wrong; the model should not pick it.
- Destructive-tool. bookings, payments, deletions; should always trigger the guardrail confirmation path.
- MCP-catalogue-drift. a synthesized scenario where one expected tool is renamed; the model should detect or escalate.
Run them on every model swap, prompt change, and MCP server update.
Common mistakes
- Conflating tool calling with function calling. Function calling is the OpenAI-API mechanism; tool calling is broader and covers retrieval, code execution, sub-agents, MCP servers, computer-use, A2A. Plan evaluation for the full surface.
- No per-tool failure dashboard. A single overall tool-fail rate hides which one tool is broken; always slice by tool name.
- Skipping argument validation. ToolSelectionAccuracy alone misses cases where the agent picked the right tool with broken args; pair with FunctionCallAccuracy.
- Letting agents call destructive tools without confirmation. Bookings, payments, deletions need a human-in-the-loop gate or a guardrail; tool calls with side effects are not the same as read-only ones. The 2025 OWASP LLM Top 10 lists excessive agency as the canonical risk.
- Tool description rot. When a tool’s underlying behavior changes but its description doesn’t, the agent calls it on the wrong inputs. version your tool specs and audit them on every release.
- Ignoring prompt injection via tool outputs. A tool result that contains adversarial instructions can hijack the next step of the trajectory; run a
post-guardrailPromptInjection check on tool outputs before the agent reads them. - Treating MCP servers as static. MCP catalogues drift; snapshot
available_toolsper request and gate releases on invalid-tool rate. - Self-judging tool calls. When LLM-judge mode is used for fuzzy tool-output validation, pin the judge to a different model family. same-family judging inflates the score, especially on tool-format conventions.
- No release gate on per-tool regression eval. Tools change far more often than models; every tool registry change should run the agent against the golden trajectory set before deploy.
Frequently Asked Questions
What is tool calling?
Tool calling is the capability that lets an LLM agent invoke external functions, APIs, retrievers, or sub-agents during a run by emitting a structured request the runtime executes.
How is tool calling different from function calling?
Function calling is the specific OpenAI-style mechanism. strict JSON schema, single function-call API. Tool calling is the broader concept covering any external action: search, code execution, RAG, sub-agents, MCP servers. Function calling is one implementation of tool calling.
How do you measure tool calling quality?
FutureAGI's ToolSelectionAccuracy scores whether the agent picked the right tool; FunctionCallAccuracy validates argument structure; both run on traceAI spans for every tool invocation, paired with TaskCompletion for end-to-end outcome.