How is tool calling different from function calling?

Function calling is the specific OpenAI-style mechanism — strict JSON schema, single function-call API. Tool calling is the broader concept covering any external action: search, code execution, RAG, sub-agents. Function calling is one implementation of tool calling.

How do you measure tool calling quality?

FutureAGI's ToolSelectionAccuracy scores whether the agent picked the right tool; FunctionCallAccuracy validates argument structure; both run on traceAI spans for every tool invocation.

Tool Calling: Definition, Examples & FutureAGI Guide (2026)

Q: What is tool calling?

Tool calling is the capability that lets an LLM agent invoke external functions, APIs, retrievers, or sub-agents during a run by emitting a structured request the runtime executes.

What Is Tool Calling?

Tool calling is the capability that lets an LLM-driven agent invoke external functions, APIs, retrievers, code interpreters, or sub-agents during a run. The agent’s model is given a registry of tools, each with a name, description, and argument schema, and emits a structured request when it decides to act. The runtime executes the request, captures the result, and feeds it back as the next observation. Tool calling is the umbrella concept; function calling is the specific OpenAI-style implementation. In a FutureAGI trace, each tool invocation is a span with the tool name, arguments, latency, success flag, and result.

Why tool calling matters in production LLM and agent systems

Tool calling is the second most common agent failure surface after planning. The model can pick the wrong tool — search when it should call the database. It can pick the right tool with wrong arguments — query="customer" instead of query="customer ID 12345". It can pick the right tool, get a transient failure, and then loop or give up. It can pick a tool that doesn’t exist because the registry changed and the prompt didn’t.

Different roles see different failure shapes. A backend engineer sees 4xx error rates spike on a downstream API because the agent passes wrong types. An SRE sees latency double when the agent retries a slow tool that should be cached. A product reviewer sees an agent confidently report success on a tool call that returned an error string the model misread as data. Compliance worries about agents calling tools that take real-world actions — bookings, payments, deletions — without human approval.

In 2026, the tool-calling surface is multiplying. The OpenAI Agents SDK ships first-class tool definitions; LangGraph nodes call tools via decorators; the Model Context Protocol (MCP) standardises tool discovery across servers; A2A treats other agents as tools. Every one of these emits the same structural shape — a tool span with name, args, result — which is what makes universal evaluation possible.

How FutureAGI handles tool calling

FutureAGI’s approach is to instrument every tool call as an OpenTelemetry span and evaluate selection plus arguments separately. The traceAI integrations for openai-agents, langchain, mcp, crewai, pydantic-ai, and llamaindex wrap the tool-execution path so each call lands as a span tagged with agent.trajectory.step, the tool name, the argument JSON, the result, and the duration. That gives engineers a per-tool dashboard across frameworks.

Evaluation runs at two levels. Selection: ToolSelectionAccuracy scores whether the agent chose the right tool given the input — the most common regression after a model swap or prompt change. Argument validity: FunctionCallAccuracy and the cloud-template EvaluateFunctionCalling validate that the arguments match the schema and the semantics of what the user wanted. Together they tell you whether failure is “wrong tool” or “right tool, wrong args.”

Concretely: a coding-assistant agent on the OpenAI Agents SDK exposes read_file, run_tests, and git_commit as tools. After a model upgrade, TaskCompletion drops from 84% to 71%. The FutureAGI span dashboard shows git_commit calls jumped 3x; ToolSelectionAccuracy flags those as wrong-tool selections — the new model commits before running tests. The team adds one line to the system prompt to enforce ordering, ToolSelectionAccuracy recovers, and TaskCompletion climbs back to 86%. Without per-tool spans and a selection evaluator, this would have been a multi-day investigation.

How to measure or detect tool calling

Tool calling fails in two distinct ways — measure each:

ToolSelectionAccuracy: returns 0–1 for whether the right tool was picked at each step.
FunctionCallAccuracy: comprehensive accuracy on function-style tool calls (name + args + types).
EvaluateFunctionCalling: cloud-template eval for end-to-end function-call quality.
TaskCompletion: end-to-end check; bad tool calls usually surface as TaskCompletion regressions.
per-tool error rate (dashboard signal): % of calls per tool name that returned an error.
agent.trajectory.step (OTel attribute): paired with span kind = tool, gives you the per-tool slice.

Minimal Python:

from fi.evals import ToolSelectionAccuracy, FunctionCallAccuracy

selection = ToolSelectionAccuracy().evaluate(
    input=user_query,
    trajectory=spans,
)
print(selection.score, selection.reason)

Common mistakes

Conflating tool calling with function calling. Function calling is the OpenAI-API mechanism; tool calling is broader and covers retrieval, code execution, sub-agents, MCP servers. Plan for both.
No per-tool failure dashboard. A single overall tool-fail rate hides which one tool is broken; always slice by tool name.
Skipping argument validation. ToolSelectionAccuracy alone misses cases where the agent picked the right tool with broken args; pair with FunctionCallAccuracy.
Letting agents call destructive tools without confirmation. Bookings, payments, deletions need a human-in-the-loop gate or a guardrail; tool calls with side effects are not the same as read-only ones.
Tool description rot. When a tool’s underlying behavior changes but its description doesn’t, the agent calls it on the wrong inputs — version your tool specs.