How is function calling different from tool calling?

Function calling is the underlying model capability — emit valid JSON for a registered schema. Tool calling is the application-level pattern that uses function calling to invoke real tools, often inside an agent loop with multiple turns.

How do you measure function-calling quality?

FutureAGI grades it with FunctionCallAccuracy for argument correctness, ToolSelectionAccuracy for picking the right tool, and JSONValidation for schema compliance, all wired through traceAI spans.

Function Calling in LLM: Definition & FutureAGI Guide

Q: What is function calling in LLM?

Function calling is a capability where an LLM emits a structured JSON object naming a registered tool and its arguments instead of free text. The runtime executes the function and feeds the result back so the model can continue.

What Is Function Calling in LLM?

Function calling in an LLM is the capability where the model produces a structured call to a registered external function, with a name and JSON-shaped arguments, instead of only returning free text. The application supplies a JSON Schema; the model decides when to call it and what to pass; the runtime executes the function and returns the result so the model can continue. It is the primitive behind agents, MCP servers, and retrieval-augmented chatbots. In FutureAGI traces, each call appears as a tool span with arguments and results.

Why It Matters in Production LLM and Agent Systems

Function calling is one of the most consequential capabilities in the 2026 stack and one of the leakiest. Every step depends on the model picking the right function, passing well-typed arguments, and not hallucinating a function that doesn’t exist. The failure modes compound: a wrong tool selection wastes a step and corrupts memory; a malformed argument crashes the tool and forces a retry; a hallucinated function silently no-ops; an over-eager retry spirals into runaway cost.

The pain is felt unevenly. A backend engineer pages on a 2 a.m. spike where a single agent loop spent $40 retrying a hallucinated tool name. A platform engineer finds half the JSON outputs missing a required field after a model upgrade. A product owner watches the agent take twelve steps to do what should take three because the planner picked tools in the wrong order. End users see a “thinking…” spinner that never resolves.

In 2026, with MCP-served tools, multi-modal function calls (image, audio inputs to functions), and parallel function calling exposed in OpenAI, Anthropic, and Gemini APIs, the function-calling failure surface is wider than ever. Step-level evaluation — not just final-answer evaluation — is mandatory. A single end-to-end success score will not tell you that step three picked the wrong tool 8% of the time.

How FutureAGI Handles Function Calling

FutureAGI’s approach is to treat function calling as a three-layer evaluation surface tied to traces. Argument fidelity: FunctionCallAccuracy runs comprehensive checks on every call — name match, parameter completeness, type compliance, value validity. FunctionCallExactMatch does an AST-level strict comparison for high-stakes flows. FunctionNameMatch and ParameterValidation decompose the failure when something is wrong. Tool selection: ToolSelectionAccuracy returns whether the model chose the right tool given the input and prior trajectory state. Schema compliance: JSONValidation validates the emitted arguments against the registered JSON Schema; IsJson and JSONSyntaxOnly catch lower-level malformedness. All run as fi.evals evaluators on offline datasets and live traces.

A real workflow: an agent team running on the OpenAI Agents SDK instruments calls with traceAI-openai-agents. Every tool span carries agent.trajectory.step, the function name, arguments, and result. ToolSelectionAccuracy and FunctionCallAccuracy run on a sampled cohort of production traces; eval-fail-rate-by-cohort dashboards slice by route, model, and tool. After a model swap from gpt-4o to gpt-4o-mini, FutureAGI surfaces a 12% drop in ToolSelectionAccuracy localized to one tool category. The team rolls back that route, keeps the cheaper model on the others, and adds the failing cases to a regression dataset. Unlike single-vendor benchmarks like Berkeley Function Calling Leaderboard, this is your tool registry, your prompts, your traffic — function-calling quality measured where it actually breaks.

How to Measure or Detect It

Pick at least one evaluator per failure axis (name, args, schema, selection):

FunctionCallAccuracy — comprehensive call evaluation; the canonical signal for argument correctness.
FunctionCallExactMatch — strict AST-level match against a reference call; for high-stakes flows.
ToolSelectionAccuracy — did the model pick the right tool given the prior state?
JSONValidation — does the emitted argument JSON validate against the registered schema?
ParameterValidation — schema-aware check on parameter values.
Tool-call retry rate (dashboard signal) — high retry rates correlate with malformed JSON or wrong-tool selection.

from fi.evals import FunctionCallAccuracy, JSONValidation

result = FunctionCallAccuracy().evaluate(
    input="Refund order 12345",
    output={"name": "refund_order", "args": {"order_id": "12345"}},
    expected_response={"name": "refund_order", "args": {"order_id": "12345"}},
)
print(result.score, result.reason)

Common Mistakes

Trusting that “it returned valid JSON” means the call was right. Schema validity does not imply tool correctness; grade selection and arguments separately.
Skipping evaluator coverage on rarely-called tools. Long-tail tools have the worst selection accuracy and the highest blast radius when picked wrong.
Handling parallel function calls as one call. They are independent failures; trace and grade each.
Using the model’s reasoning text as the only audit trail. The reasoning may explain a call that never happened; the trace is the ground truth.
Letting the agent retry without bound. A bounded retry policy is the difference between a 200 ms blip and a $40 incident.