What Is Function Calling?
The OpenAI-style mechanism by which an LLM emits a strict JSON object naming a function and its schema-validated arguments for runtime execution.
What Is Function Calling?
Function calling is the OpenAI-style mechanism that lets a language model emit a strict JSON object naming a function and its arguments for runtime execution. The model receives each function’s JSON Schema in the request, and the response is constrained to a matching JSON payload. OpenAI’s tools parameter popularized the pattern; Anthropic, Google, Meta, Mistral, and every serious model API now ship equivalents. Function calling is one implementation of broader tool-calling, and in 2026 it sits next to MCP and A2A as one of three primary ways an agent reaches the outside world. In a FutureAGI trace, each call appears as a span with the function name, validated arguments, and returned result.
Why function calling matters in production LLM and agent systems
Function calling is the boundary between LLM creativity and deterministic execution. The model decides what to do; the function decides how. When that boundary is clean, agents are reliable. When it’s not, you get the canonical 2026 production bugs: argument names that almost match but don’t, types that almost match but don’t ("42" instead of 42), nested fields that go missing under load, enum values the model invented that the function rejects, and the new failure mode of 2026. silent truncation of long argument lists when the model hits its tool-call token budget mid-payload.
Different roles see different surfaces. A backend engineer sees 422 errors on a downstream service because the LLM produced customer_id as a string when the API expects an int. An SRE watches retry storms after the model started emitting an extra unexpected field that the schema rejects. A QA engineer sees agent demos pass on Monday and fail on Friday because the model’s argument distribution shifted across deploys. A product reviewer signs off on a payment-handling function only to find the agent occasionally skips the confirmation_required field.
In May 2026, structured-output and JSON-Schema-constrained decoding have made function calling more reliable than it was in 2023. JSONValidation pass rates above 99.5% are routine on frontier models. But “JSON is valid” is the easy half. Models still hallucinate function names when they see similar tools (search_orders vs search_order), swap argument orders on function variants with paired fields (from/to, source/target), fail JSON Schema’s stricter constraints (regex patterns, conditional if/then clauses, oneOf discriminators), and pick the wrong tool entirely when two tools have overlapping descriptions. The BFCL v3 leaderboard tracks all four classes separately for exactly this reason.
The agent-era version of this problem is worse. In a multi-step trajectory, one bad argument compounds. A planner picks transfer_funds instead of quote_transfer, and the tool actually moves money. A research agent hands query="*" to a destructive deletion endpoint. Function-calling guardrails are no longer a UX nicety. they sit on the critical safety path.
Function calling vs tool calling vs MCP
These three keep getting used interchangeably in 2026 vendor docs, and they are not the same.
| Surface | What it is | Where it runs | Failure to watch |
|---|---|---|---|
| Function calling | OpenAI-style structured-output API | Inside one model provider call | Bad arguments, hallucinated function name |
| Tool calling | Conceptual umbrella for any model-invoked external action | Anywhere. function call, RAG retriever, code interpreter, sub-agent | Wrong tool selected for the goal |
| MCP (Model Context Protocol) | Open protocol for discovering and invoking tools across vendors | Out-of-process MCP server, often shared across agents | Tool-list drift, schema mismatch across server versions |
| A2A (Agent-to-Agent) | Open protocol for one agent to call another as a tool | Across process boundaries between agents | Trust, schema, and context propagation across agents |
In a 2026 stack, you usually see function calling as the model-facing API, MCP as the tool-discovery layer, and A2A as the multi-agent transport. They compose. They do not substitute for each other.
How FutureAGI handles function calling
FutureAGI’s approach is to evaluate function calls at three layers. schema validity, argument accuracy, and end-to-end intent. and to instrument every call as a first-class span. The traceAI integrations (traceAI-openai, traceAI-anthropic, traceAI-google-genai, traceAI-vertexai, traceAI-bedrock, traceAI-openai-agents, traceAI-langchain, traceAI-mcp, traceAI-strands) capture every function-call attempt as an OpenTelemetry GenAI span with the function name, raw arguments JSON, parsed arguments, and validation outcome. Each span carries agent.trajectory.step so the call sits inside the broader trajectory, and gen_ai.request.model so you can slice by model variant.
On the eval side, three classes cover the surface. JSONValidation returns a boolean check against the function’s JSON Schema. surfaces invalid-JSON rate and schema-violation rate immediately. FunctionCallAccuracy is the comprehensive evaluator: name match, argument structure, type compliance, and semantic correctness against the user’s intent. Unlike JSON Schema validation alone, FunctionCallAccuracy checks whether the model chose the right function and values, not just whether the payload parses. EvaluateFunctionCalling is the cloud-template equivalent for online evaluation against live traces. Used together, they tell you whether failure is “model produced bad JSON” (rare in 2026), “model produced valid JSON with wrong values” (the common case), or “model picked the wrong function entirely” (a tool-selection bug).
Concretely: a fintech agent calls a transfer_funds(from_account, to_account, amount, currency) function. After a prompt change, JSONValidation stays at 100%. the JSON is always well-formed. but FunctionCallAccuracy drops from 0.94 to 0.79. Drilling in, FutureAGI shows the model is now occasionally swapping from_account and to_account on cross-currency transfers. The team rolls back the prompt edit and adds an EvaluateFunctionCalling regression eval to catch this on every future change. Without separating schema validity from argument accuracy, the team would have seen “function calls look fine” while sending money the wrong way.
In our 2026 evals across customer fintech and healthcare agents, the distribution we keep seeing is roughly 0.5% structural failures, 4-7% semantic-argument failures, and 1-3% wrong-tool failures on frontier models. and the second bucket is where money actually moves wrong.
Schema design is half the battle
The most underrated lever for function-calling reliability in 2026 is the JSON Schema you hand the model. A few patterns that consistently lift FunctionCallAccuracy by 5-15 points in our evals:
- Use enums, not free-form strings, wherever possible.
status: "approved" | "rejected" | "pending"beatsstatus: stringeven with description text. - Constrain numerics with
minimum/maximum. Models respect numeric bounds when they are in the schema. Without them, you getamount: -50on a refund function. - Name fields unambiguously.
from_account_idandto_account_idare misswapped less often thanfromandto. Models pattern-match on field-name similarity. - Use
oneOffor variant shapes. Models handle discriminated unions better than they handle “this field is sometimes a string and sometimes an object.” - Add a single-sentence description per parameter. Descriptions ride into the model context; “in ISO-8601 UTC” beats “datetime” on date parameters.
- Order arguments by importance. Models occasionally truncate; put the most-important arguments first.
- Limit tool catalog size. Above 30-50 tools, model selection accuracy starts dropping linearly. Route by group, don’t expose everything to every call.
Frontier model function-calling status, May 2026
A senior engineer should know which models pass which slice of the call surface. Numbers below are from public BFCL v3 and τ-bench reports plus FutureAGI internal evals against support and fintech agents as of May 2026.
| Model | BFCL v3 overall | Parallel calls | Irrelevance detection | τ-bench retail | Notes |
|---|---|---|---|---|---|
| GPT-5.x | ~89% | Strong | Strong | ~68% | Default for high-stakes tool calls |
| Claude Opus 4.7 | ~87% | Strong | Strongest | ~70% | Best refusal behavior on ambiguous tools |
| Gemini 3 Pro | ~85% | Strong | Mixed | ~63% | Long-context tool lists handled best |
| Llama 4 (open-weight) | ~78% | Mixed | Weak | ~52% | Best open-weight, still trails frontier |
| Mistral Large 3 | ~75% | Mixed | Weak | ~48% | Improving but irrelevance detection trails |
The 2022 mental model. “use GPT-4, function calling just works”. does not survive contact with parallel calls, irrelevance detection, or trajectory state. Pick by the subset you actually need.
How to measure or detect function calling
Function calling has three independent failure modes. instrument each:
JSONValidation. returns a boolean for schema conformance against a JSON Schema; catches invalid JSON and structural violations.FunctionCallAccuracy. returns 0-1 for comprehensive function-call quality (name, structure, types, semantics).EvaluateFunctionCalling. cloud-template eval for live trace assessment, runs as a post-guardrail on the response if you want pre-execution blocking.SchemaCompliance. returns 0-1 for structured-output schema compliance with optional partial credit.FunctionCallExactMatch. AST-based exact match when you have a reference call to compare against.ParameterValidation. validates function call parameters against a schema, useful as a cheap pre-execution guard.ToolSelectionAccuracy. catches the “right JSON, wrong tool” class thatFunctionCallAccuracyalone can miss in tool-rich environments.- Schema-violation rate (dashboard signal). % of function calls that fail JSON Schema validation per function name; slice by
function.namenot just aggregate. agent.trajectory.step. OTel attribute paired with span kind =function-call, gives per-step slicing inside a trajectory.
Minimal Python:
from fi.evals import FunctionCallAccuracy, JSONValidation, ToolSelectionAccuracy
func_acc = FunctionCallAccuracy().evaluate(
input=user_query,
output=model_function_call,
expected=ground_truth_call,
)
schema = JSONValidation().evaluate(output=model_function_call, schema=function_schema)
tool = ToolSelectionAccuracy().evaluate(trajectory=trajectory, expected_tool=expected_tool)
Wire all three onto the same span via traceAI so the failure cluster is visible at one URL, not three dashboards.
Parallel and dependent calls in 2026
The 2026 frontier shifted to parallel and dependent function calls as defaults. GPT-5.x, Claude Opus 4.7, and Gemini 3 all emit multiple tool calls per turn when the task is decomposable. That introduces two new failure surfaces:
- Parallel-call ordering errors. When three calls go in parallel, the model must not assume one’s result before it lands. Measure with
TrajectoryScoreandStepEfficiency. agents that bottleneck a parallel call into a serial chain waste time and tokens. - Dependent-call leakage. When call 2 depends on call 1’s result, the model must wait. Some models occasionally emit both in parallel and use a hallucinated value for the dependency.
FunctionCallAccuracywith the trajectory asexpectedcatches this; aggregate accuracy hides it.
A 2026 evaluation suite that does not separate parallel from serial trajectories is missing the most actively-evolving part of the function-calling surface. Pin your eval to BFCL v3’s parallel and parallel-multiple subsets at minimum.
Pre-execution guardrails on function calls
Once transfer_funds is callable from the model, the guardrail surface matters as much as the eval surface. Use Agent Command Center pre-guardrail chains on the input and a post-guardrail on the model’s function-call output before the runtime executes it. A CustomEvaluation policy check on the parsed arguments. “amount must be ≤ $10,000 without manager approval,” “destination account must belong to the same customer”. runs as a pre-execution gate and blocks the runtime call if the policy fails. This is the difference between an agent that occasionally tries something stupid and an agent that occasionally executes something stupid.
Function-calling under streaming
In 2026 most production agents use streaming responses for latency. Streaming function calls introduce a class of bugs that batched evaluation cannot catch:
- Premature execution. The runtime executes a function call before the full payload streams in, getting truncated arguments. Fix: only execute after the model emits a complete tool-call block.
- Mid-stream cancellation. Stream is canceled (user navigated away, timeout), a tool was already triggered. Fix: idempotent tool design plus a cancellation handshake.
- Partial JSON repair. Some frameworks “repair” partial JSON on the fly, occasionally fabricating fields. Fix: never auto-repair tool-call JSON; reject and re-prompt instead.
Score streaming agents with JSONValidation on the fully buffered call payload and a separate stream-health metric (premature-execution rate, mid-stream cancellation rate).
The cost dimension
Function calling is not free. A 2026 frontier agent makes 4-12 function calls per task on average across our customer stacks, and each call adds to token cost (the function schema in context), latency, and observability load. Track three signals alongside accuracy:
- Tokens spent on tool schemas as a fraction of total prompt tokens. Above 30%, prune your catalog.
- Average tool-call count per task. Sudden increases usually mean a model regression where the agent loops more.
- Tool-call cost-per-trace. Token cost of the schema + arguments + result, per task. The fastest path to lower agent cost is fewer tools in context, not a cheaper model.
Common mistakes
- Conflating function calling with tool calling. Function calling is one implementation; tool calling covers retrievers, code execution, sub-agents, MCP. Plan and evaluate both.
- Schema validation as the only check. Valid JSON with wrong values still ships bugs to production; pair
JSONValidationwithFunctionCallAccuracy. - No per-function dashboards. Aggregate accuracy hides which one function is regressing; slice by name.
- Letting models invent function names. Without structured-output mode or function-name enums, the model occasionally hallucinates names. guard explicitly with
ToolSelectionAccuracy. - Ignoring argument-order swaps on similar functions. Pair-named functions (
from/to,source/target) regularly get swapped. add semantic evals, not just structural ones. - No irrelevance test in the eval suite. A 2026 agent must know when not to call a tool; without BFCL-style irrelevance rows, you cannot measure that.
- Re-prompting on schema failure instead of fixing the prompt. Retry loops are an availability hack, not a quality fix; they hide regressions and inflate token cost.
- Letting MCP tool lists drift unobserved. When an MCP server adds or renames a tool, the model sees a new surface. pin schema versions and re-run regression evals on every MCP catalog change.
The 2026 protocol shake-up
Two protocol shifts changed function calling between 2024 and 2026. First, MCP standardized tool discovery and invocation across model providers. Anthropic, OpenAI, Google, and most major frameworks now consume MCP servers as a primary tool surface. Second, A2A made one agent callable from another as a tool, which means a “function” in 2026 might be another agent’s entire trajectory wrapped behind a JSON Schema. Both protocols ride on the same function-calling JSON layer underneath, but they introduce three new failure surfaces an evaluation suite should cover:
- Schema drift across MCP server versions. When the MCP server updates a tool’s schema, the model sees a new surface; pin schema versions and re-run regression evals on every catalog change.
JSONValidationagainst a frozen schema catches drift. - Cross-agent trust. When agent A calls agent B via A2A, the response is untrusted content; treat it with the same
PromptInjectionpost-retrieval check you’d apply to a RAG chunk. - Tool-description injection. Malicious MCP servers can include injection payloads inside their tool descriptions, hijacking the model at discovery time. A
ProtectFlashcheck on the catalog before registration is the defense.
For teams building agents that consume MCP catalogs, the eval workflow we recommend in 2026: register MCP servers in a staging environment, run the full golden dataset including adversarial rows against the new catalog, only promote to production after FunctionCallAccuracy, ToolSelectionAccuracy, and PromptInjection all clear thresholds. Skipping this lets a malicious or misconfigured MCP server poison every downstream call.
Frequently Asked Questions
What is function calling?
Function calling is the OpenAI-style mechanism by which an LLM emits a strict JSON object. a function name plus schema-validated arguments. that a runtime then executes.
How is function calling different from tool calling?
Function calling is the specific structured-output API; tool calling is the broader concept that includes retrievers, code execution, sub-agents, and MCP tools. Function calling is one implementation of tool calling.
How do you measure function-call quality?
FutureAGI's FunctionCallAccuracy scores name plus argument correctness; EvaluateFunctionCalling runs the cloud-template equivalent; JSONValidation guards the schema-compliance edge.