Articles

LLM Function Calling in 2026: Tool Use Across OpenAI, Anthropic, and Beyond

How LLM function calling works in 2026. JSON Schema, OpenAI tools, Anthropic tools, structured outputs, parallel tool calls, and how to eval function calls.

January 8, 2025

Updated May 14, 2026

5 min read

agents llms integrations

Table of Contents

What is LLM function calling

LLM function calling (also called tool use) is how a large language model talks to the outside world. You describe a set of available functions using JSON Schema. The model receives the user prompt and the tool catalog, decides whether to call a tool, then emits a structured tool call with typed arguments. Your application runs the tool and returns the result. The model uses the result to continue the conversation or call another tool.

In 2026 function calling is the foundation of agents, retrieval-augmented generation, structured extraction, and any product that needs an LLM to do more than chat. This post covers the JSON Schema contract, provider-specific APIs (OpenAI, Anthropic, Google, Mistral), structured outputs, parallel tool calls, common failure modes, and how to evaluate function-calling accuracy.

TL;DR

Capability	What it means	Where to use it
Tool definition	JSON Schema describing a function the model can call	All providers
Tool choice	`auto`, `required`, `none`, or a specific tool	Force a deterministic action
Parallel tool calls	Multiple tool requests in one response	Independent actions, faster latency
Structured outputs	Entire response constrained to a schema	Extraction, form filling
Tool result	Function output fed back as a new turn	Multi-turn agent loops
Function-call eval	Score actual vs expected tool name and arguments	Regression testing across model snapshots

The JSON Schema contract

Every modern function-calling API uses JSON Schema to describe a tool:

name: the function identifier the model will return.
description: a short natural language hint about when to call it.
parameters or input_schema: a JSON Schema describing the typed arguments.
required: an array of parameter names that must be present.

Example tool definition (provider neutral):

get_order_status = {
    "name": "get_order_status",
    "description": "Look up the current status of a customer order by order ID.",
    "parameters": {
        "type": "object",
        "properties": {
            "order_id": {
                "type": "string",
                "description": "The unique order identifier in our system.",
            },
            "include_history": {
                "type": "boolean",
                "description": "Whether to include the status history events.",
                "default": False,
            },
        },
        "required": ["order_id"],
        "additionalProperties": False,
    },
}

Three rules of thumb for tool schemas:

Keep descriptions short and concrete. The model uses them to decide when to call.
Mark every truly required field in required. Optional fields cause hallucinated calls.
Set additionalProperties: False to reject unknown fields.

OpenAI tool use in 2026

OpenAI’s Responses API is the recommended path. A minimal tool-use call:

from openai import OpenAI

client = OpenAI()

tools = [
    {
        "type": "function",
        "name": "get_order_status",
        "description": "Look up the current status of a customer order by order ID.",
        "parameters": {
            "type": "object",
            "properties": {
                "order_id": {"type": "string"},
            },
            "required": ["order_id"],
            "additionalProperties": False,
        },
    }
]

response = client.responses.create(
    model="gpt-4o-mini",
    input="What is the status of order A-94821?",
    tools=tools,
    tool_choice="auto",
)

for item in response.output:
    if item.type == "function_call":
        print(item.name, item.arguments)

Set tool_choice to auto (model decides), required (model must call something), none (no tools this turn), or {"type": "function", "name": "..."} to force a specific tool.

For guaranteed-parseable responses, use Structured Outputs. The exact SDK method (for example responses.parse, responses.create with text.format, or beta.chat.completions.parse) shifts across SDK versions, so always check the OpenAI Python SDK release notes for the snapshot you target. The conceptual pattern is the same: define a Pydantic model or JSON Schema and pass it as the response schema.

from pydantic import BaseModel
from openai import OpenAI

class OrderQuery(BaseModel):
    order_id: str
    include_history: bool = False

client = OpenAI()
# Check the current OpenAI Python SDK docs for the exact structured-output
# helper that matches your installed version, then pass OrderQuery to it.

Anthropic Claude tool use

Anthropic’s Messages API uses an almost-identical shape:

import anthropic

client = anthropic.Anthropic()

tools = [
    {
        "name": "get_order_status",
        "description": "Look up the current status of a customer order by order ID.",
        "input_schema": {
            "type": "object",
            "properties": {
                "order_id": {"type": "string"},
            },
            "required": ["order_id"],
        },
    }
]

response = client.messages.create(
    model="claude-sonnet-4-5",
    max_tokens=1024,
    tools=tools,
    messages=[{"role": "user", "content": "Status of order A-94821?"}],
)

for block in response.content:
    if block.type == "tool_use":
        print(block.name, block.input)

Claude tool_choice options: {"type": "auto"}, {"type": "any"}, {"type": "none"}, or {"type": "tool", "name": "..."}. Set disable_parallel_tool_use=True to force serial execution.

After running the tool, send the result back in the next user message:

followup = client.messages.create(
    model="claude-sonnet-4-5",
    max_tokens=1024,
    tools=tools,
    messages=[
        {"role": "user", "content": "Status of order A-94821?"},
        {"role": "assistant", "content": response.content},
        {
            "role": "user",
            "content": [
                {
                    "type": "tool_result",
                    "tool_use_id": "the-tool-use-id-from-the-block",
                    "content": "Order A-94821 shipped on 2026-05-12.",
                }
            ],
        },
    ],
)

Google Gemini and other providers

Gemini exposes tools with function_declarations and supports response_schema for structured outputs.
Mistral Large 2 and Mixtral follow the OpenAI tools shape.
Llama 4 (via Bedrock, Groq, Together, or self-hosted) supports tool use through chat templates that match the OpenAI format.
Bedrock’s Converse API normalizes tool definitions across providers so the same code path works for Claude, Llama, Mistral, and Nova.

If you build an agent that mixes providers, route through a normalization layer (LiteLLM, LangChain, or your own adapter) so the tool schema and parsing logic stay one codebase.

Parallel tool calls

A single assistant turn can request many tools at once. Example flow:

User: “Compare the prices of order A and order B and pick the cheaper one.”
Model emits two function_call items: get_order_status(order_id="A") and get_order_status(order_id="B").
Caller runs both in parallel.
Caller sends both tool results back in one batched message.
Model writes the final answer.

Parallel tool calls drop latency for independent actions. Disable them when calls share side effects or must execute in order.

Structured outputs

Structured outputs lock the entire response to a JSON Schema or Pydantic model. The model never returns free-form text outside the schema. This is the strongest guarantee a provider offers for parse-ability.

Use cases:

Document field extraction (invoices, contracts, forms).
Database row generation.
Agent planners that emit a typed plan.
Eval datasets that require structured judge output.

Costs: structured outputs sometimes raise latency slightly and may reject inputs that cannot match the schema. Always validate the result against the schema even when the provider claims strict mode, since edge cases happen.

Common failure modes

Failure	Cause	Fix
Invalid JSON arguments	Model output does not parse	Enforce strict mode, retry with a corrective prompt
Hallucinated tool name	Model invents a tool not in the catalog	Validate name against the registry, reject and retry
Missing required parameter	Schema not strict enough	Mark fields required, set additionalProperties false
Mis-typed parameter	Loose schema, ambiguous description	Use enums and constrained types, tighten descriptions
Prompt injection	Untrusted input in user prompt or tool result	Sanitize tool results, add guardrails, scope tools to read-only

Function-call evaluation with Future AGI

A regression eval set is the only way to catch silent drift across model snapshots. Future AGI’s open source ai-evaluation library (Apache 2.0) exposes function-call accuracy and related agent metrics.

from fi.evals import evaluate

# Score whether the actual tool call matches the expected one.
result = evaluate(
    "function_call_accuracy",
    output={"name": "get_order_status", "arguments": {"order_id": "A-94821"}},
    expected={"name": "get_order_status", "arguments": {"order_id": "A-94821"}},
)

print(result)

For a custom rubric on more nuanced agent behavior, wrap an LLM judge:

from fi.evals.metrics import CustomLLMJudge
from fi.evals.llm import LiteLLMProvider

judge = CustomLLMJudge(
    name="agent_tool_choice_quality",
    grading_criteria="Did the model pick the most appropriate tool, with correct arguments, for the user request?",
    provider=LiteLLMProvider(model="gpt-4o-mini"),
)

verdict = judge.evaluate(
    output="Called get_order_status with order_id A-94821",
    context="User asked: 'What is the status of A-94821?'",
)

print(verdict)

turing_flash returns scores in roughly 1 to 2 seconds in the cloud, turing_small in 2 to 3 seconds, and turing_large in 3 to 5 seconds. Pick the smallest model that meets your accuracy bar.

Tracing agent tool calls with traceAI

Tracing every tool call as an OpenTelemetry span makes failures replayable. Future AGI’s traceAI (Apache 2.0) provides annotated decorators for agents, tools, and chains.

from fi_instrumentation import register, FITracer

register(project_name="agent-prod")
tracer = FITracer(__name__)

@tracer.tool
def get_order_status(order_id: str, include_history: bool = False) -> dict:
    # Real lookup goes here.
    return {"order_id": order_id, "status": "shipped"}

@tracer.agent
def handle_customer_request(prompt: str) -> str:
    # Agent loop: call LLM, dispatch tool calls, return final response.
    ...

Every prompt, tool call, argument set, latency, and cost lands in your Future AGI project for replay and regression testing.

Production safety checklist

Define every tool with a strict JSON Schema. Set additionalProperties: False.
Default to read-only tools. Gate any write or destructive action behind an explicit allowlist.
Validate tool arguments before execution, even if the provider claims strict mode.
Treat the result of any tool that touches user input as untrusted. Sanitize before re-injecting.
Add a human-in-the-loop step before any irreversible action: payments, deletes, external communications.
Run a regression eval (function_call_accuracy + custom LLM judge) on every model snapshot bump.
Capture every call with traceAI and review failures weekly.
Route traffic through the Future AGI Agent Command Center at /platform/monitor/command-center for centralized prompt versioning, model fallback, and BYOK gateway control.

Closing

LLM function calling is the bridge from words to actions, and in 2026 it is the foundation of every serious LLM product. Master the JSON Schema contract, pick the provider features that match your needs (structured outputs, parallel calls, tool_choice), and lock in continuous evaluation and tracing so the agent that worked last week still works this week.

References

Frequently asked questions

What is LLM function calling in 2026?

Function calling (also called tool use) is the contract between an LLM and your application. You declare functions or tools using JSON Schema. The model receives the user prompt and the tool catalog, decides whether a tool is needed, then returns a structured tool call with typed arguments. Your code runs the tool and feeds the result back. Function calling powers agents, RAG, structured extraction, and any workflow that needs the model to act on the world.

How does OpenAI tool use work in 2026?

OpenAI's Chat Completions and Responses APIs accept a `tools` array of JSON-Schema-typed function definitions. The model responds with a `tool_calls` field on the assistant message when it wants to call one. Set `tool_choice` to `auto`, `required`, `none`, or a specific tool name to control behavior. The Responses API also supports parallel tool calls, native web search, code interpreter, and a structured-output mode that forces the response to match a Pydantic or JSON Schema.

How does Anthropic tool use work?

Anthropic Claude accepts a `tools` array in the Messages API. Each tool has a name, description, and `input_schema` in JSON Schema. The model returns content blocks of type `tool_use` with the tool name and structured input. You execute the tool, then send a `tool_result` content block in the next user turn. Claude supports parallel tool use, tool_choice (any, auto, none, or a specific tool), and disable_parallel_tool_use for strict serial execution.

What are structured outputs?

Structured outputs are a hard-typed extension of function calling. Instead of letting the model decide which tool to call, you constrain the entire response to match a JSON Schema or Pydantic model. OpenAI calls this Structured Outputs and supports it on Responses and Chat Completions. Gemini calls it response_schema. Anthropic supports structured JSON-style outputs through prompting and tool schemas, so check the current Messages API docs for the strict schema guarantees on your snapshot. Use structured outputs when you want zero free-form text and guaranteed parse-ability.

What are parallel tool calls?

Parallel tool calls let the model emit multiple tool requests in a single response, useful when the requested actions are independent. OpenAI and Anthropic both expose this. The caller fans out the tool executions, waits for all results, then sends them back as a batched tool_results message. Disable parallel calls when ordering matters or when the tools share side effects.

How do I evaluate function calling accuracy?

Build a labeled set of (prompt, expected tool, expected arguments) tuples. Run the agent, compare the actual tool name and arguments to the expected values. Future AGI's ai-evaluation library exposes function_call_accuracy and related agent evals you can run as `evaluate('function_call_accuracy', output=..., expected=...)`. Pair with traceAI to capture every tool call as an OpenTelemetry span so you can debug failures with the exact prompt, schema, and model output.

What are the biggest function-calling failure modes?

Five recurring failures: invalid JSON arguments that fail schema validation, hallucinated tool names that do not exist, missing required parameters, mis-typed parameters (string instead of integer), and prompt injection that hijacks the model into calling a destructive tool. Mitigations: enforce strict JSON Schema validation, use structured outputs, expose only read-only tools by default, log every call, and run a regression eval on each model snapshot.

How do I keep function calling safe in production?

Treat every tool call as untrusted. Validate arguments against the schema before execution. Scope tools to the minimum required permissions (read-only by default). Use IAM or RBAC on the tool layer, not just the model layer. Log every call to an audit trail. Add a human-in-the-loop step before any destructive action like sending money, mutating production data, or contacting external parties. Apply guardrails like Future AGI's eval-based policy filters before and after each call.

View all

Guide

Top 11 LLM API Providers 2026: Pricing, Latency, Context Compared

11 LLM APIs ranked for 2026: OpenAI, Anthropic, Google, Mistral, Together AI, Fireworks, Groq. Token pricing, context windows, latency, and how to choose.

NVJK Kartik · Jul 4, 2025

11 min

Guide

Best Open-Weight LLMs 2026: Llama 4, DeepSeek R2, Qwen 3 Compared

Compare the top open-weight LLMs in 2026: Llama 4.x, DeepSeek R2, Qwen 3, Mistral, Phi family. Benchmarks, licensing, hardware, and how to evaluate yours.

Rishav Hada · Dec 1, 2024

8 min

Guide

LLM Benchmarks 2026: GPT-5, Claude 4.7, Gemini 2.5 Pro, Grok 4 Compared

Compare GPT-5, Claude Opus 4.7, Gemini 2.5 Pro, and Grok 4 on GPQA, SWE-bench, AIME, context, $/1M tokens, and latency. May 2026 leaderboard scores.

Vrinda Damani · Sep 26, 2025

9 min