How is function call accuracy different from exact match?

FunctionCallAccuracy is graded — partial credit if the name and most parameters match. FunctionCallExactMatch is binary — 1.0 only if name, parameter set, and every value are identical. Use FunctionCallAccuracy for regression scoring and FunctionCallExactMatch for strict gates.

How do you measure function call accuracy?

FutureAGI's fi.evals.FunctionCallAccuracy takes a response (the actual call) and an expected_response (the target call) as dicts or JSON, then returns a weighted 0–1 score. For parallel-call settings, set order_matters=False to score as a set.

What Is Function Call Accuracy? LLM Eval Definition (2026)

Q: What is function call accuracy?

Function call accuracy is an LLM-eval metric that compares a model's function/tool call against an expected call across three components — function name match, parameter presence, and parameter value correctness — returning a weighted 0–1 score.

What Is Function Call Accuracy?

Function call accuracy is an LLM-evaluation metric that scores whether a model’s tool or function call matches an expected target call. It grades three components: function name match, parameter presence (which expected keys appear in the actual arguments), and parameter value correctness — weighted by default at 40/30/30. The metric handles single calls, parallel-call lists, ordered sequences, and mixed input formats (OpenAI-style dicts, JSON strings, Python AST). In FutureAGI it is the FunctionCallAccuracy class in fi.evals, paired with FunctionCallExactMatch for strict identity comparison.

Why It Matters in Production LLM and Agent Systems

A function call that looks right and behaves wrong is the signature failure mode of every function-calling agent. Models confidently emit transfer_funds(amount=100, from="checking", to="savings") when the user asked to move money the other way; the call is structurally perfect, semantically wrong. Other failure modes are subtler: a model picks a string "100" for an integer amount, drops an optional-but-required parameter, or calls the right function with the right values but in the wrong order in a parallel-call setting.

The pain crosses every team that ships function calls. ML engineers see a 6% regression in agent task-completion after a prompt edit and cannot tell whether tool selection or argument filling broke. Backend SREs see schema-validation rejections at the function gateway with no offline signal that warned them. Product managers running a regression dashboard see a flat “agent works” metric while one specific tool’s argument-error rate quietly tripled.

In 2026 stacks where MCP, parallel tool calls, and multi-tool plans are routine, scalar metrics are not enough. You need decomposable signals — name, presence, values — and the ability to compare to expected calls as a set rather than a sequence. Without that, every regression hunt starts by reading raw spans. Function call accuracy is the metric that turns “the agent broke” into “the book_flight call’s cabin_class parameter regressed from 1.0 to 0.4 after the prompt update.”

How FutureAGI Handles Function Call Accuracy

FutureAGI’s approach is AST-based and runs in milliseconds without an LLM-judge. The fi.evals.FunctionCallAccuracy class accepts a response and expected_response in any common format — an OpenAI tool-call dict, a JSON string, a list of calls, or a Python-style call string like get_weather(city="NYC"). It parses both into normalised FunctionCall objects and scores three components: name match (40% of weight), parameter presence (30%, the fraction of expected keys that appear in the actual call), and value match (30%, the fraction of expected values that equal the actual values, with int/float flexibility). For parallel calls, set order_matters=True to grade as a sequence or order_matters=False to grade as a set, with coverage penalties when expected calls have no match.

When you need a strict gate — for example, a JSON-schema-driven router that should refuse anything but the canonical call shape — fi.evals.FunctionCallExactMatch returns 1.0 only on full identity and 0.0 otherwise, with a structured reason string (“missing: {amount}”, “Value mismatch for ‘currency’: got ‘USD’, expected ‘EUR’”). Concretely: a banking-agent team using traceAI-openai runs FunctionCallAccuracy on every nightly regression of their 1,200-call golden set, alerts on any function whose per-tool score drops more than 5 points, and routes regressions through Agent Command Center with a pre-guardrail that blocks the deploy until the eval recovers. Compared with DeepEval’s ToolCorrectnessMetric (LLM-judge based, slower, less deterministic), FunctionCallAccuracy is reproducible across runs and embeds cleanly into CI.

How to Measure or Detect It

Bullet-list of measurement signals tied to function-call evals:

fi.evals.FunctionCallAccuracy — returns a 0–1 weighted score with a name / params / values breakdown in the reason string. Threshold per-function rather than globally.
fi.evals.FunctionCallExactMatch — returns 1.0/0.0 for strict regression gates; useful when downstream systems require canonical call shapes.
fi.evals.ParameterValidation — schema-side check against the function definition; surfaces type and enum violations the accuracy metric collapses into a single value-mismatch.
tool.name OTel attribute — span tag emitted by traceAI integrations; pivot eval failures by tool.name to find the specific tool regressing.

Minimal Python:

from fi.evals import FunctionCallAccuracy

metric = FunctionCallAccuracy()
result = metric.evaluate(
    response={"name": "get_weather",
              "arguments": {"city": "NYC", "units": "metric"}},
    expected_response={"name": "get_weather",
                       "arguments": {"city": "NYC", "units": "imperial"}},
)
print(result.score, result.reason)

Common Mistakes

Treating it as exact match. FunctionCallAccuracy returns partial credit; if you need strict identity for a deploy gate, use FunctionCallExactMatch.
Grading only on the function name. Most regressions live in arguments, not in tool choice — wire all three components.
Default-weighting parallel calls as a sequence. When the agent is allowed to call tools concurrently, set order_matters=False, otherwise reordering looks like a regression.
Ignoring strict-type checks for numeric APIs. Set strict_type_check=True when an int field cannot tolerate a float value, or you ship the bug to a backend that rejects it.
No golden set per function. A global mean hides which tool regressed; build a per-tool eval cohort with at least 30 calls each.