What is exact match in LLM evaluation?

Exact match is a binary metric returning 1.0 when a model's output is byte-identical to an expected reference and 0.0 otherwise, with optional case-insensitivity. It is the canonical correctness signal for closed-form tasks like classification labels and canonical IDs.

How is exact match different from fuzzy match?

Exact match is binary — only an identical string scores. Fuzzy match returns a graded similarity score that tolerates whitespace, punctuation, and minor edits. Use exact match for canonical labels and IDs; use fuzzy match when the reference and output are both natural language.

How do you measure exact match?

FutureAGI's fi.evals.Equals takes a response and an expected_response string and returns 1.0 or 0.0 with case_sensitive configurable. For function/tool calls, fi.evals.FunctionCallExactMatch performs AST-based identity comparison across the call's name and every argument value.

What Is Exact Match? LLM Evaluation Metric (2026)

What Is Exact Match?

Exact match is an LLM-evaluation metric that returns 1.0 when a model’s output is byte-identical to an expected reference and 0.0 otherwise — usually with optional case-insensitivity and whitespace normalisation. It is the simplest possible correctness signal and the right one for closed-form tasks: classification labels, canonical IDs, structured-output keys, function names, single-token answers. In FutureAGI it is the Equals class in fi.evals for text and FunctionCallExactMatch for structured tool calls. Both run sub-millisecond and are cheap enough to score on every trace.

Why It Matters in Production LLM and Agent Systems

The metric people use first is the metric people abuse most. Exact match is the right tool when the reference answer is canonical — a class label, a country code, a regex match, a JSON key — and a disastrous tool when it is not. A QA team grading an open-ended chat assistant with exact match will see 12% accuracy and conclude the model is broken; the same outputs against EmbeddingSimilarity or AnswerRelevancy will show 84%. The model is fine; the metric was wrong.

The pain shows up when teams use exact match where it does not belong. ML engineers tune a prompt against an exact-match-only eval and end up with a model that parrots the gold answer’s exact phrasing — fragile, low-coverage, and prone to copying training contamination. RAG teams grade a retrieval-augmented chatbot with exact match and miss every factually-correct answer that paraphrased the gold reference. Structured-output teams use exact match on whole JSON blobs and fail any output where keys appear in a different order — a serialisation issue, not a correctness issue.

In 2026-era agent and tool-calling stacks, exact match’s right home is structured: the function-name field of a FunctionCall, the boolean output of a classifier, the integer count from a SQL agent, the canonical product SKU from an order-lookup tool. There the binary signal is exactly what you want: it cannot be gamed, it is auditable, and it is fast enough to put on every span.

How FutureAGI Handles Exact Match

FutureAGI’s approach is to keep exact match honest and direct, while routing everything else to the right metric. The text variant is fi.evals.Equals — it consumes a TextMetricInput with response and expected_response, normalises case (configurable via case_sensitive), and returns 1.0 if and only if the two strings are equal, with a one-line reason. There is no fuzzy fallback, no embedding back-off; if you want graded similarity, the metric is FuzzyMatch or EmbeddingSimilarity. The structured variant is fi.evals.FunctionCallExactMatch — it parses both call sides into a normalised FunctionCall, compares names, then compares the parameter set, then compares each value with int/float numeric flexibility. It returns 1.0 only on full identity, 0.0 with a structured reason (“missing: {amount}”, “Value mismatch for ‘currency’: got ‘USD’, expected ‘EUR’”) on any mismatch.

For classification-style data — labels, ground-truth tags, expected-class IDs — the cloud-template GroundTruthMatch evaluator is the right anchor; it includes the same identity check plus normalisation conventions. Concretely: a fintech team running traceAI-openai on a transaction-categorisation model uses Equals to score 4.2M classifier traces a day, alerting on any cohort where exact-match accuracy drops below 0.97. A coding-agent team uses FunctionCallExactMatch as the strict gate in their pre-deploy regression and FunctionCallAccuracy as the regression-tracking metric. Compared with Ragas’s answer_correctness (LLM-judge based, 200ms, $0.001 per call) Equals runs on every single trace at zero cost — but only where the task actually has a canonical answer.

How to Measure or Detect It

Bullet-list of measurement signals tied to exact-match metrics:

fi.evals.Equals — returns 1.0 or 0.0 against a string reference, configurable case_sensitive. Use for labels, IDs, single-token answers.
fi.evals.FunctionCallExactMatch — AST-based identity comparison for tool/function calls, with int/float numeric flexibility.
fi.evals.GroundTruthMatch — the classifier-style evaluator for label-equality, with cohort-friendly normalisation defaults.
Exact-match-rate-by-cohort dashboard — the canonical regression signal for closed-form tasks; a 1-point drop is a real regression because the metric has no soft floor.

Minimal Python:

from fi.evals import Equals

metric = Equals(config={"case_sensitive": False})
result = metric.evaluate(response="USD",
                         expected_response="usd")
print(result.score, result.reason)

Common Mistakes

Using exact match on open-ended generation. A correct paraphrase scores 0.0; the metric is for canonical-answer tasks only. Use EmbeddingSimilarity or judge-model rubrics instead.
Confusing exact match with function call accuracy. Equals is text-on-text; FunctionCallAccuracy grades a call across name, parameter presence, and values with partial credit. They are not interchangeable.
Whole-JSON exact match. Serialisation order, whitespace, and trailing commas all flip the score; use JSONValidation or SchemaCompliance for structured outputs.
Treating a low score as “model bad” when it is “metric wrong”. If your task has more than one correct phrasing, exact match is unsuitable — switch metric, not model.
Skipping case-normalisation by default. A case_sensitive=True exact match against natural-language references is almost always too strict; flip it on only for IDs and codes.