What Is Accuracy (ML / LLM Evaluation)?
The fraction of model outputs that exactly match the ground-truth label, computed as correct predictions divided by total predictions.
What Is Accuracy (ML / LLM Evaluation)?
Accuracy is the proportion of model outputs that exactly match the ground-truth label, computed as correct / total. It is the oldest evaluation metric in ML and the easiest to explain, but it makes two strong assumptions: classes are roughly balanced and the right answer is canonical. Both assumptions hold for closed-form tasks. multiple choice, sentiment classification, JSON validation, function-name matching. and break for open-ended generation, where two correct answers can differ word-for-word. In practice, LLM evaluation teams use accuracy as one signal in a larger suite, not the headline number. By May 2026, frontier models like GPT-5.x, Claude Opus 4.7, and Gemini 3 Ultra all clear 90%+ on legacy accuracy benchmarks such as MMLU and HumanEval. meaning raw accuracy stopped discriminating models on those suites years ago, and the action shifted to harder cohort-sliced evals.
Why It Matters in Production LLM and Agent Systems
Reporting a single accuracy number on an LLM benchmark is the fastest way to ship a confident, broken system. Two failure modes drive this. The first is class imbalance: if 95% of your customer-support intents are “billing question”, a model that always predicts “billing question” hits 95% accuracy and helps no one with the other 5%. The second is the canonical-answer trap: an LLM that generates “Yes, you can return the item within 30 days” gets zero accuracy against a label of “Returns are accepted up to 30 days post-purchase”. same meaning, different surface form. This is why exact match accuracy collapses on free-text generation and teams reach for embedding similarity or judge rubrics instead.
The pain is felt across roles. An ML engineer celebrates a 2-point accuracy gain on a static benchmark and then watches CSAT drop because the model now refuses ambiguous queries it used to handle. A compliance reviewer is told a moderation classifier is “98% accurate” and discovers it never sees the 1% of inputs it would mislabel because the test set was clean. A product lead asks “is the new model better?” and gets back a single number that hides three regressions in different cohorts.
In 2026 agent stacks, accuracy is even less informative on its own. A trajectory-level “did the agent finish the task” question doesn’t reduce to one comparison; you need step-level evaluators across tool use, function calls, and final answer. On 2026 trajectory benchmarks like τ-bench and SWE-Bench Verified, two models can post the same final-answer accuracy while spending wildly different token budgets and tool counts to get there. Accuracy is still useful. but as one row in the dashboard, not the whole dashboard.
How FutureAGI Handles Accuracy
FutureAGI’s approach is to keep accuracy where it works (classification, function-call matching, exact-output tasks) and route open-ended cases to richer evaluators. The Equals and GroundTruthMatch evaluators in fi.evals handle exact-match scoring against a label. For function calls the FunctionCallExactMatch evaluator returns AST-level accuracy on the call, and FunctionNameMatch reports the lighter “did you pick the right tool” signal. For free-text generation, FutureAGI replaces accuracy with EmbeddingSimilarity, Faithfulness, AnswerRelevancy, or judge-model rubrics scored through CustomEvaluation.
Concretely: a SQL-generation team running on traceAI-langchain tags every production call. They run Equals against the canonical query for the closed-form benchmark cohort and EmbeddingSimilarity against an executable-equivalence reference for the long-tail cohort. Aggregate accuracy on the closed cohort drops from 87% to 82% after a Claude Sonnet 4.6 → Sonnet 4.7 swap; FutureAGI’s regression eval triggers, the trace view points to a system-prompt change merged the same day, and the team rolls it back. Without the cohort split, the long-tail eval (still at 79% similarity) would have masked the regression.
Unlike Ragas, which scores faithfulness only against retrieved context, accuracy in FutureAGI is paired with confusion matrix style cohort breakdowns so engineers see which class regressed, not just that the global mean did. In our 2026 evals, the cohort accuracy table catches more regressions than any single aggregate.
When raw accuracy still wins
There is a short list of tasks where exact-match accuracy is still the right primary signal in 2026:
| Task type | Why accuracy works | FutureAGI evaluator |
|---|---|---|
| Intent classification | Closed label set, balanced cohorts | Equals, GroundTruthMatch |
| Tool / function selection | Discrete tool names with one correct choice | ToolSelectionAccuracy |
| JSON schema compliance | Pass/fail on structural validity | SchemaCompliance |
| SQL canonical query | Executable equivalence is checkable | Equals + execution check |
| Multiple-choice eval (GPQA Diamond, MMLU-Pro) | Single-letter answer | Equals |
| Numeric extraction | Canonical numeric output | GroundTruthMatch |
Outside this list, default to similarity, judge rubrics, or trajectory metrics like TaskCompletion.
How to Measure or Detect It
Pick the matcher that fits the task. exact-match where canonical, similarity-based where not:
Equals: returns boolean exact match against expected output; cheapest, brittlest.GroundTruthMatch: handles minor normalisation (whitespace, casing) before exact comparison.FuzzyMatch: token-level fuzzy comparison; useful when the label has minor variations.accuracy_by_cohort(dashboard signal): accuracy sliced by user segment, route, or model variant. the regression alarm that single-number accuracy never fires.confusion_matrix(dashboard signal): per-class accuracy that exposes class-imbalance traps.
from fi.evals import GroundTruthMatch, Equals
exact = Equals()
gt = GroundTruthMatch()
result = gt.evaluate(
output="paris",
expected_output="Paris",
)
print(result.score, result.reason)
Common Mistakes
- Reporting global accuracy on imbalanced data. A 95% number on a 95/5 split is no signal. Always slice by class or cohort.
- Using exact-match accuracy on open-ended generation. Two correct answers will rarely match word-for-word; use
EmbeddingSimilarityor judge-model rubrics. - Skipping a confusion matrix. Accuracy hides which class is regressing; a confusion matrix surfaces it in seconds.
- Treating benchmark accuracy as production fitness. Static benchmarks miss distribution shift, jargon, and tool outputs the test never saw. and on saturated 2026 suites the score has nothing left to say.
- Letting one accuracy number gate releases. Pair it with
FaithfulnessorTaskCompletionso an open-ended regression cannot pass under a closed-form pass. - Quoting MMLU or HumanEval accuracy as a 2026 release signal. Both are saturated above 90% by every frontier model. use GPQA Diamond, HLE, and your own golden dataset instead.
Frequently Asked Questions
What is accuracy in machine learning?
Accuracy is the share of predictions that match the ground-truth label. It is the simplest evaluation metric, computed as correct predictions divided by total predictions, and is most useful when classes are balanced and outputs are canonical.
Is accuracy a good metric for LLM evaluation?
Only for closed-form tasks like classification, multiple choice, or JSON validation. For open-ended generation, exact-match accuracy is too brittle. use embedding similarity, judge-model rubrics, or task-completion scores instead.
How do you compute accuracy in FutureAGI?
Use Equals or GroundTruthMatch from fi.evals for exact match against a label. For LLM tasks where canonical answers exist (e.g. SQL generation, JSON output), pair it with TypeCompliance or SchemaCompliance for richer signal.