Evaluation

What Is Accuracy (ML / LLM Evaluation)?

The fraction of model outputs that exactly match the ground-truth label, computed as correct predictions divided by total predictions.

What Is Accuracy (ML / LLM Evaluation)?

Accuracy is the proportion of model outputs that exactly match the ground-truth label, computed as correct / total. It is the oldest evaluation metric in ML and the easiest to explain, but it makes two strong assumptions: classes are roughly balanced and the right answer is canonical. Both assumptions hold for closed-form tasks — multiple choice, sentiment classification, JSON validation, function-name matching — and break for open-ended generation, where two correct answers can differ word-for-word. In practice, LLM teams use accuracy as one signal in a larger evaluation suite, not as the headline number.

Why It Matters in Production LLM and Agent Systems

Reporting a single accuracy number on an LLM benchmark is the fastest way to ship a confident, broken system. Two failure modes drive this. The first is class imbalance: if 95% of your customer-support intents are “billing question”, a model that always predicts “billing question” hits 95% accuracy and helps no one with the other 5%. The second is the canonical-answer trap: an LLM that generates “Yes, you can return the item within 30 days” gets zero accuracy against a label of “Returns are accepted up to 30 days post-purchase” — same meaning, different surface form.

The pain is felt across roles. An ML engineer celebrates a 2-point accuracy gain on a static benchmark and then watches CSAT drop because the model now refuses ambiguous queries it used to handle. A compliance reviewer is told a moderation classifier is “98% accurate” and discovers it never sees the 1% of inputs it would mislabel because the test set was clean. A product lead asks “is the new model better?” and gets back a single number that hides three regressions in different cohorts.

In 2026 agent stacks, accuracy is even less informative on its own. A trajectory-level “did the agent finish the task” question doesn’t reduce to one comparison; you need step-level evaluators across tool selection, function calls, and final answer. Accuracy is still useful — but as one row in the dashboard, not the whole dashboard.

How FutureAGI Handles Accuracy

FutureAGI’s approach is to keep accuracy where it works (classification, function-call matching, exact-output tasks) and route open-ended cases to richer evaluators. The Equals and GroundTruthMatch evaluators in fi.evals handle exact-match scoring against a label. For function calls the FunctionCallExactMatch evaluator returns AST-level accuracy on the call, and FunctionNameMatch reports the lighter “did you pick the right tool” signal. For free-text generation, FutureAGI replaces accuracy with EmbeddingSimilarity, Faithfulness, AnswerRelevancy, or judge-model rubrics scored through CustomEvaluation.

Concretely: a SQL-generation team running on traceAI-langchain tags every production call. They run Equals against the canonical query for the closed-form benchmark cohort and EmbeddingSimilarity against an executable-equivalence reference for the long-tail cohort. Aggregate accuracy on the closed cohort drops from 87% to 82% after a model swap; FutureAGI’s regression eval triggers, the trace view points to a system-prompt change merged the same day, and the team rolls it back. Without the cohort split, the long-tail eval (still at 79% similarity) would have masked the regression.

Unlike Ragas faithfulness, which only checks whether each claim is grounded, accuracy in FutureAGI is paired with confusion-matrix style cohort breakdowns so engineers see which class regressed, not just that the global mean did.

How to Measure or Detect It

Pick the matcher that fits the task — exact-match where canonical, similarity-based where not:

  • Equals: returns boolean exact match against expected output; cheapest, brittlest.
  • GroundTruthMatch: handles minor normalisation (whitespace, casing) before exact comparison.
  • FuzzyMatch: token-level fuzzy comparison; useful when the label has minor variations.
  • accuracy_by_cohort (dashboard signal): accuracy sliced by user segment, route, or model variant — the regression alarm that single-number accuracy never fires.
  • confusion_matrix (dashboard signal): per-class accuracy that exposes class-imbalance traps.
from fi.evals import GroundTruthMatch, Equals

exact = Equals()
gt = GroundTruthMatch()

result = gt.evaluate(
    output="paris",
    expected_output="Paris",
)
print(result.score, result.reason)

Common Mistakes

  • Reporting global accuracy on imbalanced data. A 95% number on a 95/5 split is no signal. Always slice by class or cohort.
  • Using exact-match accuracy on open-ended generation. Two correct answers will rarely match word-for-word; use EmbeddingSimilarity or judge-model rubrics.
  • Skipping a confusion matrix. Accuracy hides which class is regressing; a confusion matrix surfaces it in seconds.
  • Treating benchmark accuracy as production fitness. Static benchmarks miss distribution shift, jargon, and tool outputs the test never saw.
  • Letting one accuracy number gate releases. Pair it with Faithfulness or TaskCompletion so an open-ended regression cannot pass under a closed-form pass.

Frequently Asked Questions

What is accuracy in machine learning?

Accuracy is the share of predictions that match the ground-truth label. It is the simplest evaluation metric, computed as correct predictions divided by total predictions, and is most useful when classes are balanced and outputs are canonical.

Is accuracy a good metric for LLM evaluation?

Only for closed-form tasks like classification, multiple choice, or JSON validation. For open-ended generation, exact-match accuracy is too brittle — use embedding similarity, judge-model rubrics, or task-completion scores instead.

How do you compute accuracy in FutureAGI?

Use Equals or GroundTruthMatch from fi.evals for exact match against a label. For LLM tasks where canonical answers exist (e.g. SQL generation, JSON output), pair it with TypeCompliance or SchemaCompliance for richer signal.