How is a confusion matrix used in LLM evaluation?

For LLM classification tasks — intent classification, content moderation, PII detection — the confusion matrix shows which classes the model confuses. It surfaces per-class precision, recall, and the off-diagonal errors that aggregate accuracy hides.

How does FutureAGI compute confusion matrices?

FutureAGI runs evaluators like GroundTruthMatch and Toxicity on a labeled Dataset, then groups results by actual and predicted class to render the matrix. The dashboard view is a per-class heatmap with cell-level drill-down to failing rows.

What Is a Confusion Matrix in Machine Learning? (2026)

What Is a Confusion Matrix in Machine Learning?

A confusion matrix is a tabular summary of classifier predictions versus ground-truth labels. For an N-class problem, it is an N×N table where each row is an actual class and each column is a predicted class; each cell counts how often that pair occurred. From the matrix you derive accuracy, per-class precision and recall, F1, and specific error patterns. For LLM classification tasks — intent classification, content moderation, PII detection, sentiment — confusion matrices remain the canonical diagnostic. FutureAGI surfaces confusion-matrix-style breakdowns over evaluator outputs so engineers see which classes a classifier confuses, not just global accuracy.

Why Confusion Matrices Matter in Production LLM Systems

A single accuracy number is rarely enough to ship. The confusion matrix shows where the errors are — which translates directly to product impact. A 92% accurate intent classifier sounds good until the matrix reveals that “billing” is misclassified as “support” 18% of the time, sending angry customers to the wrong queue.

Pain shows up across roles. A product manager looking at flat user satisfaction sees that one specific intent has a 40% error rate hidden inside a 92% global accuracy. An applied engineer iterating on a content-moderation classifier sees that mild-toxicity messages get classified as severe — leading to over-moderation that hurts engagement. A compliance lead reviewing a PII detector cannot defend its deployment without per-class recall and false-negative rates by data type. A finance lead notices the cost of human review rising because the classifier’s false positive rate quietly drifted on one category.

In 2026 LLM stacks, confusion matrices apply not only to traditional classifiers but to any LLM step that produces a discrete label — intent extraction inside an agent, route classifier feeding a routing-policy, schema-class detection on structured outputs. The right diagnostic posture is to score every classification step with GroundTruthMatch against a labeled set, group by actual-vs-predicted, and surface the matrix in the dashboard. Multi-step agent systems benefit even more because the matrix at each classification step localizes failure to the right component.

How FutureAGI Handles Confusion Matrices

FutureAGI builds confusion-matrix views from evaluator output rather than from raw model predictions. The pattern is Dataset.add_evaluation() with GroundTruthMatch (or a domain-specific evaluator like Toxicity or BiasDetection configured with class labels) on a labeled Dataset. The evaluator returns per-row score, predicted class, and reason; the FutureAGI dashboard groups results by actual × predicted to render the heatmap with cell-level drill-down to failing rows.

A real workflow: a content-moderation team running a classifier through traceAI-openai builds a 4,000-row labeled dataset with classes [safe, mild, severe, illegal]. They run Toxicity configured with these labels via Dataset.add_evaluation() and inspect the resulting confusion matrix. Off-diagonal cells show that 12% of “severe” content is being labeled “mild” and 8% of “safe” content is being labeled “mild.” The team uses PromptWizard to refine the system prompt with sharpened class definitions and reruns the regression eval. The new matrix shifts errors back onto the diagonal and reduces severe-misses to 3%.

For agent systems, confusion-matrix views over agent.trajectory.step classification (which step the agent took) or tool selection (ToolSelectionAccuracy) localize where the agent is making the wrong dispatch decision. FutureAGI’s approach is to surface error structure — not just error counts. Unlike a single accuracy bar in a Hugging Face leaderboard, the confusion matrix tells the engineer where to invest: which class pair to disambiguate, which prompt clause to sharpen, which retrieved context to filter.

How to Measure or Detect It

Useful signals to derive from a confusion matrix:

Per-class precision and recall: column-wise and row-wise summaries; flag any class below release threshold.
Off-diagonal hot cells: the specific predicted-vs-actual confusions; each one is an investigation target.
GroundTruthMatch: returns per-row label match; the canonical input to a confusion matrix.
Toxicity and BiasDetection: classifier-style evaluators that output class labels suitable for matrix rendering.
Class imbalance check: if one class dominates the dataset, accuracy can be misleading; confusion-matrix per-class recall is the corrective.
Drift signal: rerun the matrix per release and compare cell-level rates; a cell shifting >5% is a regression to investigate.

Minimal Python:

from fi.evals import GroundTruthMatch

result = GroundTruthMatch().evaluate(
    output=predicted_label,
    expected_response=actual_label,
)
# Aggregate per (actual, predicted) cell across the dataset

Common Mistakes

Reporting only accuracy. Aggregate accuracy hides cell-level errors; always pair with the confusion matrix.
Ignoring class imbalance. A model predicting the majority class on every row scores high accuracy and zero recall on minority classes; recall per class catches it.
Not labeling per use case. A generic toxicity matrix does not reflect your platform’s class definitions; build a domain-specific labeled set.
Skipping per-release diffing. A small drop in one cell can indicate a major regression in a high-stakes class; compare matrices across releases.
Mixing class taxonomies between releases. If labels change, the matrix is no longer comparable; version both labels and predictions.