How is a confusion matrix different from accuracy?

Accuracy is one aggregate number: correct predictions divided by all predictions. A confusion matrix shows the underlying error pattern by class, so a high-accuracy model can still reveal severe false-negative or false-positive failures.

How do you measure a confusion matrix?

Run a row-level evaluator such as FutureAGI's GroundTruthMatch, Equals, or ToolSelectionAccuracy against expected labels, then aggregate predicted-versus-expected pairs into matrix cells. From those cells, compute precision, recall, F1, and accuracy by cohort.

What Is a Confusion Matrix? Definition & FutureAGI Guide (2026)

Q: What is a confusion matrix?

A confusion matrix is an evaluation table that compares predicted labels with ground-truth labels and counts true positives, false positives, false negatives, and true negatives. It shows which classes an LLM, classifier, or agent policy confuses.

What Is a Confusion Matrix?

A confusion matrix is an evaluation table that counts prediction outcomes against ground-truth labels: true positives, false positives, false negatives, and true negatives. It is a classification evaluation metric structure, not a single score. In LLM and agent eval pipelines, it appears after per-row judgments from evaluators, human annotations, or production traces. FutureAGI teams use it to diagnose which intents, safety labels, retrieval relevance classes, or tool choices are being confused before they trust aggregate accuracy, precision, recall, or F1.

Why It Matters in Production LLM and Agent Systems

Accuracy hides the shape of mistakes. A support-intent classifier can report 96% accuracy while routing nearly every “cancel subscription” request into “general question.” A safety classifier can look stable while false negatives rise in one language cohort. A RAG relevance classifier can approve irrelevant chunks because the negative class dominates the dataset. The confusion matrix exposes those class-level swaps directly.

The pain lands on different teams at once. Developers see regression tests passing because the aggregate metric moved only 0.3 points. SREs see escalation volume rise with no latency or error-rate signal. Product teams see users repeat themselves because the first wrong route sends the agent down the wrong tool path. Compliance teams care about the off-diagonal cells where “contains PII” was predicted as “safe.”

The production symptoms are specific: drift in one row of the matrix, false-negative spikes by customer tier, model-version changes that shift one intent into a neighboring intent, or trace samples where expected_tool=refund_lookup but the agent called order_status. This matters more in 2026 multi-step systems because one early misclassification can select the wrong retriever, tool, guardrail, or fallback path. The matrix gives engineers a compact error map before those downstream failures become tickets.

How FutureAGI Handles Confusion Matrices

FutureAGI’s approach is to treat a confusion matrix as a derived cohort artifact over row-level evaluations, not as a standalone ConfusionMatrix evaluator. The practical surface is Dataset.add_evaluation() plus trace-backed eval results. A team first runs GroundTruthMatch or Equals on labelled dataset rows, or ToolSelectionAccuracy on agent traces captured through traceAI-langchain. Each row keeps the expected label, predicted label, evaluator score, cohort fields, and, for traced runs, the span or step context such as agent.trajectory.step and gen_ai.evaluation.score.value.

Concretely: an agent team has 14 possible tools. Nightly regression runs score 8,000 conversations with ToolSelectionAccuracy, then group predicted tool versus expected tool into a 14-by-14 matrix. The diagonal shows correct tool choices. Off-diagonal cells show systematic errors: refund_lookup being confused with order_status, or human_escalation being confused with knowledge_base_search. The engineer opens the traces for that cell, sees that the planner prompt describes refunds after order status, and ships a prompt fix plus a regression threshold: false negatives for human_escalation must stay below 2%.

Compared with sklearn.metrics.confusion_matrix, which is useful offline but disconnected from traces, the FutureAGI workflow keeps the matrix tied to datasets, spans, model versions, prompts, and cohorts. We’ve found that the most useful matrix is rarely the global one; it is the matrix filtered to one prompt version, one model route, or one customer segment.

How to Measure or Detect It

Measurement starts with per-row expected and predicted labels, then aggregates counts by class:

GroundTruthMatch — returns a row-level match signal for predicted labels against expected labels; aggregate those rows into TP, FP, FN, and TN.
Equals — returns strict 1.0 or 0.0 equality for canonical labels, IDs, and closed-form outputs.
ToolSelectionAccuracy — scores whether an agent chose the expected tool; group expected versus selected tool names into a tool-confusion matrix.
Dashboard signal — monitor false-negative-rate-by-class, false-positive-rate-by-class, and eval-fail-rate-by-cohort, not only accuracy.
User-feedback proxy — compare high off-diagonal cells with thumbs-down rate, escalation rate, or recontact rate.

Minimal Python:

from fi.evals import GroundTruthMatch
from sklearn.metrics import confusion_matrix

metric = GroundTruthMatch()
labels = sorted({row.expected_label for row in dataset})
for row in dataset:
    metric.evaluate(response=row.predicted_label,
                    expected_response=row.expected_label)
y_true = [row.expected_label for row in dataset]
y_pred = [row.predicted_label for row in dataset]
print(confusion_matrix(y_true, y_pred, labels=labels))

Common Mistakes

Common traps when reading a confusion matrix:

Reading only diagonal totals. A high diagonal can still hide one rare but high-risk class collapsing into a safe-looking majority class.
Treating false positives and false negatives as symmetric. In PII, FN leaks data; in moderation, FP blocks users. Costs differ by label.
Mixing cohorts before building the matrix. Separate by prompt version, model, language, customer tier, and tool route before drawing conclusions.
Using soft judge scores as labels without threshold audit. A 0.79 versus 0.81 boundary can move many examples between classes.
Ignoring multi-class normalization. Raw counts favor common classes; row-normalized and column-normalized matrices reveal recall and precision failures.