Guides

Evaluating LLM Classifiers in 2026: The Eval That Ships

Evaluating LLM classifiers in 2026: per-class precision-recall, macro vs weighted F1, production-distribution calibration, confusion-matrix debugging.

·
Updated
·
11 min read
llm-evaluation classifiers f1-score calibration confusion-matrix 2026
Editorial cover image for Evaluating LLM Classifiers: A 2026 Meta-Eval Guide
Table of Contents

The LLM classifier passes every eval the team runs. Macro accuracy 0.94 on the balanced test set. F1 reads 0.91. The model ships. Two weeks in, the on-call engineer notices that refund tickets are getting routed to billing, escalation requests are closing as resolved, and the minority “fraud_suspected” class is sitting at recall 0.31 in production while the eval set still says 0.88. Nothing changed. The model is doing exactly what the eval said it would. The eval was reading the wrong distribution.

LLM-as-classifier eval is two problems that look like one. The model classifies well on balanced test sets and falls apart on imbalanced production traffic because the eval set hid the imbalance. The eval that actually matters reports per-class precision and recall on the production distribution, macro F1 alongside weighted F1, a calibration curve per class, and a confusion matrix you read before you touch the prompt. F1 on a balanced set ships theater.

TL;DR

MetricWhat it answersWhere it lies
Aggregate accuracy”Did we get most calls right?”Hides minority-class collapse on imbalanced traffic
Per-class precision”When we predict class C, are we right?”Doesn’t say if we ever predict C at all
Per-class recall”When class C exists, do we catch it?”Says nothing about confusion with other classes
Macro F1”Does the model behave on every class?”Penalizes models tuned for the common case
Weighted F1”Does the model behave on most traffic?”Hides minority-class disasters under common-class success
Calibration curve”Are confidence scores meaningful?”Off-the-shelf LLM scores cluster at extremes
Confusion matrix”Which classes does the model swap?”Has to be read per-class; the aggregate hides everything

Score on a production-distribution set, not a balanced one. Report macro and weighted F1 side by side. Calibrate per class before you set a threshold. Read the confusion matrix before you change the prompt. Run a deterministic + custom-judge cascade and let the cheap layer handle the easy cases.

Why balanced-set accuracy misleads

The classic eval setup looks fair: 100 examples per class, stratified, hand-labeled, accuracy reported as one number. Production traffic doesn’t look like that. For intent routing, sentiment classification, content tagging, and category-routing workloads, the real distribution is heavy-tailed. A handful of classes carry 70 to 90 percent of volume; a long tail of minority classes is where escalations and revenue events live.

The balanced eval set teaches the model to behave well on every class equally. Production then drops 80 percent of its weight on three buckets, the overall accuracy looks fine because those three are easy, and the minority classes that drove the project quietly collapse. Nobody notices until a downstream team starts seeing tickets in the wrong queue.

The fix is two eval sets, scored separately. The production-distribution set mirrors live traffic frequency and is what you report to the team. The per-class oversampled set holds 100 to 200 examples per class regardless of frequency and is what you debug against. Macro F1 on the oversampled set tells you whether the model can do every class. Weighted F1 on the production set tells you what users actually experience. The gap between them is the calibration drift waiting to happen.

Per-class precision-recall: the imbalanced reality

Aggregate numbers hide the failure. Per-class numbers expose it. Compute four things per class on the production-distribution set: precision (when the model predicts class C, how often is it correct), recall (when class C exists, how often does the model predict it), support (how many examples of class C are in the eval set), and F1 (the harmonic mean of precision and recall).

The pattern that surfaces almost every time: the common classes (high support) carry strong precision and recall (0.90+ both); a couple of minority classes look fine because they get predicted rarely but accurately (high precision, low recall); the rest of the long tail collapses (low recall and low precision both, because the model defaults to a common class when it isn’t sure).

# Per-class precision, recall, F1 from a list of (predicted, actual) pairs
from collections import Counter

def per_class_metrics(pairs):
    classes = sorted({y for _, y in pairs} | {p for p, _ in pairs})
    report = {}
    for c in classes:
        tp = sum(1 for p, y in pairs if p == c and y == c)
        fp = sum(1 for p, y in pairs if p == c and y != c)
        fn = sum(1 for p, y in pairs if p != c and y == c)
        precision = tp / (tp + fp) if (tp + fp) else 0
        recall    = tp / (tp + fn) if (tp + fn) else 0
        f1 = 2 * precision * recall / (precision + recall) if (precision + recall) else 0
        report[c] = {
            "precision": round(precision, 3),
            "recall":    round(recall, 3),
            "f1":        round(f1, 3),
            "support":   sum(1 for _, y in pairs if y == c),
        }
    return report

Three reading rules:

A low-precision class is over-predicted. Usually a prompt or rubric problem: the description is too broad, the few-shot examples are too lenient, or the model is using C as a fallback when it doesn’t know.

A low-recall class is under-predicted. Usually a coverage problem in the prompt: the description doesn’t surface the actual signal users send, or the model confuses C with a more familiar adjacent class.

Low precision and low recall together is the danger zone. Either the prompt is ambiguous, the label is poorly defined, or the model genuinely cannot distinguish the class from neighbors. The confusion matrix tells you which.

The team that ships intent routers on aggregate accuracy alone never sees this. The team that reads the per-class table sees the broken classes on day one.

Macro vs weighted F1 (and when each lies)

F1 is per-class. You need a single number for dashboards, CI gates, and release decisions. There are two ways to summarize, and both are useful, and both lie when used alone.

Macro F1 averages the per-class F1s with equal weight. Minority-class failure drags the number down regardless of volume. Use macro F1 when you care about every class: safety-adjacent labels, escalation routing, anything where a minority class drives a tail outcome that matters.

Weighted F1 averages the per-class F1s weighted by support. Common classes dominate. Use weighted F1 when the production cost is proportional to traffic volume: generic content tagging where common classes are most of the work, sentiment buckets where the long tail is decorative.

The lie pattern is symmetric:

  • Macro F1 lies when classes are intentionally rare. If a minority class is a known-rare event (fraud, escalation, opt-out) and you score it on a balanced set, macro F1 punishes the model for being honest about how rare it sees those events in training.
  • Weighted F1 lies when minority-class collapse is invisible. A model with 0.97 F1 on common classes and 0.21 F1 on three minority classes can post weighted F1 of 0.92, which looks like a healthy model. It is not.

Report both. The shape of the (macro, weighted) pair is the signal: macro ≈ weighted means the model behaves consistently across classes; weighted far above macro means common classes are propping up a model that is broken on the tail. The first time you see weighted - macro > 0.10, treat it as a regression flag, not a “ship it” signal.

See the F1 score primer for the underlying math.

Calibration on production distribution

A classifier’s confidence score is meaningful only when it’s calibrated: a score of 0.8 should mean 80 percent of predictions at that band are correct. LLM classifiers are usually miscalibrated by default in two shapes.

Bimodal extremes. The model returns 0.95+ or 0.05- on almost every example, skipping the 0.3 to 0.8 band entirely. There’s no middle to threshold against. Every abstain-or-route decision happens at the same effective cutoff.

Class-dependent skew. The model is over-confident on common classes (a 0.7 on the common class is right 95 percent of the time) and under-confident on rare classes (a 0.7 on the rare class is right 55 percent of the time). One threshold across classes ships different precision profiles per class.

Calibration matters when you do anything threshold-driven: route to a fallback model on low-confidence, abstain and ask for human review, gate a downstream action on confidence above 0.7. Without calibration those thresholds are arbitrary.

Measure calibration with a reliability diagram on the production-distribution set:

# Reliability diagram bins predictions by confidence and checks empirical accuracy
def reliability(pairs_with_conf, num_bins=10):
    bins = [[] for _ in range(num_bins)]
    for pred, actual, conf in pairs_with_conf:
        idx = min(int(conf * num_bins), num_bins - 1)
        bins[idx].append(1 if pred == actual else 0)
    return [
        {
            "band": f"{i/num_bins:.1f}-{(i+1)/num_bins:.1f}",
            "empirical_accuracy": sum(b) / len(b) if b else None,
            "n": len(b),
        }
        for i, b in enumerate(bins)
    ]

If empirical accuracy at the 0.7-0.8 band is 0.55, your “high-confidence” routing logic is firing on cases the model gets wrong half the time. The fix is per-class threshold tuning (set the cutoff that matches your precision target on each class) or a calibration transform like Platt scaling or isotonic regression over the production set. Don’t trust the raw score until you’ve measured the curve.

Confusion matrix-driven debugging

The confusion matrix is the debugger. Rows are actual class, columns are predicted. The diagonal is correct calls; everything off-diagonal is a confusion pair that maps directly to a prompt or rubric fix.

                 PREDICTED
                 refund  billing  shipping  cancel  fraud
ACTUAL refund     412       38        2       6       0
       billing     19      287        4       2       0
       shipping     1        3       198      0       0
       cancel       4        7        1     112       0
       fraud        7        3        0       5      11

Read it as a checklist:

Biggest off-diagonal cells. Here, (refund, billing) = 38 and (billing, refund) = 19 is the dominant confusion. The fix is rubric clarification: a sharper definition of refund vs billing in the prompt, with two contrastive few-shot examples that pin the distinction.

Asymmetric confusions. (fraud, refund) = 7 and (fraud, billing) = 3 are tiny, but fraud’s support is 26 and 15 errors out of 26 is a 0.42 recall disaster. The model under-predicts fraud because the prompt doesn’t surface the actual signals (account-takeover language, suspicious-pattern phrases). The fix is example coverage, not a model upgrade.

Silent classes. If a column sums to zero (a class is never predicted), the model has decided the class doesn’t exist. That’s a rubric problem: either the description doesn’t match the inputs you see, or there’s no positive example in the few-shot set.

The matrix tells you what to fix and in what order. Three or four prompt edits aimed at the top off-diagonal cells typically lift macro F1 more than a frontier model swap. Read the matrix first; swap models only after the prompt is doing what it can.

For the broader debugging discipline, see your agent passes evals and fails in production, which applies the same pattern to multi-turn agents.

The deterministic + custom-judge cascade

Most LLM classifier workloads don’t need a frontier model on every call. The production pattern that holds up is a cascade: a deterministic floor that handles easy cases at sub-millisecond cost, a calibrated custom judge that handles the residual.

Layer 1: deterministic floor. Regex, contains-checks, exact-match against a canonical keyword set, JSON-schema validation on the output shape. Future AGI ships Regex, Contains, ContainsAll, ContainsAny, Equals, StartsWith, and EndsWith as local heuristic metrics in fi.evals.metrics.heuristics. For intent and sentiment workloads, a regex against high-signal keywords routes 30 to 60 percent of traffic without ever calling an LLM. The deterministic layer handles the obvious cases and frees the judge for everything else.

Layer 2: calibrated custom judge. For the residual that the deterministic floor can’t decide, run a CustomLLMJudge configured with your class taxonomy, calibrated against your eval set.

from fi.evals.metrics.llm_as_judges.custom_judge.metric import CustomLLMJudge
from fi.evals.llm.litellm_provider import LiteLLMProvider

provider = LiteLLMProvider(model="gpt-4o", api_key="...")

intent_classifier = CustomLLMJudge(
    provider=provider,
    config={
        "name": "intent_router",
        "grading_criteria": """
            Classify the user's message into exactly one of:
              - refund_request: user asks for money back on a completed order
              - billing_question: user asks about charges or invoices on their account
              - shipping_inquiry: user asks about delivery status or address
              - cancel_order: user wants to stop a pending order
              - fraud_suspected: user reports unauthorized activity or account-takeover signals
            Return the class string and a confidence score 0.0 to 1.0.
        """,
        "few_shot_examples": [
            {"input": "I want my $80 back from yesterday's order", "label": "refund_request"},
            {"input": "Why was I charged twice on May 2?",          "label": "billing_question"},
            {"input": "Someone logged in from Russia, not me",      "label": "fraud_suspected"},
        ],
    },
)

result = intent_classifier.evaluate({"input": "Please reverse the charge"})

Three discipline rules for the judge layer:

  1. Calibrate before you ship. Run the judge on the labeled eval set, compute per-class precision-recall, set per-class thresholds based on the production cost profile (high precision for fraud-routing, high recall for refund-routing).
  2. Pin the judge model and rubric text in version control. A rubric change is a model change. Versioned together, scored against the same dataset version, the comparison is honest.
  3. Audit off-diagonal cells monthly. A new product feature shifts the input distribution. The confusion matrix shifts with it. The classes that were clean last month become the new confusion pair.

For the cost-stable layering the cascade sits inside, see deterministic LLM evaluation metrics, which applies the same pyramid to general eval and not just classification.

What FAGI ships for LLM classifier evaluation

The classifier eval workflow has four moving parts: dataset, metric implementation, judge implementation, and the calibration that ties them together. Future AGI ships all four as one stack.

Dataset and stratification. The Platform stores eval datasets with intent, persona, and class metadata as first-class fields, so the production-distribution sample and the per-class oversampled set are two views over the same versioned dataset. The futureagi-sdk Client uploads samples programmatically and tags them for cohort analysis.

Deterministic metric implementation. ai-evaluation (Apache 2.0) ships 20+ local heuristic metrics including Regex, Contains, Equals, StartsWith, and full JSON-schema validation (JSONValidation, JSONSchema, SchemaCompliance). Per-class precision-recall and confusion-matrix computation run as deterministic post-processors over a list of (predicted, actual) pairs. Sub-millisecond, zero API cost.

LLM-as-judge implementation. 50+ pre-built EvalTemplate classes cover the common classification axes: Tone, BiasDetection, ContentModeration, AnswerRefusal, TaskCompletion, IsCompliant, Sexist, NoRacialBias, IsHelpful, PromptAdherence. For custom label spaces, CustomLLMJudge takes a grading_criteria string and optional few-shot examples and produces a calibrated judge with a versionable rubric.

Calibration and the contract. The Platform stores per-class operating points alongside the dataset version, so the eval contract follows the rubric, not the run. The same metric definition that scored your CI suite scores your production sample via traceAI’s EvalTag attachments on live spans. A regression in macro F1 from the rolling baseline routes to an annotation queue automatically. Error Feed clusters failing predictions with HDBSCAN over ClickHouse and a Sonnet 4.5 Judge writes an immediate_fix against a 5-category taxonomy, so the off-diagonal confusion cells get a named root cause and a recommended prompt edit on each refresh.

The Platform also layers self-improving evaluators (thumbs feedback retunes the threshold), in-product authoring agents (a domain expert writes grading_criteria in natural language), and classifier-backed scoring at lower per-eval cost than Galileo Luna-2. The Agent Command Center carries the deterministic floor inline at p99 of 21 ms with guardrails on (t3.xlarge, ~29k req/s per the github.com/future-agi/future-agi README), across 100+ providers, with SOC 2 Type II, HIPAA, GDPR, and CCPA certification per futureagi.com/trust, ISO/IEC 27001 in active audit.

Ready to score your classifier on the real distribution? Start with the ai-evaluation SDK quickstart, wire CustomLLMJudge against a 100-per-class stratified sample, and report macro and weighted F1 side by side from day one. The confusion matrix is where the next prompt edit lives.

Three takeaways for 2026

  1. F1 on balanced sets ships theater. Score on a production-distribution sample, debug on an oversampled per-class set, and report macro and weighted F1 together. The gap is the regression flag.
  2. Calibrate per class, not globally. LLM confidence scores are miscalibrated by default and class-skewed by structure. Reliability diagrams on production data are the only honest threshold-setting input.
  3. Read the confusion matrix before you change the model. Off-diagonal cells point at prompt and rubric fixes that lift macro F1 more than any frontier-model swap.

Sources

Frequently asked questions

Why does accuracy fail as an LLM classifier metric?
Accuracy weights every class equally and every test case equally, which is the opposite of how production traffic distributes. Most LLM classifier workloads are skewed: 80 percent of intents are a few common buckets, 20 percent are the long tail you care about. A model that scores 0.94 accuracy on a balanced eval set can land at 0.42 recall on the minority class that drives every escalation. The eval that matters reports per-class precision and recall on the real distribution, not aggregate accuracy on a clean balanced set.
What is the difference between macro and weighted F1?
Macro F1 averages the per-class F1 scores treating each class equally, so it punishes minority-class failure even when most volume is one or two classes. Weighted F1 averages by support, so common classes dominate the number. Macro F1 is the right metric when minority-class behavior matters (refund routing, escalation detection, safety-adjacent labels). Weighted F1 is the right metric when the production cost is proportional to volume. Both are summaries; neither replaces the per-class precision-recall view that tells you which classes are actually broken.
How big should the classifier eval set be?
Aim for 100 to 200 examples per class, stratified to match production frequency, plus a held-out minority-oversampled set for rare classes. Below 100 per class the per-class confidence intervals are too wide to detect drift. The production-distribution set is what you score against; the oversampled set is what you debug against. Refresh monthly by promoting borderline production traces and re-labeling, with a focus on the confusion matrix's off-diagonal cells.
What is calibration for a classifier and why does it matter?
Calibration means the classifier's confidence scores match its empirical accuracy. If it returns 0.8 confidence, it should be right about 80 percent of the time at that band. LLM classifiers using log-probabilities or judge-style scoring are usually miscalibrated out of the box: they cluster scores at the extremes (0.05 or 0.95) and skip the middle. Without calibration, thresholds are not portable across classes and the abstain logic that routes hard cases to a fallback model fires at the wrong cutoff. Measure with a reliability diagram on production data, not the balanced eval set.
How do confusion matrices help debug LLM classifiers?
The confusion matrix is where the failure modes live. Off-diagonal cells tell you which class the model confuses with which other class, and that mapping points straight at the prompt or rubric ambiguity that caused the error. A common pattern: refund_request gets predicted as billing_question because the rubric describes both as money topics without distinguishing intent. Three or four prompt edits aimed at the worst confusion pairs typically lift macro F1 more than any model swap. Read the matrix before you change the model.
When is an LLM classifier the wrong choice?
When you have plenty of labeled data and the label space is fixed. A fine-tuned encoder (DistilBERT, DeBERTa, or a domain-specific model) trains in an afternoon, runs at sub-10ms per call, and routinely beats an LLM-judge classifier on accuracy and cost. LLMs win on cold-start (no labeled data), open-ended label spaces, or when the classification rationale needs to be human-readable. They lose on high-throughput fixed-class workloads where every millisecond and every cent matters. Use the cheaper layer when it can do the job.
What does Future AGI ship for LLM classifier evaluation?
The ai-evaluation SDK (Apache 2.0) ships deterministic confusion-matrix and per-class precision-recall computation as local heuristic metrics, plus 50+ pre-built LLM-as-judge evaluators (Tone, BiasDetection, ContentModeration, IsCompliant, IsHelpful, Sexist, NoRacialBias, AnswerRefusal, TaskCompletion) you can use as classifier components or judges. CustomLLMJudge lets you wire a grading_criteria string into a calibrated judge for any custom label space. The Platform stores per-class operating points alongside the dataset version so the eval contract follows the rubric, not the run. Same definition runs in CI and on production spans via traceAI.
Related Articles
View all