Evaluating LLM Classifiers in 2026: The Eval That Ships
Evaluating LLM classifiers in 2026: per-class precision-recall, macro vs weighted F1, production-distribution calibration, confusion-matrix debugging.
Table of Contents
The LLM classifier passes every eval the team runs. Macro accuracy 0.94 on the balanced test set. F1 reads 0.91. The model ships. Two weeks in, the on-call engineer notices that refund tickets are getting routed to billing, escalation requests are closing as resolved, and the minority “fraud_suspected” class is sitting at recall 0.31 in production while the eval set still says 0.88. Nothing changed. The model is doing exactly what the eval said it would. The eval was reading the wrong distribution.
LLM-as-classifier eval is two problems that look like one. The model classifies well on balanced test sets and falls apart on imbalanced production traffic because the eval set hid the imbalance. The eval that actually matters reports per-class precision and recall on the production distribution, macro F1 alongside weighted F1, a calibration curve per class, and a confusion matrix you read before you touch the prompt. F1 on a balanced set ships theater.
TL;DR
| Metric | What it answers | Where it lies |
|---|---|---|
| Aggregate accuracy | ”Did we get most calls right?” | Hides minority-class collapse on imbalanced traffic |
| Per-class precision | ”When we predict class C, are we right?” | Doesn’t say if we ever predict C at all |
| Per-class recall | ”When class C exists, do we catch it?” | Says nothing about confusion with other classes |
| Macro F1 | ”Does the model behave on every class?” | Penalizes models tuned for the common case |
| Weighted F1 | ”Does the model behave on most traffic?” | Hides minority-class disasters under common-class success |
| Calibration curve | ”Are confidence scores meaningful?” | Off-the-shelf LLM scores cluster at extremes |
| Confusion matrix | ”Which classes does the model swap?” | Has to be read per-class; the aggregate hides everything |
Score on a production-distribution set, not a balanced one. Report macro and weighted F1 side by side. Calibrate per class before you set a threshold. Read the confusion matrix before you change the prompt. Run a deterministic + custom-judge cascade and let the cheap layer handle the easy cases.
Why balanced-set accuracy misleads
The classic eval setup looks fair: 100 examples per class, stratified, hand-labeled, accuracy reported as one number. Production traffic doesn’t look like that. For intent routing, sentiment classification, content tagging, and category-routing workloads, the real distribution is heavy-tailed. A handful of classes carry 70 to 90 percent of volume; a long tail of minority classes is where escalations and revenue events live.
The balanced eval set teaches the model to behave well on every class equally. Production then drops 80 percent of its weight on three buckets, the overall accuracy looks fine because those three are easy, and the minority classes that drove the project quietly collapse. Nobody notices until a downstream team starts seeing tickets in the wrong queue.
The fix is two eval sets, scored separately. The production-distribution set mirrors live traffic frequency and is what you report to the team. The per-class oversampled set holds 100 to 200 examples per class regardless of frequency and is what you debug against. Macro F1 on the oversampled set tells you whether the model can do every class. Weighted F1 on the production set tells you what users actually experience. The gap between them is the calibration drift waiting to happen.
Per-class precision-recall: the imbalanced reality
Aggregate numbers hide the failure. Per-class numbers expose it. Compute four things per class on the production-distribution set: precision (when the model predicts class C, how often is it correct), recall (when class C exists, how often does the model predict it), support (how many examples of class C are in the eval set), and F1 (the harmonic mean of precision and recall).
The pattern that surfaces almost every time: the common classes (high support) carry strong precision and recall (0.90+ both); a couple of minority classes look fine because they get predicted rarely but accurately (high precision, low recall); the rest of the long tail collapses (low recall and low precision both, because the model defaults to a common class when it isn’t sure).
# Per-class precision, recall, F1 from a list of (predicted, actual) pairs
from collections import Counter
def per_class_metrics(pairs):
classes = sorted({y for _, y in pairs} | {p for p, _ in pairs})
report = {}
for c in classes:
tp = sum(1 for p, y in pairs if p == c and y == c)
fp = sum(1 for p, y in pairs if p == c and y != c)
fn = sum(1 for p, y in pairs if p != c and y == c)
precision = tp / (tp + fp) if (tp + fp) else 0
recall = tp / (tp + fn) if (tp + fn) else 0
f1 = 2 * precision * recall / (precision + recall) if (precision + recall) else 0
report[c] = {
"precision": round(precision, 3),
"recall": round(recall, 3),
"f1": round(f1, 3),
"support": sum(1 for _, y in pairs if y == c),
}
return report
Three reading rules:
A low-precision class is over-predicted. Usually a prompt or rubric problem: the description is too broad, the few-shot examples are too lenient, or the model is using C as a fallback when it doesn’t know.
A low-recall class is under-predicted. Usually a coverage problem in the prompt: the description doesn’t surface the actual signal users send, or the model confuses C with a more familiar adjacent class.
Low precision and low recall together is the danger zone. Either the prompt is ambiguous, the label is poorly defined, or the model genuinely cannot distinguish the class from neighbors. The confusion matrix tells you which.
The team that ships intent routers on aggregate accuracy alone never sees this. The team that reads the per-class table sees the broken classes on day one.
Macro vs weighted F1 (and when each lies)
F1 is per-class. You need a single number for dashboards, CI gates, and release decisions. There are two ways to summarize, and both are useful, and both lie when used alone.
Macro F1 averages the per-class F1s with equal weight. Minority-class failure drags the number down regardless of volume. Use macro F1 when you care about every class: safety-adjacent labels, escalation routing, anything where a minority class drives a tail outcome that matters.
Weighted F1 averages the per-class F1s weighted by support. Common classes dominate. Use weighted F1 when the production cost is proportional to traffic volume: generic content tagging where common classes are most of the work, sentiment buckets where the long tail is decorative.
The lie pattern is symmetric:
- Macro F1 lies when classes are intentionally rare. If a minority class is a known-rare event (fraud, escalation, opt-out) and you score it on a balanced set, macro F1 punishes the model for being honest about how rare it sees those events in training.
- Weighted F1 lies when minority-class collapse is invisible. A model with 0.97 F1 on common classes and 0.21 F1 on three minority classes can post weighted F1 of 0.92, which looks like a healthy model. It is not.
Report both. The shape of the (macro, weighted) pair is the signal: macro ≈ weighted means the model behaves consistently across classes; weighted far above macro means common classes are propping up a model that is broken on the tail. The first time you see weighted - macro > 0.10, treat it as a regression flag, not a “ship it” signal.
See the F1 score primer for the underlying math.
Calibration on production distribution
A classifier’s confidence score is meaningful only when it’s calibrated: a score of 0.8 should mean 80 percent of predictions at that band are correct. LLM classifiers are usually miscalibrated by default in two shapes.
Bimodal extremes. The model returns 0.95+ or 0.05- on almost every example, skipping the 0.3 to 0.8 band entirely. There’s no middle to threshold against. Every abstain-or-route decision happens at the same effective cutoff.
Class-dependent skew. The model is over-confident on common classes (a 0.7 on the common class is right 95 percent of the time) and under-confident on rare classes (a 0.7 on the rare class is right 55 percent of the time). One threshold across classes ships different precision profiles per class.
Calibration matters when you do anything threshold-driven: route to a fallback model on low-confidence, abstain and ask for human review, gate a downstream action on confidence above 0.7. Without calibration those thresholds are arbitrary.
Measure calibration with a reliability diagram on the production-distribution set:
# Reliability diagram bins predictions by confidence and checks empirical accuracy
def reliability(pairs_with_conf, num_bins=10):
bins = [[] for _ in range(num_bins)]
for pred, actual, conf in pairs_with_conf:
idx = min(int(conf * num_bins), num_bins - 1)
bins[idx].append(1 if pred == actual else 0)
return [
{
"band": f"{i/num_bins:.1f}-{(i+1)/num_bins:.1f}",
"empirical_accuracy": sum(b) / len(b) if b else None,
"n": len(b),
}
for i, b in enumerate(bins)
]
If empirical accuracy at the 0.7-0.8 band is 0.55, your “high-confidence” routing logic is firing on cases the model gets wrong half the time. The fix is per-class threshold tuning (set the cutoff that matches your precision target on each class) or a calibration transform like Platt scaling or isotonic regression over the production set. Don’t trust the raw score until you’ve measured the curve.
Confusion matrix-driven debugging
The confusion matrix is the debugger. Rows are actual class, columns are predicted. The diagonal is correct calls; everything off-diagonal is a confusion pair that maps directly to a prompt or rubric fix.
PREDICTED
refund billing shipping cancel fraud
ACTUAL refund 412 38 2 6 0
billing 19 287 4 2 0
shipping 1 3 198 0 0
cancel 4 7 1 112 0
fraud 7 3 0 5 11
Read it as a checklist:
Biggest off-diagonal cells. Here, (refund, billing) = 38 and (billing, refund) = 19 is the dominant confusion. The fix is rubric clarification: a sharper definition of refund vs billing in the prompt, with two contrastive few-shot examples that pin the distinction.
Asymmetric confusions. (fraud, refund) = 7 and (fraud, billing) = 3 are tiny, but fraud’s support is 26 and 15 errors out of 26 is a 0.42 recall disaster. The model under-predicts fraud because the prompt doesn’t surface the actual signals (account-takeover language, suspicious-pattern phrases). The fix is example coverage, not a model upgrade.
Silent classes. If a column sums to zero (a class is never predicted), the model has decided the class doesn’t exist. That’s a rubric problem: either the description doesn’t match the inputs you see, or there’s no positive example in the few-shot set.
The matrix tells you what to fix and in what order. Three or four prompt edits aimed at the top off-diagonal cells typically lift macro F1 more than a frontier model swap. Read the matrix first; swap models only after the prompt is doing what it can.
For the broader debugging discipline, see your agent passes evals and fails in production, which applies the same pattern to multi-turn agents.
The deterministic + custom-judge cascade
Most LLM classifier workloads don’t need a frontier model on every call. The production pattern that holds up is a cascade: a deterministic floor that handles easy cases at sub-millisecond cost, a calibrated custom judge that handles the residual.
Layer 1: deterministic floor. Regex, contains-checks, exact-match against a canonical keyword set, JSON-schema validation on the output shape. Future AGI ships Regex, Contains, ContainsAll, ContainsAny, Equals, StartsWith, and EndsWith as local heuristic metrics in fi.evals.metrics.heuristics. For intent and sentiment workloads, a regex against high-signal keywords routes 30 to 60 percent of traffic without ever calling an LLM. The deterministic layer handles the obvious cases and frees the judge for everything else.
Layer 2: calibrated custom judge. For the residual that the deterministic floor can’t decide, run a CustomLLMJudge configured with your class taxonomy, calibrated against your eval set.
from fi.evals.metrics.llm_as_judges.custom_judge.metric import CustomLLMJudge
from fi.evals.llm.litellm_provider import LiteLLMProvider
provider = LiteLLMProvider(model="gpt-4o", api_key="...")
intent_classifier = CustomLLMJudge(
provider=provider,
config={
"name": "intent_router",
"grading_criteria": """
Classify the user's message into exactly one of:
- refund_request: user asks for money back on a completed order
- billing_question: user asks about charges or invoices on their account
- shipping_inquiry: user asks about delivery status or address
- cancel_order: user wants to stop a pending order
- fraud_suspected: user reports unauthorized activity or account-takeover signals
Return the class string and a confidence score 0.0 to 1.0.
""",
"few_shot_examples": [
{"input": "I want my $80 back from yesterday's order", "label": "refund_request"},
{"input": "Why was I charged twice on May 2?", "label": "billing_question"},
{"input": "Someone logged in from Russia, not me", "label": "fraud_suspected"},
],
},
)
result = intent_classifier.evaluate({"input": "Please reverse the charge"})
Three discipline rules for the judge layer:
- Calibrate before you ship. Run the judge on the labeled eval set, compute per-class precision-recall, set per-class thresholds based on the production cost profile (high precision for fraud-routing, high recall for refund-routing).
- Pin the judge model and rubric text in version control. A rubric change is a model change. Versioned together, scored against the same dataset version, the comparison is honest.
- Audit off-diagonal cells monthly. A new product feature shifts the input distribution. The confusion matrix shifts with it. The classes that were clean last month become the new confusion pair.
For the cost-stable layering the cascade sits inside, see deterministic LLM evaluation metrics, which applies the same pyramid to general eval and not just classification.
What FAGI ships for LLM classifier evaluation
The classifier eval workflow has four moving parts: dataset, metric implementation, judge implementation, and the calibration that ties them together. Future AGI ships all four as one stack.
Dataset and stratification. The Platform stores eval datasets with intent, persona, and class metadata as first-class fields, so the production-distribution sample and the per-class oversampled set are two views over the same versioned dataset. The futureagi-sdk Client uploads samples programmatically and tags them for cohort analysis.
Deterministic metric implementation. ai-evaluation (Apache 2.0) ships 20+ local heuristic metrics including Regex, Contains, Equals, StartsWith, and full JSON-schema validation (JSONValidation, JSONSchema, SchemaCompliance). Per-class precision-recall and confusion-matrix computation run as deterministic post-processors over a list of (predicted, actual) pairs. Sub-millisecond, zero API cost.
LLM-as-judge implementation. 50+ pre-built EvalTemplate classes cover the common classification axes: Tone, BiasDetection, ContentModeration, AnswerRefusal, TaskCompletion, IsCompliant, Sexist, NoRacialBias, IsHelpful, PromptAdherence. For custom label spaces, CustomLLMJudge takes a grading_criteria string and optional few-shot examples and produces a calibrated judge with a versionable rubric.
Calibration and the contract. The Platform stores per-class operating points alongside the dataset version, so the eval contract follows the rubric, not the run. The same metric definition that scored your CI suite scores your production sample via traceAI’s EvalTag attachments on live spans. A regression in macro F1 from the rolling baseline routes to an annotation queue automatically. Error Feed clusters failing predictions with HDBSCAN over ClickHouse and a Sonnet 4.5 Judge writes an immediate_fix against a 5-category taxonomy, so the off-diagonal confusion cells get a named root cause and a recommended prompt edit on each refresh.
The Platform also layers self-improving evaluators (thumbs feedback retunes the threshold), in-product authoring agents (a domain expert writes grading_criteria in natural language), and classifier-backed scoring at lower per-eval cost than Galileo Luna-2. The Agent Command Center carries the deterministic floor inline at p99 of 21 ms with guardrails on (t3.xlarge, ~29k req/s per the github.com/future-agi/future-agi README), across 100+ providers, with SOC 2 Type II, HIPAA, GDPR, and CCPA certification per futureagi.com/trust, ISO/IEC 27001 in active audit.
Ready to score your classifier on the real distribution? Start with the ai-evaluation SDK quickstart, wire CustomLLMJudge against a 100-per-class stratified sample, and report macro and weighted F1 side by side from day one. The confusion matrix is where the next prompt edit lives.
Three takeaways for 2026
- F1 on balanced sets ships theater. Score on a production-distribution sample, debug on an oversampled per-class set, and report macro and weighted F1 together. The gap is the regression flag.
- Calibrate per class, not globally. LLM confidence scores are miscalibrated by default and class-skewed by structure. Reliability diagrams on production data are the only honest threshold-setting input.
- Read the confusion matrix before you change the model. Off-diagonal cells point at prompt and rubric fixes that lift macro F1 more than any frontier-model swap.
Related reading
- Deterministic LLM Evaluation Metrics (2026)
- F1 Score for Evaluating Classifiers
- Why LLM-as-a-Judge (2026)
- Your Agent Passes Evals and Fails in Production (2026)
- The 2026 LLM Evaluation Playbook
Sources
Frequently asked questions
Why does accuracy fail as an LLM classifier metric?
What is the difference between macro and weighted F1?
How big should the classifier eval set be?
What is calibration for a classifier and why does it matter?
How do confusion matrices help debug LLM classifiers?
When is an LLM classifier the wrong choice?
What does Future AGI ship for LLM classifier evaluation?
Logprob aggregation, semantic entropy, Brier score, and Platt scaling. The 2026 methodology for calibrated LLM confidence scores you can actually trust.
Per-intent precision-recall, escalation accuracy, OOD detection, and drift gates for the LLM router that decides which pipeline runs.
Cheap-fast-statistically-significant LLM eval gates in GitHub Actions: classifier cascade, fi CLI exit codes, Welch's t-test, path-scoped triggers, auto-rollback.