What Is Classification (Machine Learning)?
The supervised-learning task of assigning inputs to one of a finite set of discrete categories, evaluated against ground-truth labels.
What Is Classification (Machine Learning)?
Classification is the supervised-learning task of assigning an input to one of a finite set of discrete categories. Inputs can be text, images, audio, tabular rows, or embeddings; outputs are either a single class label (hard classification) or a probability distribution over classes (soft classification). Models are trained on labeled examples and graded against held-out ground truth using accuracy, precision, recall, F1, or ROC-AUC. In LLM systems, classification appears as routing decisions, intent detection, content moderation, toxicity scoring, and as the head of judge-models that grade open-ended outputs against a fixed rubric.
Why It Matters in Production LLM and Agent Systems
Classification is everywhere in an LLM stack, even when the surface is generative. A router LLM that picks “general-purpose vs. coding vs. legal” is classifying. A guardrail that emits “safe vs. unsafe” is classifying. A judge model that returns “passes vs. fails” against a rubric is classifying. The quality of these classifiers directly determines whether downstream generation runs on the right model, with the right prompt, on the right data.
The pain shows up across roles. An ML engineer ships a router with 92% accuracy on a balanced validation set; in production the class distribution is 95% general-purpose, so the router defaults to “general” on borderline cases and the legal queries that need a stricter prompt slip through. A product lead sees 8% of toxicity flags fire on benign content because the threshold was tuned on a different cohort. A compliance lead is asked which fraction of moderation decisions overrode a human, and has no per-class confusion matrix to point to.
In 2026 agent stacks, classifications cascade. An intent-classification miss at step one routes the agent to the wrong tool at step two, which misformats output at step three, which fails a downstream JSON schema at step four. Step-level evaluation of every classification decision in a trajectory is what catches the cascade.
How FutureAGI Handles Classification
FutureAGI’s approach is to grade LLM-driven classification the same way you grade any other LLM output: against ground truth, in a versioned Dataset, with thresholds wired to a regression eval. The GroundTruthMatch evaluator is purpose-built for this — it compares the model’s class label to the dataset’s labelled column and returns hit/miss plus aggregate accuracy, precision, recall, and F1. For LLM-as-judge classification rubrics, CustomEvaluation lets you wrap a judge prompt and threshold its categorical output.
Concretely: a content-moderation team uses an LLM to classify user posts into safe / borderline / unsafe. They build a 2,000-row golden Dataset with human-labeled ground truth, attach GroundTruthMatch, and run it on every prompt-template PR. The CI artifact shows per-class precision and recall plus a confusion matrix; the team blocks merge when the unsafe-class recall drops below 0.93. In production, the same evaluator runs on a sampled cohort of trace outputs (those reviewed by humans through the AnnotationQueue) and the dashboard tracks eval-fail-rate-by-class, so a sudden recall drop on the unsafe class fires an alert.
For routing classification, the gateway exposes the router’s chosen route as a span attribute; FutureAGI traces let you measure router accuracy against a labeled subset and tune classification-threshold per class.
How to Measure or Detect It
Classification quality is measured by per-class metrics, not just overall accuracy:
GroundTruthMatch(fi.evals): returns hit/miss against the labeled column and aggregates accuracy, precision, recall, F1.- Confusion matrix (dashboard panel): which classes get confused with which — the diagnostic that reveals threshold or label-leakage problems.
- ROC-AUC: threshold-independent score for binary or one-vs-rest classification.
- Per-class recall: critical when class imbalance hides the minority-class regression behind a high overall accuracy.
- Calibration error: does P(class) = 0.7 actually mean 70% empirical hit rate? Miscalibrated classifiers break downstream confidence-based routing.
from fi.evals import GroundTruthMatch
gt = GroundTruthMatch()
result = gt.evaluate(
output="unsafe",
expected_response="unsafe",
)
print(result.score, result.reason)
Common Mistakes
- Reporting overall accuracy on imbalanced data. A 95%-accurate classifier on a 95/5 split could be predicting the majority class always; track per-class recall.
- Tuning the threshold on the same data used for training. Always tune on a held-out validation slice to avoid optimistic numbers.
- Treating soft probabilities as calibrated. Most LLM judge-model outputs are not calibrated; isotonic or Platt scaling is required for confidence-based routing.
- No confusion-matrix review. A high F1 hides which two classes the model flips between; the matrix is the diagnostic, not the score.
- Skipping per-cohort evaluation. Production cohorts (free vs. paid, region, language) often have different class distributions; aggregate scores hide cohort-specific regressions.
Frequently Asked Questions
What is classification in machine learning?
Classification is the supervised-learning task of mapping an input to one of a finite set of discrete classes, like spam vs. ham or one of N intent labels, evaluated against ground-truth labels.
How is classification different from regression?
Regression predicts a continuous value (price, temperature). Classification predicts a discrete category. Many evaluation metrics differ: regression uses MSE or MAE, classification uses accuracy, F1, and ROC-AUC.
How do you measure classification quality in an LLM system?
FutureAGI's GroundTruthMatch evaluator compares an LLM's classification output to a labeled Dataset and aggregates accuracy, precision, recall, and F1. Wire it into a regression eval for every prompt or model change.