Evaluation

What Is Topic Classification (Eval)?

An eval pattern that checks whether model outputs, user inputs, or agent traces are assigned to the correct subject label.

What Is Topic Classification (Eval)?

Topic classification is an eval pattern that checks whether an LLM, router, or agent assigns text to the correct subject label. It is a classification evaluation for LLM evaluation pipelines, production traces, retrieval routing, support queues, and safety review flows. FutureAGI teams use it to compare predicted topics against ground truth, inspect class-level confusion, and stop releases when a high-risk topic such as refunds, legal advice, or data privacy is sent to the wrong downstream path. With router-style architectures dominating 2026 agent stacks. a small classifier model decides which specialist (Claude Opus 4.7 for legal, GPT-5.x for code, Gemini 3.x for multimodal) sees each request. topic accuracy is now the single most upstream lever in the trace.

Why topic classification matters in production LLM and agent systems

Topic errors rarely look like model errors at first. A support assistant may answer fluently but file a “billing dispute” conversation under “general account question.” A RAG system may route a privacy-policy query to product docs because both mention “data.” A compliance classifier may label a medical-advice request as wellness content and skip the review queue. The failure is not language quality; it is the wrong subject label at the control point.

Developers feel the pain when regression tests pass but route-level incidents rise. SREs see unexplained retries, tool-call spikes, or queue imbalance because one topic sends traffic to the wrong retriever or workflow. Product teams see users repeat themselves after the assistant enters the wrong resolution path. Compliance teams care when protected classes, financial claims, health advice, or PII topics are under-counted.

The symptoms are visible if you log labels: topic distribution drift after a prompt change, a rare class losing recall, one model version over-predicting “other,” or trace clusters where predicted_topic=refunds but the selected tool is shipping_status. In 2026 multi-step systems, one bad topic label can choose the wrong knowledge base, policy, guardrail, escalation path, or model route. Topic classification gives engineers a measurable checkpoint before that early routing mistake spreads through the agent trajectory. In our 2026 evals on a 12-topic enterprise router, fixing the topic stage from 87% to 94% accuracy improved end-to-end task completion by ~9 points. without touching the downstream specialist agents.

How FutureAGI handles topic classification

FutureAGI’s approach is to treat topic classification as a labeled-output eval composed from existing primitives, not as a dedicated TopicClassification class. The practical workflow is to store expected topic labels in a golden dataset, run a row-level evaluator such as GroundTruthMatch, Equals, or FuzzyMatch, and aggregate the results by class, prompt version, model route, and customer cohort. For label sets that change frequently or include semantic clusters (“data deletion” vs “data export”), CustomEvaluation with a judge rubric is the right escape hatch.

A concrete workflow: a platform team has 18 support topics, including refunds, cancellations, account security, privacy, and technical troubleshooting. Their LangChain agent is instrumented with traceAI-langchain, so each run keeps the input, predicted topic, selected tool, and agent.trajectory.step context. A nightly eval runs GroundTruthMatch against a labeled dataset and writes the score beside the trace. If privacy recall drops below 0.92 or macro-F1 falls by more than 0.03, the release is blocked. The engineer opens false-negative traces, finds that “data deletion” was being grouped with “account settings,” and updates the router prompt plus the golden dataset.

Unlike Galileo’s classification module, which reports an aggregate accuracy and a confusion matrix, FutureAGI keeps the eval close to traces so a failed label links back to the prompt, retriever, tool decision, and downstream action. you can pivot directly from a misclassified row to the user prompt, the chosen model, and the production cost of that error. The public anchors most teams pace topic classifiers against are MMLU-Pro (14K Q across 14 domains; the standard subject-routing benchmark in 2026) and BBH (BIG-Bench Hard) for cross-domain confusion; for safety-relevant labels, HarmBench and AgentHarm cover the high-risk topic categories where recall matters most.

Topic classification cohorts for a typical support workflow

TopicVolume shareRecall floorWhy the floor
Account / login32%0.95High volume; missed cases overflow other queues
Billing18%0.93Refund SLAs depend on the right routing
Technical troubleshooting22%0.90Multiple correct routes; some tolerance
Privacy / data deletion4%0.97GDPR / DSAR compliance risk
Security incident2%0.99Must reach incident-response queue
Health / medical1%0.99Regulated clinical handling
Other / off-topic9%0.85Catch-all; tolerance is fine

How to measure or detect topic classification

Measurement starts with a closed label set and expected labels for representative rows:

  • GroundTruthMatch. returns a row-level match signal between the predicted topic and the expected topic; aggregate it into accuracy, precision, recall, and F1.
  • Equals. use it when topic labels are canonical strings or IDs and any mismatch is a failure.
  • FuzzyMatch. use it when historical labels contain naming variation, then review borderline matches before release gating.
  • Confusion matrix. show which topic pairs are confused, especially rare or regulated classes.
  • Dashboard signals. monitor macro-F1, eval-fail-rate-by-cohort, topic-distribution drift, escalation rate, and thumbs-down rate by predicted topic.

Minimal Python:

from fi.evals import GroundTruthMatch

metric = GroundTruthMatch()
for row in dataset:
    result = metric.evaluate(
        response=row.predicted_topic,
        expected_response=row.expected_topic,
    )
    row["topic_match"] = result.score

Do not stop at one aggregate. A topic classifier with 95% accuracy can still miss every privacy request if privacy is a small class. Track per-class recall for high-risk labels.

Common mistakes

The common failures are operational, not mathematical:

  • Collapsing topic and intent. “Refunds” is a topic; “start a refund” is an intent. Mixing them makes routing rules ambiguous.
  • Reporting only accuracy. Common topics dominate the score while rare legal, privacy, or safety topics lose recall.
  • Letting the label set drift. If product teams add new topics without versioning labels, old evals become incomparable.
  • Evaluating after tool execution only. Score the topic before downstream tools run, or the wrong route can hide behind a decent final answer.
  • Ignoring multi-label cases. Some inputs are both “billing” and “security”; force single-label eval only when the product workflow requires one route.

Frequently Asked Questions

What is topic classification?

Topic classification checks whether an LLM, router, or agent assigns text to the correct subject label. FutureAGI treats it as a labeled-output eval that compares predicted topics with ground truth and highlights class-level mistakes.

How is topic classification different from intent classification?

Topic classification labels what the content is about, such as billing, refunds, legal, or privacy. Intent classification labels what the user wants to do, such as cancel, escalate, compare, or troubleshoot.

How do you measure topic classification?

Run FutureAGI evaluators such as GroundTruthMatch, Equals, or FuzzyMatch against expected topic labels, then aggregate results into a confusion matrix, per-class precision, recall, and macro-F1. Track failures by cohort and prompt version.