Evaluation

What Is Intent Classification?

Intent classification assigns inputs to predefined user or task intent labels so systems can evaluate, route, and act correctly.

What Is Intent Classification?

Intent classification is an LLM-evaluation task that maps a request, response, or agent step to a predefined intent label such as refund_request, password_reset, or escalate_to_human. It shows up in eval pipelines, production traces, and routing logic before a model selects a prompt, tool, policy, or workflow. FutureAGI teams measure it with labeled datasets and evaluator outputs so a misunderstood request does not trigger the wrong downstream action.

Why intent classification matters in production LLM and agent systems

Bad intent labels cause wrong work, not just wrong analytics. A support agent that reads “cancel my trial” as billing_question may answer politely while failing to stop renewal. A banking assistant that reads “card was stolen” as lost_card_info instead of fraud_report may send the user to the wrong workflow. In agent systems, one bad intent label can choose the wrong prompt, tool, knowledge base, guardrail, escalation path, or model route.

The pain is shared across teams. Developers see test failures that look unrelated because the classifier error happens before the visible model response. SREs see longer conversations, retries, and tool calls that do not match the user’s goal. Product teams see lower task completion and more “the bot ignored me” feedback. Compliance teams care when the wrong intent bypasses review, consent capture, or escalation.

Useful symptoms appear in logs before they appear in dashboards: sudden changes in class distribution, high fallback_response rate for one intent, repeated handoff after a specific predicted label, or rising eval-fail-rate-by-cohort after a prompt release. In 2026 multi-step pipelines, intent classification is often the first decision in a chain. If that first label is wrong, later groundedness, tool selection, and task-completion scores degrade for reasons that look downstream but started at routing.

Teams should audit minority intents separately because production harm is rarely distributed evenly across labels.

How FutureAGI handles intent classification

FutureAGI does not expose a one-purpose IntentClassification evaluator in the inventory. The reliable pattern is to model intent classification as a labeled eval task: each row stores input, expected_intent, predicted_intent, model version, prompt version, and production cohort. Engineers can attach CustomEvaluation when they need domain-specific intent rules, or use Equals and GroundTruthMatch when the predicted label should match the gold label exactly.

A real workflow starts with production traces from a support agent. The team samples recent conversations, labels expected intents such as cancel_subscription, upgrade_plan, refund_request, technical_issue, and needs_human. The classifier writes predicted_intent into the eval row and, when tracing is enabled, the decision is attached near the routing span with fields such as agent.trajectory.step and gen_ai.evaluation.score.value. When the refund_request recall drops below 0.85, the engineer opens the false-negative rows, sees that “chargeback” language was absent from the prompt examples, and adds a regression slice before the next release.

FutureAGI’s approach is to keep the label, evidence, evaluator result, and trace together. Compared with a standalone sklearn classification_report(), which gives useful aggregate numbers but little production context, FutureAGI keeps the failed rows next to prompt version, model, trace, and cohort metadata. The next action can be precise: add examples, split an overloaded label, change a router threshold, or block release when macro-F1 drops.

How to measure or detect intent classification

Measure intent classification at both row and cohort level. A single example can be right or wrong; quality emerges from class-level patterns.

  • Equals returns whether predicted_intent exactly matches expected_intent; use it for deterministic labels.
  • GroundTruthMatch checks agreement against a labeled expected answer when label normalization or cloud scoring is preferred.
  • Confusion matrix shows which intents are confused, such as refund_request predicted as billing_question.
  • Per-class precision, recall, and F1 reveal minority-intent regressions that overall accuracy hides.
  • Dashboard signals include eval-fail-rate-by-cohort, fallback_response rate, escalation-rate, and intent-distribution drift by prompt version.
from fi.evals import Equals

metric = Equals()
correct = 0
for row in dataset:
    result = metric.evaluate(response=row.predicted_intent,
                             expected_response=row.expected_intent)
    correct += result.score == 1.0
print(correct / len(dataset))

For safety or escalation labels, set class-specific thresholds. Missing a needs_human intent is usually worse than over-escalating one harmless billing question.

Common mistakes

  • Treating every intent as equally costly; false negatives on fraud, safety, cancellation, or escalation intents need stricter thresholds than low-risk account questions.
  • Reporting only global accuracy, which hides rare intents that drive escalations, policy risk, refunds, and costly human review.
  • Mixing topic labels and intent labels, then wondering why billing questions, cancellation requests, and upgrade requests collide in the same class.
  • Letting product teams rename intents without versioning the dataset, classifier prompt, saved eval runs, and routing rules together.
  • Evaluating only first-turn messages even though agents often reveal the real intent after clarification, tool failure, or user correction.

Frequently Asked Questions

What is intent classification?

Intent classification assigns a request, output, or agent step to a predefined intent label. In LLM eval pipelines, it checks whether routing and downstream actions start from the right interpretation.

How is intent classification different from topic classification?

Intent classification predicts what the user or agent is trying to do, such as cancel_order or escalate_to_human. Topic classification predicts what the text is about, such as billing, shipping, or security.

How do you measure intent classification in FutureAGI?

Use FutureAGI evaluators such as CustomEvaluation, Equals, or GroundTruthMatch on labeled examples, then inspect accuracy, precision, recall, F1, and the confusion matrix by intent.