Guides

Intent Classification Evaluation Pipeline (2026)

Per-intent precision-recall, escalation accuracy, OOD detection, and drift gates for the LLM router that decides which pipeline runs.

·
Updated
·
11 min read
llm-evaluation intent-classification agent-routing ood-detection calibration 2026
Editorial cover image for Intent Classification Evaluation Pipeline (2026)
Table of Contents

The intent classifier passes every eval the team runs. Macro accuracy 0.92 on the balanced golden set. F1 reads 0.89. The router ships. Two weeks in, the on-call engineer notices that cancellation requests are routing to billing, the fraud-report queue is empty while fraud complaints sit in product-feedback, and the long-tail migration-question class — new since the platform launch — is silently classified as billing-question 80 percent of the time. The model didn’t break. The eval set hid the failure modes that actually run in production.

Intent classification eval is three problems wearing one mask: per-intent precision and recall on the real distribution, escalation accuracy on the intents that hand off to humans, and out-of-distribution detection on the inputs no class fits. A balanced-set accuracy score answers none of them. This guide walks the pipeline that does: taxonomy, per-class CI gates, calibrated escalation thresholds, OOD-as-a-class, drift detection on production traffic, and the Future AGI surfaces (ai-evaluation SDK, traceAI, Error Feed) that wire it together.

Why aggregate accuracy lies for intent classification

Production intent traffic is heavy-tailed. Five or six common intents (product-question, pricing, account-access, shipping-status, refund) carry 70 to 90 percent of volume. The long tail — cancel-subscription, fraud-report, policy-dispute, migration-question, legal-escalation, self-harm — is where the revenue events, on-call pages, and trust-and-safety incidents live.

The classic eval setup hides this. 100 examples per class, stratified, hand-labeled, macro accuracy reported as one number. The classifier learns to behave well on every class equally. Production then drops 85 percent of its weight on three buckets, the overall accuracy looks fine because those three are easy, and the long-tail classes that drove the project quietly collapse. Nobody notices until a downstream team starts seeing tickets in the wrong queue or a customer escalates on Twitter about a cancellation that never processed.

A 50-class classifier has 2,450 confusion pairs. A scalar 0.92 can decompose into 0.97 on the common five and 0.31 on the long tail of 45, and the headline won’t move. Every downstream component conditions on the predicted intent: RAG corpus, tool pool, escalation policy. A misclassified router corrupts everything that runs after it. Our deeper take lives in evaluating LLM classifiers.

The fix is two eval sets. The production-distribution set mirrors live traffic and is what you report to the team. The per-class oversampled set holds 100 to 200 examples per class regardless of frequency and is what you debug against. Weighted F1 on the production set tells you what users experience; macro F1 on the oversampled set tells you whether the model can do every class. The gap is the calibration drift waiting to happen.

Per-intent precision-recall is the diff signal

Aggregate numbers hide failure. Per-intent numbers expose it. Compute four things per class on the production-distribution set: precision, recall, support, and F1. Render the confusion matrix as the primary artifact, not as an appendix.

from collections import defaultdict

def per_intent_metrics(pairs):
    classes = sorted({y for _, y in pairs} | {p for p, _ in pairs})
    report = {}
    for c in classes:
        tp = sum(1 for p, y in pairs if p == c and y == c)
        fp = sum(1 for p, y in pairs if p == c and y != c)
        fn = sum(1 for p, y in pairs if p != c and y == c)
        precision = tp / (tp + fp) if (tp + fp) else 0
        recall    = tp / (tp + fn) if (tp + fn) else 0
        f1 = (2 * precision * recall / (precision + recall)
              if (precision + recall) else 0)
        report[c] = {
            "precision": round(precision, 3),
            "recall":    round(recall, 3),
            "f1":        round(f1, 3),
            "support":   sum(1 for _, y in pairs if y == c),
        }
    return report

Three reading rules earn their keep here.

A low-precision intent is over-predicted. The classifier is using it as a fallback when it doesn’t know. Usually a rubric problem: the description is too broad, the few-shot pool is too lenient. Fix: tighten the definition and add contrastive few-shot examples that pin the boundary.

A low-recall intent is under-predicted. The classifier is defaulting to a more familiar adjacent class. Usually a coverage problem: the prompt doesn’t surface the actual vocabulary users send. Fix: add few-shot examples that cover the real signal, not the canonical phrasing.

Low precision and low recall together is the danger zone. Either the prompt is ambiguous, the label is poorly defined, or the model genuinely can’t distinguish the class from its neighbors. The confusion matrix tells you which. Read the cell, write the fix. Three or four prompt edits aimed at the worst confusion pairs typically lift macro F1 more than any model swap. The F1 score primer covers the per-class math in full.

On the SDK side, CustomLLMJudge takes a Jinja2 grading prompt and grading_criteria. That’s the right shape for an intent-accuracy rubric, since the answer is “did predicted match expected, and if not, which sibling did it pick.”

from fi.evals import CustomLLMJudge

intent_judge = CustomLLMJudge(
    name="IntentClassificationAccuracy",
    grading_criteria=(
        "1.0 if predicted_intent exactly matches expected_intent. "
        "0.7 if predicted is a semantic sibling (billing vs subscription). "
        "0.3 if same top-level family. 0.0 otherwise. "
        "Name the confusion pair that fired."
    ),
    grading_prompt=(
        "Expected: {{ expected_intent }}\n"
        "Predicted: {{ predicted_intent }}\n"
        "User input: {{ user_input }}\nScore and explain."
    ),
)

Score the full golden set this way and you get per-example reasons, not just a scalar. The reasons cluster into named failure modes when you read them in bulk.

Confidence calibration decides when to escalate

The classifier emits a confidence score. The routing logic gates an action on it: fall back on low confidence, abstain for human review, fire the downstream tool only above 0.7. Those thresholds are meaningless until the classifier is calibrated.

LLM intent classifiers are usually miscalibrated in two shapes. Bimodal extremes: scores cluster at 0.95+ or 0.05-, skipping the 0.3 to 0.8 band entirely. No middle to threshold against. Class-dependent skew: a 0.7 on product-question is right 95 percent of the time; a 0.7 on fraud-report is right 55 percent. One global escalation threshold ships different precision profiles per class.

Measure calibration with a reliability diagram per intent on production data, not the balanced eval set:

def reliability_per_intent(triples, num_bins=10):
    bins = defaultdict(lambda: defaultdict(list))
    for pred, actual, conf in triples:
        idx = min(int(conf * num_bins), num_bins - 1)
        bins[pred][idx].append(1 if pred == actual else 0)
    return {
        intent: [
            {
                "band": f"{i/num_bins:.1f}-{(i+1)/num_bins:.1f}",
                "empirical_accuracy": sum(b) / len(b) if b else None,
                "n": len(b),
            }
            for i, b in enumerate([bins[intent][i] for i in range(num_bins)])
        ]
        for intent in bins
    }

Set escalation thresholds per intent against the cost profile. Irreversible-action intents (cancel-subscription, fraud-report, legal-escalation) need high-precision floors. Set the threshold where empirical accuracy clears 0.95 and route everything below to a clarifying-question or human queue. Catch-net intents (human-escalation, out_of_distribution) need high-recall floors. Set them low so the system over-routes to safe paths rather than guessing. Apply Platt scaling or isotonic regression if the raw scores skew. The deeper note is in evaluating LLM confidence and uncertainty.

One discipline rule: per-intent thresholds get versioned with the rubric. A threshold change is a model change.

Out-of-distribution detection is a first-class intent

Without an explicit out_of_distribution class, the model assigns every input the highest-probability label it knows. Adversarial inputs, malformed text, off-topic questions, and pure noise all get routed into real pipelines. A jailbreak attempt becomes a product-feedback ticket. A typo-laden migration question gets billed as a refund.

Treat OOD as a first-class intent. Train it on three example types: adversarial prompts (injection, role-play breaks, off-topic provocations), ambiguous inputs that genuinely fit no other class (vague complaints, multi-intent messages), and structured nonsense (gibberish, malformed input). Gate CI on OOD recall. It never gets weaker. A PR that drops OOD recall by even one point blocks the merge.

Two inference signals route to OOD. Explicit prediction: the classifier picks out_of_distribution because the input matches the trained pattern. Implicit signal: no class clears its calibrated threshold, so the system falls through to OOD rather than picking the argmax. The implicit path catches genuinely novel inputs the trained OOD class hasn’t seen yet.

OOD doesn’t dead-end the conversation — it routes to a clarifying-question agent (“I can help with billing, returns, technical support, or cancellation. Which matches what you need?”) or a human queue. Misclassifying OOD as a real intent is more expensive than failing to recover an OOD that a clarifying turn could have resolved, because the wrong-pipeline answer is confidently wrong.

Production drift surfaces new intents

The taxonomy you ship is not the taxonomy production needs three months later. User vocabulary shifts every quarter. A product launch generates a new intent (migration-question after a consolidation). A policy change creates a new complaint shape (fee-dispute). Seasonal traffic introduces new asks (holiday-shipping, tax-season-question). A regulation introduces a new escalation path (data-deletion-request).

Two drift signals catch this. Existing-intent drift: precision or recall on a known class moves more than 2 points week-over-week. The class hasn’t broken; the input distribution shifted under it. Flag, refresh the few-shot pool with recent examples, re-baseline. New-intent emergence: the OOD bucket grows. Cluster OOD predictions and low-confidence failures weekly with HDBSCAN and look for stable clusters above a volume threshold for two weeks.

Each persistent cluster is a candidate intent. Decision tree: if it’s a genuine new ask the system should handle, split it into its own class — label 50 examples, add a definition, run the eval suite. If it’s an adversarial pattern, expand the OOD training set and rerun calibration. If it’s a transient (one news event, one campaign), monitor without changing the taxonomy.

Error Feed runs the HDBSCAN clustering over failing production predictions and a Sonnet 4.5 Judge writes an immediate_fix per cluster. A typical output: “47 misroutes. Pattern: billing-question intents misrouted to product-feedback when the input contains ‘subscription’ alongside a complaint verb. Immediate fix: add three few-shot examples covering subscription-complaint pattern; consider splitting subscription-billing as its own class.” Today the cluster summaries push to Linear; other ticketing integrations are on the roadmap.

traceAI on the router emits the OpenTelemetry spans that feed both drift loops:

from fi_instrumentation import register, ProjectType
from fi_instrumentation.fi_types import FiSpanKindValues
from opentelemetry import trace

trace_provider = register(
    project_type=ProjectType.OBSERVE,
    project_name="support-router",
)
tracer = trace.get_tracer(__name__)

with tracer.start_as_current_span("intent_router") as span:
    span.set_attribute("fi.span.kind", FiSpanKindValues.CHAIN.value)
    span.set_attribute("tag.tags",
        f"intent:{predicted},confidence:{conf:.2f}")
    span.set_attribute("session.id", session_id)
    predicted, conf = classify(user_input)
    span.set_attribute("output.value", predicted)

fi.span.kind=CHAIN makes the router filterable as a routing node. The intent and confidence in tag.tags lets you slice the dashboard and watch per-intent latency, accuracy, and confidence drift in one view. For the broader picture, our guide on best AI agent observability tools covers how traceAI sits next to the rest of the stack.

The eval pipeline, end to end

The build, in five steps.

Step 1: define the taxonomy. 5 to 15 top-level intents; expand to 30 to 50 only if the data demands it. Each intent gets a one-sentence definition, three to five canonical examples, and one explicit anti-example. Reserve a slot for out_of_distribution. Validate with annotator agreement before you build anything else — three labelers, 100 examples, Cohen’s kappa per pair. Any class under 0.7 kappa is a taxonomy bug, not a model bug. The common mistake: defining intents by what the agent does (“run-refund-flow”) instead of what the user wants (“user-wants-refund”). The first is a tool; the second is the intent.

Step 2: build the golden set. 200 to 500 labeled examples, no fewer than 20 per class, weighted toward the hardest 10 percent of observed failures. Split three ways: train slice for prompt tuning, dev slice for threshold calibration, held-out test slice that never leaks. Refresh the test slice quarterly from new production samples.

Step 3: instrument and shadow-score. Wire traceAI into the router. Run the eval suite on every prediction in shadow mode for a week: production scores the live label, the eval suite scores against ground truth from a delayed feedback loop (ticket-resolution-category, user thumbs, escalation-completion). This gives you the baseline confusion matrix you’ll diff against.

Step 4: gate deploys on per-intent thresholds. The CI gate runs the eval suite on the held-out test slice on every PR. Fail the merge if any class drops more than 2 points of precision below the production baseline, if absolute per-intent precision falls under the floor (0.85 high-volume, 0.7 long-tail, 0.95 safety-overlap), or if OOD recall drops at all. Render the confusion-matrix delta in the PR comment. The diff is the signal. See agent evaluation frameworks for the CI shape.

Step 5: close the loop with Error Feed. Failing predictions stream in, HDBSCAN clusters them, the Sonnet 4.5 Judge writes an immediate_fix per cluster, and the fixes flow back into the few-shot pool, the taxonomy, or the per-intent thresholds. Production failure to cluster to fix recommendation to retuned threshold to ship to measure. If you’ve read agent passes evals fails production, this is the routing-layer version of the same loop.

One anti-pattern: treating the classifier as separate from the agent. The intent classifier is the first hop of the agent graph, not a service to evaluate in isolation. Score it on its own metrics, then score end-to-end task completion conditional on routing correctness. If task completion is 0.85 overall but 0.45 on the misrouted slice, the eval headline is the wrong question to ask.

What FAGI ships for intent classification

The intent-eval workflow has five moving parts. Future AGI ships all five as one stack.

Dataset and taxonomy. The Platform stores eval datasets with intent, persona, and class metadata as first-class fields, so the production-distribution sample and the per-class oversampled set are two views over the same versioned dataset. The futureagi-sdk Client uploads samples programmatically and tags them for cohort analysis.

Deterministic floor. RegexScanner runs sub-10ms and short-circuits the LLM hop on guaranteed routes — order ID patterns, slash-commands like /escalate, refund-policy URL detection. Stack it with Contains, Equals, StartsWith, and JSONSchema as a local heuristic layer. For most intent workloads the deterministic floor routes 30 to 60 percent of traffic without an LLM call.

Calibrated LLM judge. CustomLLMJudge takes the IntentClassificationAccuracy rubric with a grading_criteria string and optional few-shot examples. 13 guardrail backends double as classifier backbones: TURING_FLASH for high-QPS routers, LLAMAGUARD_3_8B/3_1B for safety-overlap intents, QWEN3GUARD_8B/4B/0.6B for multilingual traffic, plus GRANITE_GUARDIAN_8B, WILDGUARD_7B, SHIELDGEMMA_2B, OPENAI_MODERATION, AZURE_CONTENT_SAFETY. The Guardrails ensemble runs multiple backends with RailType.INPUT and AggregationStrategy.WEIGHTED so you blend a fast backend with a heavier safety check.

Routing observability. traceAI (50+ AI surfaces across Python, TypeScript, Java, C#) emits spans on every router call with fi.span.kind=CHAIN, the predicted intent in tag.tags, and session/user IDs for cohort drift. The Platform’s EvalTag attachments let the same eval definition run in CI and on production spans, so the contract follows the rubric, not the run.

Closed loop. Error Feed runs HDBSCAN over ClickHouse on failing predictions and the Sonnet 4.5 Judge writes an immediate_fix per cluster. Fixes flow into the Platform’s self-improving evaluators, which retune per-intent thresholds against fresh labels. The in-product authoring agent lets non-engineer reviewers tune rubrics in natural language. Eval-driven prompt optimization ships today through agent-opt, with six optimizers including BayesianSearchOptimizer (teacher-inferred few-shot, resumable Optuna trials) plus EarlyStoppingConfig to cap budget. The Platform comes in at a lower per-eval cost than Galileo Luna-2.

Ready to wire the pipeline? Start with the ai-evaluation SDK, build a 100-per-class golden set, configure CustomLLMJudge against an IntentClassificationAccuracy rubric, and gate CI on per-intent precision floors from day one. The confusion matrix is where the next prompt edit lives.

Three takeaways for 2026

  1. Score per intent on the real distribution. Production-distribution set for the headline, per-class oversampled set for debugging, confusion matrix as the primary diff. Aggregate accuracy hides the long-tail collapse that kills the routing layer.
  2. Calibrate per intent, not globally. Reliability diagrams per class, per-intent thresholds against the cost profile, abstention to OOD or human when no class clears the floor. One global escalation threshold ships different precision profiles per class.
  3. OOD is a first-class intent and drift is a weekly job. Train it explicitly, gate CI on its recall, cluster the OOD bucket weekly with HDBSCAN to surface emergent intents, and refresh the taxonomy on a quarterly cadence.

Intent classification is the bedrock of every agent decision after it. Eval it like it.

Frequently asked questions

Why does aggregate accuracy lie for intent classification?
Production intent traffic is heavy-tailed. Five or six common intents carry 70 to 90 percent of volume; the long tail of escalation, cancel, fraud, and policy-question intents carries the revenue and the on-call pages. A model that scores 0.92 accuracy on a balanced eval set can ship at 0.35 recall on `cancel-subscription` because the eval weights every class equally and production does not. The signal that matters is per-intent precision and recall on the live distribution, plus escalation-accuracy on the intents that trigger a human handoff. Aggregate accuracy hides exactly the failures that route the wrong conversation into the wrong pipeline.
What does per-intent precision-recall actually catch?
Precision per class tells you whether predictions for that intent are trustworthy. Recall per class tells you whether real instances of that intent are getting caught. Together they expose the two failure shapes a scalar score hides: the over-predicted class that gobbles ambiguous traffic (low precision, high recall) and the under-predicted class that the model defaults away from when it's unsure (high precision, low recall). The confusion matrix names the pair: when `billing-question` predictions are 60 percent `product-feedback` actuals, the prompt or rubric on those two intents is what needs the edit, not the model.
How do I calibrate confidence so escalation fires at the right cutoff?
LLM intent classifiers are usually miscalibrated. They cluster scores at 0.05 and 0.95 with no middle, and the same 0.7 score means 95 percent accuracy on common classes and 55 percent on rare ones. Measure a reliability diagram per intent on production data, set per-class thresholds against your precision target (high precision floor for irreversible actions like `cancel-subscription` or `fraud-report`, high recall floor for catch-nets like `human-escalation`), and apply Platt scaling or isotonic regression if the raw scores skew. Don't ship a single global escalation threshold; the cost profile is different per intent.
How does out-of-distribution detection work in an intent pipeline?
OOD is a first-class intent, not noise. Train an explicit `out_of_distribution` class with adversarial, off-topic, and ambiguous examples, and gate CI on its recall — it never gets weaker. At inference time, treat low max-probability as an OOD signal: if no class clears its calibrated threshold, route to OOD rather than picking the highest-probability label. The fallback path goes to a clarifying-question agent or human queue, not a real pipeline. Without an OOD class the model assigns every input a label, and garbage routes into real flows with confident wrong answers.
How do I detect new intents appearing in production?
Cluster the OOD bucket weekly. HDBSCAN soft-clustering over the failures and low-confidence predictions surfaces emergent patterns — a new product launch generates a new intent (`migration-question`), a policy change creates a new complaint shape (`fee-dispute`), seasonal traffic introduces a new ask (`holiday-shipping`). Each cluster is a candidate intent: if it sits above a volume threshold for two weeks, add it to the taxonomy, label 50 examples, retrain or refresh the few-shot pool, and re-baseline the CI gate. Drift detection on the existing classes runs in parallel — flag any class whose precision moves more than 2 points week-over-week.
What's in the FAGI eval stack for intent classification?
Three surfaces compose. The `ai-evaluation` SDK (Apache 2.0) ships `CustomLLMJudge` with a Jinja2 grading prompt and `grading_criteria` — the right shape for an `IntentClassificationAccuracy` rubric — plus deterministic `RegexScanner` for guaranteed routes that should never hit an LLM (order IDs, `/escalate`, explicit URLs). 13 guardrail backends double as classifier backbones with `Guardrails` ensemble support across `RailType.INPUT` and `AggregationStrategy.WEIGHTED`. traceAI emits OpenTelemetry spans with the predicted intent and confidence on every request, which is what feeds the per-cohort drift dashboards. Error Feed clusters failing predictions with HDBSCAN and a Sonnet 4.5 Judge writes an `immediate_fix` per cluster.
How do I gate deploys on intent classifier changes?
Score the held-out test slice on every PR. Block the merge if any class drops more than 2 points of precision below the production baseline, if absolute per-class precision falls under a floor (0.85 high-volume, 0.7 long-tail, 0.95 for safety-overlap), or if the `out_of_distribution` recall drops at all. Per-class floors matter because a global 0.92 accuracy can hide a `billing-question` collapse that ships a refund-policy regression to 10 percent of users. Render the confusion-matrix delta in the PR comment — the diff is the signal, not the scalar.
Related Articles
View all