Research

Intent Classification LLM Pipeline: 2026 Best Practices

A vendor-neutral 2026 intent classification pipeline. Data, judge prompt, eval, and deploy. Runs end-to-end on OpenAI + traceAI without proprietary SDKs.

·
6 min read
intent-classification llm-pipeline llm-evaluation agent-routing prompt-engineering best-practices traceai 2026
Editorial cover image on a pure black starfield background with faint white grid. Bold all-caps white headline INTENT CLASSIFICATION PIPELINE fills the left half. The right half shows a wireframe horizontal pipeline of four rectangular stage nodes connected by arrows ending in a downstream diamond-shaped decision node, with a soft white halo glow on the third stage rectangle (the classifier), drawn in pure white outlines.
Table of Contents

A customer-support agent at a fintech ships a single-prompt approach: one big system message that handles refunds, churn, billing questions, and escalations. By month three, response quality is uneven, the prompt is 4,000 tokens, and adding a new intent breaks two existing ones. The fix is structural: split the work. A small classifier picks the intent, and the downstream agent gets a tighter prompt for that intent only. Refund quality climbs, prompt tokens drop, and adding a new intent is a single dataset update plus a small prompt change.

This guide is a vendor-neutral, code-complete walkthrough of how to build a production intent classification pipeline in 2026. The reference implementation uses OpenAI’s SDK plus traceAI (Apache 2.0) for instrumentation. The same pipeline works with Anthropic, Google, or OSS models. No proprietary SDK is required.

TL;DR: The 4-stage pipeline

StageWhat it doesTools
1. IngestReceive query, trace_id, PII redactionOTel collector, traceAI
2. Retrieve exemplarsEmbed query, pull 3-5 similar labeled examplesOpenAI embeddings, pgvector
3. ClassifySmall LLM call with structured outputgpt-5-nano or Claude Haiku 4.5
4. Validate and scoreSchema validate, deterministic fallback, span-attached judgeOpenAI judge or Turing-Flash

If you only read one row: split the work. A small classifier with structured output and an exemplar-based prompt routes more accurately than a big monolithic agent prompt.

Editorial diagram on a black starfield background titled INTENT CLASSIFICATION PIPELINE with subhead FOUR STAGES + DECISION DIAMOND. A horizontal pipeline of four wireframe rectangular stage nodes in thin white outlines connected by white arrows, ending in a downstream diamond-shaped decision node. Stages labeled STAGE 1 INGEST USER QUERY, STAGE 2 EMBED + RETRIEVE EXEMPLARS, STAGE 3 LLM CLASSIFIER, STAGE 4 VALIDATE + SCORE. Diamond labeled ROUTE BY INTENT? with three outgoing arrows to terminal nodes labeled REFUND_FLOW, FAQ_FLOW, ESCALATE_HUMAN. The third stage box LLM CLASSIFIER is larger with a thicker outline and a soft white halo glow as the focal element. Pure white outlines on pure black with faint grid background.

Stage 1: Ingest

The ingest stage is where the trace begins. Three responsibilities.

Attach trace_id. Every downstream span shares this id. Without it, the four stages are four disconnected calls.

Redact PII. Before the query hits the classifier, strip credit card numbers, social security numbers, email addresses, phone numbers, and any other regulated identifiers. The illustrative regex below covers only two narrow patterns; production PII redaction needs a real library like Presidio with named-entity detection plus rule-based recognizers. The trace store should never see raw PII.

Normalize. Lowercase, collapse whitespace, strip HTML and emoji. The classifier sees more uniform input; the embedding model behaves better.

import re
from opentelemetry import trace

tracer = trace.get_tracer("intent.pipeline")

def ingest(raw_query: str) -> str:
    with tracer.start_as_current_span("intent.ingest") as span:
        # Redact PII (use Presidio in production)
        query = re.sub(r"\d{4}[-\s]?\d{4}[-\s]?\d{4}[-\s]?\d{4}", "[CARD]", raw_query)
        query = re.sub(r"[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}", "[EMAIL]", query)
        query = " ".join(query.lower().split())
        span.set_attribute("query.length", len(query))
        return query

Stage 2: Retrieve exemplars

Few-shot prompting outperforms zero-shot on classification by 10-30 percent in most workloads. The pattern: embed the incoming query, retrieve the 3-5 most similar labeled examples from the dataset, include them in the classifier prompt.

The retrieval store should be one of: pgvector, Chroma, Weaviate, Qdrant, or LanceDB. The store choice is downstream of your existing infra; pgvector is the lowest-friction pick if you already run Postgres.

import os
from openai import OpenAI

client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])

def retrieve_exemplars(query: str, k: int = 5) -> list[dict]:
    with tracer.start_as_current_span("intent.retrieve_exemplars") as span:
        # 1. Embed the query
        emb = client.embeddings.create(
            model="text-embedding-3-small",
            input=query,
        ).data[0].embedding
        # 2. Vector search against your labeled dataset (pgvector etc.)
        rows = pgvector_search(emb, k=k)  # returns [{"query": ..., "intent": ..., "distance": ...}]
        span.set_attribute("exemplars.count", len(rows))
        return rows

The dataset behind this retrieval is the heart of the pipeline. Three sources:

  • Hand-labeled production queries. 200-500 rows is the floor. Label each row with the intent and a confidence (high, medium, low).
  • Synthetic generation. A frontier model (GPT-5.5, Claude Opus 4.7) generates synthetic queries conditioned on each intent label. Usual yield: 1,500-3,000 rows.
  • Negative feedback expansion. Rows where production users gave thumbs-down or escalated. The regression layer.

Stratify across difficulty: easy (clear single intent), medium (ambiguous between two), adversarial (intentionally misleading or short). Without stratification, the eval hides the failure modes that matter.

Stage 3: LLM classifier

The classifier is a small LLM with structured output. Four design choices.

Model. Distilled or small-tier (gpt-5-nano, Claude Haiku 4.5, Llama 4 Scout or Llama 3.1 8B for self-host). A frontier model is overkill and adds 200-500 ms latency.

Structured output. JSON schema with intent (one of N) and confidence (high, medium, low). Both OpenAI and Anthropic support strict structured output as of 2026. Do not parse free-form text.

Prompt. Short system prompt naming the intents and rules. User message includes the few-shot exemplars and the query.

Temperature. 0 for classification. The task is deterministic.

from pydantic import BaseModel
from typing import Literal

INTENTS = Literal["refund", "billing", "account", "faq", "escalate", "churn_risk", "unknown"]

class IntentResult(BaseModel):
    intent: INTENTS
    confidence: Literal["high", "medium", "low"]
    reasoning: str

def classify(query: str, exemplars: list[dict]) -> IntentResult:
    with tracer.start_as_current_span("intent.classify") as span:
        examples_text = "\n".join(
            f"- Query: {e['query']}\n  Intent: {e['intent']}" for e in exemplars
        )
        system = (
            "You classify customer-support queries into one of: "
            "refund, billing, account, faq, escalate, churn_risk, unknown. "
            "Use 'unknown' if you cannot decide between two."
        )
        user = f"Examples:\n{examples_text}\n\nQuery: {query}\n\nClassify:"
        resp = client.responses.parse(
            model="gpt-5-nano",
            input=[
                {"role": "system", "content": system},
                {"role": "user", "content": user},
            ],
            text_format=IntentResult,
            temperature=0,
        )
        result = resp.output_parsed
        span.set_attribute("intent.predicted", result.intent)
        span.set_attribute("intent.confidence", result.confidence)
        span.set_attribute("gen_ai.usage.input_tokens", resp.usage.input_tokens)
        span.set_attribute("gen_ai.usage.output_tokens", resp.usage.output_tokens)
        return result

The reasoning field is the audit trail. It is not used downstream but is invaluable when debugging misclassifications.

Stage 4: Validate and score

The validate-and-score stage has three jobs.

Schema validation. The model occasionally returns invalid JSON or an unknown intent. Pydantic’s parse plus a try/except is the floor.

Deterministic fallback. Low-confidence predictions go through a deterministic check (regex, allow-list, keyword match) before falling back to “escalate” or “unknown”. This cuts judge spend by 60-80 percent in most workloads.

Span-attached judge score. A judge model (frontier offline, distilled online) verifies the classification on a sample. The judge runs asynchronously; the classification ships immediately, the judge attaches its score to the same span_id post-hoc.

def validate_and_score(query: str, result: IntentResult) -> dict:
    with tracer.start_as_current_span("intent.validate") as span:
        # Deterministic fallback for low-confidence
        if result.confidence == "low":
            if re.search(r"\b(refund|money back|charged twice)\b", query):
                result = IntentResult(intent="refund", confidence="medium",
                                      reasoning="deterministic fallback: refund keywords")
            else:
                result = IntentResult(intent="escalate", confidence="medium",
                                      reasoning="deterministic fallback: low-confidence escalate")
        # Async judge call (fire-and-forget; correlate via the W3C traceparent carrier
        # so the worker can resume the trace context and emit a child span).
        from opentelemetry.propagate import inject
        carrier = {}
        inject(carrier)  # populates 'traceparent' (and 'tracestate' if set)
        enqueue_judge_call(traceparent=carrier.get("traceparent"),
                           query=query, predicted=result.intent)
        span.set_attribute("intent.final", result.intent)
        return {"intent": result.intent, "confidence": result.confidence}

The worker consuming the queue extracts the traceparent and starts a remote child span, so the judge verdict appears as a child of the original intent.validate span with a judge.verdict attribute.

Eval and CI gating

The eval suite is a separate pytest harness that runs against the dataset on every PR touching the classifier prompt or model.

# tests/test_intent_eval.py
import pytest
from intent_pipeline import classify, retrieve_exemplars
from sklearn.metrics import precision_recall_fscore_support

@pytest.mark.parametrize("split", ["easy", "medium", "adversarial"])
def test_intent_accuracy(split):
    dataset = load_dataset(f"intent_eval_{split}.jsonl")
    preds, golds = [], []
    for row in dataset:
        exemplars = retrieve_exemplars(row["query"], k=5)
        result = classify(row["query"], exemplars)
        preds.append(result.intent)
        golds.append(row["intent"])
    precision, recall, f1, _ = precision_recall_fscore_support(
        golds, preds, average="weighted", zero_division=0)
    # Per-rubric thresholds; calibrate against incumbent
    thresholds = {"easy": 0.78, "medium": 0.70, "adversarial": 0.60}
    assert f1 >= thresholds[split], f"{split} regression: f1={f1:.3f} < {thresholds[split]}"

The thresholds are calibrated against the incumbent. A regression on any split blocks the merge.

Production deployment

Five operational details matter.

Per-intent monitoring. Rolling-mean precision, recall, F1 per intent. Alert on 3-5 percent moves.

Confusion matrix. Track which intents get confused with which. A spike in “refund-confused-with-billing” tells you which exemplars are missing.

Per-cohort A/B with eval-gated rollback. Ship a new classifier prompt to 5 percent of traffic. Monitor per-cohort precision over a 1-hour window. Rollback if any intent regresses below threshold.

Dataset auto-build. Misclassifications (judge says wrong, user escalates, deterministic fallback fires) flow into a regression dataset for next week’s eval.

Annotation queue. Low-confidence rows and judge-disagreement rows flow into a human review queue. Reviewers correct the labels; corrected rows feed the dataset.

Common mistakes when building intent pipelines

  • Too many intents. Above 12-15, classifier accuracy drops. Group near-duplicates; spawn sub-classifiers if needed.
  • No synthetic data. Hand-labeled alone is too small to defend.
  • No calibration set. Without 200 human labels, judge agreement is unverified.
  • No per-intent monitoring. A single intent regresses and the aggregate metric hides it.
  • No escape hatch. Low-confidence should escalate, not guess.
  • Frontier model for the classifier. Adds 200-500 ms latency for marginal accuracy gain. Use a small or distilled model.
  • Ignoring the confusion matrix. The matrix tells you which exemplars are missing; the aggregate F1 does not.
  • Routing on the model’s first token. Use structured output. Free-form text parsers fail in production.

What changed in 2026 for intent pipelines

DateEventWhy it matters
2026GPT-5.5 family with strict structured outputJSON schema enforcement removed parser failure modes.
2026Claude 4.x JSON mode hardenedStructured output became reliable across both major frontier vendors.
Mar 2026FutureAGI shipped Agent Command CenterIntent-aware routing and guardrails moved into the gateway.
2026OTel GenAI semconv broad adoptionCross-vendor classifier spans use the same schema.
Dec 2025DeepEval v3.9.x agent metricsBuilt-in metrics for tool-call accuracy and intent classification quality.

Sources

Read next: Best LLM-as-Judge Platforms 2026, Simulated Multi-Turn LLM Evaluation, Evaluating AI Agent Skills

Frequently asked questions

What is an intent classification pipeline for LLM apps?
An intent classification pipeline is the layer that takes a user query, identifies which downstream flow to route to (refund, FAQ, escalate, support, billing, churn-risk), and feeds that decision into the downstream agent. It sits before the main LLM call and is usually a smaller LLM or a deterministic classifier. In 2026 production stacks it has four stages: ingest, retrieve exemplars, classify, validate-and-score.
Why does intent classification matter in 2026?
Three reasons. First, agents that try to handle every intent in one prompt struggle with prompt-bloat and confused context. Second, downstream tools and rubrics are intent-specific (a refund flow has different judges from a FAQ flow). Third, escalation policies require explicit intent labels (you do not escalate a billing question to a refund agent). Without intent classification, the agent acts as a do-everything monolith and observability becomes harder.
How do I build the dataset for intent classification?
Three sources. First, hand-label 200-500 production queries across the intents you care about (the floor). Second, generate synthetic queries with a frontier model conditioned on intent labels (the volume layer). Third, expand from negative-feedback rows in production (the regression layer). The dataset has to be stratified across difficulty and adversarial cases; a clean dataset hides the failure modes that matter.
What's the right judge for intent classification?
A two-judge ensemble. A frontier model (GPT-5.5 or Claude 4.x) for offline calibration; a distilled judge (Galileo Luna, FutureAGI Turing-Flash, custom small model) for online scoring at production scale. Calibrate the distilled judge against the frontier judge plus 200 human labels. Track per-intent precision, recall, F1; track confusion matrix; alert on rolling-mean drift.
Should I use LLM-as-judge or a deterministic check?
Both. Deterministic checks (regex, schema, allow-list) catch the easy cases at zero cost. LLM judges catch the ambiguous cases where two intents look similar. Layer them: deterministic first, LLM judge if deterministic returns 'unknown' or 'low-confidence'. The deterministic layer cuts judge token spend by 60-80 percent in most production workloads.
How do I deploy and monitor the pipeline in production?
Three layers. Span-attached scoring on every classification with the predicted intent and the judge verdict. Drift alerts on per-intent rolling-mean precision (page on 3-5 percent moves). Per-cohort A/B with eval-gated rollback when shipping a new classifier prompt. Dataset auto-build from misclassified spans; the regression set grows from production every week.
Does this pipeline depend on a proprietary SDK?
No. The reference implementation in this guide uses OpenAI's Python SDK plus traceAI (Apache 2.0) for instrumentation. The same pipeline works with Anthropic's SDK, the Google Vertex SDK, and OSS models via vLLM. The eval store can be FutureAGI cloud, Phoenix self-host, or a Postgres table. The judge can be any LLM you can call. The pipeline is vendor-neutral by design.
What are the common mistakes when building this pipeline?
Five. First, picking too many intents (above 12-15, classifier accuracy drops). Second, no synthetic data (the dataset is too small to defend). Third, no calibration set (the judge is unverified). Fourth, no per-intent monitoring (an intent regresses and nobody notices). Fifth, no escape hatch (when confidence is low, the system should escalate to a human or a fallback flow rather than guess).
Related Articles
View all
Stay updated on AI observability

Get weekly insights on building reliable AI systems. No spam.