What are categorical variables?

Categorical variables are features that take values from a discrete set — like country, status, or product tier — and must be encoded numerically before ML models can use them.

What is the difference between nominal and ordinal categorical variables?

Nominal variables have unordered categories (country, intent label). Ordinal variables have ordered categories (size: S/M/L, satisfaction: low/medium/high) where the order itself carries information.

How does FutureAGI relate to categorical variables?

FutureAGI doesn't engineer features. We evaluate the LLM applications that consume categorical signals — classifier outputs feeding routing, intent labels feeding prompts — using TaskCompletion and AnswerRelevancy on the downstream LLM.

Categorical Variables: Definition, Encoding & LLM Evals

What Is Categorical Variables?

Categorical variables are features that take values from a discrete set — sometimes unordered (nominal: country, language, intent label) and sometimes ordered (ordinal: tier, satisfaction band). Unlike numerical variables, the raw value of “USA” or “premium” carries no arithmetic meaning, so the variable is encoded before a model consumes it. Common encodings include one-hot, target/mean, ordinal, frequency, hashing, and learned embeddings. In LLM stacks categorical fields surface everywhere: as routing keys, classifier outputs, eval labels, span attributes, and dataset cohorts. FutureAGI evaluates categorical slices in production traces because wrong encodings can leak labels, hide cohorts, and cause silent drift.

Why Categorical Variables Matter in Production LLM and Agent Systems

Categorical encoding is rarely the headline of a project, but it shapes how a model behaves under shift. One-hot encoding is safe but can explode dimensionality on high-cardinality columns like user_id or product_sku. Target encoding is compact but leaks labels if implemented naively. Hashing is stable but loses interpretability. Learned embeddings work well at scale but require enough data per category.

The pain shows up across roles. A data engineer trains a routing classifier on country one-hot encoded; a new market launches; the encoder has no column for it; production routes default silently to a stale class. An ML engineer implements target encoding without the proper fold-out scheme; the offline AUC is 0.94; production AUC collapses to 0.62. A platform engineer slices eval_fail_rate by intent and notices that one rare category accounts for 40% of failures — but it had been lumped into “other” by an aggressive minimum-frequency filter.

In LLM pipelines, the encoded categorical often feeds a downstream prompt or routing rule. A wrong intent label routes a billing call to a tech-support agent. A stale country code retrieves the wrong knowledge base. Unlike ROC-AUC on the classifier, the signal that catches this is downstream LLM quality: answer relevance, task completion, escalation rate, and failures by category.

How FutureAGI Measures Categorical Variables in LLM Systems

FutureAGI does not engineer features or encode variables. FutureAGI’s approach is to treat categorical variables as observable cohorts that must explain downstream LLM behavior, not just upstream classifier accuracy.

Concretely: an agent stack uses an intent classifier whose output (a categorical label) drives prompt selection. In a traceAI-langchain trace, the classifier label is logged as a span_attribute on the agent’s root span. FutureAGI then runs AnswerRelevancy and TaskCompletion on the LLM’s response. The dashboard slices both scores by the categorical label, exposing which intent buckets are healthy and which degrade. When the team adds three new product categories to the classifier and re-deploys, FutureAGI’s regression-eval workflow reruns the same dataset against the new categorical encoding and compares aggregate TaskCompletion to the previous run — surfacing whether the new categories help or hurt downstream LLM behavior.

For dataset construction, the FutureAGI Dataset.add_evaluation() workflow lets you attach categorical cohort columns to any sample and aggregate eval scores within each cohort. A “billing-dispute” cohort with low groundedness and an “address-change” cohort with high groundedness tells the team which intents need a different prompt or a different model — a categorical-variable view of model quality, but at the LLM level, not the classifier level.

How to measure Categorical Variables in Production

Categorical-variable health combines encoding correctness and downstream drift:

Cardinality and coverage: track the number of distinct values seen at training vs. inference; new values are a drift signal.
Per-category sample size: rare categories with low support overfit; track minimum count per category.
Feature drift via PSI/KL: classical drift signals on the categorical’s distribution; flag deltas larger than your trained tolerance.
fi.evals.AnswerRelevancy sliced by category: catches downstream LLM degradation tied to specific cohorts.
fi.evals.TaskCompletion sliced by category: trajectory-level success rate per cohort.
Annotation-queue agreement: agreement rate between predicted category and human label sampled into the FutureAGI annotation queue.

from fi.evals import AnswerRelevancy

rel = AnswerRelevancy()
result = rel.evaluate(
    input="Reschedule my delivery to next Tuesday.",
    output="Your delivery is rescheduled to May 13th between 1–4 PM."
)
print(result.score, result.reason)

Common mistakes

Using target encoding without a fold-out scheme. This leaks the label into the training feature and inflates offline metrics that collapse in production.
Lumping rare categories into “other” too early. The “other” bucket often contains your worst-performing cohorts; track them separately.
Forgetting the unseen-category fallback. New categories appear in production; the encoder needs a deterministic default.
One-hot encoding high-cardinality fields like user IDs. Use hashing or embeddings instead.
Ignoring categorical drift in monitoring. Distribution shift on categorical columns is one of the silent causes of LLM-pipeline regressions.