What is a category in ML?

A category is a single discrete value of a categorical variable — for example, 'billing' inside the 'intent' variable. Classifiers predict categories, and dashboards usually slice metrics by category.

How is a category different from a class?

In supervised ML the two are often used interchangeably — a class label is a category. 'Class' tends to imply the prediction target; 'category' is the broader bucket usable as label, slice, tag, or feature value.

How does FutureAGI use categories?

FutureAGI slices eval scores like AnswerRelevancy and TaskCompletion by category — intent, persona, model variant — so engineers can debug cohorts rather than only chasing global averages.

Category Definition for ML and LLM Data

What Is a Category (ML / Data Context)?

In ML and data systems, a category is a single discrete value of a categorical variable, such as “billing” in an intent field or “pass” in an eval-result field. Categories are the labels classifiers predict, the cohorts dashboards compare, and the branches agent workflows use for routing or escalation. In FutureAGI, categories attach to datasets, eval runs, and production traces so teams can compare reliability by intent, persona, model variant, or error class instead of trusting one global average.

Why categories matter in production LLM and agent systems

The category surface is where production debuggability lives or dies. A dashboard that shows global task_completion = 0.78 is almost useless. A dashboard that shows task_completion = 0.92 for “billing” and 0.41 for “tech-support” tells you exactly where to look. Categories let you slice.

Unlike a loose LangSmith tag that may be enough for filtering one project, a production category taxonomy becomes a cross-system contract: dataset rows, trace spans, eval cohorts, routing rules, and human review queues all need the same ID.

The pain shows up across roles. A product manager looks at a flat eval-pass-rate of 84% and ships, while one category — “international shipping” — silently runs at 38% and accounts for the angriest tickets. An ML engineer renames a category from “complaint” to “issue” without a migration; six months of historical comparisons break and nobody notices until the next QBR. A compliance lead is asked which categories of conversation are subject to a new policy and discovers the category taxonomy has drifted across three teams.

In agent stacks the problem multiplies. Categories surface in every layer: trace span attributes, prompt-template variables, routing rules, eval cohorts. Treating them as throwaway strings rather than versioned IDs leads to silent breakage every time a category is renamed, retired, or split. Stable category IDs, with display names that can change but IDs that don’t, are the boring infrastructure that prevents an entire class of incidents.

How FutureAGI handles categories

FutureAGI uses categories as the primary slicing dimension in eval pipelines. Every score — AnswerRelevancy, TaskCompletion, Groundedness, JSONValidation — can be aggregated and filtered by any category attached to the input row, the trace span, or the dataset.

FutureAGI’s approach is to treat category IDs as observability keys, not UI labels. With the traceAI langchain integration, a team logs intent, persona_id, agent_version, and model_variant as span_attributes on each agent run. FutureAGI’s eval dashboard pivots scores across any combination — pass-rate by intent × agent_version exposes regressions before global averages move. The annotation queue feeds the same categorical taxonomy: human reviewers label samples with category IDs that match the production schema, so eval-vs-human agreement is computed on the same axes.

For dataset construction, Dataset rows carry category columns; Dataset.add_evaluation() runs evaluators per row, and aggregates can be filtered to any category. When a category is renamed, FutureAGI keeps the underlying ID stable; display names can be updated without breaking historical eval comparisons. If the “tech-support” category drops after a release, the engineer can open the traces, inspect agent_version, pause a semantic-cache rule, or run a regression eval before changing prompts. That turns category from reporting metadata into a rollback handle: alerts, sampled traces, and human review all point at the same failing cohort.

How to measure category health

Category health is measured at three layers — coverage, drift, and downstream eval:

Per-category sample count: distribution of samples per category; rare categories need minimum-support thresholds before they enter dashboards.
Category drift via PSI: distribution shift on the categorical column over time; flags new intents, retired labels, and seasonal cohorts.
fi.evals.AnswerRelevancy sliced by category: catches per-cohort relevance regressions that a global average hides.
fi.evals.TaskCompletion sliced by category: trajectory-level success rate per cohort, especially useful for agent workflows.
Trace field completeness: percentage of spans with required category attributes such as intent, persona_id, and model_variant.
Eval-fail-rate-by-cohort: dashboard signal that connects category changes to release, prompt, or model-version changes.
Annotation-queue agreement per category: agreement rate between human and model on each category, useful for finding label drift.

from fi.evals import TaskCompletion

task = TaskCompletion()
result = task.evaluate(
    input="Refund my last order #ORD-2284.",
    output="Refund of $48.20 issued to your card ending in 4421."
)
print(result.score, result.reason)

Common mistakes

Renaming display text without keeping a stable ID. Dashboards still render, but historical comparisons silently switch meaning across releases while charts look continuous.
Lumping low-volume categories into “other”. You lose cohorts most likely to contain policy edge cases, rare tool failures, expensive escalations, or new abuse patterns.
Letting teams maintain divergent taxonomies. Routing, eval, analytics, and annotation drift apart; cross-team incident review turns into schema archaeology and broken joins.
Tracking only global averages. A flat top-line score hides the category that accounts for most failed tasks or support tickets; set cohort thresholds.
Treating eval pass/fail as binary without an “abstain” category. When the evaluator cannot decide, forcing “pass” inflates reliability scores and hides ambiguous cases.