What is a taxonomy in AI?

A hierarchical classification scheme organizing concepts, intents, errors, or content categories into parent-child relationships. AI systems use taxonomies for intent classification, content-safety policies, error catalogs, and structured label sets.

How is a taxonomy different from an ontology?

A taxonomy captures only hierarchical (is-a) relationships. An ontology captures hierarchies plus arbitrary typed relationships, properties, and constraints. it is the richer structure used by description-logic reasoners and knowledge graphs.

How does FutureAGI use taxonomies?

FutureAGI uses taxonomies in content-safety policies driving ContentModeration and Toxicity evaluators, in failure-mode catalogs that drive eval-fail-rate-by-cohort dashboards, and in label sets attached to dataset rows for evaluation slicing.

What Is a Taxonomy? Definition & FutureAGI Guide (2026)

What Is a Taxonomy?

A taxonomy is a hierarchical classification scheme that organizes a set of concepts into parent-child relationships. The classic example is the biological taxonomy (kingdom > phylum > class > order). In AI it appears as intent taxonomies for chatbots, content-safety taxonomies for moderation policies, error taxonomies for failure-mode tracking, and label taxonomies for supervised training. The hierarchy is what makes a taxonomy useful: a parent category can match when no child does, policy decisions inherit downward, and evaluation cohorts can be sliced at any level. A flat enum is not a taxonomy. the structure is the point.

Why It Matters in Production LLM and Agent Systems

Production LLM systems run on label spaces. A content-moderation policy decides what to block based on a category tree (violence > graphic violence > weapon-specific violence). A customer-support agent routes based on an intent taxonomy (billing > refund > partial-refund). A failure-mode dashboard groups eval failures into a tree (hallucination > citation-fabrication > sycophancy-induced). When the taxonomy is well-structured, every dashboard, eval, and policy decision becomes interpretable. When it is flat or ad-hoc, every team builds its own labels and dashboards diverge.

The pain shows up across roles. ML engineers see content moderators and prompt engineers using subtly different category names for the same concept; the result is policy gaps and double-counting. Product leads cannot answer “what fraction of failures are in the billing intent” because intent labels were never structured hierarchically; the answer requires manual joins. Compliance leads in regulated industries need an auditable error taxonomy mapped to regulatory categories; an unstructured failure log does not satisfy that.

For 2026 agent stacks the relevance is sharper. A multi-agent system where each agent has its own label set produces traces that are hard to aggregate. A shared taxonomy. for intents, errors, content categories. turns those traces into a unified dashboard. The taxonomy becomes the lingua franca of the AI platform.

How FutureAGI Handles Taxonomy

FutureAGI does not impose a global taxonomy on your domain. that is your team’s design problem. What FutureAGI does is consume the taxonomy you define and turn it into evaluator policies, eval-cohort slices, and dashboard structure. Three places this shows up. First, content-safety taxonomies feed the eval:ContentModeration, eval:ContentSafety, eval:Toxicity, and eval:BiasDetection evaluators. each can be configured with the policy’s category tree and return scores per category. Second, eval-failure taxonomies are attached as tags on Dataset rows so the dashboard can slice eval-fail-rate by category at any level of the tree. Third, intent and topic taxonomies feed routing and intent-classification evaluators that score whether the agent’s label assignment matches the gold taxonomy node.

Concretely: a content-moderation team defines a 3-level taxonomy with 8 top-level categories and 47 leaves. They configure ContentModeration to score each response against the taxonomy and run it on production traces sampled at 5%. The dashboard shows fail rate at the top level (where most are zero) and at leaf level (where the actionable signal lives). When a model swap regresses on harm > self-harm > indirect-references, the trace cohort is one click away. None of that workflow needs a custom dashboard. the taxonomy becomes the dashboard structure. FutureAGI’s job is to make the taxonomy a first-class citizen of the evaluation surface.

How to Measure or Detect It

Taxonomy-aware evaluation is a per-category breakdown, not a single number:

ContentModeration: scores each response against a content-policy taxonomy and returns per-category labels. surfaces which leaf was violated.
ContentSafety: returns a content-safety violation flag with category attribution; pairs with policy taxonomy.
Toxicity: returns toxicity score; can be sliced by domain taxonomy in dashboards.
BiasDetection: returns bias category and score; useful when the bias taxonomy is hierarchical (e.g., demographic > age, gender, race).
Per-leaf eval-fail-rate (dashboard signal): slice eval-fail-rate-by-cohort using taxonomy leaves as the cohort dimension.

Minimal Python:

from fi.evals import ContentModeration

mod = ContentModeration()
result = mod.evaluate(
    input=user_request,
    output=model_response,
)
print(result.score, result.reason)

Common Mistakes

Treating a flat enum as a taxonomy. Without parent-child structure, you cannot inherit policies or aggregate dashboards meaningfully. Build the tree before you build the eval.
Letting two teams maintain divergent taxonomies. Customer-support intents and content-moderation categories can share a backbone; pin a shared root and version it.
Versionless taxonomies. When a category is added, removed, or renamed, every historical dashboard becomes ambiguous. Version the taxonomy as a first-class artifact.
Tagging only at the leaf. If a row only has a leaf tag, you cannot answer questions at the parent level without the tree; tag at every level the evaluation will slice on.
Forgetting that LLMs hallucinate categories. A model asked to classify into a 50-leaf taxonomy will invent labels; constrain output to the enumerated set with SchemaCompliance.