What are tags in AI workflows?

Tags are short, free-form labels attached to dataset rows, traces, evaluations, prompts, or model runs that allow filtering and cohort slicing without restructuring the data.

How are tags different from metadata?

Metadata is typed structured data (string fields, numeric fields, timestamps). Tags are free-form, low-cost labels — often a string or list of strings — meant for ad-hoc grouping rather than enforcing structure.

Where does FutureAGI use tags?

FutureAGI's Client.log accepts tags on every logged event, datasets carry tags on rows, and the dashboard slices eval-fail-rate-by-cohort with tags as the cohort dimension. Use tags to split production traffic by environment, customer, or experiment.

What Is a Tag? Definition & FutureAGI Guide (2026)

What Is a Tag?

Tags are short free-form labels attached to AI artifacts — dataset rows, traces, spans, evaluations, prompts, or model runs — that let teams group, filter, and slice data without changing its shape. A trace tagged env:prod, customer_tier:enterprise, agent:support_v2, experiment:cot_prompt_b becomes findable in a dashboard by any of those dimensions. Tags are the cheapest production telemetry primitive: they require no schema migration, no ETL, no upfront design. They are not typed metadata, not a substitute for structured fields, and not a metric, but they are how teams turn unstructured event streams into queryable cohorts. Most LLM observability platforms, including FutureAGI, treat tags as a first-class field.

Why It Matters in Production LLM and Agent Systems

A model deployed without tags produces an undifferentiated stream of traces. The team sees average latency, average cost, and average evaluation score, but cannot answer “what’s the eval score on enterprise traffic this week” or “did the new prompt regress only on Spanish-language requests.” Without a slice, every alert is a global alert and every regression looks the same. Without tags, the slice is impossible.

The pain shows up across roles. SREs investigating a latency spike cannot tell whether it is region-specific without a region tag. ML engineers running an A/B between two prompts cannot compare without an experiment tag on each request. Product leads asked “is the agent better for our top 10 customers” need a customer-tier tag. Compliance leads need an environment tag (prod vs. staging) and a data-residency tag to scope their audits. None of these questions need a schema; they all need a string label propagated on the request.

For 2026 agent stacks, tags propagate across multi-step trajectories. The same experiment:cot_prompt_b tag on the user request appears on the planner span, every tool call, the reranker, and the final response — turning an entire trajectory into a coherent cohort. Without that propagation, evaluating an experiment over a multi-step pipeline degenerates into manual joins.

How FutureAGI Handles Tags

FutureAGI’s approach is to make tags a first-class field on every artifact in the SDK. The fi.client.Client.log() method accepts tags on every logged event — model inputs, outputs, conversations — so a tag set at request time follows the trace through traceAI. The fi.datasets.Dataset primitive carries tags on rows, so a row marked domain:medical, complexity:hard can be filtered before evaluation. Prompts in fi.prompt.Prompt carry tags via labels and versions, so a production label on prompt v3 distinguishes it from experiment:beta on prompt v4. Annotation queues from fi.queues.AnnotationQueue carry tags on items so reviewers can prioritize — urgent, revisit, policy_review.

Concretely: an agent team running an A/B between a base prompt and a Chain-of-Thought variant tags every request with experiment:base or experiment:cot. After 48 hours, the team filters the FutureAGI dashboard by tag and runs eval-fail-rate-by-cohort with experiment as the cohort dimension. TaskCompletion is 0.79 on base and 0.84 on CoT — but latency-p99 is 1.4s on base and 3.1s on CoT. The decision is data-driven because both metrics share the tag dimension. None of that workflow needs a schema migration; the only contract is that the tag string is set consistently at the call site.

How to Measure or Detect It

Tags are a dimension, not a metric — but they enable every metric to be sliced:

session.tags (OTel attribute): the canonical span attribute that carries tag strings through traceAI; filter dashboards on it.
eval-fail-rate-by-cohort (dashboard signal): the canonical regression view, with tag as the cohort dimension.
AggregatedMetric: combines per-tag-cohort metrics into a single weighted score; useful when one tag dominates volume.
Tag-cardinality monitoring: very high cardinality on a tag (e.g., user-id) makes it useless as a cohort dimension and expensive to index — alert on cardinality blowups.
Per-tag evaluation runs: filter the Dataset by tag and run the eval suite on that subset; the per-cohort number is what teams act on.

Minimal Python:

from fi.client import Client

client = Client()
client.log(
    inputs={"prompt": user_input},
    outputs={"response": model_output},
    tags=["env:prod", "experiment:cot", f"customer_tier:{tier}"],
)

Common Mistakes

Using high-cardinality fields as tags. User IDs, request IDs, and timestamps belong in their own indexed fields, not in tags. Tag cardinality should stay in the hundreds or low thousands.
Adding tags reactively after an incident. Tags are most valuable when applied uniformly from day one; retroactive tagging only works on what is still in the trace store.
Free-form spelling. customer_tier:enterprise and customerTier:Enterprise look like the same tag and segment your dashboard. Pin a tag vocabulary, or you’ll silently split cohorts.
No tag for environment. Mixing prod and staging traces in one dashboard is the most common source of misleading metrics; an env: tag is non-negotiable.
Tags as a substitute for structured metadata. Customer subscription tier is structured; a churn date is structured. Don’t model them as tags or you’ll build queries on top of fragile string parsing.