What is the difference between input and output tokens?

Input tokens are the prompt the model received (system prompt, user message, retrieved context, tool definitions). Output tokens are what the model generated. Cached input tokens are previously prefixed prompts the provider hashed and reused, usually billed at a discount.

How do you track token usage in production?

Instrument with traceAI; every LLM span auto-populates gen_ai.usage.input_tokens, gen_ai.usage.output_tokens, gen_ai.usage.total_tokens, plus gen_ai.usage.cache_read_tokens and gen_ai.usage.output_tokens.reasoning where the provider returns them.

What Is Token Usage Tracking? FutureAGI Guide (2026)

What Is Token Usage Tracking?

Token usage tracking is the per-call capture of input, output, cached, and reasoning token counts on every LLM span. It is the unit metric of LLM economics — the substrate cost attribution, model-comparison ROI, prompt optimization, and rate-limit alerting all depend on. The OpenTelemetry GenAI conventions standardize the attribute names: gen_ai.usage.input_tokens, gen_ai.usage.output_tokens, gen_ai.usage.total_tokens, plus details for cache reads (gen_ai.usage.cache_read_tokens), cache writes, audio tokens, and reasoning tokens. FutureAGI’s traceAI emits these on every span automatically.

Why It Matters in Production LLM and Agent Systems

Tokens are how providers bill, how rate limits enforce, and how model performance varies. A reasoning model burning 40K reasoning tokens before producing 500 output tokens is priced and budgeted differently from a non-reasoning chat call — and you cannot tell them apart without gen_ai.usage.output_tokens.reasoning. A prompt that quietly grew from 4K to 12K tokens (because someone added more retrieved context) tripled prefill latency and doubled cost — invisible without gen_ai.usage.input_tokens per span.

The pain hits four roles. Engineers cannot debug latency spikes without correlating to input token size. Product cannot answer “what does this feature cost per user?” without per-span tokens aggregated by user.id. SREs hit provider rate limits — most enforce TPM (tokens per minute) — and cannot tell which route or tenant is burning the budget. Finance cannot reconcile invoices without provider-grade token counts on every call.

In agent stacks, token usage compounds. A multi-step agent with a planner, three tool calls, two sub-agent dispatches, and a critic can fan one user request into 30K input tokens and 8K output tokens across 12 spans. Aggregating to a per-trace total lets you set per-trace budgets — kill the trace if it crosses N tokens before the agent goes infinite. Without per-span tracking, you discover the problem on the next provider invoice.

The 2026 wrinkle is cache and reasoning tokens. Anthropic and OpenAI both ship prompt caching at heavy discounts (90% cheaper cache-read tokens). Tracking cache_read separately from input shows the actual ROI of the cache layer. Reasoning tokens (o1, o3, Claude thinking) are billed but not surfaced to users; tracking them per span tells you when a reasoning model is over-thinking a simple query.

How FutureAGI Handles Token Usage

FutureAGI captures token usage on every LLM span via traceAI. Each integration (traceAI-openai, traceAI-anthropic, traceAI-bedrock, traceAI-vertexai, traceAI-mistral, traceAI-cohere, traceAI-groq, traceAI-together, etc.) reads the provider’s usage block from the response and writes it into the OTel span as gen_ai.usage.input_tokens, gen_ai.usage.output_tokens, and gen_ai.usage.total_tokens. Cache and reasoning details land in the detailed attributes (gen_ai.usage.input_tokens.cache_read, gen_ai.usage.output_tokens.reasoning, gen_ai.usage.input_tokens.audio). Legacy traceAI integrations also emit llm.token_count.prompt, llm.token_count.completion, and llm.token_count.total for backward compatibility with older dashboards.

The platform exposes per-trace total tokens, p99 input tokens by model, and a token-usage-by-cohort breakdown sliced by user.id, session.id, gen_ai.request.model, and gen_ai.prompt.template.version. The same data feeds the Agent Command Center cost-optimized routing policy: when a route’s average input tokens grows past a threshold, traffic shifts to a cheaper model variant.

The differentiator vs. gateway-only tools (Helicone, Portkey) is per-span granularity inside agent runs. A gateway sees the API call; FutureAGI sees the agent step that triggered it, the prompt version that was rendered, and the eval verdict on the result. A Dataset.add_evaluation() run can slice token usage by eval-fail vs eval-pass cohorts — quantifying whether failures are correlated with input length, which is a common pattern for RAG drift.

In a typical incident, an engineer filters spans to gen_ai.usage.input_tokens > 8000 AND gen_ai.evaluation.score.value < 0.6 and finds the prompt template that started loading too much retrieved context.

How to Measure or Detect It

Wire these signals on every LLM span:

Core counts: gen_ai.usage.input_tokens, gen_ai.usage.output_tokens, gen_ai.usage.total_tokens.
Cache breakdown: gen_ai.usage.cache_read_tokens, gen_ai.usage.cache_write_tokens, gen_ai.usage.input_tokens.cache_read.
Reasoning: gen_ai.usage.output_tokens.reasoning for o1/o3/thinking-mode models.
Multimodal: gen_ai.usage.input_tokens.audio, gen_ai.usage.output_tokens.audio.
Aggregations: tokens-per-trace, p99 input tokens by model, cache hit ratio (cache_read / input_tokens).
Cost: gen_ai.cost.input, gen_ai.cost.output, gen_ai.cost.total derived from token counts.

from fi_instrumentation import register
from traceai_anthropic import AnthropicInstrumentor

trace_provider = register(project_name="prod-claude")
AnthropicInstrumentor().instrument(tracer_provider=trace_provider)
# every Anthropic call now emits gen_ai.usage.input_tokens,
# gen_ai.usage.output_tokens, and cache token details on the span

Common Mistakes

Trusting client-side estimates. Counting tokens with a local tokenizer drifts from provider billing. Always use the provider’s returned usage block.
Aggregating without slicing. Total tokens hide the cohort that is burning budget. Slice by user.id, gen_ai.request.model, and prompt version.
Ignoring cache tokens. Treating cache_read as input_tokens at full price overstates cost by up to 10× on cached prefixes. Track and discount separately.
Missing reasoning tokens. Reasoning models bill for hidden tokens. A response with 500 visible output tokens may carry 30K reasoning tokens — invisible without the dedicated attribute.
No per-trace budget. Without one, an agent loop or runaway tool call can burn $50 in tokens before timing out.