Research

LLM Cost Tracking Best Practices in 2026: Per-User, Per-Prompt, Per-Route

LLM cost tracking in 2026: token-level attribution, per-user spend caps, reasoning vs cache, gateway aggregation, drift detection. The practices that actually scale.

July 26, 2025

9 min read

llm-cost-tracking token-attribution cost-observability llm-budgets reasoning-tokens gateway-cost llmops 2026

Table of Contents

The bill arrives on the first of the month. Forty-three thousand dollars. The team’s CFO emails the eng lead. The eng lead pulls up the cost dashboard. The dashboard shows infra cost, total LLM cost, and a pie chart by provider. The slice labeled “OpenAI” grew 38% month-over-month; the slice labeled “Anthropic” held flat. There the dashboard ends. The lead spends the next four hours asking the team: which product? Which feature? Which prompt? Which user? The answer arrives in fragments stitched from gateway logs and Slack threads. By the time they pinpoint the cause (a reasoning-model rollout on the support agent that ten-x’d cost per session for a small power-user cohort), the next month’s bill is already accruing.

This is what generic cost tracking looks like applied to LLM workloads. The dimensions that matter (per-user, per-prompt-version, per-route, reasoning vs regular output, cache hit) are not in the dashboard. The fix is not better dashboards; it is the discipline to capture, attribute, and enforce on the right dimensions from the start. This piece walks through what those practices are and how teams in 2026 are wiring them.

TL;DR: Per-span capture, per-dimension attribution, gateway enforcement

LLM cost tracking that scales has three components. Capture: every LLM span carries token counts (input, output, cache_read, cache_creation, reasoning_output) and a per-span cost computed on the way in from a maintained price book. Attribute: every span tagged with user, tenant, prompt version, route, feature flag. Enforce: the gateway reads per-user and per-tenant caps and short-circuits when caps are hit. Drift detection runs on rolling cost per dimension. The bill arrives, and you already know what drove every dollar.

If you only read one paragraph: aggregate cost dashboards lie. Per-dimension cost dashboards tell the truth. Tag everything with the dimensions you care about at week one, because adding the tags after the cost spike is the part that hurts.

Why LLM cost tracking matters in 2026

Three forces converged.

First, reasoning models changed the cost curve. A non-reasoning chat call was $0.001-$0.01. A reasoning call with 30K reasoning tokens at $15 per 1M tokens is $0.45 per call. The same workload on a reasoning model can cost tens to hundreds of times more than on a chat model, depending on the reasoning-token mix. Without reasoning-aware cost tracking, the bill arrives and the team rediscovers the per-call cost.

Second, agent loops are uncapped by default. An agent that retries tool calls 8 times when stuck consumes 8x the tokens of a successful run. A handful of stuck sessions can dominate the daily cost. Per-user p99 tracking surfaces these; aggregate tracking does not.

Third, B2B workloads ship to enterprise customers with cost expectations. Enterprise customers want per-tenant invoicing, capped tenants, and the ability to charge back. A platform that cannot answer “how much did tenant X cost this month, broken down by feature?” is a platform that cannot price.

Tools like the OpenTelemetry GenAI semantic conventions name the token-usage attributes as a standard, including the cache and reasoning sub-attributes. Capturing these on every span is the foundation everything else rides on.

The 10 cost tracking practices

1. Capture token counts on every span

The OTel GenAI usage attributes per LLM call (Recommended where applicable; the GenAI conventions remain in Development status):

gen_ai.usage.input_tokens (prompt tokens)
gen_ai.usage.output_tokens (total output tokens; OTel GenAI conventions include reasoning tokens within this total)
gen_ai.usage.cache_creation.input_tokens (tokens written to provider cache)
gen_ai.usage.cache_read.input_tokens (tokens served from provider cache)
gen_ai.usage.reasoning.output_tokens (sub-attribute breakdown of output_tokens for reasoning models)

Capturing only input_tokens and output_tokens is the most common failure mode. Reasoning models look cheap on naive output counts and are not.

2. Compute cost per span at ingest

Multiply token counts by current prices on the way in, tag the result as cost.usd on the span. Avoids the recompute pass at query time. Preserves the price snapshot used at ingest, which is only accurate if the price book itself was current at ingest.

A reasonable cost computation:

def span_cost(model, input_tokens, output_tokens, cache_read, cache_creation, reasoning_output):
    p = price_book.get(model)
    # Provider usage formats differ. Normalize first.
    # OTel GenAI conventions place cache_read/cache_creation under input_tokens and
    # reasoning_output under output_tokens. Anthropic exposes cache_creation_input_tokens
    # and cache_read_input_tokens separately from input_tokens. OpenAI exposes cached_tokens
    # under prompt_tokens_details. Validate the invariant for your provider before subtracting.
    if usage_includes_cache_in_input(model):
        uncached_input = max(input_tokens - cache_read - cache_creation, 0)
    else:
        uncached_input = input_tokens
    return (
        uncached_input * p.input
        + cache_read * p.cache_read         # often ~10x cheaper than base input
        + cache_creation * p.cache_creation  # often ~1.25x input on Anthropic
        + output_tokens * p.output           # output_tokens already includes reasoning
    )
# Track reasoning_output separately as an attribution dimension, not a billing addend.

The cost.usd attribute becomes a first-class number you sum, group, and alert on.

3. Maintain a price book

A price book is a mapping of (provider, model, tier) to per-token cost. Provider price changes happen quarterly to monthly; cached and reasoning prices are different from base prices. Stale price books produce wrong numbers.

Two viable patterns:

Vendor-provided. Your gateway or observability platform maintains the book. You consume it. Saves engineering time but bind you to the platform’s update cadence.
In-repo. A YAML or JSON file in the same repo as the instrumentation, updated by a watcher script that polls provider pricing pages. More work but no vendor coupling.

Either way: when provider X changes pricing, the book updates within a business day. A price book six months stale produces materially wrong dashboards; provider price changes have ranged from minor to substantial.

4. Per-user, per-tenant attribution

Every span tagged with the user id (or hashed user id, depending on PII policy) and the tenant id. The cost dashboard pivots on these.

The reason: averages lie. A workload with $0.05 mean cost per user can have a 99th percentile of $5 per user. The 99th percentile is where the budget alerts and the abuse signals live. Without per-user tags, you cannot compute the percentile.

For B2B, per-tenant attribution is also the basis for chargeback and invoicing.

5. Per-prompt-version attribution

Every LLM span tagged with the prompt version id (resolved from the registry; see Prompt Versioning). The cost dashboard alerts on per-version cost delta.

A common failure: a prompt change adds a few hundred tokens of system context for accuracy reasons; cost per call rises 12%; nobody notices for a month because aggregate cost is flat-ish (other things changed). Per-version delta surfaces this within a day.

Editorial figure on a black background showing a vertical accounting ledger with seven line items: input_tokens (gpt-4o) $124.20, output_tokens (gpt-4o) $87.10, cache_read $12.40, reasoning_tokens (o3) $402.66, embeddings $14.80, judge_tokens $61.30, tool_overhead $7.10. A horizontal rule separates them from a TOTAL row at the bottom: TOTAL $709.56. The TOTAL row has a soft white halo glow.

6. Gateway-level budget enforcement

The gateway is on the request path. When per-user or per-tenant cap is exceeded, the gateway short-circuits the request before it reaches the provider. Without enforcement at the gateway, caps are aspirational.

Seven viable gateways in 2026:

Future AGI Agent Command Center. Apache 2.0, Go-based, ships budget enforcement plus routing, caching, fallback.
Helicone Gateway. Apache 2.0, OpenAI-compatible. After Mintlify acquisition, gateway is in maintenance mode; verify currency before adopting.
OpenRouter. Closed SaaS, multi-model unified API.
Portkey. MIT open-source AI Gateway plus hosted and enterprise platform; gateway-first observability.
LiteLLM. MIT, lightweight provider proxy with budget hooks.
Cloudflare AI Gateway. Closed, free tier with paid scaling.
Vercel AI Gateway. Closed, integrates with Vercel deployments.

Pick by infra alignment. Do not write a gateway from scratch unless your constraints rule out all seven.

7. Drift detection on rolling cost

Four signals worth alerting on:

Rolling daily total. Alert on 20%+ deviation from baseline.
Per-user p99 cost. Alert when long-tail user cost grows.
Per-prompt-version cost delta. Alert when a new version costs 10%+ more than incumbent.
Per-route cost. Alert on routes with no corresponding traffic growth.

Drift detection is the early signal. The bill at the end of the month is the trailing indicator. By the time the bill arrives, the cost spike has been running for weeks.

8. Reasoning tokens as a first-class dimension

Reasoning tokens are billed as output for most providers but invisible in the response payload. A reasoning call that produces 30K reasoning tokens before 500 visible tokens is 60x more expensive than the visible output suggests.

Capture gen_ai.usage.reasoning.output_tokens separately. Display it as its own line in cost dashboards. Alert when reasoning cost grows faster than non-reasoning cost.

9. Cache hit-rate tracking

Cache attributes (cache_read.input_tokens, cache_creation.input_tokens) tell you how much you saved. Track cache hit rate per route, per prompt version. A route with 80% hit rate is doing it right; a route with 5% hit rate has a cache strategy problem (cache key churn, prompt instability, TTL too short).

Cache savings are real. Anthropic’s cached input pricing is roughly 10x cheaper than uncached. OpenAI’s prompt caching saves similarly. A workload that consistently hits cache materially reduces input-token spend; the bill impact depends on cacheable share, hit rate, and write frequency.

10. Cost per evaluator

Online scoring is itself an LLM call. A judge running on every production span is its own cost line, and at scale it can rival production cost. Track judge cost separately from production cost; budget judge cost against a sampling rate.

A typical configuration: production calls run on a chat or reasoning model; judges run on a distilled small model (Galileo Luna, Future AGI Turing-flash) at 5-20% sample rate. The judge cost stays under 10% of production cost. If it exceeds 25%, the sample rate is too high or the judge model is overpowered.

Common mistakes when tracking LLM cost

Tracking only aggregate cost. Aggregate dashboards do not surface the dimensions that drive cost.
Forgetting reasoning tokens. Reasoning models look cheap on naive output counts.
Stale price book. Pricing changes happen monthly; a six-month-stale book is 30%+ wrong.
No per-user attribution. P99 user cost is where abuse and stuck sessions live.
No per-version delta alerts. A prompt change that adds 12% per call hides in aggregate cost.
No gateway enforcement. Caps without enforcement are aspirational.
Ignoring cache. A route with 5% cache hit rate is leaving 40-70% of its budget on the floor.
Cost recompute at query time. Dashboards built on raw token counts plus a stale price book are slow and wrong.
No judge cost line. Online scoring at the wrong sample rate can rival production cost.
Treating tier discounts as discount on the bill, not on tracking. Provider tier discounts (Anthropic Tier 4, OpenAI usage tiers) compound over the month; track the effective rate, not the rack rate.

What changed in LLM cost tracking in 2026

Date	Event	Why it matters
2026	Reasoning-token attributes named in OTel GenAI conventions	Cost dashboards stopped under-attributing reasoning models
2026	Distilled judge models hit production scale	Online judge cost dropped 10-50x; sampling math changed
2026	Provider cached-input pricing standardized across providers	Cache hit rate became a first-class cost lever
Mar 2026	Future AGI shipped Agent Command Center with budget enforcement	Gateway-level per-user, per-tenant caps moved out of custom code
Mar 3, 2026	Helicone joined Mintlify, gateway in maintenance mode	Gateway choice list shifted; verify currency before adopting
2026	OpenAI prompt caching expanded across model lineup	Cache attribute capture became universally relevant

How to actually track LLM cost in 2026

Tag every span with the dimensions. user, tenant, prompt version, route, feature flag, model.
Capture all five token sub-attributes. input, output, cache_read, cache_creation, reasoning_output.
Maintain a price book in repo. Update within a business day of provider pricing changes.
Compute cost at ingest. cost.usd as a first-class span attribute.
Build per-dimension dashboards. User percentiles, prompt-version delta, route, model.
Wire the gateway for enforcement. Per-user and per-tenant caps.
Alert on drift. Rolling daily total, p99 user cost, per-version delta, per-route cost.
Audit monthly. Compare gateway sum vs observability backend sum vs provider invoice. Reconcile gaps.

Sources

Series cross-link

LLM Cost Optimization: How Product-Engineering Collaboration Can Reduce AI Infrastructure Spend by 30%

Frequently asked questions

Why does LLM cost tracking need its own discipline in 2026?

Cost left the footnote and joined the budget. A reasoning model burning 40K output tokens at $15 per 1M tokens turns a single user turn into 60 cents. Multiply by retries, tool calls, judge evals, and a single feature can cost more than the user's monthly subscription. Generic infra cost tracking does not capture token usage per prompt version, per user, per route. Without the per-dimension attribution, cost spikes are mysteries and budget gates are aspirational.

What attribution dimensions should I track?

Five at minimum. Per-user (or per-tenant for B2B): some users burn 100x the average. Per-prompt-version: a prompt change can quietly double cost. Per-route or per-feature: a feature on a path nobody monitors can drift unbounded. Per-model: model swaps shift the cost curve. Per-evaluator: judge calls can rival the production calls themselves. The five compose into a cost cube you can pivot.

How do I capture cost on a span level?

Compute cost on the way in. Capture token counts as OTel GenAI attributes (gen_ai.usage.input_tokens, output_tokens, cache_read.input_tokens, cache_creation.input_tokens, reasoning.output_tokens), multiply by current prices, tag the result as a span attribute. Avoids the recompute pass at query time. The cache and reasoning attributes matter; collapsing them into a single token count under-attributes reasoning models and over-attributes cached calls.

Where should the cost ledger live, the gateway or the observability backend?

Both, for different uses. The gateway (Future AGI Agent Command Center, LiteLLM, Helicone, Portkey, OpenRouter, Cloudflare AI Gateway, Vercel AI Gateway) tracks cost on the request path and enforces budgets in real time. The observability backend stores per-span cost for analytics and drift detection. The gateway's cost view is for enforcement; the observability backend's cost view is for understanding.

What are reasonable per-user cost caps?

Workload-specific. Healthy ranges: a chat product at $0.01-$0.10 per active user per day; an agent product at $0.10-$1.00 per active user per day; a research-assistant product at $1-$10 per session. Cap enforcement at 5x the median is a reasonable starting heuristic; below that, false positives hurt; above that, runaway sessions hurt. Tune against your retention curve.

How do I detect cost drift before the bill arrives?

Watch four signals. Rolling daily total (alert on 20% jump). Per-user p99 cost (alert when long-tail users grow). Per-prompt-version cost delta (alert when a new version costs 10%+ more than incumbent). Per-route cost (alert on routes with no traffic-growth justification). Drift detection on cost is the early signal; the bill at the end of the month is the trailing indicator.

How do reasoning tokens change cost tracking?

Reasoning models (OpenAI o-class, Claude extended thinking, Gemini deep-think) emit reasoning tokens that are billed as output tokens but not visible in the response. A reasoning call that produces 30K reasoning tokens before 500 visible tokens looks cheap on a naive output-token count and is in fact 60x more expensive. Capture gen_ai.usage.reasoning.output_tokens separately. Without it, your dashboards lie about your most expensive calls.

What does cost tracking infrastructure cost in operational complexity?

At minimum: token-count capture on every LLM span, a price book mapping model to per-token cost, a multiplier function, and a dashboard. The harder costs: keeping the price book current as providers update pricing, handling provider-specific cost quirks (tier discounts, batch API rebates, cached input pricing), and surfacing budget breaches before they become incidents. Tools that ship a maintained price book and enforce budgets at the gateway save the most engineering time.

View all

Research

Best LLM Cost Tracking Tools in 2026: 7 Platforms Compared

Helicone, FutureAGI, Langfuse, OpenMeter, Datadog, Vantage, and Portkey compared on per-token, per-route, per-user, and per-provider cost attribution.

Vrinda Damani · May 6, 2026

10 min