Research

Best Tools for Token Cost Tracking in LLMs in 2026: 7 Platforms Compared

Helicone, Langfuse cost panels, Datadog LLM cost, Braintrust cost panels, Phoenix token costs, Portkey, and FutureAGI compared on per-tenant, per-feature, and per-agent token attribution.

September 11, 2025

11 min read

token-cost-tracking llm-cost-attribution helicone langfuse phoenix-tokens tenant-attribution finops 2026

Table of Contents

Token cost tracking in 2026 is the FinOps layer beneath LLM cost tracking. Aggregate spend tells you the bill. Token attribution tells you which tenant, feature, agent, prompt version, or experiment burned which tokens. The seven tools below cover gateway-first attribution, observability platforms with cost dashboards, OSS metering, and platform-bundled token tracking. The differences that matter are tag depth, price table freshness, multi-dimensional slicing, and how the tool handles cached input pricing. This guide is the honest shortlist; cross-link to Best LLM Cost Tracking Tools for the broader cost-tracking category.

TL;DR: Best token cost tracking tool per use case

Use case	Best pick	Why (one phrase)	Pricing	OSS
Unified token attribution + eval + observe + simulate + gate + optimize loop	FutureAGI	One plane for cost, eval, runtime guardrails, simulation, gateway routing	Free + usage from $5/100K reqs	Apache 2.0
Gateway-attached per-user attribution	Helicone	Mature property-based tagging	Hobby free, Pro $79/mo	Apache 2.0
Self-hosted custom cost aggregations	Langfuse	Tag any dimension, query freely	Hobby free, Core $29/mo	MIT core
LLM cost correlated with infra cost	Datadog LLM cost	One dashboard for LLM and infra	Billed by monitored LLM span/request volume; verify on pricing page	Closed
Cost tied to experiments and CI	Braintrust cost panels	Per-experiment, per-score cost	Starter free, Pro $249/mo	Closed
OpenTelemetry-native token cost	Phoenix token costs	OTel span-attached cost attributes	Phoenix free, AX Pro $50/mo	ELv2
Gateway routing + cost dashboard	Portkey	Routing rule attribution	Free + paid from $49/mo	MIT gateway

If you only read one row: pick FutureAGI when token attribution must close back into evals, runtime guardrails, simulation, and gateway routing on the same plane; pick Helicone for property-based per-user attribution; pick Langfuse for self-hosted custom aggregations.

What token attribution actually requires

A working token cost tracking layer covers six functions:

Tag instrumentation. Every LLM request emits tenant_id, user_id, feature_id, agent_role, prompt_version, and any other dimension you want to slice.
Maintained price table. Per-model, per-token-type cost lookup updated within 30 days of vendor pricing changes.
Cached vs uncached input separation. OpenAI, Anthropic, Bedrock all support prompt caching with reduced cached-input pricing.
Multi-dimensional slicing. Group by any tag combination. Per-tenant per-feature, per-agent per-prompt-version, per-experiment per-model.
Daily and forecast dashboards. Daily spend trend, weekly delta, and forecast based on recent slope.
Spike alerts. Alert when a tenant or feature crosses a threshold; rollback gateway routing or notify product owner.

Without all six, the dashboard is a graph, not a FinOps tool.

The 7 token cost tracking tools compared

1. FutureAGI: The leading token cost tracking platform with attribution + evals + gateway in one runtime

Apache 2.0. Self-hostable. Hosted cloud option.

FutureAGI ranks #1 here for teams whose token attribution must close back into evals, runtime guardrails, simulation, and gateway routing on the same plane. The platform captures token and cost metadata on LLM, gateway, and eval spans via traceAI or SDK instrumentation, attaches Turing eval scores, and routes via the Agent Command Center BYOK gateway across 100+ providers. Per-tenant, per-feature, per-agent slicing on the same data, alongside 50+ eval metrics, 18+ runtime guardrails, simulation, and 6 prompt-optimization algorithms.

Use case: Production token attribution where the cost dashboard sits next to evals and gateway routing in the same product, especially for RAG agents, voice agents, and copilots where token cost should drive gateway routing decisions (route expensive tenants to cheaper models) and the cost panel must show eval scores side-by-side.

Pricing: Free plus usage from $5 per 100,000 gateway requests, $2/GB storage, $10 per 1,000 AI credits. Boost $250/mo, Scale $750/mo (HIPAA), Enterprise from $2,000/mo (SOC 2).

OSS status: Apache 2.0. Permissive over Datadog, Braintrust closed source and Phoenix’s ELv2.

Performance: turing_flash runs guardrail screening at 50-70 ms p95 and full eval templates run in roughly 1-2 seconds.

Best for: Teams that want one runtime where token attribution, eval scoring, and gateway gating close on each other.

Worth flagging: Helicone is genuinely the lowest-friction path from base-URL change to a per-tenant property-tagged cost dashboard, but FutureAGI delivers the same gateway-attached attribution plus eval, simulation, and CI gates in one platform.

2. Helicone: Best for gateway-attached per-user attribution

Apache 2.0. Self-hostable. Hosted cloud option.

Use case: Token attribution by tagging requests at the gateway. Helicone uses request headers (Helicone-Property-tenant_id, Helicone-Property-feature_id) to attach metadata to every LLM call; the dashboard slices by any property combination. Mature attribution model with one of the longer histories in the OSS LLM observability category.

Pricing: Hobby includes 10,000 free requests; Pro starts at $79/mo and usage-based pricing applies. Team adds SOC-2 / HIPAA posture and a dedicated Slack/private channel; Enterprise adds SAML SSO, on-prem deployment, custom terms, and dedicated support.

OSS status: Apache 2.0. Verify current GitHub star count on the repo page.

Best for: Teams that want low-code gateway attribution by switching the OpenAI base URL to Helicone’s gateway; per-tenant attribution still requires Helicone-Property headers or metadata. Property-based tagging is the lowest-friction path to per-tenant cost.

Worth flagging: Following the March 2026 Mintlify announcement, Helicone says the service stays live in maintenance mode with security updates, new model support, bug fixes, and performance fixes; treat roadmap depth as a diligence item. Span depth is shallower than Phoenix or Langfuse for multi-step agents. See Helicone Alternatives.

3. Langfuse cost panels: Best for self-hosted custom cost aggregations

MIT core. Self-hostable.

Use case: Self-hosted token attribution with custom dashboards. Langfuse stores cost per trace and per span; the dashboard supports custom queries (group by metadata, filter by tag, aggregate over windows). Strong fit when the team needs aggregations the vendor dashboard does not ship out of the box.

Pricing: Hobby free with 50K units per month, 30 days data access, 2 users. Core $29/mo with 100K units. Pro $199/mo. Enterprise $2,499/mo.

OSS status: MIT core. Enterprise directories handled separately.

Best for: Platform teams that want trace data in their own infrastructure with first-party cost panels and custom aggregations.

Worth flagging: Cached input pricing requires explicit instrumentation; verify the cost table treats cached and uncached separately. See Langfuse Alternatives.

4. Datadog LLM cost: Best for LLM cost correlated with infra cost

Closed platform. SaaS only.

Use case: Teams that already run Datadog APM and want LLM token cost in the same dashboard as infra spend, network cost, and database cost. Datadog LLM Observability surfaces tokens, cost, latency, and error rate alongside the rest of the APM stack.

Pricing: LLM Observability is billed by monitored LLM span/request volume; verify current minimums, included spans, retention, and add-ons on Datadog’s pricing page. Total cost depends on span volume, indexed logs, hosts, and retention.

OSS status: Closed.

Best for: Engineering organizations standardized on Datadog where infra correlation matters more than open instrumentation.

Worth flagging: Datadog LLM Observability is billed by LLM span/request volume; total contract size depends on monitored span volume, retention, and any add-ons. Per-tenant attribution requires explicit tenant tags or metadata on every request. Eval depth is shallower than dedicated LLM platforms.

5. Braintrust cost panels: Best for cost tied to experiments and CI

Closed platform. Hosted SaaS or enterprise self-host.

Use case: Token cost tied to experiments, datasets, and CI gates. Braintrust shows cost per experiment run, cost per scorer call, and cost per benchmark across model variants. Strong fit when the FinOps question is “did experiment X overspend, and which prompt change caused it.”

Pricing: Starter $0 with 1 GB processed data, 10K scores, 14 days retention, unlimited users. Pro $249/mo with 5 GB, 50K scores, 30 days retention. Enterprise custom.

OSS status: Closed.

Best for: Teams that already use Braintrust for experiments and want cost as a first-class CI metric. The CI gate pattern (block PR on cost regression) is canonical here.

Worth flagging: No first-party gateway. Per-tenant attribution requires metadata tagging on every score; less seamless than Helicone’s gateway-attached property model. See Braintrust Alternatives.

6. Phoenix token costs: Best for OpenTelemetry-native token attribution

Source available (ELv2). Self-hostable.

Use case: Token cost as OpenInference span attributes. Phoenix accepts OTLP traces with attributes such as llm.token_count.prompt, llm.token_count.completion, llm.token_count.prompt_details.cache_read, llm.token_count.prompt_details.cache_write, and llm.cost.total; the dashboard aggregates by any span attribute including model, project, environment, and custom tags. Verify current attribute names against the OpenInference semantic conventions.

Pricing: Phoenix free for self-hosting. AX Pro $50/mo for the hosted path with deeper retention.

OSS status: ELv2. Source available with restrictions on offering as a managed service.

Best for: Engineers who want OpenInference adherence and a self-hosted OTel workbench for token attribution. The attribute schema is portable to any OTel-compatible backend.

Worth flagging: Phoenix is not a gateway. Tag instrumentation lives at the SDK or OTel collector; verify your client emits the right attributes. ELv2 license matters for legal teams that follow OSI definitions strictly.

7. Portkey: Best for gateway routing + cost dashboard

MIT gateway. Closed paid surface.

Use case: LLM gateway with first-class cost tracking tied to routing rules. Portkey routes requests across providers (OpenAI, Anthropic, Bedrock, Together, custom), captures cost per request, and surfaces dashboards by virtual key, by route, and by metadata.

Pricing: Free tier covers 10K requests/mo. Paid tiers from $49/mo with higher request volume, SSO, and team features.

OSS status: MIT gateway. Closed paid platform.

Best for: Teams that want one product for LLM gateway + cost tracking + routing rules + virtual keys for tenant attribution.

Worth flagging: Smaller eval surface than Braintrust or FutureAGI; pair with an eval framework if scoring is the goal. Gateway-only architecture; no first-party simulation.

Decision framework: pick by constraint

Gateway-first per-user attribution: Helicone.
Self-hosted custom aggregations: Langfuse.
Already on Datadog: Datadog LLM cost.
Cost tied to experiments and CI: Braintrust.
OpenTelemetry adherence: Phoenix token costs.
Gateway routing rules + cost: Portkey.
Bundled with evals + gateway: FutureAGI.
Multi-cloud finance allocation on top: Vantage or CloudZero with any of the above as the LLM source.

Common mistakes when tracking token cost

Not tagging every request. Un-tagged requests aggregate to “unknown” and the per-tenant dashboard breaks. Make tag emission a SDK middleware, not a per-call concern.
Stale price tables. Vendor pricing changes regularly. A stale table can materially miscalculate cost when new models or new caching tiers ship. Verify the dashboard’s pricing against the provider invoice monthly and audit a sample of recent invoices.
Ignoring cached input pricing. OpenAI cached input is around 50% of uncached. Anthropic prompt caching is around 10% of uncached for cached reads. A tool that lumps cached and uncached over-counts cost.
Single-dimension slicing. “Cost by model” is a starting point. The interesting question is “cost by tenant by feature by model”; verify multi-dimensional slicing works before committing.
No spike alerts. A tenant on autopilot can 10x your bill in a day. Set per-tenant and per-feature thresholds with alert and (optionally) gateway-side rate limits.
Skipping retention. 90 days of trace data on a high-volume agent stack is 200 GB to 2 TB. Storage retention dominates the bill at scale; verify the math on your data plane.

What changed in token cost tracking in 2026

Date	Event	Why it matters
2024-2025	OpenAI prompt caching went GA	Cached input pricing became a first-class FinOps dimension.
2025	OpenInference semantic conventions cover token and cost attributes	Cross-platform schema for token cost on spans (verify the current spec in the linked repo).
Mar 2026	FutureAGI shipped Agent Command Center with token attribution	Cost panels closed into eval and gateway routing.
Mar 2026	Helicone joined Mintlify	Roadmap risk became part of vendor diligence.
2025	Langfuse shipped custom aggregations on cost panels	Self-hosted custom slicing matured.
2025	Anthropic prompt caching expanded	More cached pricing models needed in the price table.
2025	Portkey gateway open sourced	MIT gateway became a first-class OSS option.

How to actually evaluate this for production

Inventory the dimensions you need. List the dimensions you actually want to slice by (tenant, feature, agent role, prompt version, experiment, model). Most teams need 4-6.
Tag a sample. Pick the busiest LLM call site, instrument it with tags via SDK or gateway, and verify the dashboard slices on each tag.
Audit price accuracy. Run the dashboard for a billing cycle. Reconcile against the OpenAI / Anthropic / Bedrock invoice. Acceptable variance is ±5%; anything higher means stale price tables or missing dimensions.
Set up spike alerts. Pick the top 3-5 tenants, the top 2-3 features, and add per-dimension thresholds. Verify alerts fire on a synthetic spike.
Cost-adjust the platform itself. Real cost equals platform price + storage retention + alert backend + the engineering hours to operate the tool. Self-hosted is cheaper at scale; managed wins for small teams.

Sources

Series cross-link

Frequently asked questions

How is token cost tracking different from LLM cost tracking?

LLM cost tracking aggregates spend across providers, models, routes, and time. Token cost tracking is the attribution layer beneath: which tenant, feature, agent, or experiment burned which tokens. The first answers 'what is our LLM bill.' The second answers 'which customer is the long tail of our $10K/month gpt-4o bill, and which feature should we move to gpt-4o-mini.' Token attribution is the FinOps prerequisite for usage-based pricing, per-tenant gross margin, and feature-level cost optimization.

What are the best tools for token cost tracking in 2026?

The shortlist is Helicone, Langfuse cost panels, Datadog LLM cost, Braintrust cost panels, Phoenix token costs, Portkey, and FutureAGI. Helicone leads on gateway-attached per-user attribution with the longest history. Langfuse leads on self-hosted attribution dashboards with custom aggregations. Datadog correlates token cost with infra spend. Braintrust ties cost to experiments and CI gates. Phoenix surfaces token costs on OpenInference spans. Portkey ties cost to gateway routing rules. FutureAGI bundles token attribution with span-attached evals and gateway routing.

What dimensions matter for token cost tracking?

Six dimensions: (1) per-tenant (which customer in a B2B product), (2) per-feature (chat vs RAG vs summarize), (3) per-agent or per-role (planner vs researcher vs executor), (4) per-experiment (cost of a benchmark or A/B run), (5) per-prompt-version (how much did v3 cost vs v2), (6) per-model (gpt-4o vs gpt-4o-mini vs Claude). The platform that wins is the one that lets you slice on all six without custom code. Per-model and per-prompt-version are usually captured by default; per-tenant, per-feature, per-agent, and per-experiment commonly require explicit tag instrumentation in every tool covered here.

Are any of these token cost tracking tools open source?

Helicone is Apache 2.0. Langfuse core is MIT. Portkey gateway is open source under MIT. FutureAGI is Apache 2.0. Phoenix is source available under Elastic License 2.0 (not OSI open source). Braintrust and Datadog are closed platforms. The OSI-open-license shortlist for self-hosting is Helicone, Langfuse core, Portkey gateway, and FutureAGI. Phoenix under ELv2 is a source-available self-host option that fits some procurement profiles but not strict OSI-open-source mandates.

How do these tools attribute cost to a specific tenant or user?

Tag every LLM request with tenant_id, user_id, feature_id, and any other dimension you want to slice. Helicone uses request headers (Helicone-Property-tenant_id). Langfuse uses metadata in the trace. Phoenix uses OpenInference span attributes. Braintrust uses metadata on the experiment or score. Portkey uses request-level metadata. FutureAGI uses span attributes via traceAI or metadata via the SDK. The aggregation happens in the dashboard. The trap is forgetting to instrument; un-tagged requests aggregate to 'unknown' and the per-tenant dashboard breaks.

How does pricing compare across token cost tracking tools?

Helicone Hobby includes 10,000 free requests; Pro starts at $79/mo and usage-based pricing applies. Langfuse Hobby is free; Core is $29/mo flat. Datadog LLM Observability is billed by monitored LLM span/request volume; verify current minimums, included spans, retention, and add-ons on Datadog's pricing page. Braintrust Pro is $249/mo with 5 GB processed and 50K scores. Phoenix self-host is free; AX Pro is $50/mo. Portkey Free tier covers 10K requests; paid tiers from $49/mo. FutureAGI is free plus usage from $5 per 100,000 gateway requests and $2/GB storage.

How accurate are these tools' price tables?

Most tools maintain a price table indexed by model name and token type. Helicone, Langfuse, Phoenix, Portkey, and FutureAGI ship maintained tables for OpenAI, Anthropic, Google, Mistral, Bedrock, and others. Verify the table is updated within 30 days of vendor pricing changes; stale tables can materially miscalculate cost when new models or new caching tiers ship. Cached input pricing (OpenAI, Anthropic, and Bedrock all support it for selected models) is a frequent miss; verify the tool counts cached and uncached input separately. For a B2B product where pricing accuracy is the FinOps gate, sample-audit the dashboard against the provider invoice monthly.

Can I integrate token cost tracking with cloud cost (AWS, GCP, Azure)?

Datadog correlates LLM cost with cloud infra cost in the same dashboard. Vantage and CloudZero handle finance-led cost allocation across cloud and SaaS. FutureAGI, Helicone, Langfuse, and Phoenix surface LLM cost; pair with Vantage or CloudZero for cloud attribution. The integration pattern is: emit LLM cost as a metric to your cloud cost tool with the same tags (project, env, team), then query in one place. For most teams the LLM tool owns granular per-tenant attribution; the cloud cost tool owns rolled-up financial reporting.

View all

Research

Best LLM Cost Tracking Tools in 2026: 7 Platforms Compared

Helicone, FutureAGI, Langfuse, OpenMeter, Datadog, Vantage, and Portkey compared on per-token, per-route, per-user, and per-provider cost attribution.

Vrinda Damani · May 6, 2026

10 min