Best LLM Cost Tracking Tools in 2026: 8 Platforms Compared
Future AGI, Helicone, Langfuse, OpenRouter, Portkey, LangSmith, Datadog, and CloudZero compared on per-trace, per-developer LLM cost attribution.
Table of Contents
LLM cost tracking in 2026 is an observability problem, not a billing problem. The tool worth picking attributes every cent to a trace, an outcome, and a developer, not just a model and a token count. Token dashboards lie. They hide retries, tool-call loops, judge cost, and cache misses inside one aggregate finance sees a month late. Cost per resolved conversation tells the truth. The shortlist below ranks eight platforms (Future AGI Agent Command Center, Helicone, Langfuse, OpenRouter, Portkey, LangSmith, Datadog LLM, and CloudZero or Vantage) on whether they can answer “how much did this trace cost” without a custom pipeline.
The failure mode this guide is written against: the Cursor bill came in at twelve thousand dollars. Half was a single feature flag calling Claude Sonnet on every keystroke for two weeks. Nobody capped the key. Nobody knew which route held the spend. By the time finance asked, the trail was cold.
This shortlist is opinionated. The platforms higher in the ranking set cost on the trace, attach it to an outcome, and let you query the join in production. The ones further down do one slice well and stop. As of May 2026.
TL;DR: best LLM cost tracking tool per use case
| Use case | Best pick | Why | Pricing | License |
|---|---|---|---|---|
| Per-trace cost with evals, gateway routing, and 5-level budgets in one runtime | Future AGI Agent Command Center | x-agentcc-cost on every response, span-attached eval scores, hard caps at org/team/user/key/tag | Free + usage | Apache 2.0 |
| Gateway-attached cost dashboard, minutes to first chart | Helicone | Base URL change captures every call as a span with cost | Hobby free, Pro $79/mo | Apache 2.0 |
| Self-hosted cost panels next to traces and prompts | Langfuse | Mature traces, prompts, cost views; MIT core | Hobby free, Core $29/mo | MIT |
| Provider arbitration with cost shown inline | OpenRouter | One key, 100+ providers, cost per call visible at routing time | Pay-per-use + 5% margin | Closed |
| Gateway with virtual keys and cost panels | Portkey | Cost + provider failover + budgets in MIT gateway | Free + paid from $49/mo | MIT gateway |
| LangChain or LangGraph shop, end-to-end | LangSmith | Cost lives next to LangChain trace tree | Free + paid from $39/mo | Closed |
| Already paying for Datadog, want LLM in APM | Datadog LLM Observability | LLM cost correlated with infra cost | Custom; from $31/host/mo APM | Closed |
| Finance-led multi-source cost allocation | CloudZero / Vantage | Cloud + SaaS + LLM cost in one allocation view | Quote-based | Closed |
If you only read one row: Future AGI is the only platform on this list where per-trace cost lives on the same span as the eval score, the developer key, and the budget cap. The rest pick a slice and ship it cleanly. The slice you pick depends on the workload, not the marketing page.
The opinionated thesis
The bill is the lagging indicator. The trace is the leading one. A platform that can show you the line item but cannot show you the trace that produced it is a finance tool, not an engineering tool. Three patterns follow:
- Cost belongs on the span. Same row as latency, model name, prompt version, and the eval score. Set by the gateway handler before the response body returns. Every dashboard query starts there.
- The denominator is an outcome, not a token. Cost per resolved conversation, accepted PR, completed task. Token-shaped metrics flatter cheap models that loop more. Outcome-shaped metrics catch the regression.
- Budgets enforce at the gateway, not in a wiki page. A 429 with a structured payload telling the caller which level blocked is the only cap that holds when the on-call engineer is asleep.
The tools below get ranked on how cleanly they support these three patterns. The further down, the more of the loop you build yourself.
What an LLM cost tracking tool actually has to do
A working cost layer answers six questions on production data, without a custom pipeline:
- Per provider. OpenAI vs Anthropic vs Google vs Bedrock vs Mistral. Daily spend with model-mix breakdown.
- Per model.
gpt-4o-2024-11vsgpt-4o-minivsclaude-3-5-sonnet. The substitution alert (“model swapped, eval score dropped 18 points”) catches a real class of regressions. - Per route or feature.
/chatvs/rag-searchvs/agent-actionvs the autocomplete flag. - Per developer or team. One virtual key per developer in Cursor, Codex, Claude Code, or Cline. Caps with a warn-threshold before the block fires.
- Per tenant. Customer-level spend tagged at the request layer. Difference between flat-rate gross margin and per-tenant contribution margin in a B2B product.
- Per outcome. Cost per resolved conversation, accepted PR, booked meeting. The join that catches the cheap-model regression.
Anything less and your team rebuilds the slicing in a spreadsheet, loses fidelity to the 3x spike that should have paged, and stops trusting the dashboard within a quarter.
The 8 LLM cost tracking tools compared
1. Future AGI Agent Command Center: per-trace cost wired to evals, gateway routing, and 5-level budgets
Apache 2.0. Self-hostable as a single Go binary. Managed cloud at gateway.futureagi.com.
Future AGI ships cost tracking the way the thesis demands. The Agent Command Center is an OpenAI-compatible gateway delivered as a single Go binary. Every response carries five cost-related headers set in the gateway handler before the body returns:
x-agentcc-cost: 0.000075
x-agentcc-latency-ms: 612
x-agentcc-model-used: anthropic/claude-3-5-sonnet
x-agentcc-provider: anthropic
x-agentcc-cache: miss
The dollar number lands on the trace span the same way latency and model name do, captured by traceAI, the OpenTelemetry-native SDK for Python, TypeScript, Java, and C# with auto-instrumentation for OpenAI, LangChain, Groq, Portkey, and Gemini. Prometheus on /-/metrics surfaces agentcc_cost_total, agentcc_tokens_total, agentcc_cache_hits_total/misses_total, and agentcc_requests_total. OTLP traces export to any OTel collector. Routing config lives in YAML, not deploys.
# OpenAI SDK drop-in. Cost arrives on the response headers.
from openai import OpenAI
client = OpenAI(
api_key="sk-agentcc-...",
base_url="https://gateway.futureagi.com/v1",
)
response = client.chat.completions.create(
model="anthropic/claude-3-5-sonnet",
messages=[{"role": "user", "content": "..."}],
)
# response.headers["x-agentcc-cost"] -> "0.000075"
The 5-level budget hierarchy makes Cursor and Codex spend predictable across a fifty-engineer team. Levels: org, team, user, key, tag. A single request inherits the lowest applicable ceiling. Org cap stops the runaway script. User cap separates the power user from the occasional one. Key cap pins CI to a hard daily ceiling that returns 429 when blown. Tag cap catches the experiment that ran a week longer than planned. Each level supports warn_threshold (default 0.8) and a hard-or-soft mode.
budgets:
enabled: true
default_period: monthly
warn_threshold: 0.8
org: { limit: 50000, hard: false }
teams:
platform: { limit: 12000, hard: false }
support-cx: { limit: 8000, hard: true }
keys:
ci-tests: { limit: 200, hard: true, period: daily }
The other half of the pitch is the eval loop. The same trace span that carries x-agentcc-cost also carries the ai-evaluation score from the rubric. Cost per resolved conversation is a query against the trace store, not a separate billing pipeline. When the model swap drops cost 40 percent and the eval score drops 18 points on the same route, the alert fires before the rollout completes. That is the loop dashboard-only tools cannot close.
Best for: Engineering and platform teams running production agents (Cursor, Codex, Claude Code internal builds, Cline workflows, customer-support copilots) where a cost spike has to trace back to the route, model swap, or developer that caused it.
Honest tradeoff: Span-attached cost adds operational surface. Gateway, trace processor, and dashboard all have to agree on the schema. Teams already running OpenTelemetry pay the smaller version of this cost.
Pricing. Free to start (50 GB tracing, 100K gateway requests, 100K cache hits, 30-day retention); usage-based as you grow. SOC 2 Type II, HIPAA BAA, SAML SSO add on. Pricing.
2. Helicone: proxy-based, cost-dashboard-first
Apache 2.0. Self-hostable. Hosted cloud option.
Helicone is the lowest-friction path from cold start to a per-provider cost dashboard. Change the OpenAI base URL, add a header, and every request becomes a span with cost attached. The cost view sits at the front of the product: per provider, per model, per user, cost-per-request, cache hit rate.
The tradeoff is depth. Helicone optimizes for the dashboard, not the loop. Eval surface is shallower than dedicated LLM platforms. Per-trace eval correlation exists but lacks the rubric library and CI gating Future AGI ships.
Use case: A team that wants a working cost dashboard by Friday and accepts the loop is a separate problem.
Pricing: Hobby free with 10K logs/mo. Pro $79/mo.
Worth flagging: The March 2026 Mintlify acquisition introduced roadmap risk. The platform remains usable; feature velocity has slowed. See Helicone alternatives.
3. Langfuse: self-hosted cost panels next to traces and prompts
MIT core. Self-hostable. Hosted cloud option.
Langfuse is the strongest self-hosted cost panel for LangChain-adjacent shops. Cost lives next to traces and prompts in the same UI, with per-provider, per-model, per-route, per-user, and per-tenant breakdowns. The price table covers major providers and updates within reasonable cadence.
The reason Langfuse falls short for the per-trace loop is that cost is a dashboard slice rather than a routing input. Cost arrives after the call, as an analytic surface. Future AGI sets cost on the response header at the gateway layer, before the body returns, so routing decisions can read it inline. Langfuse forecasting is also lighter than dedicated FinOps tools; pair with Vantage or CloudZero when finance owns the conversation.
Use case: A platform team that operates the data plane, wants traces and cost in their own infrastructure, and treats the cost panel as a reporting view.
Pricing: Hobby free with 50K units/mo. Core $29/mo. Pro $199/mo.
Worth flagging: Langfuse Experiments shipped CI/CD integration in May 2026. See Langfuse alternatives.
4. OpenRouter: provider arbitration with cost shown inline
Closed platform. Pay-per-use with a 5% margin over upstream provider prices.
OpenRouter is the cheapest path when “show me the cost for this exact model on this exact prompt” is the question. One API key, 100+ providers, every model with a unit price visible in the same surface. Spend per model, per app, per key. Good enough for a small team running provider arbitration as the primary cost-control lever.
OpenRouter does not climb higher because of the loop. No eval surface, no rubric-gated substitution, no virtual-key budgets with warn-thresholds, no OTel export. Cost is shown, not stored. It does not arrive on your trace store unless you wire it yourself.
Use case: Pre-production prototypes and teams whose cost-control story is “pick the cheapest model that passes the smoke test”. Stops being enough once the same prompt runs on 41,000 traces.
Pricing: Pay-per-use. 5% margin over upstream provider price.
Worth flagging: Acceptable on small workloads; meaningful on six-figure monthly spend where direct provider contracts pay back.
5. Portkey: gateway with virtual keys and cost panels
MIT gateway. Closed managed tier on top.
Portkey is the closest peer to the Agent Command Center on the gateway dimension. MIT-licensed gateway, multi-provider routing, virtual keys, fallbacks, retries, cost panels per provider and per key. The managed tier adds audit logs, prompt management, and a guardrail layer.
Portkey falls short relative to Future AGI at the loop edges. The eval surface, rubric library, simulation product, and optimizer live elsewhere. The 5-level budget hierarchy (org, team, user, key, tag) is wider than Portkey’s per-key model. The Future AGI Go runtime (29k req/s, P99 21 ms with guardrails, t3.xlarge) is a different performance envelope from Portkey’s Node-based gateway.
Use case: A team standardized on Portkey, comfortable stitching a separate eval product to the trace store.
Pricing: Free OSS gateway. Hosted from $49/mo.
Worth flagging: Eval depth smaller than dedicated LLM platforms. See Portkey alternatives.
6. LangSmith: end-to-end for LangChain shops
Closed platform. SaaS plus self-hosted on the higher tier.
LangSmith is the cost surface a LangChain or LangGraph shop already has. Cost lives next to the LangChain trace tree, broken down by run, by chain, by tool call. For teams using LangSmith as the trace store anyway, adding cost tracking is one toggle.
LangSmith does not climb higher because of scope. Excellent inside the LangChain world; shallower at the edges where production traffic mixes frameworks. The 5-level budget hierarchy is not there. The gateway header pattern is not there.
Use case: A team that lives in LangChain or LangGraph end-to-end and uses LangSmith as the source of truth for traces.
Pricing: Free tier. Plus from $39/mo.
Worth flagging: If half the production traffic comes from non-LangChain agents, LangSmith covers half the bill.
7. Datadog LLM Observability: only worth it when Datadog is already the standard
Closed platform. SaaS only.
Datadog LLM Observability is the right pick when Datadog is already the platform. Cost shows up inside the same dashboard as CPU, memory, p99 latency, and Redis hit rate. For a team that already pays the Datadog bill and wants one place to look during an incident, the case is straightforward.
Datadog falls short on the LLM-specific surface. Eval depth is shallower than dedicated LLM platforms. The cost panel does not natively join to a rubric score the way Future AGI or Langfuse do. Per-span ingest and per-log indexing pricing crosses into five-figure monthly contracts at LLM scale.
Use case: Engineering organizations standardized on Datadog where infra correlation matters more than open instrumentation.
Pricing: Custom. From $31/host/mo APM plus the LLM Observability add-on.
Worth flagging: Right answer for “we want one platform”. Wrong answer for “we want the deepest LLM eval loop” — those workloads end up running Future AGI or Langfuse alongside.
8. CloudZero, Vantage, and the FinOps tools
Closed platforms. SaaS only.
CloudZero and Vantage are the cost-allocation tools finance teams use to attribute AWS, GCP, and Azure spend. Both ingest LLM cost as one more line item alongside cloud and SaaS bills. Every dollar mapped to a project, team, environment, and customer in one view.
They sit at the bottom of this list because they answer a finance question, not an engineering one. No per-trace surface, no eval correlation, no per-route slicing. Pair with Future AGI, Helicone, or Langfuse for the LLM detail. FinOps tool for the rollup, per-trace tool for the diagnostic.
Use case: Engineering finance or FinOps teams with multi-cloud, multi-provider spend that needs unified attribution.
Pricing: Vantage free tier; paid quote-based. CloudZero quote-based.
Worth flagging: Never the sole tool. No eval correlation, no per-route slicing, no virtual-key budgets.
![]()
Decision framework: pick by the constraint that actually holds
Walk down this list and stop at the first constraint that matches.
- Per-trace cost on the same span as the eval score. Future AGI Agent Command Center.
- Working cost dashboard by Friday; the loop is a separate problem. Helicone.
- Self-hosted, cost next to traces and prompts. Langfuse.
- Provider arbitration with cost shown inline. OpenRouter.
- Gateway-first with virtual keys. Portkey for the MIT gateway, Future AGI for the 5-level budgets plus eval loop.
- LangChain end-to-end. LangSmith.
- Already pay for Datadog at scale. Datadog LLM Observability.
- Finance-led cost conversation across cloud, SaaS, and LLM. CloudZero or Vantage. Pair with an LLM-native tool above.
- All of the above on one runtime. Future AGI.
Common mistakes teams make picking a cost tracking tool
- Trusting stale price tables. OpenAI, Anthropic, and Google updated pricing four-plus times in 2024-2025. A table older than 30 days miscalculates by 20 to 40 percent on the providers that re-tiered.
- Skipping per-tenant attribution. A B2B product without tenant tagging cannot model contribution margin. Flat-rate gross margin obscures the customer running the entire support agent for free.
- Tracking only the platform fee. Real cost equals provider fee plus retries plus retries-on-timeout plus speculative-decoding wasted tokens plus judge tokens. Provider fee in isolation undercounts.
- Picking on demo dashboards. Demos use clean cost data with idealized routes. Run a domain reproduction with your real route mix for two weeks before signing.
- Ignoring forecasting and spike alerts. Daily spend without a forecast leaves the team caught by a 3x spike before the alert fires.
- Optimizing token price instead of cost per outcome. The cheap-model swap that drops 40 percent of token spend and 18 points of resolution rate is a regression. The denominator is the lever that catches it.
Recent LLM cost tracking updates
| Date | Event | Why it matters |
|---|---|---|
| May 2026 | Langfuse shipped Experiments CI/CD integration | OSS-first teams can gate experiments by cost as well as eval pass rate. |
| Mar 9, 2026 | Future AGI shipped Agent Command Center with ClickHouse trace storage | Per-trace cost moved into the gateway layer with span-attached eval correlation and 5-level budgets. |
| Mar 3, 2026 | Helicone joined Mintlify | Helicone remains usable; roadmap risk became part of vendor diligence. |
| 2024-2025 | Major providers updated pricing 4+ times | Stale price tables became a real source of cost-tracking error. The 30-day freshness rule got teeth. |
How to actually evaluate this for production
Three weeks, three steps.
- Run a domain reproduction. Tag your real route mix (
/chat,/rag-search,/agent-action, the autocomplete flag) and compare per-route, per-provider, per-model spend across two candidate tools for two weeks. Verify cost shows up on the trace span in both. - Test the alert path. Trigger a 3x query volume bump on one route and verify the platform pages on the right channel within 5 minutes. The alert that fires the next morning is the alert that didn’t fire.
- Cost-adjust the comparison. Real cost equals platform price plus the engineer-hours to maintain price tables, build dashboards, and run substitution experiments. A tool that ships these out of the box is often cheaper than a cheaper line item that doesn’t.
How Future AGI closes the cost loop where the other tools leave gaps
Future AGI is the only tool on this list where a cost spike feeds back into prompts and routing on the same runtime. The other seven ship cost as a dashboard or a finance allocation view. Future AGI ships cost as the input to the next deploy.
- Capture. The Agent Command Center gateway fronts 100+ providers, sets the five
x-agentcc-*headers on every response, and benchmarks at ~29k req/s with P99 21 ms on t3.xlarge with guardrails on. Apache 2.0, self-host or cloud atgateway.futureagi.com/v1. - Trace. traceAI carries cost to the span store as an OpenTelemetry attribute. Python, TypeScript, Java, and C#.
- Score. ai-evaluation runs 50+ built-in evaluators as the same templates in pytest, in CI, against the shadow set, and against live traces.
- Enforce. Five-level budgets with
warn_thresholdand hard or soft modes. A request that would blow the cap returns a structured 429 naming the level that blocked. - Optimize. agent-opt consumes failing or expensive trajectories and ships a shorter prompt the gateway routes next time. PROTEGI, GEPA, and four more optimizers, all wired to the eval score on the trace.
Best open source and best enterprise-grade in the same product. The Apache 2.0 stack is the entire platform, not a stripped preview. The managed tier adds SOC 2 Type II, HIPAA on Scale, RBAC, AWS Marketplace billing, dedicated VPC, and SLAs without changing APIs.
Most teams comparing cost-tracking tools in 2026 end up running three or four products in production. Cost data sits in tool A. Eval scores sit in tool B. When something goes wrong, an engineer joins them by hand. Future AGI ships the join.
Ready to attribute your agent’s bill to specific traces? Point your OpenAI SDK at https://gateway.futureagi.com/v1, read x-agentcc-cost on the response, and instrument the outcome event. Start with the Agent Command Center quickstart and the traceAI integration guide.
Sources
Future AGI docs · Future AGI GitHub · Future AGI pricing · Helicone GitHub · Helicone pricing · Helicone Mintlify announcement · Langfuse pricing · OpenRouter · Portkey gateway · Portkey pricing · LangSmith pricing · Datadog pricing · Vantage pricing
Related reading
- AI Agent Cost Optimization and Observability
- Best LLM Monitoring Tools
- Best LLM Gateways
- Best AI Agent Observability Tools
- LLM Cost Optimization: How Product-Engineering Collaboration Can Reduce AI Infrastructure Spend by 30%
- Best AI Gateway for Claude Code Cost Management
- Best AI Gateways for Codex CLI Token Spend
Frequently asked questions
What are the best LLM cost tracking tools in 2026?
What is the right unit of LLM cost tracking?
Why is LLM cost tracking an observability problem, not a billing problem?
How do these tools compute per-trace cost?
Should I use a gateway or an SDK for LLM cost tracking?
How does Future AGI's Agent Command Center expose cost per trace?
How do I attribute LLM cost to a specific developer, feature, or tenant?
Helicone, Langfuse, Datadog LLM cost, Braintrust, Phoenix, Portkey, FutureAGI compared on per-tenant, per-feature, per-agent token attribution.
Best LLMs May 2026: compare GPT-5.5, Claude Opus 4.7, Gemini 3.1 Pro, and DeepSeek V4 across coding, agents, multimodal, cost, and open weights.
Best LLMs April 2026: compare GPT-5.5, Claude Opus 4.7, DeepSeek V4, Gemma 4, and Qwen after benchmark trust broke and prices compressed fast.