Research

Best LLM Cost Tracking Tools in 2026: 8 Platforms Compared

Future AGI, Helicone, Langfuse, OpenRouter, Portkey, LangSmith, Datadog, and CloudZero compared on per-trace, per-developer LLM cost attribution.

·
Updated
·
16 min read
llm-cost-tracking agent-observability helicone langfuse portkey finops open-source 2026
Editorial cover image on a pure black starfield background with faint white grid. Bold all-caps white headline LLM COST TRACKING 2026 fills the left half. The right half shows a wireframe pie chart with a dollar sign at center drawn in pure white outlines with a soft white halo behind the largest slice.

LLM cost tracking in 2026 is an observability problem, not a billing problem. The tool worth picking attributes every cent to a trace, an outcome, and a developer, not just a model and a token count. Token dashboards lie. They hide retries, tool-call loops, judge cost, and cache misses inside one aggregate finance sees a month late. Cost per resolved conversation tells the truth. The shortlist below ranks eight platforms (Future AGI Agent Command Center, Helicone, Langfuse, OpenRouter, Portkey, LangSmith, Datadog LLM, and CloudZero or Vantage) on whether they can answer “how much did this trace cost” without a custom pipeline.

The failure mode this guide is written against: the Cursor bill came in at twelve thousand dollars. Half was a single feature flag calling Claude Sonnet on every keystroke for two weeks. Nobody capped the key. Nobody knew which route held the spend. By the time finance asked, the trail was cold.

This shortlist is opinionated. The platforms higher in the ranking set cost on the trace, attach it to an outcome, and let you query the join in production. The ones further down do one slice well and stop. As of May 2026.

TL;DR: best LLM cost tracking tool per use case

Use caseBest pickWhyPricingLicense
Per-trace cost with evals, gateway routing, and 5-level budgets in one runtimeFuture AGI Agent Command Centerx-agentcc-cost on every response, span-attached eval scores, hard caps at org/team/user/key/tagFree + usageApache 2.0
Gateway-attached cost dashboard, minutes to first chartHeliconeBase URL change captures every call as a span with costHobby free, Pro $79/moApache 2.0
Self-hosted cost panels next to traces and promptsLangfuseMature traces, prompts, cost views; MIT coreHobby free, Core $29/moMIT
Provider arbitration with cost shown inlineOpenRouterOne key, 100+ providers, cost per call visible at routing timePay-per-use + 5% marginClosed
Gateway with virtual keys and cost panelsPortkeyCost + provider failover + budgets in MIT gatewayFree + paid from $49/moMIT gateway
LangChain or LangGraph shop, end-to-endLangSmithCost lives next to LangChain trace treeFree + paid from $39/moClosed
Already paying for Datadog, want LLM in APMDatadog LLM ObservabilityLLM cost correlated with infra costCustom; from $31/host/mo APMClosed
Finance-led multi-source cost allocationCloudZero / VantageCloud + SaaS + LLM cost in one allocation viewQuote-basedClosed

If you only read one row: Future AGI is the only platform on this list where per-trace cost lives on the same span as the eval score, the developer key, and the budget cap. The rest pick a slice and ship it cleanly. The slice you pick depends on the workload, not the marketing page.

The opinionated thesis

The bill is the lagging indicator. The trace is the leading one. A platform that can show you the line item but cannot show you the trace that produced it is a finance tool, not an engineering tool. Three patterns follow:

  • Cost belongs on the span. Same row as latency, model name, prompt version, and the eval score. Set by the gateway handler before the response body returns. Every dashboard query starts there.
  • The denominator is an outcome, not a token. Cost per resolved conversation, accepted PR, completed task. Token-shaped metrics flatter cheap models that loop more. Outcome-shaped metrics catch the regression.
  • Budgets enforce at the gateway, not in a wiki page. A 429 with a structured payload telling the caller which level blocked is the only cap that holds when the on-call engineer is asleep.

The tools below get ranked on how cleanly they support these three patterns. The further down, the more of the loop you build yourself.

What an LLM cost tracking tool actually has to do

A working cost layer answers six questions on production data, without a custom pipeline:

  1. Per provider. OpenAI vs Anthropic vs Google vs Bedrock vs Mistral. Daily spend with model-mix breakdown.
  2. Per model. gpt-4o-2024-11 vs gpt-4o-mini vs claude-3-5-sonnet. The substitution alert (“model swapped, eval score dropped 18 points”) catches a real class of regressions.
  3. Per route or feature. /chat vs /rag-search vs /agent-action vs the autocomplete flag.
  4. Per developer or team. One virtual key per developer in Cursor, Codex, Claude Code, or Cline. Caps with a warn-threshold before the block fires.
  5. Per tenant. Customer-level spend tagged at the request layer. Difference between flat-rate gross margin and per-tenant contribution margin in a B2B product.
  6. Per outcome. Cost per resolved conversation, accepted PR, booked meeting. The join that catches the cheap-model regression.

Anything less and your team rebuilds the slicing in a spreadsheet, loses fidelity to the 3x spike that should have paged, and stops trusting the dashboard within a quarter.

The 8 LLM cost tracking tools compared

1. Future AGI Agent Command Center: per-trace cost wired to evals, gateway routing, and 5-level budgets

Apache 2.0. Self-hostable as a single Go binary. Managed cloud at gateway.futureagi.com.

Future AGI ships cost tracking the way the thesis demands. The Agent Command Center is an OpenAI-compatible gateway delivered as a single Go binary. Every response carries five cost-related headers set in the gateway handler before the body returns:

x-agentcc-cost: 0.000075
x-agentcc-latency-ms: 612
x-agentcc-model-used: anthropic/claude-3-5-sonnet
x-agentcc-provider: anthropic
x-agentcc-cache: miss

The dollar number lands on the trace span the same way latency and model name do, captured by traceAI, the OpenTelemetry-native SDK for Python, TypeScript, Java, and C# with auto-instrumentation for OpenAI, LangChain, Groq, Portkey, and Gemini. Prometheus on /-/metrics surfaces agentcc_cost_total, agentcc_tokens_total, agentcc_cache_hits_total/misses_total, and agentcc_requests_total. OTLP traces export to any OTel collector. Routing config lives in YAML, not deploys.

# OpenAI SDK drop-in. Cost arrives on the response headers.
from openai import OpenAI

client = OpenAI(
    api_key="sk-agentcc-...",
    base_url="https://gateway.futureagi.com/v1",
)
response = client.chat.completions.create(
    model="anthropic/claude-3-5-sonnet",
    messages=[{"role": "user", "content": "..."}],
)
# response.headers["x-agentcc-cost"] -> "0.000075"

The 5-level budget hierarchy makes Cursor and Codex spend predictable across a fifty-engineer team. Levels: org, team, user, key, tag. A single request inherits the lowest applicable ceiling. Org cap stops the runaway script. User cap separates the power user from the occasional one. Key cap pins CI to a hard daily ceiling that returns 429 when blown. Tag cap catches the experiment that ran a week longer than planned. Each level supports warn_threshold (default 0.8) and a hard-or-soft mode.

budgets:
  enabled: true
  default_period: monthly
  warn_threshold: 0.8
  org: { limit: 50000, hard: false }
  teams:
    platform:   { limit: 12000, hard: false }
    support-cx: { limit: 8000,  hard: true  }
  keys:
    ci-tests:   { limit: 200, hard: true, period: daily }

The other half of the pitch is the eval loop. The same trace span that carries x-agentcc-cost also carries the ai-evaluation score from the rubric. Cost per resolved conversation is a query against the trace store, not a separate billing pipeline. When the model swap drops cost 40 percent and the eval score drops 18 points on the same route, the alert fires before the rollout completes. That is the loop dashboard-only tools cannot close.

Best for: Engineering and platform teams running production agents (Cursor, Codex, Claude Code internal builds, Cline workflows, customer-support copilots) where a cost spike has to trace back to the route, model swap, or developer that caused it.

Honest tradeoff: Span-attached cost adds operational surface. Gateway, trace processor, and dashboard all have to agree on the schema. Teams already running OpenTelemetry pay the smaller version of this cost.

Pricing. Free to start (50 GB tracing, 100K gateway requests, 100K cache hits, 30-day retention); usage-based as you grow. SOC 2 Type II, HIPAA BAA, SAML SSO add on. Pricing.

2. Helicone: proxy-based, cost-dashboard-first

Apache 2.0. Self-hostable. Hosted cloud option.

Helicone is the lowest-friction path from cold start to a per-provider cost dashboard. Change the OpenAI base URL, add a header, and every request becomes a span with cost attached. The cost view sits at the front of the product: per provider, per model, per user, cost-per-request, cache hit rate.

The tradeoff is depth. Helicone optimizes for the dashboard, not the loop. Eval surface is shallower than dedicated LLM platforms. Per-trace eval correlation exists but lacks the rubric library and CI gating Future AGI ships.

Use case: A team that wants a working cost dashboard by Friday and accepts the loop is a separate problem.

Pricing: Hobby free with 10K logs/mo. Pro $79/mo.

Worth flagging: The March 2026 Mintlify acquisition introduced roadmap risk. The platform remains usable; feature velocity has slowed. See Helicone alternatives.

3. Langfuse: self-hosted cost panels next to traces and prompts

MIT core. Self-hostable. Hosted cloud option.

Langfuse is the strongest self-hosted cost panel for LangChain-adjacent shops. Cost lives next to traces and prompts in the same UI, with per-provider, per-model, per-route, per-user, and per-tenant breakdowns. The price table covers major providers and updates within reasonable cadence.

The reason Langfuse falls short for the per-trace loop is that cost is a dashboard slice rather than a routing input. Cost arrives after the call, as an analytic surface. Future AGI sets cost on the response header at the gateway layer, before the body returns, so routing decisions can read it inline. Langfuse forecasting is also lighter than dedicated FinOps tools; pair with Vantage or CloudZero when finance owns the conversation.

Use case: A platform team that operates the data plane, wants traces and cost in their own infrastructure, and treats the cost panel as a reporting view.

Pricing: Hobby free with 50K units/mo. Core $29/mo. Pro $199/mo.

Worth flagging: Langfuse Experiments shipped CI/CD integration in May 2026. See Langfuse alternatives.

4. OpenRouter: provider arbitration with cost shown inline

Closed platform. Pay-per-use with a 5% margin over upstream provider prices.

OpenRouter is the cheapest path when “show me the cost for this exact model on this exact prompt” is the question. One API key, 100+ providers, every model with a unit price visible in the same surface. Spend per model, per app, per key. Good enough for a small team running provider arbitration as the primary cost-control lever.

OpenRouter does not climb higher because of the loop. No eval surface, no rubric-gated substitution, no virtual-key budgets with warn-thresholds, no OTel export. Cost is shown, not stored. It does not arrive on your trace store unless you wire it yourself.

Use case: Pre-production prototypes and teams whose cost-control story is “pick the cheapest model that passes the smoke test”. Stops being enough once the same prompt runs on 41,000 traces.

Pricing: Pay-per-use. 5% margin over upstream provider price.

Worth flagging: Acceptable on small workloads; meaningful on six-figure monthly spend where direct provider contracts pay back.

5. Portkey: gateway with virtual keys and cost panels

MIT gateway. Closed managed tier on top.

Portkey is the closest peer to the Agent Command Center on the gateway dimension. MIT-licensed gateway, multi-provider routing, virtual keys, fallbacks, retries, cost panels per provider and per key. The managed tier adds audit logs, prompt management, and a guardrail layer.

Portkey falls short relative to Future AGI at the loop edges. The eval surface, rubric library, simulation product, and optimizer live elsewhere. The 5-level budget hierarchy (org, team, user, key, tag) is wider than Portkey’s per-key model. The Future AGI Go runtime (29k req/s, P99 21 ms with guardrails, t3.xlarge) is a different performance envelope from Portkey’s Node-based gateway.

Use case: A team standardized on Portkey, comfortable stitching a separate eval product to the trace store.

Pricing: Free OSS gateway. Hosted from $49/mo.

Worth flagging: Eval depth smaller than dedicated LLM platforms. See Portkey alternatives.

6. LangSmith: end-to-end for LangChain shops

Closed platform. SaaS plus self-hosted on the higher tier.

LangSmith is the cost surface a LangChain or LangGraph shop already has. Cost lives next to the LangChain trace tree, broken down by run, by chain, by tool call. For teams using LangSmith as the trace store anyway, adding cost tracking is one toggle.

LangSmith does not climb higher because of scope. Excellent inside the LangChain world; shallower at the edges where production traffic mixes frameworks. The 5-level budget hierarchy is not there. The gateway header pattern is not there.

Use case: A team that lives in LangChain or LangGraph end-to-end and uses LangSmith as the source of truth for traces.

Pricing: Free tier. Plus from $39/mo.

Worth flagging: If half the production traffic comes from non-LangChain agents, LangSmith covers half the bill.

7. Datadog LLM Observability: only worth it when Datadog is already the standard

Closed platform. SaaS only.

Datadog LLM Observability is the right pick when Datadog is already the platform. Cost shows up inside the same dashboard as CPU, memory, p99 latency, and Redis hit rate. For a team that already pays the Datadog bill and wants one place to look during an incident, the case is straightforward.

Datadog falls short on the LLM-specific surface. Eval depth is shallower than dedicated LLM platforms. The cost panel does not natively join to a rubric score the way Future AGI or Langfuse do. Per-span ingest and per-log indexing pricing crosses into five-figure monthly contracts at LLM scale.

Use case: Engineering organizations standardized on Datadog where infra correlation matters more than open instrumentation.

Pricing: Custom. From $31/host/mo APM plus the LLM Observability add-on.

Worth flagging: Right answer for “we want one platform”. Wrong answer for “we want the deepest LLM eval loop” — those workloads end up running Future AGI or Langfuse alongside.

8. CloudZero, Vantage, and the FinOps tools

Closed platforms. SaaS only.

CloudZero and Vantage are the cost-allocation tools finance teams use to attribute AWS, GCP, and Azure spend. Both ingest LLM cost as one more line item alongside cloud and SaaS bills. Every dollar mapped to a project, team, environment, and customer in one view.

They sit at the bottom of this list because they answer a finance question, not an engineering one. No per-trace surface, no eval correlation, no per-route slicing. Pair with Future AGI, Helicone, or Langfuse for the LLM detail. FinOps tool for the rollup, per-trace tool for the diagnostic.

Use case: Engineering finance or FinOps teams with multi-cloud, multi-provider spend that needs unified attribution.

Pricing: Vantage free tier; paid quote-based. CloudZero quote-based.

Worth flagging: Never the sole tool. No eval correlation, no per-route slicing, no virtual-key budgets.

Future AGI four-panel dark product showcase. Top-left: Cost dashboard (focal panel with halo) showing daily spend line chart climbing across 30 days with KPI tiles for total $8,412, avg per day $280, vs last month +24%. Top-right: Per-provider breakdown horizontal bar chart with OpenAI, Anthropic, Google, Mistral, Together rows. Bottom-left: Per-route table with /chat, /rag-search, /agent-action, /summarize spend rows. Bottom-right: Alert rules panel with 4 rules including daily spend threshold, per-user threshold, per-route spike, tokens per call, with one rule fired status.

Decision framework: pick by the constraint that actually holds

Walk down this list and stop at the first constraint that matches.

  • Per-trace cost on the same span as the eval score. Future AGI Agent Command Center.
  • Working cost dashboard by Friday; the loop is a separate problem. Helicone.
  • Self-hosted, cost next to traces and prompts. Langfuse.
  • Provider arbitration with cost shown inline. OpenRouter.
  • Gateway-first with virtual keys. Portkey for the MIT gateway, Future AGI for the 5-level budgets plus eval loop.
  • LangChain end-to-end. LangSmith.
  • Already pay for Datadog at scale. Datadog LLM Observability.
  • Finance-led cost conversation across cloud, SaaS, and LLM. CloudZero or Vantage. Pair with an LLM-native tool above.
  • All of the above on one runtime. Future AGI.

Common mistakes teams make picking a cost tracking tool

  • Trusting stale price tables. OpenAI, Anthropic, and Google updated pricing four-plus times in 2024-2025. A table older than 30 days miscalculates by 20 to 40 percent on the providers that re-tiered.
  • Skipping per-tenant attribution. A B2B product without tenant tagging cannot model contribution margin. Flat-rate gross margin obscures the customer running the entire support agent for free.
  • Tracking only the platform fee. Real cost equals provider fee plus retries plus retries-on-timeout plus speculative-decoding wasted tokens plus judge tokens. Provider fee in isolation undercounts.
  • Picking on demo dashboards. Demos use clean cost data with idealized routes. Run a domain reproduction with your real route mix for two weeks before signing.
  • Ignoring forecasting and spike alerts. Daily spend without a forecast leaves the team caught by a 3x spike before the alert fires.
  • Optimizing token price instead of cost per outcome. The cheap-model swap that drops 40 percent of token spend and 18 points of resolution rate is a regression. The denominator is the lever that catches it.

Recent LLM cost tracking updates

DateEventWhy it matters
May 2026Langfuse shipped Experiments CI/CD integrationOSS-first teams can gate experiments by cost as well as eval pass rate.
Mar 9, 2026Future AGI shipped Agent Command Center with ClickHouse trace storagePer-trace cost moved into the gateway layer with span-attached eval correlation and 5-level budgets.
Mar 3, 2026Helicone joined MintlifyHelicone remains usable; roadmap risk became part of vendor diligence.
2024-2025Major providers updated pricing 4+ timesStale price tables became a real source of cost-tracking error. The 30-day freshness rule got teeth.

How to actually evaluate this for production

Three weeks, three steps.

  1. Run a domain reproduction. Tag your real route mix (/chat, /rag-search, /agent-action, the autocomplete flag) and compare per-route, per-provider, per-model spend across two candidate tools for two weeks. Verify cost shows up on the trace span in both.
  2. Test the alert path. Trigger a 3x query volume bump on one route and verify the platform pages on the right channel within 5 minutes. The alert that fires the next morning is the alert that didn’t fire.
  3. Cost-adjust the comparison. Real cost equals platform price plus the engineer-hours to maintain price tables, build dashboards, and run substitution experiments. A tool that ships these out of the box is often cheaper than a cheaper line item that doesn’t.

How Future AGI closes the cost loop where the other tools leave gaps

Future AGI is the only tool on this list where a cost spike feeds back into prompts and routing on the same runtime. The other seven ship cost as a dashboard or a finance allocation view. Future AGI ships cost as the input to the next deploy.

  • Capture. The Agent Command Center gateway fronts 100+ providers, sets the five x-agentcc-* headers on every response, and benchmarks at ~29k req/s with P99 21 ms on t3.xlarge with guardrails on. Apache 2.0, self-host or cloud at gateway.futureagi.com/v1.
  • Trace. traceAI carries cost to the span store as an OpenTelemetry attribute. Python, TypeScript, Java, and C#.
  • Score. ai-evaluation runs 50+ built-in evaluators as the same templates in pytest, in CI, against the shadow set, and against live traces.
  • Enforce. Five-level budgets with warn_threshold and hard or soft modes. A request that would blow the cap returns a structured 429 naming the level that blocked.
  • Optimize. agent-opt consumes failing or expensive trajectories and ships a shorter prompt the gateway routes next time. PROTEGI, GEPA, and four more optimizers, all wired to the eval score on the trace.

Best open source and best enterprise-grade in the same product. The Apache 2.0 stack is the entire platform, not a stripped preview. The managed tier adds SOC 2 Type II, HIPAA on Scale, RBAC, AWS Marketplace billing, dedicated VPC, and SLAs without changing APIs.

Most teams comparing cost-tracking tools in 2026 end up running three or four products in production. Cost data sits in tool A. Eval scores sit in tool B. When something goes wrong, an engineer joins them by hand. Future AGI ships the join.

Ready to attribute your agent’s bill to specific traces? Point your OpenAI SDK at https://gateway.futureagi.com/v1, read x-agentcc-cost on the response, and instrument the outcome event. Start with the Agent Command Center quickstart and the traceAI integration guide.

Sources

Future AGI docs · Future AGI GitHub · Future AGI pricing · Helicone GitHub · Helicone pricing · Helicone Mintlify announcement · Langfuse pricing · OpenRouter · Portkey gateway · Portkey pricing · LangSmith pricing · Datadog pricing · Vantage pricing

Frequently asked questions

What are the best LLM cost tracking tools in 2026?
The shortlist is Future AGI Agent Command Center, Helicone, Langfuse, OpenRouter, Portkey, LangSmith, Datadog LLM, and CloudZero or Vantage. Future AGI leads on per-trace cost attribution wired to evals, gateway routing, and 5-level budgets in one runtime. Helicone is the lowest-friction gateway-attached cost dashboard. Langfuse is the strongest self-hosted cost panel for LangChain-adjacent shops. OpenRouter is the cheapest path when you want provider arbitration and a usable cost view in the same product. Portkey ties cost to gateway routing with virtual keys. LangSmith covers LangChain shops end-to-end. Datadog LLM correlates LLM cost with infra cost for teams already paying the Datadog bill. CloudZero or Vantage handle finance-led, multi-source FinOps.
What is the right unit of LLM cost tracking?
Cost per trace, cost per developer, cost per feature, and cost per resolved outcome. Cost per token tells you the bill got smaller. It tells you nothing about whether each dollar still bought a resolved conversation, an accepted pull request, a completed task. The tool worth picking attributes every cent to a trace, an outcome, and a developer, then lets you slice on all four axes without a custom pipeline. Token dashboards lie because they hide retries, tool-call loops, judge cost, and cache misses inside a single aggregate. Cost per resolved conversation tells the truth because the denominator catches the wrong kind of optimization win.
Why is LLM cost tracking an observability problem, not a billing problem?
Because the cost number is useless without the trace context that produced it. A bill says the team spent $12,000 in March. A trace says the spend lives on a single feature flag calling Claude Sonnet on every keystroke for two weeks, attributed to one developer key, broken across 41,000 traces with a 23 percent cache hit rate. The first number is a finance artifact. The second number is the input to a fix. Billing tools sum line items. Observability tools attach cost to a span next to latency, model name, and eval score, and let you join cost to outcome on the same trace store. Without that attribution, every cost decision is a guess against an aggregate.
How do these tools compute per-trace cost?
Most multiply token counts by a maintained price table per model. Future AGI Agent Command Center, Helicone, Langfuse, Portkey, OpenRouter, and Datadog LLM all ship maintained price tables for OpenAI, Anthropic, Google, Bedrock, Mistral, and others. CloudZero and Vantage take usage events as input and let you supply the price table. The thing to verify is freshness. A price table older than 30 days miscalculates by 20 to 40 percent on the providers that re-tiered. Future AGI sets the dollar cost on the response header (x-agentcc-cost) before the body returns, so the number on the trace is the number the gateway charged, not a downstream reconstruction.
Should I use a gateway or an SDK for LLM cost tracking?
Use a gateway when you want zero code change and cost attached to every provider call automatically, including the providers you forgot you were calling. Use an SDK when adding a network hop is unacceptable for latency or when the LLM client is buried in a framework you don't control. Future AGI ships both: the Agent Command Center as an OpenAI-compatible gateway at gateway.futureagi.com/v1 and traceAI as the OpenTelemetry-native SDK. Most production teams end up running the gateway for cost capture plus the SDK for in-process spans and end up with cost living on the same trace either way.
How does Future AGI's Agent Command Center expose cost per trace?
Every response carries x-agentcc-cost, x-agentcc-latency-ms, x-agentcc-model-used, x-agentcc-provider, and x-agentcc-cache as response headers, set in the gateway handler before the body returns. Prometheus metrics surface agentcc_cost_total, agentcc_tokens_total, agentcc_cache_hits_total and misses_total, and agentcc_requests_total, labelled by provider and status. OTLP traces export to any OTel collector, so cost shows up on the same span as the agent trajectory. Budgets are five-level (org, team, user, key, tag) with warn-threshold and hard or soft semantics. The Go runtime hits roughly 29k requests per second with P99 at 21 ms on a t3.xlarge with guardrails on.
How do I attribute LLM cost to a specific developer, feature, or tenant?
Tag every request with the dimensions you care about, then enforce caps at the gateway. Future AGI uses virtual keys plus a five-level hierarchy: org, team, user, key, and tag. A single request inherits the lowest applicable ceiling. One key per developer for Cursor, Codex, or Claude Code workflows. One key per feature flag so the prototype that ran a week longer than planned cannot drain the quarter. One tag per tenant so a B2B product can model contribution margin per customer. Helicone, Langfuse, Portkey, and OpenRouter all support tag-based attribution at varying granularity. Future AGI is the one that ties the cap to a hard 429 plus a Prometheus alert in the same loop.
Related Articles
View all