Guides

LLM Pricing and Cost Comparison Guide for 2026

Sticker per-token price is the wrong unit. Use effective cost = sticker x (1 - cache_hit_rate) x prompt_token_ratio. The 2026 methodology, five discount lanes, and per-provider knobs.

·
Updated
·
13 min read
llm-pricing llm-cost ai-gateway finops model-routing prompt-caching 2026
Editorial cover image on a black starfield background. Bold all-caps white headline LLM PRICING 2026 fills the left half. The right half shows a wireframe stacked-bar chart of seven price dimensions in pure white outlines with a soft halo on the cached-input bar.
Table of Contents

A FinOps lead I spoke with last month walked in with a pricing spreadsheet that ranked GPT-5.1 as 4x more expensive than Claude Sonnet 4.5 on their workload. The audited bill, pulled from gateway traces, showed the two models within 8 percent of each other. The spreadsheet counted headline input and output rates and nothing else. It missed the 75 percent cached-input discount on the OpenAI side, the prompt-cache write fee on the Anthropic side, the batch share running every night, and the reasoning-token bill on the queries that hit extended thinking. Comparing 2026 LLM prices on sticker tokens alone is the FinOps equivalent of comparing flights by base fare. The unit that matters is effective cost on your real traffic, and this guide is the methodology for finding it.

Sticker per-token price is the wrong unit

The number that lands at the top of every provider pricing page is per-million input tokens and per-million output tokens on synchronous, non-cached, on-demand calls. It is the worst predictor of your actual bill in 2026 because most production workloads do not look like that calculation.

The unit you want is effective cost per call. The shape of the formula:

effective_cost_per_call
  = input_tokens x input_rate x (1 - cache_hit_rate) x cache_discount
  + cached_tokens x input_rate x cache_discount
  + output_tokens x output_rate
  + thinking_tokens x output_rate

Three variables collapse the sticker number. The cache hit rate, which can move the input line by 90 percent. The output-to-input ratio, which determines how much the output rate (usually 3-5x input) matters. The thinking-token share, which only applies to reasoning models but completely dominates their bill when it does.

A worked example. Two candidates for a coding agent. Model A lists at $0.20 input and $0.80 output per million tokens. Model B lists at $3 input and $15 output. On a sticker comparison, Model A looks 15x cheaper. On the real workload (8K-token system prompt cached at 70 percent hit rate, 400-token user messages, 600-token outputs, 1-hour session TTL), Model A has no cache support and pays the full rate every call. Model B caches the system prompt at 10 percent of sticker. Run the numbers and Model B comes in 1.4x more expensive, not 15x. Add Model A’s higher retry rate from weaker instruction following and the two converge. Add a 50 percent batch share on overnight backfills and the workload-shaped winner is rarely the sticker winner.

Premium models do not always win. The point is that sticker rankings get the answer wrong on most production workloads, and the only way to find the real winner is to replay your traffic and measure.

The five discount lanes

Five mechanisms move the effective rate away from sticker. Score each candidate model on the lanes that apply to your workload, not on the headline number.

Lane one: prompt cache. Anthropic prompt caching, OpenAI cached input, and Gemini context caching keep a stable prefix warm in the inference cache for a short TTL. Calls within the TTL pay 10-25 percent of headline input on the cached portion. Anthropic ships two TTLs (5 min and 1 hour) with a small write fee on first use. OpenAI cached input kicks in automatically above the 1024-token prefix threshold. Gemini context caching is opt-in with a per-hour storage fee plus discounted reads. For long-prefix workloads (chat, retrieval agents, system-prompt-heavy routes) this lane usually moves the bill more than any other knob.

Lane two: batch APIs. OpenAI Batch, Anthropic Message Batches, and Google Vertex Batch each price input and output at roughly 50 percent of sync rates with a 24-hour SLA. Anything that can wait overnight (offline scoring, dataset generation, eval backfills) belongs on the batch path. The question is how much of your traffic is sync-required versus async-tolerant.

Lane three: prepaid commits. Enterprise contracts and committed-use discounts knock 10-40 percent off list for sustained spend. OpenAI scale tier, Anthropic enterprise commits, Google Vertex committed-use, AWS Bedrock private offers, and Azure enterprise agreements all live here. Commits are negotiated, not published. Two anti-patterns: locking the commit at a single model (workload mix shifts faster than the contract) and committing without a routing layer that can shift traffic when a cheaper model ships.

Lane four: prompt-token ratio. The hidden discount. Output rates run 3-5x input rates across every frontier provider in 2026, so a 90-percent-input workload (RAG, classification, short-output summarization) is dominated by the input rate, while a 50-50 workload (code generation, long drafts) is dominated by the output rate. Two models with identical sticker rates can have very different effective costs depending on which side your workload sits.

Lane five: provisioned throughput. Azure PTU, AWS Bedrock Provisioned Throughput, and Google Vertex Provisioned Throughput price a fixed capacity reservation by the hour. Break-even sits around 60-80 percent sustained utilization. Above that, PTU saves 20-40 percent against on-demand and adds latency predictability. Below that, on-demand wins. Right candidates: steady high-volume workloads with predictable load. Wrong candidates: spiky tools where unused capacity wastes the discount.

A scorecard built on these five lanes ranks candidates against your traffic mix, not against the marketing comparison. For the FinOps view that wires the same cost data into per-team and per-tenant attribution, see LLM spend and cost tracking and AI agent cost optimization and observability.

Per-provider knobs (as of mid-2026)

Prices age fast, so this section names the levers rather than the cents. Every figure below is approximate as of mid-2026 and should be verified against the current pricing page before it lands in a financial model.

OpenAI. GPT-5 (frontier), GPT-5.1 + GPT-5.1-mini (workhorse), GPT-5-nano (light), o3 and o3-pro (reasoning). Approximate sticker bands: GPT-5.1 in the low single digits per million input, GPT-5.1-mini around $0.15-0.30, GPT-5-nano below $0.10. Cached input applies automatically above the 1024-token threshold at roughly 25 percent of headline. Batch API at 50 percent of sync. Reasoning models bill thinking tokens at the output rate, the largest hidden cost line in the OpenAI stack. Scale tier offers committed-spend discounts through enterprise sales.

Anthropic. Three tiers: Claude Opus 4.7 (frontier), Sonnet 4.5 (workhorse), Haiku 4.5 (light). Approximate bands: Opus in the high single digits per million input, Sonnet in the low single digits, Haiku below $1. Prompt caching is the standout knob: two TTLs (5 min and 1 hour), small write fee, cached reads at roughly 10 percent of headline input. Message Batches API at 50 percent of sync. Vertex and Bedrock resell Claude with their own provisioned-throughput options. Rebuild the system prefix on every call and you pay no cache discount.

Google Gemini. Gemini 3 Pro (frontier), 3 Flash (workhorse), 2.5 Flash still in heavy production. Approximate bands: 3 Pro in the low single digits, 3 Flash below $0.30, 2.5 Flash below $0.15. Context caching is opt-in with a per-hour storage fee plus discounted reads (different shape from Anthropic and OpenAI). Vertex Batch at 50 percent of sync. Gemini wins on long-context routes (1M-2M token windows) where the per-token rate stays flat across the full window.

Open-weight and challenger tier. DeepSeek V3 hosted below $0.30 per million input, open weights free to self-host. Meta Llama 4 Maverick on Together, Fireworks, and Bedrock around $0.20-0.50. Mistral Large 3 in the workhorse band. Cohere Command R+ targets RAG. xAI Grok 4 in the frontier band. Self-hosted economics replace token pricing with GPU-hour pricing: an H100 serving Llama 4 Scout under vLLM at $2.50 per hour spot lands around $0.05-0.10 per million output tokens. Break-even against hosted: somewhere between 5K and 50K sustained tokens per second depending on the comparison.

Bedrock and Azure provisioned. AWS Bedrock Provisioned Throughput, Azure PTU, and Vertex Provisioned Throughput all price by reserved capacity rather than tokens. The decision is utilization math, not sticker math: build the spreadsheet on your sustained throughput and the break-even is mechanical.

For a deeper read on benchmark behavior across providers, see LLM benchmarking state in 2026. For the gateway routing pattern that handles multi-provider in one config, see best AI gateways for cost optimization.

The replay methodology

Sticker tables are useful for screening candidates. Replay is required for picking the winner. The pattern works in four steps.

Step one: pull a representative traffic slice. A thousand production calls is usually enough. Sample across times of day, route types, and customer tiers if those vary. Strip PII before replay if compliance requires it. Store input prompts, expected output shape, and the production model’s response as ground truth.

Step two: route the same prompts through each candidate. Configure the AI gateway with multiple provider routes and point them at the slice. The gateway handles API contract differences, so you do not rewrite client code per provider. Shadow and mirror routing run this against live traffic without affecting users.

Step three: capture the per-call cost. The gateway exposes canonical dollar cost via the x-prism-cost header, computed after cache, routing, and fallback. Capture this alongside x-prism-model-used, x-prism-latency-ms, and token counts. After the replay, you have a per-call cost distribution per provider on your real traffic shape.

Step four: measure cost per successful answer, not per token. Apply an evaluator (LLM-as-judge, exact match, embedding similarity, or a domain rubric) to the replay outputs and divide cost by success rate. A model that costs 30 percent less per token but answers correctly 40 percent less often is a worse deal. Cost-per-successful-task is the metric that survives the next round of price changes.

Replay captures the variables sticker pricing ignores: your actual cache hit rate, your prompt-to-output ratio, your retry rate on bad outputs, and the difference in instruction following between candidates. A replay that takes a Tuesday afternoon to run gives a defensible answer that survives a CFO review.

Reasoning mode breaks the math

The newest provider knob is the most expensive failure mode. Reasoning models (OpenAI o3 and o3-pro, Claude Sonnet 4.5 with extended thinking, DeepSeek R1, Gemini 3 Pro deep think) emit thinking tokens that bill at the output rate. A single hard reasoning call can spend 30K thinking tokens before producing a 500-token answer, which puts the effective per-answer cost 10-30x the non-reasoning sibling.

Three patterns mitigate the bill. First, gate reasoning models behind a confidence-threshold router that only invokes the reasoning tier when the workhorse model flags low confidence. Most production traffic does not need reasoning; route it to the workhorse and escalate selectively. Second, set a thinking-token budget per call. Claude exposes budget_tokens, OpenAI exposes reasoning_effort. Third, monitor the thinking-to-output token ratio per route. When the ratio spikes above baseline, the reasoning model is grinding on prompts it cannot solve, and the right move is to surface those queries for prompt revision rather than keep paying for thinking.

A subtle compounding effect: thinking tokens are not cacheable in most current implementations. The prompt cache discount that saves 80 percent on input does nothing for the reasoning output line, so the cache-shaped math that makes a premium non-reasoning model competitive does not save the reasoning bill the same way. Score reasoning models on cost per successfully answered hard query, separately from the workhorse comparison.

How Future AGI grounds the comparison

The methodology above needs three primitives: per-call dollar cost, shadow or mirror routing against live traffic, and per-span cost attribution to roll up the replay. Agent Command Center ships all three today.

Per-call cost headers. Every gateway response includes x-prism-cost (canonical dollar value after cache, routing, and fallback), x-prism-model-used (the model that actually served the request), x-prism-latency-ms, x-prism-fallback-used, x-prism-routing-strategy, and x-prism-guardrail-triggered. Capture these into your warehouse and you have a per-call ledger ready for replay analysis:

import requests

response = requests.post(
    "https://gateway.futureagi.com/v1/chat/completions",
    headers={
        "Authorization": f"Bearer {gateway_api_key}",
        "x-fagi-tag-replay": "candidate=claude-sonnet-4-5,run=may-2026",
    },
    json={
        "model": "claude-sonnet-4-5",
        "messages": [{"role": "user", "content": "..."}],
    },
)

cost_usd = float(response.headers["x-prism-cost"])
model_fired = response.headers["x-prism-model-used"]
latency_ms = float(response.headers["x-prism-latency-ms"])

Shadow and mirror routing. Shadow sends the request to a candidate and discards the response. Mirror returns the primary response to the user and runs the candidate in parallel. Race returns whichever responds first. All three modes capture the per-call cost on both sides for fair comparison, so the replay runs against live traffic without disrupting users.

Per-span cost attribution. Register the instrumentor once and every LLM call inside the trace inherits the cost attribute, so the per-call ledger aggregates to per-route and per-tenant views:

from fi_instrumentation import register
from fi_instrumentation.fi_types import ProjectType
from traceai_openai import OpenAIInstrumentor

trace_provider = register(
    project_type=ProjectType.OBSERVE,
    project_name="cost-replay",
)
OpenAIInstrumentor().instrument(tracer_provider=trace_provider)

Token-cost calibration. The gateway carries 100+ token-cost calibration entries per provider and model. When a new model ships, the table updates and the dollar figure stays accurate without code changes. The cost number on every gateway call is canonical, not a downstream warehouse calculation that goes stale when prices change.

Five-level hierarchical budgets. Org, team, user, key, and tag (verified in core/internal/server/server.go). Tag-level budgets let you cap a replay run at a fixed dollar amount so a misconfigured test does not blow through your committed-use discount.

The FAGI Platform’s per-eval cost lands lower than Galileo Luna-2 on equivalent online scoring workloads, which frames the eval bill as a cost-optimization target alongside the inference bill. For the eval-cost angle, see LLM evaluation playbook 2026.

Four anti-patterns that quietly burn budget

Compare sticker rates only. Sticker rates are the marketing number. Real bills are shaped by cache hit rate, prompt-token ratio, batch share, and reasoning surcharges. A spreadsheet that ranks models on sticker alone routinely picks an option that costs 3-5x more after the discount stack is counted.

Lock to a single provider. Single-provider deployments kill negotiating room on the enterprise contract, remove the failover path when a provider has an outage, and prevent shadow comparisons against a candidate. Multi-provider through a gateway is the default in 2026.

Serve one model for every route. Shallow traffic pays the frontier rate when it could run on a Haiku-class or Flash-class model. Configure per-route model choice with a router (gate Opus or o3-pro behind a confidence threshold, default to Sonnet or GPT-5.1, route shallow to Haiku or Flash) and the bill drops without any application code change.

Skip the provisioned-throughput evaluation. For steady high-volume workloads above the 60-80 percent sustained-utilization threshold, PTU saves 20-40 percent against on-demand. Teams skip the evaluation because the contract feels heavy. The contract is heavy. The savings are real. Run the math on your top three routes by spend before discarding it.

A note on prices and dates

Every cents-and-tenths figure in this post is approximate as of mid-2026 and will be stale within a quarter. Frontier sticker prices have collapsed roughly 5-10x since 2024 and continue to trend down. Provider-specific discount programs (enterprise commits, startup credits, regional promotions) routinely move list prices another 10-30 percent off, and those numbers are negotiated rather than published. Use this guide for the methodology and the discount-lane structure, which are durable. Use the current pricing page from each provider for the actual numbers before they land in a financial model.

Two roadmap caveats on the FAGI side. The trace-stream-to-agent-opt connector that auto-promotes failing traces into optimization datasets is on the roadmap, not shipped today. The optimization loop is currently eval-driven: failing traces become labeled examples that feed the agent-opt optimizers (BayesianSearch, GEPA, ProTeGi, PromptWizard, MetaPrompt, RandomSearch) once you push the dataset through. Linear is the only ticket integration on the Error Feed today; Slack, GitHub, Jira, and PagerDuty are on the roadmap.

The five-level hierarchical budgets, the per-call cost headers, the shadow and mirror gateway modes, the per-span cost attributes on traceAI, and the 100+ token-cost calibration entries all ship today. They are the parts of this post you can deploy this week.

Next steps

Three concrete moves. First, route one route through the Agent Command Center and capture the x-prism-cost header into your warehouse for a week. You will see your real effective cost on that route, not a sticker estimate. Second, pick the route by spend and run a replay against two candidate models. Mirror routing makes this safe against live traffic, and the per-call cost headers give you the comparison numbers without spreadsheet math. Third, score the result on cost per successfully answered query, not cost per million tokens. The winner on that metric is the model you should be using.

For the FinOps attribution layer that turns the same gateway data into per-team and per-tenant chargeback, see LLM spend and cost tracking. For the full agent cost picture across inference and eval, see AI agent cost optimization and observability. For the routing-and-caching gateway pattern in isolation, see best AI gateways for Claude Code caching.

Frequently asked questions

What is the right unit for comparing LLM prices in 2026?
Not the per-million-token sticker rate. Use effective cost per call, which is sticker x (1 - cache_hit_rate) x prompt_token_ratio + output_share x output_rate. The sticker rate ignores the four discount lanes that move the actual bill: prompt caching, batch APIs, prepaid commits, and the input-to-output token ratio that varies wildly between workloads. A 10x cheaper sticker model with zero cache hits is more expensive at scale than a 'premium' model with 70 percent cache hits and a 50 percent batch share. The right comparison replays your actual workload through each candidate and measures effective cost per successfully answered query.
What are the five discount lanes that actually move the bill?
First, prompt cache: Anthropic prompt caching, OpenAI cached input, and Gemini context caching cut the input rate to 10-25 percent of sticker on cached prefixes. Second, batch APIs: OpenAI Batch, Anthropic Message Batches, and Vertex Batch run at roughly 50 percent of synchronous rates with a 24-hour SLA. Third, prepaid commits: enterprise contracts and committed-use discounts knock 10-40 percent off list for sustained spend. Fourth, prompt-token ratio: a workload that is 95 percent input and 5 percent output costs very differently than a 50-50 workload, because output rates run 3-5x input rates. Fifth, tiered pricing: provisioned throughput (Azure PTU, Bedrock PT, Vertex PT) prices by reserved capacity instead of tokens and wins above 60-80 percent sustained utilization.
How does prompt caching change the effective cost calculation?
Prompt caching is the single largest knob in 2026 LLM pricing. Anthropic prompt caching exposes two TTLs (5 minutes and 1 hour) with a small write fee on first use and cached reads at roughly 10 percent of headline input. OpenAI cached input kicks in automatically once prefixes exceed 1024 tokens, with cached portions billed at around 25 percent of headline. Gemini context caching is opt-in with a per-hour storage fee plus discounted reads. If your system prompt sits above 1K tokens and your traffic reuses that prefix, your effective input rate is one-tenth to one-quarter of the sticker rate. Models with no cache support pay the full rate every call, which makes cheaper sticker prices misleading.
When does provisioned throughput beat token-metered pricing?
Provisioned throughput (Azure PTU, AWS Bedrock Provisioned Throughput, Google Vertex Provisioned Throughput) reserves a fixed inference capacity priced by the hour, day, or month rather than by token. Break-even sits around 60-80 percent sustained utilization of the reserved capacity. Below that, on-demand wins on cost. Above that, PTU saves 20-40 percent and adds latency predictability plus quota guarantees. The right candidates are steady weekday chat products, voice agents with predictable session counts, and regulated workloads that need region-locked capacity. The wrong candidates are spiky internal tools and experimentation traffic, where unused reserved capacity wastes the discount.
Why does reasoning-mode pricing break the cache math?
Reasoning models (OpenAI o3 and o3-pro, Claude Sonnet 4.5 extended thinking, DeepSeek R1, Gemini 3 Pro deep think) emit thinking tokens that bill at the output rate. A single hard reasoning call can spend 30K thinking tokens to produce a 500-token answer, which makes the effective per-answer cost 10-30x the non-reasoning sibling. Worse, thinking tokens are not cacheable in most current implementations, so the prompt cache discount that saves you 80 percent on input does nothing for the reasoning output line. The right way to use reasoning models is to gate them behind a confidence-threshold router and set a thinking-token budget per call (Claude's budget_tokens parameter, OpenAI's reasoning_effort). The right way to compare them is cost per successfully answered hard query, not cost per million tokens.
How do I actually compare providers without trusting a sticker table?
Replay your workload. Pull a representative slice of production traffic (a thousand calls is usually enough), route the same prompts through each candidate behind an AI gateway, capture the per-call dollar cost from response headers, and aggregate. The Agent Command Center exposes the canonical per-call cost via the x-prism-cost header after factoring cache, routing, and fallback. Shadow and mirror routing modes let you run the comparison without affecting user traffic. The output is a real effective-cost number per provider on your traffic shape, not a theoretical sticker comparison. Sticker tables are useful for screening candidates; replay is required for picking the winner.
How does Future AGI help compare and optimize LLM cost?
Three surfaces. First, the Agent Command Center gateway exposes per-call dollar cost via the x-prism-cost header plus the actual model fired via x-prism-model-used, which removes spreadsheet math from the comparison loop. Second, shadow and mirror routing modes let you A/B cost-test candidate providers against production traffic without disrupting user experience. Third, traceAI emits per-span cost attributes that aggregate to per-team, per-product, and per-tenant chargeback views, so the comparison artifact is the same artifact your finance team uses for monthly attribution. The gateway carries 100+ token-cost calibration entries per provider so the dollar figure stays accurate as models change.
Related Articles
View all
Evaluating LLM Routing Policies in 2026
Guides

Routing-policy eval is not model eval. The 2026 playbook: route correctness, cost-savings realized vs theory, quality preservation under substitution, and fallback correctness — instrumented end to end.

NVJK Kartik
NVJK Kartik ·
12 min