Guides

Best AI Gateway to Manage Codex CLI Token Spend in 2026

Five AI gateways for Codex CLI token-spend management in 2026: per-session attribution, per-dev caps, alerts, model downgrade, cache observability.

March 17, 2026

20 min read

ai-gateway 2026 codex-cli

Table of Contents

The first time Codex CLI broke a monthly budget at our company, finance pinged engineering on the 19th: “We’re at 142% of the OpenAI line item. Why is Codex CLI $38,000 ahead of plan?” The engineering director opened the OpenAI usage dashboard. Aggregate spend, no per-developer split, no per-repo split, no way to tell whether the overrun came from the platform team running multi-file refactors or one engineer running a nightly cron that re-summarised the monorepo every six hours.

That was the day the team understood the difference between token tracking and spend management. Tracking tells finance what happened after the bill arrives. Spend management is the layer that prevents the breach, per-developer caps with auto-pause, alerts routed to the manager who can act, requests downgraded from gpt-5.1 to gpt-5.1-mini when a team is over budget, cache-hit-rate dashboards, and a monthly burndown forecast that surfaces the overrun on the 11th.

Codex CLI is OpenAI’s terminal coding agent. It reads OPENAI_API_KEY, hits the Responses API on api.openai.com, and produces no native chargeback view, no native cap, no native downgrade. Everything comes from the gateway in front. This post scores the 2026 cohort on the seven axes that matter when Codex CLI is rolled out across 30+ engineers and the FinOps lead’s job is to stop the bleeding without turning developers against the tool.

TL;DR

Future AGI Agent Command Center is the strongest pick for an AI gateway for Codex CLI token spend management because it ships per-developer virtual keys with three-stage budget thresholds, an automatic gpt-5.1 → gpt-5.1-mini downgrade on breach, per-session attribution that maps each Codex CLI session to one chargeback line item, and OpenAI / Bedrock / Anthropic all reachable behind one OpenAI-compatible Responses-API base URL. The other four picks below win on specific edges.

Future AGI Agent Command Center — Best overall. Three-stage budget thresholds, per-session attribution, virtual-key fan-out that preserves OpenAI tiered-pricing discounts, and dry-run policy testing.
Portkey — Best for cleanest virtual-key budgets with Slack/Teams alerts out of the box. Mature hosted-only product (verify the Palo Alto Networks acquisition timeline before signing multi-year).
Helicone — Best for 10-developer Codex CLI rollouts where minimal infra wins. Lightweight per-key rate limiting (treat as planned migration after the March 3, 2026 Mintlify acquisition).
LiteLLM — Best when Codex CLI traffic cannot leave the VPC. Self-hosted Python proxy with team-level budgets and webhooks; pin commits after the March 24, 2026 PyPI compromise.
OpenRouter — Best for 3-5 person teams A/B-testing Codex CLI against many models before the FinOps lead gets involved. Pay-per-token directory with caller-side model selection.

Why Codex CLI spend management is harder than spend tracking

Tracking is read-only: dump cost rows into a spreadsheet, finance produces the chargeback table, conversation ends until next month. Spend management is write-side, it decides what happens next, in real time, when a budget is at risk. Three properties of Codex CLI make this harder than tracking it.

First, Codex CLI sessions are long and lumpy. A single multi-file refactor can run 80 to 120 turns, with input spiking to 180K tokens on the merge-conflict turn near the end. A requests-per-minute rate limit is the wrong tool, a $54 session was three requests. The cap that matters is dollar-denominated, at the developer or cost-center level.

Second, the cost-quality mismatch is concrete and large. In our Q1 2026 data across 18 engineering teams, 62% of Codex CLI turns had input contexts under 8K tokens, file renames, lint fixes, completion-style turns. They run fine on gpt-5.1-mini at ~1/8th the per-token cost of gpt-5.1. The remaining 38% are multi-file refactors where gpt-5.1 earns its keep. Routing every turn to the flagship is the second-most-common reason a Codex CLI rollout blows budget. The most common isn’t capping it at all.

Third, auto-pause mid-session destroys developer trust faster than over-spend destroys the budget. The pattern that survives is three-stage: soft alert at 80%, automatic downgrade at 95%, structured 429 with a “your EM has been alerted” body at 110%. Anything cruder generates a Slack thread the FinOps lead loses.

All five picks below sit in front of Codex CLI via OPENAI_BASE_URL.

The 7 axes we score on

The generic “best AI gateway” axis list is the wrong shape for a FinOps lead with a Codex CLI overrun. Seven axes specifically about spend management for the OpenAI terminal coding agent:

Axis	What it measures
1. Per-session token attribution for Codex CLI	Can it group token spend by Codex CLI session ID so one long refactor is a single line item?
2. Per-developer budget caps	Can it enforce a hard dollar cap per developer per day, with virtual keys that fan out to one underlying OpenAI key?
3. Alert routing	When budget is at 80% / 95% / 110%, can it page the right person — the EM, not #general?
4. Model downgrade on budget breach (gpt-5.1 → gpt-5.1-mini)	When a wallet is exhausted, can the gateway auto-route easy turns to a cheaper model without code changes?
5. Cache hit-rate observability	Does it surface OpenAI prompt-cache hit rate per developer / per repo?
6. Retry-budget interaction	When tool-use retries fail, does it count retries against budget and cap retries-per-session?
7. Monthly burndown forecasting	Does it project end-of-month spend from the first 10 days, so overrun surfaces on the 11th instead of the 19th?

Verdict line at the end of each pick scores all seven.

How we picked

We started from public AI gateways with an OpenAI-compatible endpoint as of May 2026. Four cuts: gateways that only do RPM rate limiting (no dollar caps); gateways without per-virtual-key budget enforcement; gateways whose auto-pause returns a 500 instead of a structured 429 (Codex CLI hangs on 500s); gateways with no surfacing of OpenAI’s prompt-cache hit rate. Five survived.

A note on the 2026 trust cohort: Portkey is mid-acquisition by Palo Alto Networks (announced April 30, 2026, close expected PANW fiscal Q4 2026); LiteLLM had a PyPI supply-chain compromise on 1.82.7 / 1.82.8 (March 24, 2026, remediated past 1.83.7 per Datadog Security Labs); Helicone was acquired by Mintlify on March 3, 2026 with the roadmap pivoting toward documentation-platform-first. All three remain in the cohort; the procurement story is now different. Flagged per pick.

1. Future AGI Agent Command Center: Best for per-developer Codex CLI budget caps with auto-downgrade routing

Verdict: Future AGI ships three-stage budget thresholds (soft alert at 80%, automatic downgrade at 95%, hard pause at 110%), per-session attribution that maps every Codex CLI session to one chargeback line item, virtual keys that preserve OpenAI tiered-pricing discounts via fan-out to one underlying key, and Bedrock alongside OpenAI both reachable behind one OpenAI-compatible Responses-API base URL so a budget breach downgrades from gpt-5.1 to gpt-5.1-mini mid-session without an SDK change.

What it does for Codex CLI token spend management:

Per-session attribution through fi.attributes.session.id, set when Codex CLI’s session header is forwarded. Each session is one chargeback line item; each turn a child span. The 180K-token turn that broke the budget is one click away.
Per-developer budget caps through virtual keys with native three-stage thresholds (80% soft alert, 95% downgrade, 110% hard pause with structured 429). Each virtual key fans out to one underlying OpenAI key, preserving tiered-pricing discounts.
Alert routing through fi.alerts with per-cost-center rules. Platform squad’s breach DMs the platform EM; data squad’s DMs the data EM. FinOps gets the end-of-day digest, not a 9pm firehose.
Model downgrade (gpt-5.1 → gpt-5.1-mini) through the budget-aware routing policy. At 95%, turns under 10K input tokens auto-route to gpt-5.1-mini; flagship stays for hard turns until 110%. Past 110%, both drop to mini until midnight UTC.
Cache hit-rate observability through the prompt-cache span attribute. OpenAI’s cached-input pricing (~50% of full input cost on hit) surfaces per developer, per repo, per session, no SQL needed.
Retry-budget interaction through the retry-counter attribute. Each retry counts against budget and a per-session cap (default 3). A session looping 14 times on a bash error pauses after the third with a structured 429.
Monthly burndown forecasting through the spend-projection view. On day 11, three bands (P10 / P50 / P90). If P50 crosses budget, FinOps has nine days to act.

The loop. Every turn produces a span via traceAI (Apache 2.0). fi.evals (Apache 2.0) scores tool-use accuracy, code correctness, task completion. On a breach, fi.opt.optimizers (six optimizers (RandomSearchOptimizer, BayesianSearchOptimizer Optuna-backed with teacher-inferred few-shot templates and resumable studies, MetaPromptOptimizer, ProTeGi, GEPAOptimizer, PromptWizardOptimizer), all sharing an EarlyStoppingConfig (patience + min_delta + threshold + max_evaluations) and the same unified Evaluator over 60+ FAGI rubrics, all Apache 2.0 in agent-opt) reads breach + trace + eval history and emits a policy diff: “for platform-squad, route turns under 8K input tokens to gpt-5.1-mini between 9am and 5pm, regression 0.4%, under 2% tolerance. Estimated monthly saving: $3,840.” Math shown. Protect guardrails (~65 ms text, 107 ms image per arXiv 2510.13351) add no perceptible latency.

Where it falls short:

The three-stage threshold (80 / 95 / 110) is opinionated. Teams wanting a single hard cap will find the default over-engineered. Configurable.
The “learn which turns to downgrade” loop needs ~three weeks of traffic before stable policy diffs emerge. Day-one uses static rules.

Pricing: Apache 2.0 single Go binary; cloud or self-host. Free tier 100K traces / month. Scale from $99 / month. Enterprise custom, SOC 2 Type II, HIPAA, GDPR, and CCPA certified, BAA available, AWS Marketplace listing.

Score: 7 / 7 axes.

2. Portkey: Best for hosted Codex CLI spend management with Slack-native alerts

Verdict: Portkey is the most polished hosted product for the enforce-budgets-and-alert slice. Virtual-key budgets are the cleanest in the cohort, Slack and Teams integrations are first-class, the dashboard is the one FinOps leads log into. It doesn’t close the loop on routing policy. Verify the PANW acquisition timeline before signing multi-year.

What it does for Codex CLI token spend management:

Per-session attribution through Portkey’s trace_id header. The Codex CLI wrapper has to set it; without it, sessions blend. A 12-line shell wrapper is the standard pattern.
Per-developer budget caps through virtual keys with daily or monthly resets. Soft-alert thresholds configurable per key.
Alert routing through native Slack and Teams. Per-key alerts target channels or DM specific users. “Platform-EM versus #general” is first-class.
Model downgrade on budget breach is partial. Metadata-conditional routing exists, but “at 95%, route easy turns to gpt-5.1-mini” requires a YAML rule. Wireable, not turnkey. Plan a sprint for the three-stage pattern.
Cache hit-rate observability is partial. Portkey surfaces its own semantic cache hit rate; OpenAI’s prompt-cache hit rate is recoverable from raw trace, not from the dashboard.
Retry-budget interaction isn’t natively modeled. Each retry counts as a separate request with no per-session cap. The “agent loops on a bash error and burns $80 in 4 minutes” failure mode happens.
Monthly burndown forecasting isn’t a dashboard feature.

Where it falls short:

Palo Alto Networks announced intent to acquire Portkey on April 30, 2026, close expected PANW fiscal Q4 2026. The gateway is slated to become the AI Gateway for Prisma AIRS. Verify standalone-product continuity before signing multi-year.
No optimizer. This month’s breach pattern doesn’t influence next month’s policy unless the EM manually re-tunes.
Route-by-budget is wireable but not turnkey. Day-one is caps + alerts; the downgrade rule is a follow-on sprint.

Pricing: Open-source core (MIT) + commercial cloud control plane. Free tier 10K requests / day. Scale from $99 / month. Enterprise custom with SOC 2 Type II.

Score: 5 / 7 axes (missing: turnkey downgrade-on-breach, retry-budget cap, burndown forecast, self-improving policy).

3. Helicone: Best for lightweight Codex CLI spend management on small teams

Verdict: Helicone is the right pick when the story is “10 developers on Codex CLI, $5K / month, daily cap per developer plus a Slack ping when it trips.” Drop in the proxy, set the rate-limit policies, the overhead matches the budget. Past 20 developers or once FinOps asks for downgrade-on-breach, the cracks show. Acquired by Mintlify in March 2026 with the roadmap pivoting toward documentation-platform-first.

What it does for Codex CLI token spend management:

Per-session attribution through Helicone-Session-Id. Wrapper has to set it.
Per-developer budget caps through rate-limit policies. Helicone is RPM-first; dollar caps work through usage alerts plus a webhook that flips the key off. Less clean than Portkey’s virtual-key budgets.
Alert routing through usage alerts; Slack hook is standard. Per-EM routing requires per-policy channel config.
Model downgrade on budget breach isn’t present out of the box. Failover fires on errors, not budget proximity. The downgrade has to be coded in a wrapper upstream.
Cache hit-rate observability through Helicone’s own cache. OpenAI’s native prompt-cache hit rate isn’t first-class.
Retry-budget interaction isn’t modeled. Each retry counts as a request; no per-session cap.
Monthly burndown forecasting isn’t present.

Where it falls short:

Acquired by Mintlify (March 3, 2026) with the public roadmap shifting toward documentation-platform-first. Treat 2026 as a migration-evaluation window.
No optimizer.
Policy expressiveness is below Portkey and Future AGI. The three-stage pattern is hand-wired through webhooks.
Self-host beyond a few hundred RPS gets operational.

Pricing: Free tier 10K requests / month. Pro from $25 / month. Enterprise custom.

Score: 3.5 / 7 axes (missing: downgrade-on-breach, retry-budget cap, burndown forecast, self-improving policy).

4. LiteLLM: Best for self-hosted Codex CLI spend management inside the VPC

Verdict: LiteLLM is the pick when Codex CLI traffic can’t leave the VPC and security wants source code they can read. Budget primitives are real, team budgets, user budgets, virtual keys, webhook alerts. Polish is below the hosted alternatives; the downgrade story is Python, not a toggle. Pin commit hashes past 1.83.7 per the March 2026 PyPI compromise.

What it does for Codex CLI token spend management:

Per-session attribution through metadata pass-through. Wire metadata.session_id from the Codex CLI header into the proxy config.
Per-developer budget caps through team and user budgets; hard cap returns 429. team_id maps to the SSO claim if you wire your IdP.
Alert routing through webhook hooks. PagerDuty, Opsgenie, Slack, bring your own destination.
Model downgrade on budget breach through pre_call_check hooks. A 25-line Python hook that checks remaining budget and rewrites model="gpt-5.1" to model="gpt-5.1-mini". Real engineering time, not a toggle.
Cache hit-rate observability is thin. LiteLLM forwards OpenAI’s cache headers but doesn’t aggregate them. Pair with traceAI or build the dashboard yourself.
Retry-budget interaction through num_retries (per-request only). Per-session caps live in the pre-call hook.
Monthly burndown forecasting isn’t present. Export the spend table to your analytics warehouse.

Where it falls short:

March 24, 2026 PyPI supply-chain compromise. Versions 1.82.7 and 1.82.8 were published by an attacker with the maintainer’s PyPI token; the package exfiltrated SSH keys, cloud credentials, and Kubernetes configs (Datadog Security Labs TeamPCP writeup). Remediated past 1.83.7. Pin commit hashes and rotate touched credentials.
No optimizer.
Observability is thinner than Portkey or Future AGI. Most enterprises pair LiteLLM with traceAI for breach forensics.
Day-one setup is heavier than hosted alternatives. Plan a platform-team sprint.

Pricing: Open source under MIT (enterprise dir licensed separately). Enterprise from ~$250 / month for small teams.

Score: 4 / 7 axes (missing: turnkey downgrade-on-breach, cache hit-rate UI, retry-session cap, burndown forecast, self-improving policy).

5. OpenRouter: Best for early-stage Codex CLI A/B testing before FinOps gets involved

Verdict: OpenRouter is the lowest-friction way to route Codex CLI across many models. One API key, one base URL, 200+ models, per-token markup. It answers “three engineers A/B-testing models without operating a gateway.” It doesn’t answer “cap each developer at $40 / day and downgrade easy turns on breach.” Past 5 developers, the wrong shape.

What it does for Codex CLI token spend management:

Per-session attribution is shallow. Aggregates by API key. Session ID requires CSV export.
Per-developer budget caps are account-level only. One developer’s spend isn’t isolated. Bulk-pricing discounts aren’t preserved since OpenRouter is the marketplace.
Alert routing is account-level usage alerts. No per-EM routing.
Model downgrade on budget breach is caller-side. No budget-aware routing rule.
Cache hit-rate observability isn’t present.
Retry-budget interaction isn’t modeled.
Monthly burndown forecasting isn’t present.

Where it falls short:

No per-virtual-key budget enforcement. Cost control is a single account billing limit.
No semantic or exact cache at the gateway layer.
Per-token markup adds up at 50M+ tokens / month.
Closed source, no self-host.

Pricing: Per-token markup on provider rates; cloud only.

Score: 2 / 7 axes (missing: per-virtual-key caps, alert routing, downgrade-on-breach, cache UI, retry cap, burndown forecast, self-improving policy).

Capability matrix

Axis	Future AGI	Portkey	Helicone	LiteLLM	OpenRouter
Per-session token attribution for Codex CLI	Native via session-ID span attr	Header (`trace_id`)	Header (`Helicone-Session-Id`)	Metadata pass-through	CSV export only
Per-developer budget caps	Native virtual key, 3-stage	Virtual key, hard cap	Rate limit + webhook	Team + user budgets	Account-level only
Alert routing	Per-cost-center routing UI	Slack/Teams native	Webhook (Slack standard)	Webhook (BYO destination)	Account-level alerts
Model downgrade on budget breach (gpt-5.1 → gpt-5.1-mini)	Native budget-aware policy	Wireable conditional (sprint)	Not present	Wireable via `pre_call_check` hook	Caller-side only
Cache hit-rate observability	First-class span attr, dashboard view	Partial (own cache only)	Partial (own cache only)	Header pass-through; bring own dashboard	Not present
Retry-budget interaction	Per-session retry cap + budget counter	Per-request only	Per-request only	Router `num_retries` only	Not modeled
Monthly burndown forecasting	P10 / P50 / P90 projection on day 10	Not present	Not present	Not present	Not present
Self-improving budget policy	`fi.opt` auto-tunes thresholds	Not present	Not present	Not present	Not present

Decision framework: Choose X if

Choose Future AGI Agent Command Center if budget breaches are a recurring monthly conversation, the FinOps lead is tired of paging the EM at 9pm on the 23rd, and the goal is to make next month’s policy automatically tighter than this month’s. agent-opt is opt-in, turn it on once Codex CLI has eval baselines and live traces flowing, and the cost curve compounds downward from there.

Choose Portkey if you want hosted polish, virtual-key budgets are enough for the year-one story, you can wire the downgrade-on-breach rules yourself, and you have a read on the Palo Alto Networks acquisition timeline.

Choose Helicone if the team is under 10 developers, the story is “daily cap + Slack ping,” policy gets re-tuned manually in standup, and the Mintlify acquisition timeline doesn’t bother you.

Choose LiteLLM if compliance forbids Codex CLI traffic leaving the VPC, Python is acceptable as a runtime, you can pin commit hashes past 1.83.7, and the platform team will write the downgrade-on-breach hook.

Choose OpenRouter if you’re a 3-5 person team A/B-testing Codex CLI against many models and per-developer budgets aren’t yet a procurement issue. Re-evaluate the moment a FinOps lead joins the conversation.

Common mistakes when wiring Codex CLI spend management

Mistake	What goes wrong	Fix
Single hard cap at 100%	Codex CLI pauses mid-conversation; engineers route around by pasting prompts into ChatGPT.com	Three-stage cap: alert 80%, downgrade 95%, hard pause 110%
Capping at the OpenAI-key level instead of per-developer	One heavy engineer pauses the entire team’s key	Issue virtual keys per developer; underlying OpenAI key keeps the bulk discount
Alerts to a shared channel	Notifications get muted; the next breach goes unnoticed until day 19	Per-cost-center routing to the EM responsible; FinOps gets the daily digest, not the per-event firehose
Not surfacing cache hit rate per session	Sessions with drifted cache keys pay full price while aggregate looks fine	Surface hit rate per session and per repo; flag any session under 30% for prompt-shape review
Not capping retries per session	Agent loops on a bash error and burns $80 in 4 minutes; per-request budget never sees it	Cap retries per session (default 3) with structured 429 “retry budget exhausted” body
Forecasting burndown linearly	Spend curves are not linear; weekends pull the average down; forecast undershoots	Forecast in P10 / P50 / P90 bands; act when P50 crosses budget
Static downgrade list	Fallback model gets deprecated; gateway errors at 95% threshold	Wire the cascade as a list (`gpt-5.1` → `gpt-5.1-mini` → `gpt-4.1-mini`); update quarterly
Skipping the dry-run before shipping a new cap	First production day pauses three sessions the EM did not predict; trust drops	Run the proposed cap against the previous 30 days of traffic; iterate offline

How Future AGI closes the loop on Codex CLI spend management

The other four gateways treat spend management as a static policy: write the cap, enforce it, re-tune when reality drifts. Future AGI treats the cap as the input to a self-improving policy. Six stages:

Trace. Every Codex CLI turn produces a span via traceAI (Apache 2.0) capturing tokens, cost, model, session ID, cache state, retry count, and budget state.
Evaluate. fi.evals scores every turn for code-correctness, faithfulness, and tool-use accuracy. The gateway knows which turns were “gpt-5.1 called when gpt-5.1-mini would have done it.”
Cluster. High-cost-low-difficulty sessions cluster by failure mode. The spend-management cluster is the routine work that ate the budget, small inputs, simple tool calls, trivial diffs at flagship prices.
Optimize. fi.opt.optimizers (six optimizers (RandomSearchOptimizer, BayesianSearchOptimizer Optuna-backed with teacher-inferred few-shot templates and resumable studies, MetaPromptOptimizer, ProTeGi, GEPAOptimizer, PromptWizardOptimizer), all sharing an EarlyStoppingConfig (patience + min_delta + threshold + max_evaluations) and the same unified Evaluator over 60+ FAGI rubrics) reads breach + trace + eval history and emits a policy diff: “for platform-squad, route turns under 8K input tokens to gpt-5.1-mini between 9am and 5pm, regression 0.4%, under 2% tolerance. Estimated monthly saving: $3,840.”
Route. Gateway applies the policy on the next request, versioned. The squad goes from “we breach by 40% every month” to “87% with headroom.”
Re-deploy. New rule is signed; if eval scores regress, automatic rollback. Each deploy is an audit event tied to the approver’s IdP claim.

Net effect: a 30-engineer team starting at $28,000 / month with monthly breaches typically settles into a steady-state where the cap holds, monthly spend trends down 22-30% within four weeks, and the FinOps lead stops getting paged at 9pm on the 23rd.

Building blocks are open source: traceAI, ai-evaluation, agent-opt (all Apache 2.0 at github.com/future-agi). Agent Command Center adds the budget-policy UI, dry-run mode, per-cost-center alert routing, cache-hit dashboard, retry-budget primitives, burndown forecast, Protect guardrails (~65 ms text, 107 ms image per arXiv 2510.13351), RBAC, SOC 2 Type II certified, BAA available, and AWS Marketplace listing.

What we did not include

Cloudflare AI Gateway. Strong edge primitives but budget-cap UX is rate-limit-first; dollar-denominated caps with automatic downgrade still require Worker code as of May 2026. Better as an edge-routing pick than a spend-management pick.
Kong AI Gateway. Strong if you already run Kong for REST APIs, but AI-specific spend-management is plugin-driven and requires AI Proxy plugin 3.6+ plus a state-store wiring sprint. For teams not already on Kong, the cohort above is faster to ship.
TrueFoundry. Solid ML-platform gateway with tenant-level cost centers, but the actuator side (auto-pause, downgrade-on-breach) is less mature than the tracking side. Better as a tracking pick than a management pick.

All three are worth a second look in Q3 2026.

Sources

OpenAI Codex CLI documentation, github.com/openai/codex
OpenAI Responses API + prompt caching, platform.openai.com/docs/guides/prompt-caching
Future AGI Agent Command Center, futureagi.com/platform/monitor/command-center
Future AGI traceAI, github.com/future-agi/traceAI
Future AGI ai-evaluation, github.com/future-agi/ai-evaluation
Future AGI agent-opt, github.com/future-agi/agent-opt
Future AGI Protect latency benchmarks, arxiv.org/abs/2510.13351 (65 ms text, 107 ms image)
Portkey AI gateway, portkey.ai
Palo Alto Networks press release on Portkey acquisition (April 30, 2026), paloaltonetworks.com/company/press/2026/palo-alto-networks-to-acquire-portkey-to-secure-the-rise-of-ai-agents
Helicone proxy, helicone.ai
Mintlify acquisition of Helicone (March 3, 2026), mintlify.com/blog/helicone-joins-mintlify
LiteLLM proxy, github.com/BerriAI/litellm
Datadog Security Labs writeup on LiteLLM PyPI compromise (TeamPCP campaign, March 24, 2026), securitylabs.datadoghq.com/articles/litellm-compromised-pypi-teampcp-supply-chain-campaign
OpenRouter models directory, openrouter.ai/models

Frequently asked questions

How is Codex CLI token spend management different from tracking?

Tracking is read-only: dashboards, per-developer spend, monthly invoices. Management is write-side: caps with auto-pause, alert routing, model downgrade on breach, retry-budget caps, burndown forecasting. Tracking tells you what happened on the 19th. Management surfaces the overrun on the 11th.

How do I set a budget cap on Codex CLI without breaking developer flow?

Three-stage threshold: soft alert at 80%, downgrade at 95% (Codex CLI keeps working on `gpt-5.1-mini`), hard pause at 110% (structured 429). Native in Future AGI. Portkey through metadata-conditional rules; LiteLLM through `pre_call_check` hooks.

Can the gateway automatically downgrade Codex CLI from gpt-5.1 to gpt-5.1-mini when over budget?

Yes. Future AGI ships it natively with optimizer-tuned thresholds. Portkey through YAML rules (wireable, not turnkey). LiteLLM through `pre_call_check` Python hooks. Helicone and OpenRouter need caller-side logic in the wrapper.

How do I track Codex CLI cost per developer when everyone shares one OpenAI key?

Use a gateway with virtual keys (Future AGI, Portkey, LiteLLM). Each developer gets a virtual key that fans out to the team's OpenAI key, preserving tiered-pricing discounts. OpenRouter does not support per-developer virtual keys.

What is OpenAI's prompt-cache hit rate, and why does it matter?

OpenAI prices cached input tokens at roughly 50% of full input cost. Codex CLI sessions on the same repo share large chunks of system prompt across turns. A team at 30% hit rate is paying ~15% more than a team at 60%. Future AGI surfaces hit rate per developer, per repo, per session; the others surface their own caches or nothing.

Does the gateway count Codex CLI's tool-use retries against the budget?

Yes on Future AGI (per-session retry cap, default 3). The other four count retries against the per-request budget but lack a per-session cap — a Codex CLI agent that loops 14 times on a bash error can burn $80 in 4 minutes before the daily cap sees it.

Is it safe to send Codex CLI source code through an AI gateway?

For hosted gateways, the data flow is gateway → OpenAI; both endpoints already see the code. If compliance forbids both, the safe picks are self-hosted LiteLLM or Future AGI BYOC. Helicone and OpenRouter are cloud-only.

How is Future AGI different from Portkey for Codex CLI spend management?

Portkey is a hosted enforcement layer with mature virtual-key budgets and Slack-native alerts. Future AGI adds the self-improving optimizer, dry-run mode, three-stage cap as a native primitive, first-class cache-hit surfacing, per-session retry caps, and the burndown forecast. Portkey enforces the policy you write. Future AGI enforces the policy you write *and* writes the next policy for you.

View all

Guides

AI Gateway for Codex CLI in 2026: The Playbook

Wrap OpenAI Codex CLI in an AI gateway for per-developer budgets, per-call audit trail, and provider flexibility, without changing the CLI command.

Nikhil Pareek · May 15, 2026

11 min

Guides

Best 5 AI Gateways for MCP Tool-Level Observability with Codex CLI in 2026

Five AI gateways scored for MCP tool-level observability with Codex CLI: per-tool latency, success rate, argument validation, MCP auth.

Vrinda Damani · Apr 22, 2026

17 min

Guides

How an MCP Gateway Cuts Token Costs in Claude Code and Codex CLI in 2026

A 2026 architecture essay on why MCP blows up coding-agent token bills in Claude Code and Codex CLI, and five mechanisms that compress cost.

Nikhil Pareek · Apr 13, 2026

14 min

TL;DR

Why Codex CLI spend management is harder than spend tracking

The 7 axes we score on

How we picked

1. Future AGI Agent Command Center: Best for per-developer Codex CLI budget caps with auto-downgrade routing

2. Portkey: Best for hosted Codex CLI spend management with Slack-native alerts

3. Helicone: Best for lightweight Codex CLI spend management on small teams

4. LiteLLM: Best for self-hosted Codex CLI spend management inside the VPC

5. OpenRouter: Best for early-stage Codex CLI A/B testing before FinOps gets involved

Capability matrix

Decision framework: Choose X if

Common mistakes when wiring Codex CLI spend management

How Future AGI closes the loop on Codex CLI spend management

What we did not include

Related reading

Sources

Frequently asked questions