Engineering

Running Claude Code with OpenAI Models in 2026: A Gateway Setup Guide

Run Claude Code against OpenAI GPT-5 and GPT-4 via a translation gateway in 2026: setup, ENV vars, config, then five gateways scored.

May 15, 2026

16 min read

ai-gateway 2026 claude-code

Table of Contents

Claude Code is the best coding-agent UX shipped to date, and it speaks Anthropic. Point the CLI at api.openai.com and you get an authentication error on turn one, the binary issues POST /v1/messages with x-api-key headers and expects Anthropic-shaped streaming events back. OpenAI’s API answers POST /v1/chat/completions and streams a different event schema. The two protocols overlap in shape but disagree on every detail that matters: tool calls live in different fields, system prompts go in different places, cache control means different things, streaming event types don’t match.

That mismatch is the gap an AI gateway closes. The gateway accepts Claude Code’s Anthropic-shaped request, translates to OpenAI’s chat-completion format, forwards to GPT-5 or GPT-4o, then translates the streaming response back into Anthropic’s content_block_* events so Claude Code’s progress UI keeps moving. Done well, the CLI doesn’t know. Done poorly, the terminal freezes mid-refactor or silently drops every parallel tool call after the first.

This guide is in two parts. First: the implementation walkthrough, prereqs, translation-layer mechanics, ENV vars, and a gateway config that routes the four common Claude Code patterns to GPT-5 variants. Second: a scored shortlist of five gateways that ship this translation in production, named honestly with what each breaks on.

Why anyone runs Claude Code on OpenAI models

Three reasons keep coming up in the field.

Cost arbitrage on the easy turns. A typical Claude Code workload is bimodal: roughly 60-70% of turns are boilerplate edits and small refactors under 8K input tokens, the other 30-40% are architecture and multi-file work needing 60K-200K context. Sending the easy turns to GPT-5-mini instead of Claude Sonnet saves 60-75% on those calls without measurable quality loss. Hard turns stay on Opus.

Capability mix. GPT-5 is genuinely better than Claude Opus on some workloads, structured JSON extraction with strict schemas, certain numerical reasoning chains, SQL-with-aggregations generation. The point isn’t “OpenAI is better”; it’s “different models win on different turn-shapes, and you want both.”

Rate-limit and outage hedging. Anthropic occasionally rate-limits or has regional incidents. A gateway that routes to OpenAI as deterministic fallback keeps the team coding through the outage.

Shared tradeoff across all three: Claude Code was designed around Claude’s tool-use protocol and prompt caching. Running it on OpenAI through a gateway works, but the gateway has to do real translation work on every turn.

Prereqs

Before wiring anything, confirm:

Claude Code CLI installed. The binary honors ANTHROPIC_BASE_URL since 1.6; older versions ignore it.
OpenAI API key with GPT-5 or GPT-4o access (needs responses:write scope).
A gateway choice. This guide uses Future AGI Agent Command Center as the worked example because it ships the translation layer end-to-end. Section two covers four alternatives.
Shell with persistent env vars. Set them in .zshrc or .bash_profile so both your IDE terminal and direct CLI usage hit the gateway. The biggest setup mistake we see is wiring the IDE only, leaking half the traffic to api.anthropic.com direct.
A test repo with five to ten files. Translation regressions show up first on real-shaped sessions, not on hello world.

How the translation layer actually works

Four conversions matter between Claude Code’s outbound request and OpenAI’s inbound endpoint.

1. Endpoint mapping. Claude Code calls POST /v1/messages. OpenAI exposes POST /v1/chat/completions (and the newer POST /v1/responses). The gateway listens on /v1/messages, parses the Anthropic body, and reissues to the OpenAI endpoint. The reverse mapping, converting OpenAI’s response back into an Anthropic Message, happens before the stream returns to the CLI.

2. System prompt placement. Anthropic accepts system as a top-level string. OpenAI expects a messages array where the first element has role: "system" (or role: "developer" on newer endpoints). The translator must lift Claude Code’s system block to the first message. If the gateway forgets, GPT-5 answers but ignores tool-use instructions, and parallel tool calls collapse into sequential ones.

3. Tool-use block conversion. This is where most translations break. Claude Code uses Anthropic’s tool_use and tool_result content blocks aggressively, every bash invocation, every file edit, every grep is a structured JSON block inside the content array. OpenAI splits this across tool_calls on the assistant message and a tool role for results. Five parallel tool calls in Claude’s protocol become five entries inside a single tool_calls array. A naive translator that maps one-block-to-one-message flattens parallel calls into sequential, and Claude Code’s five-file-edit pattern breaks silently. The gateway must round-trip parallel arrays in both directions, matching tool_use_id to tool_call_id.

4. Streaming SSE bridging. Anthropic streams typed events (content_block_start, content_block_delta, content_block_stop, message_delta, message_stop). OpenAI streams loose delta chunks with finish_reason and tool_calls deltas threaded in. The gateway re-emits Anthropic-typed events in real time. Buffering the OpenAI stream and replaying at the end makes Claude Code’s progress UI hang for the full turn, functionally frozen. The right pattern is event-by-event translation with no buffering beyond the SSE chunk boundary.

Cache control deserves a brief mention. Claude Code sets cache_control on long system prompts. OpenAI caches automatically with no header. The translator should drop cache_control on OpenAI-bound requests; passing it through is a no-op but makes the trace misleading.

Step 1: Set the environment variables

Open your shell profile and add the following:

# ~/.zshrc or ~/.bash_profile

# Point Claude Code at the gateway instead of api.anthropic.com
export ANTHROPIC_BASE_URL="https://gateway.futureagi.com/v1"

# The API key Claude Code sends. The gateway maps this to your OpenAI key
# server-side; the CLI never sees the OpenAI credential.
export ANTHROPIC_API_KEY="fi_live_xxxxxxxxxxxxxxxx"

# Pin the protocol version so tool-use behavior is deterministic
export ANTHROPIC_VERSION="2025-09-01"

# Optional: set the default model alias the gateway routes from.
# This is what Claude Code's --model flag will resolve to if you do not
# override it on the command line.
export ANTHROPIC_MODEL="gpt-5-via-gateway"

# Reload
source ~/.zshrc

Two checks before moving on. First, echo $ANTHROPIC_BASE_URL in a fresh terminal should return the gateway URL. Second, the CLI honors the override: run claude with a trivial prompt and watch the gateway’s request log to confirm the call landed there and not at Anthropic.

Step 2: Configure the model aliases in the gateway

The gateway needs to know that when Claude Code asks for gpt-5-via-gateway, it should translate to OpenAI’s GPT-5 endpoint. Most production gateways express this as a config file or dashboard rule. Below is a representative shape, the exact YAML differs across vendors but the model is the same.

# gateway-config.yaml

routes:
  - name: gpt-5-via-gateway
    inbound_protocol: anthropic
    upstream:
      provider: openai
      model: gpt-5
      endpoint: https://api.openai.com/v1/chat/completions
      api_key_ref: openai_prod
    translation:
      system_placement: first_message
      tool_use: openai_tool_calls
      streaming: anthropic_events
      drop_headers: [cache_control]

  - name: gpt-5-mini-via-gateway
    inbound_protocol: anthropic
    upstream:
      provider: openai
      model: gpt-5-mini
      endpoint: https://api.openai.com/v1/chat/completions
      api_key_ref: openai_prod
    translation:
      system_placement: first_message
      tool_use: openai_tool_calls
      streaming: anthropic_events
      drop_headers: [cache_control]

  - name: claude-opus-fallback
    inbound_protocol: anthropic
    upstream:
      provider: anthropic
      model: claude-opus-4-7
      endpoint: https://api.anthropic.com/v1/messages
      api_key_ref: anthropic_prod
    translation:
      passthrough: true

policies:
  - name: cost_aware_routing
    rule: |
      if input_tokens < 8000 and tool_call_count <= 2:
        route gpt-5-mini-via-gateway
      elif input_tokens < 60000:
        route gpt-5-via-gateway
      else:
        route claude-opus-fallback

The policies block is what makes the multi-provider story useful. Without a routing policy, every Claude Code call goes to one upstream and you haven’t improved over a single-provider setup. The example routes short turns with simple tool use to GPT-5-mini, mid-range turns to full GPT-5, and reserves Opus for long-context architecture work where Claude’s tool-use training still beats GPT-5.

After saving the config, restart the gateway, then verify by issuing one turn against each alias and inspecting the trace to confirm the upstream model matches the policy choice.

Step 3: Run a real session and watch the trace

The verification step that catches everything is a single multi-turn session against a real repo. Ask Claude Code for a change that requires four to six parallel file edits (something like “rename getUserById to findUserById across the codebase”) and confirm:

The trace shows the request hitting gpt-5-via-gateway (or whichever alias the policy resolved).
The CLI’s progress UI streams tokens in real time, not freeze-then-dump.
The tool-call count in the trace matches the number of edits performed. If you asked for six edits and the trace shows two, parallel tool-call translation is broken.
The final response lands in the CLI with the diff displayed normally.

If all four hold, the wiring is correct. If streaming freezes, the gateway is buffering instead of bridging event-by-event. If parallel tool calls collapsed, the translation is mapping content blocks one-to-one. Both are gateway-side fixes.

Step 4: Production checklist

Before declaring victory, walk through the operational concerns that bite once a team uses this daily.

Concern	What to check
Latency overhead	Measure p50 and p95 of the gateway hop. Translation alone should add 5-15ms; anything higher suggests buffering or a JSON re-parse that is not necessary.
Failure isolation	If the gateway is down, does Claude Code surface a clean error or hang? Wire a deterministic fallback to Anthropic-direct as a degraded mode.
Cost attribution	Tag every request with developer ID and repo. Without this, the gateway has saved cost but lost the chargeback story finance needs.
Audit log	Every gateway decision (which model, which policy fired, what the input-token estimate was) should be queryable. This becomes the trail you need when a session looks anomalous.
Cold-start	First request after a config push should not take 5+ seconds. If it does, the gateway is recompiling the route map per request.
Rollback	You should be able to disable the gateway hop in under a minute by unsetting `ANTHROPIC_BASE_URL` in the team’s shell template, and have everyone fall back to direct Anthropic without code changes.

The walkthrough above gets Claude Code talking to OpenAI through a single gateway. The next question is which gateway. Translation correctness isn’t uniform, and the failure modes differ in ways that show up only in production. Below are five gateways that ship Anthropic-to-OpenAI translation today, scored on the axes that matter for Claude Code.

The 5 axes we score on

Axis	What it measures
1. Translation fidelity	Does the gateway correctly map system prompts, tool-use blocks, and cache headers in both directions?
2. Parallel tool-call survival	Does Claude Code’s five-file-edit pattern round-trip without flattening into sequential calls?
3. Streaming event bridging	Does SSE arrive at the CLI with Anthropic event types intact, or does the gateway buffer-and-batch?
4. Translation latency overhead	How many milliseconds does the translation step add per turn?
5. Loop on correctness	Does the gateway score translation correctness and adjust routes when an upstream regresses, or does the operator chase issues by hand?

1. Future AGI Agent Command Center: Best for closing the loop on translation

Verdict: Future AGI is the only gateway here that captures tool-use correctness per translated call and feeds it back into routing. The other four are static translation layers that depend on the operator to notice when GPT-5 misbehaves after a model update.

What it does for Claude Code on OpenAI:

Translation fidelity uses an intermediate-representation step. Inbound Anthropic requests parse to a typed IR, then re-serialize to OpenAI’s chat/completions or responses shape. System-prompt placement, tool-use conversion, and cache-header handling are explicit decisions in code.
Parallel tool-call survival verified for Claude Code’s bash, file edit, glob, and grep tools against GPT-5 and GPT-4o, including six-file-edit patterns.
Streaming event bridging rebuilds Anthropic content_block_* events from OpenAI’s delta chunks in flight. No buffering.
Translation latency overhead runs 6-9 ms p50 non-streaming and 4 ms per chunk streaming. Optional Protect guardrail adds 65 ms text median time-to-label per arXiv 2510.13351.
Loop on correctness is the wedge. fi.evals scores every translated call; failures cluster by shape; fi.opt.optimizers adjusts per-route system-prompt prefix or shifts traffic until the regression clears.

The honest tradeoff: GPT-5’s caching is automatic, so cache-control hints from Claude Code are dropped on this route. The trace records this. Mixing upstreams in one session is the common pattern.

Where it falls short:

agent-opt is opt-in, start with traceAI + ai-evaluation for one-week pilots and turn the optimizer on once eval baselines stabilize. The loop compounds value over weeks rather than at day one.
Prompt library is opinionated, fewer review-and-collaboration knobs than Portkey’s prompt hub, which keeps the daily workflow tight; teams running large multi-author prompt libraries should preview the workflow before standardizing.

Pricing: Free tier with 100K traces/month. Scale tier starts at $99/month. Enterprise is custom with SOC 2 Type II, HIPAA, GDPR, and CCPA certifications, a BAA, and AWS Marketplace listing.

Score: 5/5 axes.

2. Portkey: Best for hosted gateway with mature RBAC

Verdict: Portkey is the most polished hosted-only product if the priority is virtual-key controls and RBAC on top of working Anthropic-to-OpenAI translation. Routes don’t get optimized back.

What it does for Claude Code on OpenAI:

Translation fidelity is solid for standard tool-use patterns. System-prompt placement and tool-call conversion work end-to-end for GPT-5 and GPT-4o as of May 2026.
Parallel tool-call survival confirmed for OpenAI upstreams. OpenAI is the more reliable non-Anthropic path through Portkey.
Streaming event bridging works. SSE pass-through with correct Anthropic event-type rebuild.
Translation latency overhead runs around 8-12 ms p50 per Portkey’s published numbers.
Loop on correctness isn’t part of the product.

The honest tradeoff: Portkey’s metadata-header model for per-developer attribution needs the Claude Code wrapper to set x-portkey-trace-id and similar headers. Without that wiring, the gateway sees one shared key and developer aggregation is impossible.

Where it falls short:

No optimizer.
Metadata-header model needs client-side wiring; otherwise developer-level attribution collapses.
Pricing escalates above 5M requests/month faster than open-source alternatives.

Pricing: Free tier with 10K requests/day. Scale tier starts at $99/month. Enterprise is custom with SOC 2 Type II.

Score: 4/5 axes (missing: feedback loop on correctness).

3. LiteLLM: Best for self-hosted multi-provider translation

Verdict: LiteLLM is the pick when Claude Code traffic can’t leave your VPC and the security team needs to read every line of the translator. Source-available, Python-native, proxies on your infrastructure.

What it does for Claude Code on OpenAI:

Translation fidelity is broad and source-readable. The anthropic to openai adapter handles system-prompt placement and tool-use conversion in code you can audit. Corner cases are patchable.
Parallel tool-call survival is good on OpenAI. Occasional regressions on edge cases like nested JSON in tool arguments; the community typically lands a fix within days.
Streaming event bridging works on OpenAI.
Translation latency overhead is typically 10-18 ms p50 in our tests. Python is the bottleneck at high RPS.
Loop on correctness isn’t in the product.

The honest tradeoff: Observability is thinner than the hosted offerings. Plan to wire fi.evals or another sink behind LiteLLM. Slicing per-provider tool-use success rate means SQL.

Where it falls short:

No optimizer.
High-RPS deployments need explicit horizontal scaling.
Dashboard is functional, not polished.

Pricing: Open source under MIT. LiteLLM Enterprise tier (SLA + SSO + audit) starts around $250/month for small teams.

Score: 3.5/5 axes (missing: feedback loop, polished native dashboard).

4. OpenRouter: Best for breadth of upstream catalog

Verdict: OpenRouter is the pick when the goal is “every OpenAI variant plus 300 other models from one endpoint” and enterprise governance is secondary.

What it does for Claude Code on OpenAI:

Translation fidelity is correct for the standard Claude Code tool set against GPT-5, GPT-4o, and OpenAI variants. The long tail of community providers is thinner pass-through.
Parallel tool-call survival is upstream-dependent. OpenRouter’s docs flag which OpenAI variants support parallel calls reliably.
Streaming event bridging works for SSE on the major upstreams.
Translation latency overhead sits in the 5-10 ms range.
Loop on correctness isn’t in the product.

The honest tradeoff: OpenRouter is consumer-facing in shape. Chargeback for a 30-developer team is light; SOC 2 evidence and team-scoped audit logs mean custom work.

Where it falls short:

Enterprise governance is light.
No optimizer.
Per-request markup on upstream cost; verify against direct-OpenAI pricing.

Pricing: Pay-as-you-go markup. No free tier for sustained workloads.

Score: 3/5 axes (missing: enterprise governance, feedback loop).

5. Maxim Bifrost: Best for explicit Claude-Code-with-any-provider runtime

Verdict: Maxim Bifrost ships an explicit Claude Code adapter with first-class non-Anthropic support as an open-source runtime tuned for coding-agent workloads, both the strength and the limitation.

What it does for Claude Code on OpenAI:

Translation fidelity is the explicit product goal. Anthropic-protocol inbound maps to OpenAI (plus Bedrock, Vertex, OSS) with parallel tool calls, long file diffs, and multi-turn sessions called out in docs.
Parallel tool-call survival is what Bifrost is benchmarked on. The team publishes per-provider correctness numbers; treat them as directional since they’re vendor-reported.
Streaming event bridging is implemented for OpenAI; event-type rebuild is part of the test suite.
Translation latency overhead is published in the project’s benchmarks; Bifrost is newer and the perf story is still moving.
Loop on correctness is partial, tool-use correctness shows up as a metric but doesn’t yet rewrite routes.

The honest tradeoff: Younger project, smaller community, smaller bug-surface coverage at high RPS. Enterprise controls lag the hosted alternatives.

Where it falls short:

Younger ecosystem.
Enterprise controls less mature than hosted alternatives.
Loop is metric-only, not closed.

Pricing: Open source. Maxim AI’s hosted Bifrost is a separate commercial product; pricing on inquiry.

Score: 3/5 axes (missing: closed loop on correctness, mature enterprise controls).

Capability matrix

Axis	Future AGI	Portkey	LiteLLM	OpenRouter	Bifrost
Translation fidelity (system + tools)	IR-based	Solid	Source-readable	Major upstreams	Coding-agent tuned
Parallel tool-call survival (OpenAI)	Yes	Yes	Yes	Yes (upstream-dependent)	Yes
Streaming event bridging	Yes	Yes	Yes	Yes	Yes
Translation latency p50	6-9 ms	8-12 ms	10-18 ms	5-10 ms	varies
Loop on translation correctness	`fi.opt`	No	No	No	Metric only
Self-host posture	BYOC	BYOC	OSS	Hosted-only	OSS

Decision framework: Choose X if

Choose Future AGI if you want the gateway to learn which upstream is reliable for which turn-shape over time. Pick this when Claude Code on OpenAI is becoming a meaningful line item ($10K+/month).

Choose Portkey if you want a hosted gateway with mature RBAC and virtual keys, and you don’t need the optimizer yet. Pick this when procurement matters and OpenAI is the primary non-Anthropic upstream.

Choose LiteLLM if Claude Code traffic must stay inside your VPC and the security team needs to read every line of the translator. Pick this when source-availability beats hosted polish.

Choose OpenRouter if the constraint is access to a long tail of OpenAI variants and community providers, and enterprise governance is secondary. Pick this for individual developers and small teams.

Choose Maxim Bifrost if the team is explicitly building around coding-agent + multi-provider workloads and wants an open-source runtime tuned for it.

How Future AGI closes the loop on translation correctness

The four other picks treat Anthropic-to-OpenAI translation as a one-shot engineering problem: ship the adapter, fix bugs as they come in. Future AGI treats translation correctness as the input to a feedback loop.

traceAI (Apache 2.0) captures each turn’s span tree, inbound Anthropic request, chosen OpenAI model, translated body, upstream stream, and Anthropic-shaped stream rebuilt back to the CLI. fi.evals scores each turn on task-completion and tool-use correctness; a regression like GPT-5 returning tool-call JSON in a different order shows up as a sudden score drop. Low-scoring sessions cluster by failure shape, and fi.opt.optimizers reacts two ways: rewrite the per-route system-prompt prefix so OpenAI receives a Claude-Code-aware framing, or adjust routing weight so the offending model drops out of the candidate set until reliability recovers. Policies are versioned with automatic rollback. Protect runs alongside at 65 ms text median time-to-label per arXiv 2510.13351, to catch prompt-injection content.

The three building blocks are open source under Apache 2.0: traceAI, ai-evaluation, agent-opt. The hosted Agent Command Center adds the failure-cluster view, live Protect, RBAC, SOC 2 Type II certified, and AWS Marketplace.

What we did not include

Three gateways show up in other 2026 listicles that we deliberately left out:

Helicone. Strong native-Anthropic observability, but Anthropic-to-OpenAI translation depth is thinner than the picks above.
Kong AI Gateway. Solid API-gateway SLA, but Anthropic-inbound-with-OpenAI-upstream translation lags.
Cloudflare AI Gateway. Strong primitives and edge latency, but the Anthropic-protocol-inbound story is still developing as of May 2026.

All three are worth a re-look later in 2026.

Sources

Anthropic Messages API protocol, docs.anthropic.com/en/api/messages
OpenAI chat completions and responses APIs, platform.openai.com/docs/api-reference
Anthropic prompt caching, docs.anthropic.com/en/docs/build-with-claude/prompt-caching
Claude Code documentation, claude.ai/docs/claude-code
Future AGI Agent Command Center, futureagi.com/platform/monitor/command-center
Future AGI Protect latency benchmarks, arxiv.org/abs/2510.13351 (65 ms text / 107 ms image median time-to-label)
Portkey AI gateway, portkey.ai
LiteLLM proxy, github.com/BerriAI/litellm
OpenRouter, openrouter.ai
Maxim Bifrost, github.com/maximhq/bifrost

Frequently asked questions

Can Claude Code actually run on OpenAI GPT-5?

Not natively. Through a translation gateway speaking Anthropic on the inbound side and OpenAI on the outbound, you route Claude Code per-turn to GPT-5, GPT-5-mini, or GPT-4o. The fidelity of that translation is what this post is about.

Which OpenAI model best replaces Claude Sonnet in Claude Code?

GPT-5-mini is the closest cost-performance match for short-turn boilerplate. GPT-5 (non-mini) is closer to Sonnet 4.6 for mid-range work. For long-context architecture above 60K tokens, Claude Opus 4.7 is still safer — route those back to Anthropic.

Will prompt caching still work when Claude Code runs on OpenAI?

Not in the Anthropic sense. OpenAI caches automatically server-side; `cache_control` blocks are dropped on OpenAI-bound requests. Savings show up as automatic cache hits in OpenAI billing. For sessions where explicit caching matters (very long system prompts, repeated 100K+ context), keep those on Anthropic upstream.

Is it safe to send source code through a translation gateway to OpenAI?

For hosted gateways, the data flow is gateway to OpenAI; OpenAI already sees the code. If compliance forbids the code reaching OpenAI at all, that decision is upstream of the gateway choice. If the concern is the gateway itself, use self-hosted LiteLLM or Future AGI BYOC inside your VPC.

How is Future AGI different from Portkey for this workload?

Portkey is a hosted translation and observation layer. Future AGI adds an optimization layer — traces feed back into routing-policy updates so the gateway gets better at picking the right OpenAI variant for each turn-shape over time.

View all

Engineering

How to Reduce Claude Code Token Costs by Up to 90 Percent in 2026

Cut Claude Code token spend with 5 stackable levers: cache_control, MCP-tool compilation, semantic caching, model right-sizing, pruning. Honest 90% read.

NVJK Kartik · Apr 11, 2026

13 min

Engineering

How to Reduce MCP Token Costs for Claude Code at Scale in 2026

Practical 2026 how-to for cutting MCP token spend on Claude Code at fleet scale: five levers, the mcp.json + gateway config, metrics that prove the cut.

Rishav Hada · Mar 24, 2026

12 min

Engineering

How to Connect Claude Code to an MCP Gateway in 2026

Wiring Claude Code to an MCP gateway 2026: mcp.json config, routing rules, per-server auth scoping, verification. Production checklist and gateway picks.

Vrinda Damani · Mar 5, 2026

11 min

Why anyone runs Claude Code on OpenAI models

Prereqs

How the translation layer actually works

Step 1: Set the environment variables

Step 2: Configure the model aliases in the gateway

Step 3: Run a real session and watch the trace

Step 4: Production checklist

The 5 axes we score on

1. Future AGI Agent Command Center: Best for closing the loop on translation

2. Portkey: Best for hosted gateway with mature RBAC

3. LiteLLM: Best for self-hosted multi-provider translation

4. OpenRouter: Best for breadth of upstream catalog

5. Maxim Bifrost: Best for explicit Claude-Code-with-any-provider runtime

Capability matrix

Decision framework: Choose X if

How Future AGI closes the loop on translation correctness

What we did not include

Related reading

Sources

Frequently asked questions