Guides

Best AI Gateways for A/B Testing LLM Models and Prompts in 2026

Five AI gateways for A/B testing LLM models and prompts in 2026, scored on shadow traffic, sample-size enforcement, and outcome-attached scoring at the gateway hop.

·
Updated
·
16 min read
ai-gateway ab-testing llm-experimentation shadow-traffic outcome-scoring 2026
Editorial cover image for Best 5 AI Gateways for A/B Testing LLM Models and Prompts in 2026
Table of Contents

A team rolls a one-line prompt tweak to the support agent. Offline lift on a 30-example suite reads +0.04 Groundedness. The change ships to 100 percent at 09:00. By 09:05, refusal rate on legitimate refund queries is up 14 points and on-call is rolling back from a Slack thread.

The change wasn’t the problem. The gateway was. It split by percentage, logged a cost number per route, and called that A/B testing. It never shadowed the candidate. It never enforced a sample floor. It never scored the response. The dashboard showed the cheaper arm, not the better one.

Gateway-level A/B testing requires three things most gateways don’t ship: shadow/mirror traffic, statistical sample-size enforcement, and outcome-attached scoring (not just latency and cost). The gateway that gives you all three is the gateway you ship the experiment on. The five below are scored on exactly that. One ships all three. The other four ship the split and leave one or more pieces as your problem.

TL;DR: the 2026 cohort, ranked on the three things that matter

RankGatewayShadow trafficSample-size enforcementOutcome scoring at the hop
1Future AGI Agent Command CenterNative (shadow + race + mirror)Declarative floor blocks promote50+ EvalTemplate classes per span
2PortkeyNative (mirror)None (BYO notebook)External (log to Langfuse/FAGI)
3HeliconeNoneNonePer-request Scores API
4LiteLLMHook-basedNoneExternal
5Cloudflare AI GatewayNoneNoneLogs only

Pick number one if you want the gateway to run the experiment end-to-end. Pick two through five if you accept that your CI pipeline or data team will own the statistics, scoring, and rollback automation the gateway will not.

Why gateway-level A/B testing is different

A classic web A/B splits users between checkout pages and waits for conversion to clear a p-value. LLM A/B at the gateway breaks that playbook on four fronts.

Outcomes are multidimensional. A prompt that lifts task completion 4 points can raise hallucination 2 points, cost 38 percent, and refusal rate 1.2 points. No single conversion number; the dashboard needs every dimension per arm.

Arms are expensive. A naive 50/50 split at 2M req/day for two weeks is roughly $180K of arm overhead at 2026 list prices. Shadow plus bandit allocation cuts that 60 to 80 percent.

Statistical power requires sticky routing. If the same user bounces between arms, the signal mixes both arms and the test doesn’t reach significance.

Rollback is measured in seconds. A bad prompt at 100 percent at 09:00 costs thousands by 09:05 unless the gateway flips back automatically. The “merge, wait for CI, redeploy” cycle doesn’t apply.

The gateway is the right place to handle all four. The five below all claim to. Only one ships the three primitives that decide whether the claim survives a post-mortem.

The 7 axes we score on

The first three are the thesis. The next four are the productionization tax.

AxisWhat it measures
1. Shadow / mirror trafficCandidate runs against live inputs without serving its output; both responses scored
2. Sample-size enforcementDeclarative per-arm floor; gateway blocks promotion until met
3. Outcome scoring at the hopEval scores attach to the same span as latency and cost
4. Traffic split granularitySplit by %, segment, header, SSO claim, tenant, span attribute
5. Sticky routingSame user, same arm across the experiment window
6. Rollback under 60 secondsHuman or alert flips routing back to control within one minute
7. Eval-gated promotion + banditsWinning arm clears thresholds; Thompson / epsilon-greedy supported

Pass on all seven, the gateway runs the experiment. Pass the first three, the test is defensible. Fail the first three, you have a load balancer with an A/B button.

How we picked the cohort

The pool was every AI gateway whose May 2026 docs advertised A/B testing, experimentation, mirror traffic, or split routing. We removed gateways that ship request-percentage splits without sticky routing (the unit of randomization drifts within a session). We removed gateways with no per-route observability (load balancers wearing an A/B label). The five remaining are below.

1. Future AGI Agent Command Center: best for end-to-end experimentation

Verdict. The only gateway in this list that ships all three load-bearing primitives. The split, the score, the significance check, and the promotion gate live in one product.

Shadow, the load-bearing primitive. Configure the candidate as a shadow arm and the gateway routes a copy of the live request to it without serving the output to the client. Both responses get scored on the same rubric the canary will use. The shadow window is where you validate that offline lift survives the production distribution before any user sees the candidate.

Sample-size enforcement, the wedge. The experiment config declares a per-arm floor: min_samples_per_arm: 1200, min_window_hours: 72, primary_metric: groundedness. Agent Command Center refuses to fire the promotion until both clear. The CI/CD pipeline cannot bypass it because the gate lives in the gateway, not the deploy script. This kills “we promoted at hour 4 because the early signal looked great” mistakes.

Outcome scoring. ai-evaluation (Apache 2.0) ships 50+ EvalTemplate classes covering task completion, faithfulness, tool-use, structured-output, hallucination, groundedness, context relevance, and instruction-following, plus unlimited custom evaluators and self-improving evaluators that learn from live traces. The Future AGI Platform runs classifier-backed evals at lower per-eval cost than Galileo Luna-2. Scores attach to the same span as cost and latency, so the per-arm table is one query.

Split and stickiness. Six routing strategies (round robin, weighted, least latency, cost optimized, adaptive, race) plus six reliability features (failover, retries, circuit breaking, model fallbacks, complexity-based, provider lock). Split by percentage, fi.attributes.user.id, tenant ID, span attribute, or header. Compose dimensions inside the config without writing routing code. Sticky routing hashes the configured key to a 64-bit space; the hash function is exposed for offline reproduction.

Significance, promotion, rollback. Per-arm sample size, mean, 95% CI, and frequentist p-value refresh every five minutes; Bayesian credible intervals available. The promotion gate is declarative: clear the primary lift, hold every guardrail in bounds, hit the sample floor, hit the window. Routing policies are immutable; revert is one click, typically under 20 seconds. Auto-rollback on guardrail breach for 15 consecutive minutes.

Bandits and the loop. Epsilon-greedy and Thompson sampling with live fi.evals as the reward signal. No manual reward instrumentation. Losing arms become training data for agent-opt (six optimizers: ProTeGi, GEPA, MetaPrompt, BayesianSearch, RandomSearch, PromptWizard). Error Feed auto-clusters failing per-arm traces via HDBSCAN and writes the immediate_fix, so regressions surface like exceptions. Inline guardrails run through Future AGI Protect at 65 ms p50 text and 107 ms p50 image (arXiv 2510.13351).

Performance and deployment. Roughly 29K req/s, P99 21 ms with guardrails on, on t3.xlarge. Cloud at gateway.futureagi.com/v1 (OpenAI SDK drop-in), self-hosted Go binary, Docker, Kubernetes, air-gapped. Apache 2.0 core.

Where it falls short. Heavier than a team needs for a one-off “is GPT-5.2 better than Claude Sonnet 4.6” check; Portkey is faster to spin up for a single comparison. Contextual bandits are in beta as of May 2026.

Pricing. Free: 100K traces, one active experiment. Scale from $99 a month. Enterprise custom with SOC 2 Type II, HIPAA, BAA, AWS Marketplace.

Score: 7/7 axes.

2. Portkey: best for hosted split UX with mirror traffic

Verdict. Portkey ships the cleanest traffic-split primitive of any hosted gateway. Configs are versioned, the dashboard is polished, virtual keys make per-tenant splits straightforward, and mirror traffic is a first-class config strategy. The right pick when the experimentation team is comfortable owning sample-size math and outcome capture in tools that live next to the gateway.

Split and shadow. Strategies include loadbalance, fallback, single, and mirror. Mirror sends the request to a primary plus mirror targets, returns the primary response to the client, and logs the mirror response separately. That is shadow in everything but name; the mirror row shares trace_id with the primary, so the per-arm join is straightforward. Split by percentage and metadata header; per-segment splits use conditional config (verbose past two segments).

Sample-size enforcement. None. The gateway promotes a config the moment you publish it. Teams enforce the floor in CI or in the deployment pipeline.

Outcome scoring + significance. Not native. Every request gets logged with the arm tag; per-arm outcomes require a separate eval pipeline. Teams typically wire Future AGI traceAI or Langfuse behind Portkey and join on trace_id. The Feedback API accepts post-hoc scores; the scoring engine is on you. P-value and CI live in a notebook.

Sticky routing. Through the Portkey-Sticky-Session-Id header.

Rollback. Edit live config in UI or API; typical 5 to 15 seconds. No eval-gated rollback.

Bandits. None as of May 2026. Roadmap mentions a bandit router for late 2026.

Where it falls short. No native eval pipeline (the dashboard tells you which arm was cheaper, not which was better). No sample-size enforcement. No bandits. Procurement note: verify the Palo Alto Networks acquisition timeline before signing multi-year.

Pricing. Free: 10K requests a day. Scale from $99 a month. Enterprise custom with SOC 2 Type II.

Score: 3/7. Strong on split and mirror; weak on enforcement, native scoring, significance, bandits.

3. Helicone: best per-request scoring API for proxy-mode experiments

Verdict. Helicone treats the per-request scoring API as a first-class primitive. The proxy logs every request with an arm tag and exposes properties and feedback APIs for attaching scores, metadata, and prompt template versions. What it doesn’t ship is shadow traffic or routing-policy-driven experimentation; the proxy is observability-first.

Split. Proxy-style, by user-property header (Helicone-Property-User, Helicone-Property-Cohort) and prompt template version. No native percentage-based split. The application picks the model per request; Helicone logs the choice.

Shadow + sample floor. Neither. The proxy logs what the app sends and doesn’t gate deploys.

Outcome scoring. The wedge: POST a score keyed on request_id and Helicone joins it to the log row. Scores group by prompt template version and user property, which is the per-arm view. Source can be LLM judge, deterministic check, or human review. Experiments view shows per-arm sample size and mean; SQL exposes the table for p-value and CI work in a notebook.

Rollback. One-click prompt template revert. The gateway doesn’t block deploys.

Bandits, sticky routing. Neither. Helicone isn’t a router; pair with Portkey, LiteLLM, or Future AGI if model-level A/B requires gateway-side split.

Pricing. Free 10K requests a month; Pro from $20 a month; Team and Enterprise custom.

Score: 2/7. Strong on the scoring API; weak everywhere routing or experiment automation matters.

4. LiteLLM: best self-hosted Python-native proxy

Verdict. LiteLLM fits when A/B traffic cannot leave the VPC and security wants source-availability over polish. It ships the routing primitive (percentage splits, weighted load balancing, fallback chains, retries with backoff) and leaves analytics to whatever stack you already run.

Split. Weighted model groups in config.yaml. Virtual keys for per-team and per-user routing; a small Python pre-call hook for per-header routing. Strategies include usage-based-routing-v2, least-busy, and latency-based. Sticky-by-user-identity is an extension teams write.

Shadow. Not config-declarative. Teams implement shadow with a Python hook that duplicates the request to a candidate model and discards the response.

Sample-size floor + outcome scoring + significance. None native. Teams run Future AGI traceAI, Langfuse, or Helicone behind LiteLLM and join on litellm_call_id. The proxy logs spend and latency; experiments-grade scores live elsewhere.

Rollback. Edit the YAML and reload. Typical rollback 30 seconds.

Bandits. None in the statistical sense. least-busy and latency-based optimize for latency, not for an eval reward.

Where it falls short. Experimentation story is the thinnest of the five. Plan to bolt three other tools on top. Pin commits after the March 24, 2026 PyPI compromise.

Pricing. MIT for the proxy. Enterprise (SLA, SSO, audit, SOC 2 Type II) starts around $250 a month.

Score: 1/7. Strong as a routing proxy; thin as an experimentation platform.

5. Cloudflare AI Gateway: best for teams already on Cloudflare Workers

Verdict. A free-to-start observability and caching layer in front of model providers. Fits when the workload already runs on Cloudflare Workers and the team wants one dashboard for logs, caching, and rate limits. Experimentation is not the product surface.

Split. Universal Endpoint accepts an array of provider configs and falls back through them on error. No percentage split, no header split; the split has to live in the Worker that calls the gateway.

Shadow, sample floor, outcome scoring, significance, bandits. None of the five. The dashboard shows request count, cost, latency, and cached vs uncached. No scoring API, no eval pipeline, no routing-policy versioning.

Sticky routing. Application-layer (Worker code keys the decision).

Rollback. Edit the Worker and redeploy; edge propagation under 30 seconds.

Where it falls short. Not an experimentation platform. Honest framing: a caching, logging, rate-limiting front door for model providers, useful for cost control and basic observability, not for A/B testing. Pair with a real routing layer (LiteLLM in a Worker, or Agent Command Center / Portkey above) if you need split, score, or shadow.

Pricing. Free tier covers most teams; paid tiers add longer log retention and higher rate limits.

Score: 0/7 for A/B testing. Useful for adjacent jobs, not the one this list scores on.

Capability matrix

AxisFuture AGIPortkeyHeliconeLiteLLMCloudflare AI Gateway
Shadow / mirror trafficNative (shadow + race + mirror)Native (mirror)NoneHook-basedNone
Sample-size enforcementDeclarative floor blocks promoteNoneNoneNoneNone
Outcome scoring at the hop50+ EvalTemplate classesExternal (BYO)Scores APIExternalNone
Traffic split granularity%, segment, header, tenant, attr%, headerProperty-based%, model group, teamApp-layer
Sticky routingConfigurable keyHeaderApp-layerExtensionApp-layer
Rollback < 60s~20s5-15s1-click prompt~30s~30s
Eval-gated promotion + banditsEval gate + Thompson + epsilon-greedyNoneNoneNoneNone
Score7/73/72/71/70/7

Decision framework: choose X if

Choose Future AGI Agent Command Center if you want the gateway to run the experiment end-to-end: shadow, score, enforce the sample floor, gate the promotion, run bandits. Pick this when experimentation cadence is the constraint on shipping LLM quality and arm-overhead cost ($100K+ per cycle) makes bandits and shadow worth the investment.

Choose Portkey if you want a hosted gateway with the cleanest split UX and you accept that significance, sample-size enforcement, and scoring live in tools next to the gateway. The mirror strategy buys you shadow; the rest is on the team.

Choose Helicone if the experiment is logging-first and the team has discipline about posting scores through the Scores API. Pair with a router for model-level A/B between providers.

Choose LiteLLM if traffic cannot leave the VPC, the team is Python-native, and the routing decision matters more than the experimentation analytics. Plan to bolt evals, significance, and shadow on top.

Choose Cloudflare AI Gateway for cost telemetry, caching, and rate limits in a Cloudflare-native stack. Do not pick it as an A/B testing surface.

Common mistakes the gateway should kill

MistakeFix
Splitting per-request instead of per-user (signal mixes both arms)Hash a stable identity (user_id, tenant_id, SSO subject) and route the hash
Unequal sample sizes (one arm wide CI)Equalize the split or use a bandit that reports CI per arm
Reading eval before the window (easy queries route first)Enforce minimum window (24-72h) and sample floor at the gateway
Eval gate only on primary (PII leak slips through)Gate must require primary and all guardrail metrics to clear
No rollback automationAuto-rollback on guardrail breach; humans review after
Stickiness on IP address (NAT clusters users)Stickiness is a user identity, not a network identity
Floating the judge mid-testPin and version the judge for the experiment window

A worked example: 600K req/day, shadow first, bandit second

A SaaS support workload runs 600K requests a day through claude-sonnet-4-6. Task-completion sits at 76. The team wants a +3 lift before promoting.

Days 1-2, shadow. The candidate enters as a shadow arm; the gateway mirrors live requests and scores both responses on Groundedness, Completeness, tool-use accuracy, and refusal correctness. Offline lift read +4.1; shadow lift on production distribution reads +3.6 (95% CI [+2.9, +4.3]). Offline slightly overestimated, but the production CI clears the +3 bar.

Day 4, canary. Candidate enters at 5 percent of cohort traffic, keyed on tenant ID. Eval-gated rollback wired on refusal rate (max +1.5), PII leak (zero, Protect gate), latency p95, and cost. Sample floor: 1,200 paired examples per arm; minimum window 72 hours.

Hours 12-48, hold. CI tightens around 78.6 vs 75.9 (+2.7, p=0.02). Below the +3 threshold; the gateway refuses to promote.

Hour 60, signal. A tenant cluster hits a PII guardrail at 0.15% vs 0.02% on control. Below the 0.5% breach threshold so no auto-rollback, but Error Feed flags the pattern and the optimizer ingests the 47 flagged spans, proposing a second-iteration prompt with an explicit PII-redaction clause.

Days 5-9, multi-arm. Treatment-B enters. Thompson-sampling bandit starts at 60/20/20 and shifts traffic on the live fi.evals reward signal. By day 9 the bandit has 78 percent on treatment-B. Per-arm scores: control 75.9, treatment-A 78.6, treatment-B 80.1. All guardrails in bound. Sample floor and window cleared; the promotion gate fires.

Net result over two weeks: task-completion 76 to 80.1 (+4.1); refusal rate +0.4; PII leak -0.01; cost unchanged. By month three the team has run 14 experiments with the optimizer authoring 9. The cadence is no longer human-bound.

The point is not the numbers; it is the gate at hour 48 where the gateway said no when the team would have said yes. That is what gateway-level A/B testing is for.

Where Future AGI fits in the loop

The other four gateways treat A/B testing as a routing primitive: split traffic, log the result, hand the rest to the team. Future AGI Agent Command Center treats it as the visible part of a loop that compounds across six stages.

  1. Declare. An experiment is a declarative config: arms, primary metric via fi.evals, guardrail metrics, stickiness key, shadow window, sample-size floor, traffic split, promotion threshold, rollback condition. Versioned in Git, applied through Agent Command Center.
  2. Shadow. Before any user sees the candidate, the gateway mirrors live traffic and scores both responses against the same rubric the canary will use.
  3. Trace. Every request produces a traceAI span with arm tag, model, prompt version, inputs, outputs, tool calls, latency, and cost. The span is the unit of join across the loop.
  4. Evaluate. ai-evaluation scores every span and attaches scores as span attributes. Protect runs on the request path so a PII-leaking arm fails its guardrail at request time, not at end-of-day analysis.
  5. Decide. Per-arm CI and p-value refresh every five minutes. Promotion fires only when primary lift, guardrails, sample floor, and window all clear.
  6. Optimize. Losing arms become training data for agent-opt. The six optimizers ingest failures and propose rewrites that become next-experiment arms. The team’s role shifts from drafting variants to reviewing optimizer proposals.

Three Apache 2.0 building blocks (traceAI, ai-evaluation, agent-opt). Hosted Agent Command Center adds the experiment view, shadow primitives, Protect, bandit routers, RBAC, SOC 2 Type II, HIPAA, and AWS Marketplace procurement.

Ready to ship an A/B that survives a post-mortem? Drop in the OpenAI SDK with base_url="https://gateway.futureagi.com/v1", declare min_samples_per_arm and shadow_window_hours, score every span with Evaluator.evaluate, and let the gateway gate the promotion. The first experiment is the one you would have shipped manually; the tenth is the one the optimizer drafted.

Sources

Frequently asked questions

What makes A/B testing at an AI gateway different from a feature-flag A/B test?
A feature flag tests a deterministic branch. An LLM A/B tests a stochastic system where outputs vary per call, costs vary by token, and quality is a multi-dimensional rubric (faithfulness, tool-use, refusal, format) rather than a single conversion event. The gateway has to do three things a flag service never has to: shadow the candidate against live traffic without serving its output, enforce a per-arm sample-size floor before declaring a winner, and attach outcome scores to every request so the per-arm table is one query. Anything less is a load balancer with a dashboard.
Why is shadow traffic the gating capability for gateway-level A/B testing?
Offline evaluation tells you a candidate prompt beats the incumbent on a fixed dataset. Production tells you whether the dataset still represents production. Shadow traffic routes a copy of the live request to the candidate, scores both responses, and never serves the candidate output. The risk is zero; the data is real. A gateway that ships shadow as a first-class primitive lets you validate offline lift on production distribution before any user sees the change. Without it you are jumping straight from notebook to canary, which is how a +0.04 offline lift becomes a 14-point refusal-rate regression at 09:05.
What sample size should the gateway enforce before promoting an arm?
For a continuous rubric (a 0-1 Groundedness, a 1-5 helpfulness), the working formula is n_per_arm = 16 * sigma_squared / MDE_squared at alpha 0.05 and 80 percent power. Sigma 0.18 with MDE 0.04 needs about 324 paired examples per arm. For a binary metric (PII leak, format adherence) at a 1 percent base rate and an MDE of 0.5 points absolute, the floor jumps to roughly 6,000 per arm. The gateway should not let a promotion fire below the configured floor; the floor should be declarative in the experiment config, not implicit in the CI/CD pipeline.
Should I use fixed splits or multi-arm bandits at the gateway?
Fixed splits for two-arm decisions you want to defend with a clean point estimate of B minus A. Bandits when you have three or more arms, the regret of serving a worse arm is bounded, and you cannot wait for fixed-N. Thompson Sampling on a Beta-Bernoulli model is the default for binary metrics; Gaussian Thompson works for continuous rubrics. Bandits give a winner faster but a less clean effect-size estimate; fixed A/B gives a defensible delta but burns more arm overhead. Most teams should ship the first three experiments as fixed A/B, then move to bandits once the eval metric is calibrated and trusted.
Can I A/B test models from different providers on the same workload?
Yes, but the request and response contracts have to match across arms. A prompt that depends on Anthropic tool-use schema will not return the same shape from OpenAI without a translation layer. Run a calibration pass: same input, both arms, diff the structured fields the downstream consumer reads. If the diff is non-empty, fix the contract before the experiment starts. The OpenAI-compatible gateways in this list (Agent Command Center, Portkey, LiteLLM) handle the surface translation; you still own the schema.
What is the safe production rollout sequence after the gateway A/B?
Shadow, then canary, then ramp. Shadow runs the candidate against live inputs without serving its output; both responses get scored on the same rubric the offline test used. Canary serves the candidate to one to five percent of cohort traffic with eval-gated rollback wired on per-rubric rolling pass rates. Ramp grows the cohort (5 to 25 to 50 to 100 percent) only if the canary deltas stay inside the offline confidence interval. Every stage uses the same rubric definition. Attach the rubric to live spans (Future AGI traceAI EvalTag, OTel attribute) so the score writes as a span attribute next to latency.
How does Future AGI Agent Command Center fit prompt and model A/B testing?
Future AGI ships the three primitives most gateways do not: shadow traffic with paired scoring, a declarative per-arm sample-size floor that blocks promotion until met, and outcome scoring from 50+ EvalTemplate classes attached to the same span as cost and latency. Bandits (epsilon-greedy + Thompson) and eval-gated promotion sit on top. The eval stack is Apache 2.0 (ai-evaluation, traceAI, agent-opt); the hosted Agent Command Center adds the experiment view, live Protect guardrails at 65 ms p50 text, and SOC 2 Type II procurement. The optimizer loop turns losing arms into next-iteration candidates, so the experimentation cadence stops being limited by humans drafting variants.
Related Articles
View all
The Comprehensive Guide to LLM Security (2026)
Guides

LLM security is four layers — input, output, retrieval, tool-call. Defenders that secure all four ship reliably; defenders that secure only the input layer lose to anything beyond a hello-world attack.

NVJK Kartik
NVJK Kartik ·
17 min