Best AI Gateways for A/B Testing LLM Models and Prompts in 2026
Five AI gateways for A/B testing LLM models and prompts in 2026, scored on shadow traffic, sample-size enforcement, and outcome-attached scoring at the gateway hop.
Table of Contents
A team rolls a one-line prompt tweak to the support agent. Offline lift on a 30-example suite reads +0.04 Groundedness. The change ships to 100 percent at 09:00. By 09:05, refusal rate on legitimate refund queries is up 14 points and on-call is rolling back from a Slack thread.
The change wasn’t the problem. The gateway was. It split by percentage, logged a cost number per route, and called that A/B testing. It never shadowed the candidate. It never enforced a sample floor. It never scored the response. The dashboard showed the cheaper arm, not the better one.
Gateway-level A/B testing requires three things most gateways don’t ship: shadow/mirror traffic, statistical sample-size enforcement, and outcome-attached scoring (not just latency and cost). The gateway that gives you all three is the gateway you ship the experiment on. The five below are scored on exactly that. One ships all three. The other four ship the split and leave one or more pieces as your problem.
TL;DR: the 2026 cohort, ranked on the three things that matter
| Rank | Gateway | Shadow traffic | Sample-size enforcement | Outcome scoring at the hop |
|---|---|---|---|---|
| 1 | Future AGI Agent Command Center | Native (shadow + race + mirror) | Declarative floor blocks promote | 50+ EvalTemplate classes per span |
| 2 | Portkey | Native (mirror) | None (BYO notebook) | External (log to Langfuse/FAGI) |
| 3 | Helicone | None | None | Per-request Scores API |
| 4 | LiteLLM | Hook-based | None | External |
| 5 | Cloudflare AI Gateway | None | None | Logs only |
Pick number one if you want the gateway to run the experiment end-to-end. Pick two through five if you accept that your CI pipeline or data team will own the statistics, scoring, and rollback automation the gateway will not.
Why gateway-level A/B testing is different
A classic web A/B splits users between checkout pages and waits for conversion to clear a p-value. LLM A/B at the gateway breaks that playbook on four fronts.
Outcomes are multidimensional. A prompt that lifts task completion 4 points can raise hallucination 2 points, cost 38 percent, and refusal rate 1.2 points. No single conversion number; the dashboard needs every dimension per arm.
Arms are expensive. A naive 50/50 split at 2M req/day for two weeks is roughly $180K of arm overhead at 2026 list prices. Shadow plus bandit allocation cuts that 60 to 80 percent.
Statistical power requires sticky routing. If the same user bounces between arms, the signal mixes both arms and the test doesn’t reach significance.
Rollback is measured in seconds. A bad prompt at 100 percent at 09:00 costs thousands by 09:05 unless the gateway flips back automatically. The “merge, wait for CI, redeploy” cycle doesn’t apply.
The gateway is the right place to handle all four. The five below all claim to. Only one ships the three primitives that decide whether the claim survives a post-mortem.
The 7 axes we score on
The first three are the thesis. The next four are the productionization tax.
| Axis | What it measures |
|---|---|
| 1. Shadow / mirror traffic | Candidate runs against live inputs without serving its output; both responses scored |
| 2. Sample-size enforcement | Declarative per-arm floor; gateway blocks promotion until met |
| 3. Outcome scoring at the hop | Eval scores attach to the same span as latency and cost |
| 4. Traffic split granularity | Split by %, segment, header, SSO claim, tenant, span attribute |
| 5. Sticky routing | Same user, same arm across the experiment window |
| 6. Rollback under 60 seconds | Human or alert flips routing back to control within one minute |
| 7. Eval-gated promotion + bandits | Winning arm clears thresholds; Thompson / epsilon-greedy supported |
Pass on all seven, the gateway runs the experiment. Pass the first three, the test is defensible. Fail the first three, you have a load balancer with an A/B button.
How we picked the cohort
The pool was every AI gateway whose May 2026 docs advertised A/B testing, experimentation, mirror traffic, or split routing. We removed gateways that ship request-percentage splits without sticky routing (the unit of randomization drifts within a session). We removed gateways with no per-route observability (load balancers wearing an A/B label). The five remaining are below.
1. Future AGI Agent Command Center: best for end-to-end experimentation
Verdict. The only gateway in this list that ships all three load-bearing primitives. The split, the score, the significance check, and the promotion gate live in one product.
Shadow, the load-bearing primitive. Configure the candidate as a shadow arm and the gateway routes a copy of the live request to it without serving the output to the client. Both responses get scored on the same rubric the canary will use. The shadow window is where you validate that offline lift survives the production distribution before any user sees the candidate.
Sample-size enforcement, the wedge. The experiment config declares a per-arm floor: min_samples_per_arm: 1200, min_window_hours: 72, primary_metric: groundedness. Agent Command Center refuses to fire the promotion until both clear. The CI/CD pipeline cannot bypass it because the gate lives in the gateway, not the deploy script. This kills “we promoted at hour 4 because the early signal looked great” mistakes.
Outcome scoring. ai-evaluation (Apache 2.0) ships 50+ EvalTemplate classes covering task completion, faithfulness, tool-use, structured-output, hallucination, groundedness, context relevance, and instruction-following, plus unlimited custom evaluators and self-improving evaluators that learn from live traces. The Future AGI Platform runs classifier-backed evals at lower per-eval cost than Galileo Luna-2. Scores attach to the same span as cost and latency, so the per-arm table is one query.
Split and stickiness. Six routing strategies (round robin, weighted, least latency, cost optimized, adaptive, race) plus six reliability features (failover, retries, circuit breaking, model fallbacks, complexity-based, provider lock). Split by percentage, fi.attributes.user.id, tenant ID, span attribute, or header. Compose dimensions inside the config without writing routing code. Sticky routing hashes the configured key to a 64-bit space; the hash function is exposed for offline reproduction.
Significance, promotion, rollback. Per-arm sample size, mean, 95% CI, and frequentist p-value refresh every five minutes; Bayesian credible intervals available. The promotion gate is declarative: clear the primary lift, hold every guardrail in bounds, hit the sample floor, hit the window. Routing policies are immutable; revert is one click, typically under 20 seconds. Auto-rollback on guardrail breach for 15 consecutive minutes.
Bandits and the loop. Epsilon-greedy and Thompson sampling with live fi.evals as the reward signal. No manual reward instrumentation. Losing arms become training data for agent-opt (six optimizers: ProTeGi, GEPA, MetaPrompt, BayesianSearch, RandomSearch, PromptWizard). Error Feed auto-clusters failing per-arm traces via HDBSCAN and writes the immediate_fix, so regressions surface like exceptions. Inline guardrails run through Future AGI Protect at 65 ms p50 text and 107 ms p50 image (arXiv 2510.13351).
Performance and deployment. Roughly 29K req/s, P99 21 ms with guardrails on, on t3.xlarge. Cloud at gateway.futureagi.com/v1 (OpenAI SDK drop-in), self-hosted Go binary, Docker, Kubernetes, air-gapped. Apache 2.0 core.
Where it falls short. Heavier than a team needs for a one-off “is GPT-5.2 better than Claude Sonnet 4.6” check; Portkey is faster to spin up for a single comparison. Contextual bandits are in beta as of May 2026.
Pricing. Free: 100K traces, one active experiment. Scale from $99 a month. Enterprise custom with SOC 2 Type II, HIPAA, BAA, AWS Marketplace.
Score: 7/7 axes.
2. Portkey: best for hosted split UX with mirror traffic
Verdict. Portkey ships the cleanest traffic-split primitive of any hosted gateway. Configs are versioned, the dashboard is polished, virtual keys make per-tenant splits straightforward, and mirror traffic is a first-class config strategy. The right pick when the experimentation team is comfortable owning sample-size math and outcome capture in tools that live next to the gateway.
Split and shadow. Strategies include loadbalance, fallback, single, and mirror. Mirror sends the request to a primary plus mirror targets, returns the primary response to the client, and logs the mirror response separately. That is shadow in everything but name; the mirror row shares trace_id with the primary, so the per-arm join is straightforward. Split by percentage and metadata header; per-segment splits use conditional config (verbose past two segments).
Sample-size enforcement. None. The gateway promotes a config the moment you publish it. Teams enforce the floor in CI or in the deployment pipeline.
Outcome scoring + significance. Not native. Every request gets logged with the arm tag; per-arm outcomes require a separate eval pipeline. Teams typically wire Future AGI traceAI or Langfuse behind Portkey and join on trace_id. The Feedback API accepts post-hoc scores; the scoring engine is on you. P-value and CI live in a notebook.
Sticky routing. Through the Portkey-Sticky-Session-Id header.
Rollback. Edit live config in UI or API; typical 5 to 15 seconds. No eval-gated rollback.
Bandits. None as of May 2026. Roadmap mentions a bandit router for late 2026.
Where it falls short. No native eval pipeline (the dashboard tells you which arm was cheaper, not which was better). No sample-size enforcement. No bandits. Procurement note: verify the Palo Alto Networks acquisition timeline before signing multi-year.
Pricing. Free: 10K requests a day. Scale from $99 a month. Enterprise custom with SOC 2 Type II.
Score: 3/7. Strong on split and mirror; weak on enforcement, native scoring, significance, bandits.
3. Helicone: best per-request scoring API for proxy-mode experiments
Verdict. Helicone treats the per-request scoring API as a first-class primitive. The proxy logs every request with an arm tag and exposes properties and feedback APIs for attaching scores, metadata, and prompt template versions. What it doesn’t ship is shadow traffic or routing-policy-driven experimentation; the proxy is observability-first.
Split. Proxy-style, by user-property header (Helicone-Property-User, Helicone-Property-Cohort) and prompt template version. No native percentage-based split. The application picks the model per request; Helicone logs the choice.
Shadow + sample floor. Neither. The proxy logs what the app sends and doesn’t gate deploys.
Outcome scoring. The wedge: POST a score keyed on request_id and Helicone joins it to the log row. Scores group by prompt template version and user property, which is the per-arm view. Source can be LLM judge, deterministic check, or human review. Experiments view shows per-arm sample size and mean; SQL exposes the table for p-value and CI work in a notebook.
Rollback. One-click prompt template revert. The gateway doesn’t block deploys.
Bandits, sticky routing. Neither. Helicone isn’t a router; pair with Portkey, LiteLLM, or Future AGI if model-level A/B requires gateway-side split.
Pricing. Free 10K requests a month; Pro from $20 a month; Team and Enterprise custom.
Score: 2/7. Strong on the scoring API; weak everywhere routing or experiment automation matters.
4. LiteLLM: best self-hosted Python-native proxy
Verdict. LiteLLM fits when A/B traffic cannot leave the VPC and security wants source-availability over polish. It ships the routing primitive (percentage splits, weighted load balancing, fallback chains, retries with backoff) and leaves analytics to whatever stack you already run.
Split. Weighted model groups in config.yaml. Virtual keys for per-team and per-user routing; a small Python pre-call hook for per-header routing. Strategies include usage-based-routing-v2, least-busy, and latency-based. Sticky-by-user-identity is an extension teams write.
Shadow. Not config-declarative. Teams implement shadow with a Python hook that duplicates the request to a candidate model and discards the response.
Sample-size floor + outcome scoring + significance. None native. Teams run Future AGI traceAI, Langfuse, or Helicone behind LiteLLM and join on litellm_call_id. The proxy logs spend and latency; experiments-grade scores live elsewhere.
Rollback. Edit the YAML and reload. Typical rollback 30 seconds.
Bandits. None in the statistical sense. least-busy and latency-based optimize for latency, not for an eval reward.
Where it falls short. Experimentation story is the thinnest of the five. Plan to bolt three other tools on top. Pin commits after the March 24, 2026 PyPI compromise.
Pricing. MIT for the proxy. Enterprise (SLA, SSO, audit, SOC 2 Type II) starts around $250 a month.
Score: 1/7. Strong as a routing proxy; thin as an experimentation platform.
5. Cloudflare AI Gateway: best for teams already on Cloudflare Workers
Verdict. A free-to-start observability and caching layer in front of model providers. Fits when the workload already runs on Cloudflare Workers and the team wants one dashboard for logs, caching, and rate limits. Experimentation is not the product surface.
Split. Universal Endpoint accepts an array of provider configs and falls back through them on error. No percentage split, no header split; the split has to live in the Worker that calls the gateway.
Shadow, sample floor, outcome scoring, significance, bandits. None of the five. The dashboard shows request count, cost, latency, and cached vs uncached. No scoring API, no eval pipeline, no routing-policy versioning.
Sticky routing. Application-layer (Worker code keys the decision).
Rollback. Edit the Worker and redeploy; edge propagation under 30 seconds.
Where it falls short. Not an experimentation platform. Honest framing: a caching, logging, rate-limiting front door for model providers, useful for cost control and basic observability, not for A/B testing. Pair with a real routing layer (LiteLLM in a Worker, or Agent Command Center / Portkey above) if you need split, score, or shadow.
Pricing. Free tier covers most teams; paid tiers add longer log retention and higher rate limits.
Score: 0/7 for A/B testing. Useful for adjacent jobs, not the one this list scores on.
Capability matrix
| Axis | Future AGI | Portkey | Helicone | LiteLLM | Cloudflare AI Gateway |
|---|---|---|---|---|---|
| Shadow / mirror traffic | Native (shadow + race + mirror) | Native (mirror) | None | Hook-based | None |
| Sample-size enforcement | Declarative floor blocks promote | None | None | None | None |
| Outcome scoring at the hop | 50+ EvalTemplate classes | External (BYO) | Scores API | External | None |
| Traffic split granularity | %, segment, header, tenant, attr | %, header | Property-based | %, model group, team | App-layer |
| Sticky routing | Configurable key | Header | App-layer | Extension | App-layer |
| Rollback < 60s | ~20s | 5-15s | 1-click prompt | ~30s | ~30s |
| Eval-gated promotion + bandits | Eval gate + Thompson + epsilon-greedy | None | None | None | None |
| Score | 7/7 | 3/7 | 2/7 | 1/7 | 0/7 |
Decision framework: choose X if
Choose Future AGI Agent Command Center if you want the gateway to run the experiment end-to-end: shadow, score, enforce the sample floor, gate the promotion, run bandits. Pick this when experimentation cadence is the constraint on shipping LLM quality and arm-overhead cost ($100K+ per cycle) makes bandits and shadow worth the investment.
Choose Portkey if you want a hosted gateway with the cleanest split UX and you accept that significance, sample-size enforcement, and scoring live in tools next to the gateway. The mirror strategy buys you shadow; the rest is on the team.
Choose Helicone if the experiment is logging-first and the team has discipline about posting scores through the Scores API. Pair with a router for model-level A/B between providers.
Choose LiteLLM if traffic cannot leave the VPC, the team is Python-native, and the routing decision matters more than the experimentation analytics. Plan to bolt evals, significance, and shadow on top.
Choose Cloudflare AI Gateway for cost telemetry, caching, and rate limits in a Cloudflare-native stack. Do not pick it as an A/B testing surface.
Common mistakes the gateway should kill
| Mistake | Fix |
|---|---|
| Splitting per-request instead of per-user (signal mixes both arms) | Hash a stable identity (user_id, tenant_id, SSO subject) and route the hash |
| Unequal sample sizes (one arm wide CI) | Equalize the split or use a bandit that reports CI per arm |
| Reading eval before the window (easy queries route first) | Enforce minimum window (24-72h) and sample floor at the gateway |
| Eval gate only on primary (PII leak slips through) | Gate must require primary and all guardrail metrics to clear |
| No rollback automation | Auto-rollback on guardrail breach; humans review after |
| Stickiness on IP address (NAT clusters users) | Stickiness is a user identity, not a network identity |
| Floating the judge mid-test | Pin and version the judge for the experiment window |
A worked example: 600K req/day, shadow first, bandit second
A SaaS support workload runs 600K requests a day through claude-sonnet-4-6. Task-completion sits at 76. The team wants a +3 lift before promoting.
Days 1-2, shadow. The candidate enters as a shadow arm; the gateway mirrors live requests and scores both responses on Groundedness, Completeness, tool-use accuracy, and refusal correctness. Offline lift read +4.1; shadow lift on production distribution reads +3.6 (95% CI [+2.9, +4.3]). Offline slightly overestimated, but the production CI clears the +3 bar.
Day 4, canary. Candidate enters at 5 percent of cohort traffic, keyed on tenant ID. Eval-gated rollback wired on refusal rate (max +1.5), PII leak (zero, Protect gate), latency p95, and cost. Sample floor: 1,200 paired examples per arm; minimum window 72 hours.
Hours 12-48, hold. CI tightens around 78.6 vs 75.9 (+2.7, p=0.02). Below the +3 threshold; the gateway refuses to promote.
Hour 60, signal. A tenant cluster hits a PII guardrail at 0.15% vs 0.02% on control. Below the 0.5% breach threshold so no auto-rollback, but Error Feed flags the pattern and the optimizer ingests the 47 flagged spans, proposing a second-iteration prompt with an explicit PII-redaction clause.
Days 5-9, multi-arm. Treatment-B enters. Thompson-sampling bandit starts at 60/20/20 and shifts traffic on the live fi.evals reward signal. By day 9 the bandit has 78 percent on treatment-B. Per-arm scores: control 75.9, treatment-A 78.6, treatment-B 80.1. All guardrails in bound. Sample floor and window cleared; the promotion gate fires.
Net result over two weeks: task-completion 76 to 80.1 (+4.1); refusal rate +0.4; PII leak -0.01; cost unchanged. By month three the team has run 14 experiments with the optimizer authoring 9. The cadence is no longer human-bound.
The point is not the numbers; it is the gate at hour 48 where the gateway said no when the team would have said yes. That is what gateway-level A/B testing is for.
Where Future AGI fits in the loop
The other four gateways treat A/B testing as a routing primitive: split traffic, log the result, hand the rest to the team. Future AGI Agent Command Center treats it as the visible part of a loop that compounds across six stages.
- Declare. An experiment is a declarative config: arms, primary metric via
fi.evals, guardrail metrics, stickiness key, shadow window, sample-size floor, traffic split, promotion threshold, rollback condition. Versioned in Git, applied through Agent Command Center. - Shadow. Before any user sees the candidate, the gateway mirrors live traffic and scores both responses against the same rubric the canary will use.
- Trace. Every request produces a
traceAIspan with arm tag, model, prompt version, inputs, outputs, tool calls, latency, and cost. The span is the unit of join across the loop. - Evaluate.
ai-evaluationscores every span and attaches scores as span attributes. Protect runs on the request path so a PII-leaking arm fails its guardrail at request time, not at end-of-day analysis. - Decide. Per-arm CI and p-value refresh every five minutes. Promotion fires only when primary lift, guardrails, sample floor, and window all clear.
- Optimize. Losing arms become training data for
agent-opt. The six optimizers ingest failures and propose rewrites that become next-experiment arms. The team’s role shifts from drafting variants to reviewing optimizer proposals.
Three Apache 2.0 building blocks (traceAI, ai-evaluation, agent-opt). Hosted Agent Command Center adds the experiment view, shadow primitives, Protect, bandit routers, RBAC, SOC 2 Type II, HIPAA, and AWS Marketplace procurement.
Ready to ship an A/B that survives a post-mortem? Drop in the OpenAI SDK with base_url="https://gateway.futureagi.com/v1", declare min_samples_per_arm and shadow_window_hours, score every span with Evaluator.evaluate, and let the gateway gate the promotion. The first experiment is the one you would have shipped manually; the tenth is the one the optimizer drafted.
Related reading
- A/B Testing LLM Prompts: The Statistical Playbook (2026)
- Best 5 AI Gateways for Prompt Management in 2026
- Best AI Gateways for LLM Observability and Tracing in 2026
- Best AI Gateways for Model Routing in 2026
- Best AI Gateways for LLM Failover and Fallback in 2026
- LLM as Judge Best Practices in 2026
Sources
- Future AGI Agent Command Center, futureagi.com/platform/monitor/command-center
- Future AGI Protect latency benchmarks, arXiv 2510.13351 (65 ms text, 107 ms image)
- Future AGI open-source libraries (Apache 2.0): traceAI, ai-evaluation, agent-opt
- Portkey AI gateway, portkey.ai
- Helicone, helicone.ai
- LiteLLM proxy, github.com/BerriAI/litellm
- Cloudflare AI Gateway, developers.cloudflare.com/ai-gateway
Frequently asked questions
What makes A/B testing at an AI gateway different from a feature-flag A/B test?
Why is shadow traffic the gating capability for gateway-level A/B testing?
What sample size should the gateway enforce before promoting an arm?
Should I use fixed splits or multi-arm bandits at the gateway?
Can I A/B test models from different providers on the same workload?
What is the safe production rollout sequence after the gateway A/B?
How does Future AGI Agent Command Center fit prompt and model A/B testing?
Agent rollout is a four-stage gate: shadow, canary, percentage, full. Each stage has a different eval question. Skipping one ships a production incident.
Shadow is not canary. Mirror routing with no user effect vs. percentage routing with rollback. Score-attached traffic, the Agent Command Center patterns, and the gotchas.
LLM security is four layers — input, output, retrieval, tool-call. Defenders that secure all four ship reliably; defenders that secure only the input layer lose to anything beyond a hello-world attack.