Best AI Gateway for Bolt.new Coding Workflows in 2026
Five AI gateways scored on Bolt.new in-browser coding workflows in 2026: per-project cost attribution, iteration-tree observability, error recovery telemetry, B2B2C budget caps, and what each gateway misses.
Table of Contents
A SaaS company offering Bolt.new-style “build your own internal tool” features can watch one power user generate $180 of LLM cost in a single afternoon and have no idea which project, which iteration, or which prompt did it. Bolt.new looks like a single product to the end user, but underneath every “build me a CRM” turn is a fan-out of model calls that scaffold a project, write files, debug errors, re-prompt when WebContainer barfs, and stream the result back into a browser tab that may or may not still be open by the time the build is done.
Whether you’re running Bolt.new directly or embedding a Bolt-style agent into your own product (B2B2C), the operational shape is the same: a tree of iterations, an unpredictable per-project token count, an in-browser runtime that constrains telemetry, and a finance team that wants per-customer attribution.
An AI gateway fixes the visibility problem. It doesn’t fix the WebContainer runtime limits or answer “did this user ship a working app”. But the right gateway makes the workload operable. This post scores five gateways on the seven axes that matter when Bolt.new is the workload. Only one of them closes the loop from trace to optimizer to route.
TL;DR
Future AGI Agent Command Center is the strongest pick for an AI gateway in front of Bolt.new coding workflows because it ships per-project virtual keys with hard budget caps for B2B2C resale, iteration-tree traces rooted at the project (not flat lists), WebContainer-friendly SSE pass-through, and OpenTelemetry-native cost attribution per iteration. The other four picks below win on specific edges.
- Future AGI Agent Command Center — Best overall. Iteration-tree observability, per-end-customer span attributes, Stripe-metered budgets, and Anthropic / OpenAI / Bedrock all behind one OpenAI-compatible base URL.
- Portkey — Best when you re-sell Bolt-style app-builder inside your product. Cleanest virtual-key + budget-cap UX (verify the Palo Alto Networks acquisition timeline before signing multi-year).
- Helicone — Best when a small team uses Bolt.new directly and only needs per-iteration cost numbers. Drop-in proxy with minimal infra (treat as planned migration after the March 3, 2026 Mintlify acquisition).
- LiteLLM — Best when customer source code must stay in your network. Source-available Python proxy that runs in your VPC; pin commits after the March 24, 2026 PyPI compromise.
- OpenRouter — Best for pay-as-you-go prototyping with many providers. Useful for early-stage Bolt-style products before chargeback matters.
Why Bolt.new specifically needs a gateway in front of it
Bolt.new is StackBlitz’s in-browser AI coding tool. A user types “build me a Stripe-checkout SaaS landing page with Postgres,” and the agent scaffolds a full-stack project that boots inside a WebContainer, a browser-native Node.js runtime. The agent iterates with the user: install a dependency, fix a build error, rename a component, re-run, retry. Four properties make the workload distinctive to monitor:
-
Per-project cost is unbounded. A “build me a Trello clone” prompt can produce 4 turns or 40, depending on how many TypeScript errors the model has to chase. In Future AGI’s internal benchmark across 60 Bolt-style sessions in Q1 2026, p50 project cost was 84K tokens and p95 was 410K. Average is meaningless; you have to look at distribution by project.
-
Iteration trees are deep, not linear. Each user prompt spawns multiple model calls (planner, file-writer, error-fixer), each tool call can spawn a new code-edit cycle, and users often fork when one path fails. A flat “list of API calls” trace is useless. You need a tree rooted at the project, with each iteration as a sub-tree.
-
Errors are first-class. WebContainer compile errors, npm install failures, type errors, these aren’t exceptional cases, they’re the steady-state of an AI-driven app generator. The gateway has to capture the recovery path: which error was hit, which prompt got sent next, did it succeed.
-
The runtime constrains the client. Bolt.new runs entirely in the browser’s WebContainer. You can’t drop a daemon next to the runtime; any observability has to share the same fetch() as model traffic. This eliminates “agent SDK” patterns that assume a server-side process.
A gateway sits between the Bolt.new client (or your B2B2C proxy) and the model providers. It tags each request with project + iteration + customer metadata, streams SSE without buffering (the in-browser progress UI breaks if batched), and caps spend so a runaway customer can’t drain the budget. All five picks below speak both Anthropic and OpenAI API surfaces, since Bolt-style products typically route across Claude, GPT-4-class, and open-weights models depending on complexity.
The 7 axes we score on
The default “best AI gateway” axes (provider breadth, routing, fallback, observability, cost, security, deployment) are too generic. We scored each pick on seven axes that specifically affect Bolt.new and Bolt.new-style in-browser AI coding workflows.
| Axis | What it measures |
|---|---|
| 1. Per-project cost attribution | Can the gateway group cost by Bolt.new project ID, not just by API key? |
| 2. Iteration-tree observability | Can the trace render a tree of user-prompt → planner → file-writer → error-fixer, not a flat list? |
| 3. Error-recovery telemetry | Does the gateway capture WebContainer compile / npm errors and the prompts that followed them? |
| 4. Iteration-phase model selection | Can routing differ between scaffolding, polish, and error-recovery phases? |
| 5. Per-customer budget caps for B2B2C | Can you cap spend per end-customer (not just per developer) when you re-sell Bolt? |
| 6. WebContainer-friendly streaming | Does SSE pass-through work without buffering that breaks the in-browser UI? |
| 7. Audit trail for user-generated apps | Can compliance trace which prompt produced which file in which user’s project? |
Verdict line at the end of each pick scores all seven.
How we picked
We started from the universe of public AI gateways that ship OpenAI-compatible and Anthropic-compatible endpoints as of May 2026. We dropped two early-stage proxies that buffered SSE responses (they break Bolt.new’s in-browser progress UI on the first turn). We dropped one major API-management gateway because its AI-extension plugin doesn’t preserve Anthropic tool-use blocks across the proxy hop without a custom Lua rewrite. The remaining five are the cohort below.
We tested each gateway by routing 50 Bolt-style projects through it across three days, instrumenting per-project cost, iteration-tree shape, and time-to-first-token. Numbers cited inline come from that test run.
1. Future AGI Agent Command Center: Best for per-project / per-customer Bolt attribution
Verdict: Future AGI captures every Bolt.new project as an OpenTelemetry tree rooted at the project (not a flat list), with per-iteration sub-trees, WebContainer compile and npm error spans, per-end-customer span attributes for B2B2C deployments, and Stripe-metered budget caps that stop a runaway customer from draining the budget. Anthropic, OpenAI, and Bedrock all sit behind one OpenAI-compatible base URL so the same project can switch model providers per iteration without re-instrumenting.
What it does for Bolt.new coding workflows:
- Per-project cost attribution through the
fi.attributes.project.idspan attribute. The dashboard groups iterations into projects natively. - Iteration-tree observability is the wedge here. Each user prompt becomes a root span, each planner call a child, each file-write tool call a grandchild, each error-recovery loop a sibling sub-tree. Click any node, see the prompt, model, tokens, and diff.
- Error-recovery telemetry because
traceAI(Apache 2.0) instruments successful and failed tool calls. WebContainer errors surface as span events with the error text; the next model call is linked as the recovery span. - Iteration-phase model selection through routing keyed on a
phasespan attribute, scaffolding turns toclaude-sonnet-4-6, polish toclaude-haiku-4-5, error-recovery to whichever model your evals show recovers fastest. - Per-customer budget caps through
fi.budgetskeyed oncustomer.id. Soft alert at 80%, hard pause at 110%. - WebContainer-friendly streaming. SSE pass-through doesn’t buffer. We measured time-to-first-token at 412ms p50 vs 387ms p50 direct, a 25ms overhead that doesn’t break the in-browser progress UI.
- Audit trail through immutable span persistence plus
fi.attributes.user.emailfrom SSO. Every file the agent wrote is traceable back to its prompt and model version.
The loop. Every iteration gets scored by fi.evals against code-correctness, build-success, and tool-use rubrics. Low-scoring iterations cluster by failure mode, a common one is “model spent 18K tokens scaffolding a project that fails npm install.” That cluster feeds fi.opt.optimizers (six optimizers (RandomSearchOptimizer, BayesianSearchOptimizer Optuna-backed with teacher-inferred few-shot templates and resumable studies, MetaPromptOptimizer, ProTeGi, GEPAOptimizer, PromptWizardOptimizer), all sharing an EarlyStoppingConfig (patience + min_delta + threshold + max_evaluations) and the same unified Evaluator over 60+ FAGI rubrics), which rewrites the scaffolding prompt or adjusts the routing policy. Next deploy, the gateway uses the updated route. Trace data also feeds the Protect guardrail, which screens for prompt-injection at ~65 ms text-mode latency (arXiv 2510.13351), low enough to keep in the hot path of an interactive in-browser tool.
Where it falls short:
-
agent-opt is opt-in, for a one-week prototype or hobby Bolt clone, start with traceAI + ai-evaluation and light up the optimizer once eval baselines stabilize.
-
The phase-attribute model assumes you can label which phase a request belongs to. For vanilla Bolt.new where you don’t own the client, you have to infer phase from request shape, doable, but adds setup time.
-
The project-tree view assumes one parent span per project. Bolt-style products that run parallel agents on the same project get a noisy tree.
Pricing: Free tier with 100K traces / month. Scale starts at $99/month. Enterprise custom with SOC 2 Type II, HIPAA, GDPR, and CCPA certifications. Listed on AWS Marketplace.
Score: 7/7 axes.
2. Portkey: Best for hosted virtual keys in B2B2C Bolt-style products
Verdict: Portkey is the strongest hosted-only product when you re-sell a Bolt-style app-builder to your customers and each customer needs a virtual key with its own budget cap. The virtual-key + RBAC UX is the most polished in this list. It observes, routes, and enforces, but doesn’t optimize prompts back.
What it does for Bolt.new coding workflows:
- Per-project cost attribution through
metadata.project_id. The wrapper must set the header; without it, aggregation collapses to virtual-key level. - Iteration-tree observability through Portkey’s trace dashboard. Requests link via
trace_id. The view is flat list with parent-child markers, workable, but you have to mentally build the tree. - Error-recovery telemetry at request level; filter by
metadata.error_typeif your wrapper sets it. - Iteration-phase model selection through Portkey’s routing config keyed on metadata. Works well, less integrated with trace data than Future AGI.
- Per-customer budget caps through the virtual-key system, each customer gets a key fanned out to your provider key, with per-key $X/day caps. This is Portkey’s strongest feature for B2B2C.
- WebContainer-friendly streaming works; we measured 432ms p50 time-to-first-token vs 387ms direct.
- Audit trail through request log plus virtual-key attribution. SOC 2 Type II GA. Per-customer slicing is clean.
Where it falls short:
- No optimizer. Traces inform humans, not the gateway.
- Project-tree visualization is flat-with-links, not a real tree. For deep iteration trees (a single Bolt prompt can fan out 20 sub-calls), readability suffers.
- Metadata-header model assumes you control the Bolt request flow. Vanilla Bolt.new without a wrapper sends what it sends.
Pricing: Free tier with 10K requests/day. Scale starts at $99/month. Enterprise custom with SOC 2 Type II.
Score: 6/7 axes (missing: feedback loop / optimization).
3. Helicone: Best for lightweight observability on small Bolt teams
Verdict: Helicone is the right pick when a small team runs Bolt.new directly (not B2B2C), wants per-iteration cost numbers, and doesn’t need budgets, routing, or guardrails. The drop-in is genuinely a drop-in: change the base URL, get a request log.
What it does for Bolt.new coding workflows:
- Per-project cost attribution through
Helicone-Property-ProjectId. The dashboard slices by it. - Iteration-tree observability through
Helicone-Session-IdandHelicone-Session-Path. Helicone renders a real tree of nested sessions, a step up from Portkey’s flat list, less polished than Future AGI’s view. - Error-recovery telemetry is shallow. Non-200 responses surface; semantic errors (compile failures, npm errors) require custom properties.
- Iteration-phase model selection isn’t native. Helicone’s routing layer is basic (failover, retries). Phase decisions have to happen in your client.
- Per-customer budget caps are limited. Usage alerts and rate-limit policies, not hard spend caps. For B2B2C, the biggest gap.
- WebContainer-friendly streaming works.
- Audit trail through the request log; less mature than Portkey.
Where it falls short:
- No optimizer.
- No hard budget caps with auto-pause, only alerts. A runaway B2B2C iteration loop still bleeds the $180 from the intro before anyone gets paged.
- No iteration-phase model selection out of the box.
Pricing: Free tier with 10K requests/month. Pro starts at $25/month. Enterprise custom.
Score: 5/7 axes (missing: hard budget caps, optimizer).
4. LiteLLM: Best for self-hosted Bolt-style products that cannot send code to a hosted gateway
Verdict: LiteLLM is the pick when you run a Bolt-style product in a regulated environment, fintech, health, government, where customer source code can’t leave your VPC. Source-available, Python-native, runs as a proxy inside your infra. Less observability out of the box, but the source is yours.
What it does for Bolt.new coding workflows:
- Per-project cost attribution through metadata pass-through, wire
metadata.project_id; LiteLLM persists it to its spend tracker. - Iteration-tree observability is LiteLLM’s weakest area. The proxy logs requests with parent-id metadata, but visualization is “go to your SQL warehouse.” Most teams pair LiteLLM with
traceAIas the OTel sink and use Agent Command Center for the tree view. - Error-recovery telemetry through exception hooks, wire a callback on failure; functional, not pretty.
- Iteration-phase model selection through routing config keyed on metadata. Python-based, so you can write arbitrary phase logic.
- Per-customer budget caps through team_id / user_id budgets, webhook-based alerts. You wire the pager yourself.
- WebContainer-friendly streaming works; we measured 405ms p50 time-to-first-token through a local LiteLLM proxy.
- Audit trail is your job. LiteLLM writes to its spend DB, you set retention and access controls.
Where it falls short:
- No optimizer.
- Observability is thin out of the box. Plan to wire
traceAIor another OTel sink for tree depth. - UI is functional, not polished. Finance reviews mean a SQL dashboard.
- No native guardrails, pair with
ai-evaluationor a sidecar for prompt-injection screening.
Pricing: Open source under MIT. Enterprise tier starts around $250/month for small teams.
Score: 5.5/7 axes (missing: polished dashboard, optimizer).
5. OpenRouter: Best for pay-as-you-go model multiplexing in early-stage Bolt-style products
Verdict: OpenRouter is the pick when prototyping a Bolt-style product, you want to swap across 30 models without signing 30 contracts, and you haven’t yet built per-customer chargeback. Once revenue and SLAs arrive, OpenRouter is the wrong shape. Early on, it removes friction better than anyone.
What it does for Bolt.new coding workflows:
- Per-project cost attribution through per-request metadata. The dashboard groups by API key; per-project slicing requires bringing your own analytics or using the
metadatafield. - Iteration-tree observability isn’t a strength. The log is flat; you ship to your warehouse and build the tree there.
- Error-recovery telemetry is shallow. Failed requests surface; linking the recovery prompt is your job.
- Iteration-phase model selection is genuinely strong, model selection is the whole product. Switch per request with one field. The downside: the routing decision lives in your client, not gateway config.
- Per-customer budget caps aren’t the shape. Billing is per-API-key. Mapping customers to keys means either provisioning an OpenRouter key per customer (operationally painful) or wrapping OpenRouter behind your own proxy (defeats the point).
- WebContainer-friendly streaming works; OpenRouter is the multiplexer of choice for several OSS Bolt clones because SSE pass-through is reliable.
- Audit trail is shallow. Event log; SOC 2 Type II, HIPAA, GDPR, and CCPA certified.
Where it falls short:
- No optimizer.
- Per-customer chargeback isn’t the model. Fine for prototypes, wrong for paid B2B2C.
- No self-host, hosted multiplexer by design. If compliance requires code to stay in your VPC, OpenRouter is out.
- No iteration-tree view, no native guardrails.
Pricing: Pay-as-you-go on top of provider pricing. No fixed monthly fee, tokens plus a thin routing markup.
Score: 4/7 axes (missing: budget caps, audit depth, optimizer).
Capability matrix
| Axis | Future AGI | Portkey | Helicone | LiteLLM | OpenRouter |
|---|---|---|---|---|---|
| Per-project cost attribution | ✅ Native | ✅ Metadata header | ✅ Custom property | ✅ Metadata | ⚠️ Bring-your-own |
| Iteration-tree observability | ✅ True tree | ⚠️ Flat + parent links | ✅ Session tree | ⚠️ SQL warehouse | ❌ |
| Error-recovery telemetry | ✅ Span events | ⚠️ Request-level | ⚠️ Wire custom | ✅ Hook callback | ⚠️ Manual |
| Iteration-phase model selection | ✅ Span-attr routing | ✅ Metadata routing | ❌ Client-side | ✅ Python config | ✅ Per-request |
| Per-customer budget caps (B2B2C) | ✅ + auto-pause | ✅ Virtual key + cap | ⚠️ Alerts only | ✅ Team/user budget | ❌ Wrong model |
| WebContainer-friendly streaming | ✅ 412ms p50 | ✅ 432ms p50 | ✅ | ✅ 405ms p50 | ✅ |
| Audit trail | ✅ SOC 2 + HIPAA + GDPR + CCPA | ✅ SOC 2 GA | ⚠️ Basic | ✅ DIY retention | ⚠️ SOC 2 in progress |
| Feedback loop / optimizer | ✅ fi.opt | ❌ | ❌ | ❌ | ❌ |
Decision framework: Choose X if
Choose Future AGI if you want the gateway to drive prompt and routing optimization over time, not monitor alone. Pick this when Bolt-style spend is a meaningful cost line (over $5K/month) and you want the cost curve to bend down, and when iteration-tree visibility matters because failure modes hide inside the tree.
Choose Portkey if you re-sell a Bolt-style app-builder and per-customer virtual-key + budget-cap UX is what you need shipped this quarter. Pick this when monitoring as a one-time setup is enough.
Choose Helicone if your team is under 10 developers running Bolt.new directly (not embedded in a product) and the simplest possible drop-in is the right fit.
Choose LiteLLM if compliance requires customer source code to never leave your VPC. Plan to pair it with traceAI for observability depth.
Choose OpenRouter if you’re prototyping, want to swap across 30 providers with one API, and haven’t yet built per-customer chargeback. Use for the first 6 months, then migrate once revenue makes per-customer attribution real money.
Common mistakes when wiring Bolt.new through a gateway
| Mistake | What goes wrong | Fix |
|---|---|---|
| Buffering SSE in the gateway | Bolt.new’s in-browser progress UI freezes mid-build; user thinks it hung; user refreshes; project state is lost | Confirm the gateway forwards SSE without buffer-and-batch; measure time-to-first-token before going live |
| Tagging by user but not by project | One developer with five active projects looks like one row; per-project cost is impossible to surface | Tag both user and project; nest user under project in the dashboard |
| No iteration-tree view, just a flat request log | A 40-call iteration tree looks like 40 unrelated requests; debugging “why did this project cost $40” takes hours | Pick a gateway that renders the tree (Future AGI, Helicone) or pair LiteLLM with a tree-aware OTel sink |
| Treating compile errors as just non-200 responses | The recovery prompt that consumes the error is not linked back; you cannot tell which errors the model recovers from cleanly and which it loops on forever | Capture WebContainer errors as span events with the recovery prompt as a linked span |
| Setting B2B2C budget caps at the API-key level | One runaway customer can still consume the team’s entire daily budget before the cap trips | Set per-customer (per-virtual-key) caps, not just team caps; aim for a soft alert at 80% and hard pause at 110% |
| Routing the same model for all iteration phases | Scaffolding turns and polish turns get the same model; you over-pay for polish and under-spec for scaffolding | Route by phase: heavier models for scaffolding, lighter for polish, hand-tuned route for error recovery |
| Sending in-browser telemetry through the same fetch as model traffic | The browser’s connection limit kicks in; model streaming chokes; UX degrades | Pipe telemetry through the gateway sidecar so it never competes with the model SSE channel |
How Future AGI closes the loop on Bolt.new spend
The other four gateways treat per-project cost as an end state: capture the tree, show it on a dashboard, alert on threshold trips. Future AGI treats it as the input to a feedback loop with six stages:
-
Trace. Every iteration produces a span tree via
traceAI(Apache 2.0). Root = project, children = user prompts, grandchildren = planner, file-writer, error-fixer calls. Spans capture inputs, outputs, tool calls, model, errors, and customer ID. -
Evaluate.
fi.evalsscores each iteration against task-completion, code-correctness (did it compile), and tool-use accuracy. Scores sit alongside cost, a high-cost low-score iteration is the most expensive failure mode to find without this loop. -
Cluster. Low-scoring iterations cluster by failure mode. Common Bolt-style clusters: “scaffolded a project that fails npm install on first run” and “model loops on the same TypeScript error for 6 turns.”
-
Optimize.
fi.opt.optimizers(six optimizers (RandomSearchOptimizer, BayesianSearchOptimizer Optuna-backed with teacher-inferred few-shot templates and resumable studies, MetaPromptOptimizer, ProTeGi, GEPAOptimizer, PromptWizardOptimizer), all sharing an EarlyStoppingConfig (patience + min_delta + threshold + max_evaluations) and the same unified Evaluator over 60+ FAGI rubrics) rewrites the scaffolding prompt or adjusts routing against the clusters. Typical Bolt optimization: a phase-aware routing rule plus a tightened scaffolding prompt that stops the model from generating dependencies WebContainer can’t run. -
Route. Agent Command Center applies the updated policy on the next request. Scaffolding goes to the optimizer-tuned model, polish to a cheaper one, error-recovery to whatever has the best recovery rate.
-
Re-deploy. New prompts + routes are versioned. If a score regresses, automatic rollback with a Slack alert.
Net effect: a B2B2C team starting at $25,000/month on Bolt-style spend typically sees costs trend down 18 to 28% within five weeks without changing the customer-facing UI.
Adjacent capabilities round it out. Protect screens user prompts for prompt-injection at ~65 ms text-mode latency (arXiv 2510.13351), low enough for the hot path of an in-browser tool, relevant because Bolt-style products give end-users a raw prompt input, a real injection vector. The dashboard surfaces the cost-quality matrix so you can see which customers are most expensive and which have the lowest task-completion rate; usually the same customers, and that overlap is where prompt + route updates land first.
Three building blocks are Apache 2.0 open source:
traceAI, github.com/future-agi/traceAIai-evaluation, github.com/future-agi/ai-evaluationagent-opt, github.com/future-agi/agent-opt
The hosted Agent Command Center adds the iteration-tree view, failure-cluster UI, live Protect guardrails (the Future AGI Protect model family. Gemma 3n fine-tuned adapters across Content Moderation, Bias Detection, Security, and Data Privacy Compliance; multi-modal text, image, and audio), RBAC, SOC 2 Type II certified, BYOC for regulated workloads, and AWS Marketplace listing.
What we did not include
We deliberately left out three gateways that show up in other 2026 listicles:
- Cloudflare AI Gateway. Strong primitives and excellent latency at the edge, but per-customer slicing in a B2B2C Bolt-style product still requires custom workers for chargeback. Worth a re-check in Q3 2026.
- Kong AI Gateway. Solid if you already run Kong for REST, but AI-specific observability is plugin-driven rather than native; for a Bolt-style workload you would spend the first two weeks wiring plugins before the iteration tree is legible.
- TrueFoundry. Capable MLOps gateway, but the Bolt-specific integration (especially the WebContainer SSE pass-through) wasn’t stable in our May 2026 testing window.
If your situation is different, all three are worth a second look in Q3 2026.
Related reading
- Best 5 AI Gateways to Monitor Claude Code Token Usage in 2026
- What Is an AI Gateway? The 2026 Definition
- Best LLM Gateways in 2026
- Best AI Gateways for Agentic AI in 2026
- Best LLM Cost Tracking Tools in 2026
Sources
- StackBlitz Bolt.new documentation, bolt.new and stackblitz.com/docs/bolt
- Future AGI Agent Command Center, futureagi.com/platform/monitor/command-center
- Portkey AI gateway, portkey.ai
- Helicone proxy, helicone.ai
- LiteLLM proxy, github.com/BerriAI/litellm
- OpenRouter, openrouter.ai
- Future AGI Protect latency benchmarks, arxiv.org/abs/2510.13351 (65 ms text, 107 ms image)
Frequently asked questions
What is the cheapest way to monitor Bolt.new token usage?
Does Bolt.new run entirely in the browser?
Can I route Bolt.new through multiple model providers?
How do I track Bolt.new cost per end-customer for a B2B2C feature?
What happens to tool calls when Bolt.new runs through a gateway?
Is it safe to send customer-generated source code through an AI gateway?
How is Future AGI Agent Command Center different from Portkey for a Bolt-style product?
LLM security is four layers — input, output, retrieval, tool-call. Defenders that secure all four ship reliably; defenders that secure only the input layer lose to anything beyond a hello-world attack.
Agent rollout is a four-stage gate: shadow, canary, percentage, full. Each stage has a different eval question. Skipping one ships a production incident.
Helpful and harmless trade. Labs that pretend otherwise are training to a benchmark, not a behavior. A practitioner's reading of the alignment paradox in mid-2026.