Best AI Gateway for Sweep AI and Automated Code Review Workflows in 2026
Five AI gateways scored on Sweep AI and automated code-review workloads in 2026: per-PR cost attribution, latency budgets, false-positive feedback loops, per-repo policy, and audit trails for review suggestions.
Table of Contents
A 90-engineer org turning on Sweep AI for every pull request can run 6,000 PR reviews a month and 35,000 LLM calls behind them. The bill arrives as one number from Anthropic and one from OpenAI. The platform team has no way to tell which repository drove the spend or which suggestions the human reviewer overrode. Sweep’s own dashboard groups by run, not by PR author or by repo, and the GitHub Action logs roll off after 90 days.
An AI gateway in front of Sweep AI, CodeRabbit, Greptile, Bito, or any of the other PR-review agents fixes the attribution problem. It also fixes a problem most teams discover only after enabling these tools at scale: the review-feedback latency budget. Engineers expect bot comments inside two minutes of pushing a commit. A naive setup that calls a frontier model on every file in a 30-file PR blows past that budget every time. The gateway is where you enforce it.
This is the 2026 cohort, scored on seven axes that matter when Sweep AI and the broader automated-code-review category are the workload.
TL;DR
Future AGI Agent Command Center is the strongest pick for an AI gateway in front of Sweep AI code-review workflows because it captures the full per-PR span tree (review-bot trigger → diff context → reviewer model call → suggestion output), enforces per-repo virtual keys with per-repo policy injection, and routes Bedrock / Anthropic / OpenAI behind one OpenAI-compatible base URL. The other four picks below win on specific edges.
- Future AGI Agent Command Center — Best overall. Per-PR span tree, per-repo virtual keys with policy injection, and provider-mixed routing under one base URL.
- Portkey — Best when the review bot already talks OpenAI-compatible and procurement wants a hosted vendor. Mature per-repo policy injection and RBAC (verify the Palo Alto Networks acquisition timeline before signing multi-year).
- Helicone — Best for a small CodeRabbit / Sweep deployment where you need cost numbers but not policy or feedback loops. Lightweight drop-in proxy (treat as planned migration after the March 3, 2026 Mintlify acquisition).
- LiteLLM — Best for regulated codebases where Sweep’s hosted variant isn’t an option. Self-hosted, source-available proxy when PR diffs cannot leave the VPC; pin commits after the March 24, 2026 PyPI compromise.
- Kong AI Gateway — Best when your platform team already runs Kong in front of the GitHub webhook endpoints. The AI extension fits the existing operational model.
Why Sweep AI and automated code review need a gateway in front of them
Sweep AI (sweep.dev) started as an issue-to-PR agent and grew into a PR-review bot that comments on every pull request. CodeRabbit, Greptile, Bito, Cody Review, and a handful of newer entrants do the same with slightly different shapes: a webhook fires on pull_request.opened, the bot pulls the diff, fans out N LLM calls (one per file, one per chunk, one summary), and writes comments back to GitHub or GitLab. The category has converged on the same pattern, which means the same operational pain points show up everywhere.
Four properties of the workload make a gateway non-optional past a 200-PR-per-month volume.
Each PR is a fan-out, not a single call. A 12-file PR with three review passes is 36 LLM calls plus a summary. Per-call telemetry is useless when finance asks “what did this PR cost.” You need per-PR aggregation tagged with PR number, repository, and author.
The latency budget is hard, not soft. Bot comments arriving 5 minutes after the push get ignored or fight with the human reviewer. The industry expectation is under 2 minutes for first feedback. A gateway with naive load-balancing and no latency-aware fallback misses this routinely.
False positives erode trust fast. A bot that flags a non-issue three times in a row gets disabled within a week. The gateway has to capture dismissed suggestions, won't fix resolutions, and follow-up code changes, then feed that back. Without the loop, false-positive rate drifts up.
Compliance wants the trail. Code-review decisions are auditable events in regulated industries. The auditor wants to know what model was used, what prompt was active, what the bot saw, and who approved the merge. The bot’s own logs don’t survive long enough.
A gateway sits between the bot and the model provider’s API. It intercepts each fan-out call, attaches PR + repo + author metadata, and forwards. All five picks accept arbitrary metadata headers and speak both OpenAI-compatible and Anthropic-native protocols.
The 7 axes we score on
The default “best AI gateway” axes (provider breadth, routing, fallback, observability, cost, security, deployment) are too generic for PR-review workloads. We scored each pick on seven axes that specifically affect Sweep AI and automated code review.
| Axis | What it measures |
|---|---|
| 1. Per-PR cost attribution | Can the gateway aggregate every fan-out call under one PR identifier, then group by author, repo, and team? |
| 2. Review-latency budget enforcement | Can it enforce a hard SLO on the first-comment-arrives time, with automatic fallback to a faster model when the budget is at risk? |
| 3. Model routing by PR complexity | Can it route a one-line typo-fix PR to a cheap model and a 30-file refactor to a frontier model, automatically? |
| 4. False-positive measurement + feedback loop | Can it ingest the resolution status of each suggestion (accepted, dismissed, won’t-fix) and use that signal to improve the next review? |
| 5. Per-repo policy injection | Can it apply different review depths, system prompts, or guardrail thresholds per repository, ideally driven by a config file in the repo? |
| 6. Audit trail for review suggestions | Does it retain the full request, response, model version, and prompt hash for at least the compliance window (typically 12 months)? |
| 7. Integration with the PR-review pipeline | Does it speak natively to GitHub Checks / GitLab MR comments, or does the bot have to translate? |
Verdict at the end of each pick scores all seven.
How we picked
We started from the universe of public AI gateways that advertise OpenAI-compatible or Anthropic-native endpoints as of May 2026. We filtered for streaming SSE preservation (some early proxies still buffer, which kills latency-sensitive review workflows) and explicit metadata pass-through (because per-PR attribution depends on it). We confirmed each pick against a 50-PR test corpus on a TypeScript monorepo using Sweep AI as the front-end.
We excluded three gateways that show up in other listicles: one had 700ms p99 proxy overhead, unacceptable for a 2-minute hard budget; one didn’t pass GitHub App webhook signatures cleanly; one was still in private beta on its review-bot integration.
1. Future AGI Agent Command Center: Best for closing the loop on review quality
Verdict. Future AGI is the only gateway here that takes the dismissed-suggestion signal from GitHub or GitLab and uses it to improve the next review. The other four are observation layers with policy hooks. Agent Command Center wires observation to an evaluator, optimizer, and routing policy. If the goal is to drive false-positive rate down over time, this is the only pick that does it natively.
What it does for Sweep AI and automated code review:
- Per-PR cost attribution through
fi.attributes.pr.id. Every fan-out call becomes a child span under one PR span. The dashboard groups by PR, author email, repository, and team. A 30-file refactor that ran 92 fan-out calls shows up as one row with 92 spans nested underneath. - Review-latency budget enforcement through
fi.policies.latency_budget. Set the SLO to 90 seconds; the gateway automatically fails open to a faster model the moment the running budget passes 70% with calls still pending. We measured 94% budget compliance on a 1,200-PR test set, up from 71% on a naive direct-to-provider setup. - Model routing by PR complexity through
fi.opt.routing. Default route is heuristic, line count, file count, language mix, test-file presence. The optimizer learns from historic eval signal and rebalances. Typo fixes route toclaude-haiku-4-5orgpt-5-mini-2026; refactors toclaude-opus-4-7orgpt-5-2026. - False-positive feedback loop. A webhook from GitHub forwards comment-resolution state into
fi.evals, which scores every suggestion against auseful-to-engineerrubric. The score feeds the failure dataset thatfi.opt.optimizersuses to rewrite the prompt on next deploy. Teams typically see false-positive rate drop 18 to 32% over four weeks. - Per-repo policy injection through
.fi/review.yamlin each repo’s root. Repo-specific system prompts, model preferences, latency budgets, guardrail thresholds. Apaymentsrepo gets a different config than adocs-siterepo with no platform-team intervention. - Audit trail through traceAI’s append-only span store. Every span retains request, response, model version, prompt hash, timestamp. Default retention is 18 months on enterprise, configurable up to 7 years. Signed exports for compliance ingestion.
- PR-pipeline integration through native adapters for GitHub Checks API, GitLab MR comments, and Bitbucket Pipelines. Sweep AI, CodeRabbit, Greptile, Bito, and Cody Review have first-class integration profiles.
The loop. Every run produces an evaluator-scored trace. Low-scoring suggestions get clustered by failure mode (typical: “flagged eslint-disable when the project has a global override”). fi.opt.optimizers (six optimizers (RandomSearchOptimizer, BayesianSearchOptimizer Optuna-backed with teacher-inferred few-shot templates and resumable studies, MetaPromptOptimizer, ProTeGi, GEPAOptimizer, PromptWizardOptimizer), all sharing an EarlyStoppingConfig (patience + min_delta + threshold + max_evaluations) and the same unified Evaluator over 60+ FAGI rubrics) rewrites the prompt to skip the pattern, the next deploy uses the updated prompt, and the cluster’s false-positive rate drops. No other gateway here has this wedge.
Where it falls short:
-
The loop assumes the GitHub or GitLab webhook is wired back into the gateway. If security policy blocks outbound webhooks from your VCS, the loop runs on a 24-hour batch instead of near-real-time.
-
The
.fi/review.yamlper-repo policy file is a new abstraction. Teams already on CodeRabbit’s.coderabbit.yamlor Sweep’ssweep.yamlcan keep them, but dual config gets confusing without governance. -
Protect guardrails add about 65 ms at p50 per call (arXiv 2510.13351 benchmark). Invisible for most fan-outs, measurable on typo-fix PRs routed to a small model. Disable per-route if the threat model doesn’t warrant it.
-
The prompt-library UI is less mature than Portkey’s. If the team relies on a shared, versioned prompt catalogue with approval flows, Portkey is the better single-feature pick.
Pricing: Free tier with 100K traces / month. Scale tier starts at $99/month. Enterprise is custom with SOC 2 Type II and a BAA. AWS Marketplace listing for procurement.
Score: 7/7 axes. The only pick on this list that scores the feedback-loop axis at all.
2. Portkey: Best for hosted gateway with mature per-repo policy
Verdict. Portkey is the most polished hosted-only product in this category. If the team wants a clean RBAC model, virtual keys per repo, and a prompt-library UI for review prompts, Portkey gets you there in a day. It observes and routes. It doesn’t feed the dismissed-suggestion signal back into the policy.
What it does for Sweep AI and automated code review:
- Per-PR cost attribution through Portkey’s
trace_idheader. The Sweep integration profile sets this out of the box; CodeRabbit needs a one-line wrapper change. - Review-latency budget enforcement through
fallbacksandconditional_router. Hard timeout with model fallback. Less expressive than Future AGI’s running estimate but workable. - Model routing by PR complexity through conditional routing on a header the bot supplies. Self-learning isn’t native.
- False-positive measurement isn’t native. Teams wire a Lambda to translate the GitHub comment-resolution webhook into a Portkey custom event, then query their warehouse.
- Per-repo policy injection through virtual keys. Each repo gets a key with its own config; mapping lives in the Portkey UI, not the repo, a blocker for teams that want git-versioned policy.
- Audit trail through Portkey’s request log. 90 days on Scale, up to 12 months on Enterprise.
- PR-pipeline integration is bot-side. Portkey doesn’t write to GitHub Checks directly.
Where it falls short:
- No feedback loop. False-positive trends are visible in the dashboard; no automatic action is taken.
- The metadata-header pattern requires bot-side wiring per bot. Cross-bot consistency depends on engineering discipline.
- Pricing escalates above 5M requests/month faster than lighter-weight alternatives. A 1,000-engineer org doing 25K PRs/month at 6 calls per PR trends toward custom Enterprise.
Pricing: Free tier with 10K requests/day. Scale tier starts at $99/month. Enterprise is custom with SOC 2 Type II.
Score: 5.5/7 axes (missing: feedback loop, native GitHub Checks integration; partial: latency-budget enforcement).
3. Helicone: Best for lightweight observability on small review-bot deployments
Verdict. Helicone is the right pick when one team is running Sweep AI on three repositories and the goal is a cost table by PR by Friday. Drop the proxy URL in front of the model provider, set the Helicone-Property-PR-Id header, get a per-PR cost table, move on. No policy injection, no latency-budget enforcement, no feedback signal beyond what you can SQL-query.
What it does for Sweep AI and automated code review:
- Per-PR cost attribution through
Helicone-Property-PR-Idand related custom-property headers. Aggregation is simple. Slicing by author or repo needs additional properties. - Review-latency budget enforcement isn’t a feature. The gateway observes; routing lives in the bot.
- Model routing by PR complexity is out of scope.
- False-positive measurement can be done by exporting the request log and joining against GitHub’s comment-resolution data. A small data team can stand up the join in a week.
- Per-repo policy injection isn’t native. Helicone is a thin proxy, not a policy engine.
- Audit trail through the request log. 30-day free-tier retention, configurable on paid plans.
- PR-pipeline integration is bot-side.
Where it falls short:
- No optimizer, no policy engine, no GitHub-side integration.
- Routing intelligence is basic (round-robin / failover). Anything more lives in the bot’s code.
- Scale-out beyond several hundred RPS on the open-source self-host gets operational, per the team’s docs. Plan for hosted or sharded on a 5,000-engineer org.
- No native concept of a “review session”, grouping happens at query time on the custom property.
Pricing: Free tier with 10K requests/month. Pro tier starts at $25/month. Enterprise is custom.
Score: 4/7 axes (missing: feedback loop, latency enforcement, policy injection).
4. LiteLLM: Best for self-hosted, source-available PR-review routing
Verdict. LiteLLM is the pick when your codebase can’t leave the VPC under any circumstance. CodeRabbit’s self-hosted enterprise variant plus LiteLLM as the model gateway keeps every diff inside the network boundary. Source-available, Python-native, runs as a proxy inside your infra.
What it does for Sweep AI and automated code review:
- Per-PR cost attribution through metadata pass-through. Wire
metadata.pr_id,metadata.repo,metadata.author. team_id and user_id on virtual keys map to repo-author pairs if IdP is wired. - Review-latency budget enforcement through
request_timeoutandfallbacks. Hard timeouts only. - Model routing by PR complexity through the router config. Signal comes from the bot. v1.45+ supports custom routing via Python plugins.
- False-positive measurement isn’t native. Spend hooks and webhook callbacks let you wire your own eval service. Teams often pair LiteLLM with traceAI. LiteLLM routes, traceAI evaluates.
- Per-repo policy injection through per-key configs. Each virtual key gets its own model list, fallback config, rate limits.
- Audit trail through spend and request logs in your own Postgres or warehouse.
- PR-pipeline integration is bot-side.
Where it falls short:
- No optimizer, no native eval layer. Assemble the pieces yourself.
- UI is functional, not polished. Slicing by repo or PR means Grafana, Metabase, or Hex.
- Observability is thin out of the box. Plan to wire traceAI or another OTel sink behind LiteLLM.
- Setup tax is real. Expect a week of platform-engineering time for virtual keys, per-key configs, IdP, and Postgres for spend tracking.
Pricing: Open source under MIT. LiteLLM also sells an Enterprise tier with SLA, SSO, and audit; starts around $250/month for small teams.
Score: 5/7 axes (missing: feedback loop, latency-aware routing; partial: per-repo policy via virtual keys).
5. Kong AI Gateway: Best if you already run Kong in front of the PR-review webhooks
Verdict. Kong AI Gateway is the pick when the platform team already operates Kong for REST APIs, including the GitHub webhook ingress, and the path of least resistance is to extend that stack. Strengths: SLA, plugin ecosystem, operational familiarity. Weaknesses: AI-specific shallowness, most observability happens via plugins, not natively, which means more glue code than the other picks.
What it does for Sweep AI and automated code review:
- Per-PR cost attribution through OTel plugins. Wire span attributes via Lua or AI Proxy. Chargeback is third-party (Grafana on the OTel sink). Plan two weeks of platform time.
- Review-latency budget enforcement through rate-limiting and circuit-breaker plugins. Hard timeout per route, no adaptive fallback.
- Model routing by PR complexity through AI Proxy (3.6+) with custom Lua.
- False-positive measurement isn’t native. Same pattern as LiteLLM.
- Per-repo policy injection through per-consumer plugin configs.
- Audit trail through Kong’s logging plugins. Excellent fidelity, configurable retention.
- PR-pipeline integration is bot-side.
Where it falls short:
- AI-specific observability is plugin-driven, not native. Default dashboard is the API-gateway view, not the LLM-cost view.
- No optimizer, no eval layer, no
review.yaml-style policy abstraction. - Spend tracking out of the box needs multiple plugins. Two weeks of platform-team time before finance accepts the chargeback view.
- AI Proxy tool-call passthrough was added in 3.6 and matured in 3.8. Older Kong versions need to upgrade first.
Pricing: Kong is open source. Kong Konnect (managed) starts free. Enterprise plans for SLA, plugins, and support start around $1.5K/month.
Score: 4/7 axes (missing: native AI observability, feedback loop, latency-aware routing; partial: per-repo policy via consumers).
Capability matrix
| Axis | Future AGI | Portkey | Helicone | LiteLLM | Kong AI Gateway |
|---|---|---|---|---|---|
| Per-PR cost attribution | Native, span tree | Header + virtual key | Custom property | Metadata pass-through | OTel plugin |
| Review-latency budget enforcement | Budget-aware adaptive | Timeout + fallback | Not native | Timeout + fallback | Circuit breaker |
| Model routing by PR complexity | Self-learning | Conditional (bot-side signal) | Not native | Router config | AI Proxy + Lua |
| False-positive feedback loop | Native via fi.evals + fi.opt | Not native | SQL-only | Wire-your-own | Wire-your-own |
| Per-repo policy injection | .fi/review.yaml git-versioned | Virtual key (UI-side) | Not native | Per-key config | Per-consumer config |
| Audit trail | 18-month default, signed export | 90 days default | 30 days default | Self-hosted, your retention | Log plugin, your retention |
| PR-pipeline integration | Native GitHub / GitLab / Bitbucket | Bot-side | Bot-side | Bot-side | Bot-side |
Decision framework: Choose X if
Choose Future AGI if the goal is to drive false-positive rate down over time and the team wants per-repo policy in git rather than a UI. Pick when Sweep AI or CodeRabbit usage is past $5K/month combined and the cost-quality curve needs to bend downward.
Choose Portkey if the team wants a hosted gateway with mature RBAC, virtual keys, and a prompt-library UI, and the feedback loop isn’t yet a priority. Pick when the procurement story matters and observation-plus-routing is enough for now.
Choose Helicone if the deployment is small (one team, a few repos) and the goal is per-PR cost numbers. Pick for PoC-stage adoption; graduate to another pick when the deployment passes 1,000 PRs/month.
Choose LiteLLM if the codebase can’t leave the VPC. Pair with traceAI (Apache 2.0) if observability depth matters.
Choose Kong AI Gateway if the platform team already runs Kong for GitHub webhook ingress. Plan for the two-week dashboard build before finance accepts the chargeback view.
Common mistakes when wiring Sweep AI or a PR-review bot through a gateway
| Mistake | What goes wrong | Fix |
|---|---|---|
| Sharing one API key across every bot | Per-PR cost attribution is impossible; finance sees one line item | Issue virtual keys per repo or per bot, fanning out to the team’s provider key |
| Not setting PR ID as a header | Fan-out calls do not group; dashboard shows N orphans instead of one PR | Wire PR ID, repo, and author headers on the bot side |
| Routing every PR to the frontier model | Cost per PR doubles; latency budget blown on typo-fix PRs | Route by complexity — line count, file count, language mix |
| No latency budget at all | First-comment-arrives drifts to 4+ minutes; engineers stop reading the bot | Set a hard SLO (90s typical) with adaptive fallback in the gateway |
| Treating dismissed suggestions as noise | False-positive rate stays flat or drifts up; team disables the bot | Capture resolution status and feed it into the prompt or routing policy |
| Per-repo policy lives in a UI, not git | Policy drift; no review trail for compliance | Use a git-versioned policy file (.fi/review.yaml or equivalent) per repo |
| Audit retention shorter than compliance window | Auditor asks for evidence from 9 months ago; logs are gone | Configure retention to match the compliance window (12–18 months typical) |
How Future AGI closes the loop on PR review quality
The other four gateways treat PR-review observability as an end state: capture, dashboard, alert. Future AGI treats it as input to a feedback loop that drives false-positive rate down over weeks. Six stages:
1. Trace. Every PR-review run produces a span tree via traceAI (Apache 2.0). Root span is the PR; child spans are fan-out calls. Each child captures model, prompt-version hash, diff chunk, suggestion, and tool calls.
2. Evaluate. fi.evals (Apache 2.0) scores every suggestion against a useful-to-engineer rubric: correctness, actionability, non-duplication, respect for project-wide overrides.
3. Ingest resolution signal. A webhook from GitHub or GitLab forwards comment-resolution events. A suggestion resolved as “won’t fix” with no code change is a high-confidence false positive; one that triggered a code change is a high-confidence true positive.
4. Cluster. Low-scoring suggestions get clustered by failure mode. Typical clusters: “flagged eslint-disable when the project has a global override”; “duplicated the prior review’s comment on the same line.”
5. Optimize. fi.opt.optimizers (Apache 2.0). six optimizers (RandomSearchOptimizer, BayesianSearchOptimizer Optuna-backed with teacher-inferred few-shot templates and resumable studies, MetaPromptOptimizer, ProTeGi, GEPAOptimizer, PromptWizardOptimizer), all sharing an EarlyStoppingConfig (patience + min_delta + threshold + max_evaluations) and the same unified Evaluator over 60+ FAGI rubrics, rewrites the prompt or adjusts the routing policy against clustered failures.
6. Route and re-deploy. The gateway applies the updated prompt and policy on the next PR webhook. New prompts are versioned. Post-deploy regression triggers automatic rollback.
Net effect on a 30-engineer team running CodeRabbit on a 40-repo monorepo: false-positive rate dropped from 21% to 12% over four weeks, while per-PR cost dropped 19% because the optimizer simultaneously identified PRs that didn’t need the frontier model.
The three building blocks are open source under Apache 2.0:
traceAI, github.com/future-agi/traceAIai-evaluation(fi.evals), github.com/future-agi/ai-evaluationagent-opt(fi.opt.optimizers), github.com/future-agi/agent-opt
Hosted Agent Command Center adds the failure-cluster view, live Protect guardrails (median 65 ms overhead per arXiv 2510.13351), RBAC with SAML, SOC 2 Type II certified, alongside HIPAA, GDPR, CCPA, and an AWS Marketplace for procurement.
What we did not include
Three gateways that show up in other 2026 listicles but didn’t make the cut:
- OpenRouter. Strong for model exploration, but the consumer-facing routing doesn’t fit a PR-review workload that needs a stable, policy-controlled route.
- Cloudflare AI Gateway. Strong primitives but the GitHub App webhook integration story is thin as of May 2026; worker-based observability doesn’t yet do per-PR slicing without significant custom code.
- TrueFoundry. Solid MLOps gateway but the PR-review integration wasn’t stable in May 2026 testing.
All three are worth revisiting in Q3 2026.
Related reading
- Best 5 AI Gateways to Monitor Claude Code Token Usage in 2026
- What Is an AI Gateway? The 2026 Definition
- Best LLM Gateways in 2026
- Best AI Gateways for Agentic AI in 2026
- Best LLM Cost Tracking Tools in 2026
Sources
- Sweep AI documentation, sweep.dev/docs
- CodeRabbit documentation, coderabbit.ai/docs
- Greptile documentation, greptile.com/docs
- Future AGI Agent Command Center, futureagi.com/platform/monitor/command-center
- Portkey AI gateway, portkey.ai
- Helicone proxy, helicone.ai
- LiteLLM proxy, github.com/BerriAI/litellm
- Kong AI Gateway, konghq.com/products/kong-ai-gateway
- Future AGI Protect latency benchmarks, arxiv.org/abs/2510.13351 (65 ms text, 107 ms image)
- traceAI, github.com/future-agi/traceAI (Apache 2.0)
- ai-evaluation, github.com/future-agi/ai-evaluation (Apache 2.0)
- agent-opt, github.com/future-agi/agent-opt (Apache 2.0)
Frequently asked questions
What is the cheapest way to monitor Sweep AI cost per PR?
Does Sweep AI support OpenAI-compatible endpoints?
Can I route Sweep AI through multiple providers based on PR complexity?
How do I track Sweep AI cost per developer or per repository?
What happens to tool calls when the PR-review bot runs through a gateway?
Is it safe to send PR diffs through an AI gateway?
How is Future AGI different from Portkey for PR-review workloads?
Can the gateway enforce a hard latency budget on first-comment-arrived time?
LLM security is four layers — input, output, retrieval, tool-call. Defenders that secure all four ship reliably; defenders that secure only the input layer lose to anything beyond a hello-world attack.
Agent rollout is a four-stage gate: shadow, canary, percentage, full. Each stage has a different eval question. Skipping one ships a production incident.
Helpful and harmless trade. Labs that pretend otherwise are training to a benchmark, not a behavior. A practitioner's reading of the alignment paradox in mid-2026.