Best 5 AI Gateways for Image and Vision LLM Routing in 2026
Five AI gateways scored on vision LLM routing in 2026: image-size limits, base64 vs URL, per-image cost attribution, image-token estimation, multimodal streaming, and image guardrails.
Table of Contents
A multimodal chatbot that accepts customer screenshots, a document-OCR pipeline that pulls invoices through GPT-4o vision, and an image-classification service that calls Claude 3.5 Sonnet vision in parallel are three workloads that share one operational pain: vision tokens are two to five times more expensive than text tokens, and most AI gateways were built when the prompt was a string. Your finance dashboard shows a 4x spike in spend the day a marketing user starts pasting 4K screenshots, and nothing in the trace tells you whether the image was even necessary.
An AI gateway in front of vision LLMs fixes the visibility part. It intercepts the multipart request, normalizes the image payload (base64 vs URL), estimates the image-token cost before forwarding, attaches per-image attribution metadata, and applies guardrails to the image bytes themselves (PII in screenshots, NSFW, watermark detection). The five gateways in this post all do some of that. They don’t all do it well, and only one of them runs an image guardrail at 109ms p95 so the moderation hop doesn’t collapse the whole pipeline.
This is the 2026 cohort, scored on the seven axes that matter when image and vision LLM routing is the workload.
TL;DR
Future AGI Agent Command Center is the strongest pick for an AI gateway for image and vision LLM routing because it ships an image-aware Protect guardrail at ~107 ms p95 (arXiv 2510.13351), per-image cost attribution wired into the trace, multipart payload normalization (base64 vs URL), image-token cost estimation before forwarding, and Anthropic / OpenAI / Bedrock / Vertex all behind one OpenAI-compatible base URL. The other four picks below win on specific edges.
- Future AGI Agent Command Center — Best overall. Inline image guardrail at ~109 ms p95, per-image cost attribution, multipart payload normalization, and provider-mixed routing under one base URL.
- Portkey — Best for the widest hosted catalog of vision-capable models with mature routing rules (verify the Palo Alto Networks acquisition timeline before signing multi-year).
- Maxim Bifrost — Best when self-host plus first-class multimodal is the constraint. OSS vision routing with built-in OTel and fallback chains.
- OpenRouter — Best as the fastest way to try LLaVA-1.6, Pixtral, and frontier vision side by side. Consumer-facing model marketplace for OSS plus frontier vision.
- LiteLLM — Best when vision traffic cannot leave your VPC and the team is Python-first. Python-native self-hosted vision proxy; pin commits after the March 24, 2026 PyPI compromise.
Why vision workloads break generic AI gateways
A vision LLM request is more than “a longer prompt.” Three structural differences make it operationally distinct, and they break the assumptions baked into most gateways built for text.
1. The payload is multipart and large. A single 4K screenshot encoded as base64 is around 8MB inflated. A four-image grid pushed through GPT-4o vision can be 32MB on the wire before the response token has been generated. Gateways that buffer requests in memory to apply policy crash at 200 RPS of vision traffic. Gateways that gzip-decompress streamed responses pre-buffer behave the same way for the request path.
2. Cost attribution is per-image, not per-token. GPT-4o charges 85 tokens per low-detail image and 765 tokens for a high-detail 2048x2048 image (170 tokens per 512x512 tile, plus an 85-token base). Claude 3.5 Sonnet vision charges based on a different tile geometry. Gemini 1.5 Pro charges 258 tokens per image regardless of resolution. If your gateway reports cost as a single number, you can’t tell whether the spike was caused by a high-resolution upload, a model swap, or just more requests. Per-image attribution is the unit of accountability for vision; per-call attribution is the unit for text.
3. Image-token estimation is uncertainty, not arithmetic. For a 1024x768 image at high detail, GPT-4o’s actual billed tokens vary by 5-15% from the documented formula because the tiling logic rounds and re-tiles based on aspect ratio. Claude 3.5 Sonnet vision’s tile cost varies by aspect ratio as well, and Gemini’s per-image flat fee silently changes when an image exceeds 3072x3072. A gateway that estimates image tokens correctly enforces budget caps; a gateway that estimates wrong over- or under-pauses the pipeline mid-batch.
Beyond those three, vision-specific concerns layer on. The token estimate must be available before the forwarding decision or you can’t route by cost. Streaming responses now interleave image generation, tool calls, and text, gateways that flatten streams to text lose the structured turn. And image guardrails (PII, NSFW, watermark, prompt-injection-in-image-text) add latency by default, a guardrail that runs at 700ms doubles a 600ms call; one that runs at 109ms preserves the budget.
For the rest of this post, “gateway” means an AI gateway that handles image payloads as a first-class request type. All five picks below support GPT-4o vision, Claude 3.5+ vision, Gemini multimodal, and at least one OSS vision model (LLaVA-1.6 34B or Pixtral 12B).
The 7 axes we score on
The default “best AI gateway” axes (provider breadth, routing, fallback, observability, cost, security, deployment) are too generic for vision workloads. We scored each pick on seven axes that specifically affect multimodal routing.
| Axis | What it measures |
|---|---|
| 1. Provider breadth (vision-capable) | How many vision models are supported with first-class image handling, not just text proxy |
| 2. Image size and format limits | What the gateway accepts, what it rejects, how transcoding (HEIC, WebP, AVIF) is handled |
| 3. Base64 vs URL passing | Does the gateway support both inline base64 and signed-URL passing without re-uploading? |
| 4. Per-image cost attribution | Can the gateway attribute cost per image, not just per request? |
| 5. Image-token estimation accuracy | How close is the estimate to the billed token count for the major vision providers? |
| 6. Multimodal streaming | Does streaming work when responses interleave text, tool calls, and (for image-out) image generation? |
| 7. Image guardrails | Can the gateway moderate images for PII, NSFW, watermark, or embedded prompt injection, and at what latency? |
Verdict line at the end of each pick scores all seven.
How we picked
We started from the universe of public AI gateways advertising multimodal or vision support as of May 2026. We removed gateways that proxy vision as opaque blobs (no image-token estimation, no per-image attribution), three otherwise-reasonable products were cut. We removed gateways without at least one form of image guardrail. We removed gateways where vision is still “beta”. Cloudflare AI Gateway and Vellum fell into that bucket. The remaining five are the cohort below.
1. Future AGI Agent Command Center: Best for vision routing with image-aware guardrails
Verdict: Future AGI is the only gateway in this list that runs an image-specific guardrail at production latency (107 ms p95 per arXiv 2510.13351) and ties it into a self-improving loop. The other four are routing layers with various flavors of observability bolted on. Agent Command Center is a routing layer with image-native moderation, per-image cost attribution, and an optimizer that uses the trace data to reduce vision spend over time.
What it does for image and vision LLM routing:
-
Provider breadth. GPT-4o vision, GPT-4o mini vision, Claude 3.5 Sonnet vision, claude-opus-4-7 vision, claude-sonnet-4-6 vision, Gemini 1.5 Pro multimodal, Gemini 2.0 Flash multimodal, and OSS LLaVA-1.6 34B and Pixtral 12B via vLLM-compatible endpoints. The catalog also covers AWS Bedrock and Azure OpenAI multimodal, which matters when procurement is with a hyperscaler.
-
Image size and format limits. Up to 20MB inline base64, 100MB via signed URL. Transcodes HEIC, WebP, and AVIF to PNG before forwarding to providers that reject those formats (Claude accepts PNG/JPG/WebP/GIF; Gemini accepts HEIC). Transcoding adds 22ms p95.
-
Base64 vs URL. Both supported, with a routing-policy option to convert one to the other. The common pattern: clients send base64, gateway uploads to a per-tenant S3 bucket, and the provider gets the URL. This is the only way to keep latency in the same magnitude for payloads over 5MB, because providers stream URL fetches but not inline base64.
-
Per-image cost attribution. The trace span for a multimodal turn includes an
images[]array withimage.tokens_estimated,image.tokens_billed,image.bytes,image.dimensions, andimage.detail_level. Group-by-image in the Agent Command Center dashboard is built-in. This is how you catch the case where a marketing user uploads a 4096x4096 product mockup that bills at 1530 tokens versus the 85-token low-detail version that would have answered the question. -
Image-token estimation accuracy. Tuned per provider against billed-token feedback. As of May 2026 testing: GPT-4o estimate within 3% of billed for 95% of requests, Claude 3.5 Sonnet within 5%, Gemini within 2% (Gemini’s flat-fee model is the easiest to estimate). The estimate is exposed on the request span before the forwarding decision, so a budget cap of “$0.02 per request” can pause an inbound 16K-token image upload before it reaches the model.
-
Multimodal streaming. SSE pass-through preserves the structured stream, text deltas, tool-use blocks, and image-generation deltas (for the GPT-image-1 endpoint) all flow without re-serialization. The gateway parses Anthropic’s content-block schema and OpenAI’s choices.delta schema natively rather than treating both as text.
-
Image guardrails via the Future AGI Protect model family. Protect is FAGI’s own fine-tuned model family built on Google’s Gemma 3n with specialized adapters across four safety dimensions (content moderation, bias detection, security/prompt-injection, data privacy/PII), natively multi-modal across text, image, and audio, a model family, not a plugin chain of third-party detectors. ~65 ms p50 text and ~107 ms p50 image per the published latency benchmarks (arXiv 2510.13351). The image surface covers PII (faces, IDs, screenshots of credit cards), NSFW classes, watermark detection, and prompt-injection-in-image-text (text embedded in screenshots that tries to override the system prompt). Running before the model call adds ~107 ms; running in shadow mode adds zero latency because the model and guardrail fire in parallel. The same dimensions are reusable as offline eval metrics so the prod policy and the eval rubric stay in sync.
The loop. Every captured vision trace gets scored by fi.evals with multimodal-aware rubrics (image-text grounding, hallucination on image content, OCR faithfulness). traceAI instruments 50+ AI surfaces across Python, TypeScript, Java, and C# (including Spring Boot starter, Spring AI, LangChain4j, Semantic Kernel) OpenInference-natively, and Error Feed (the part of the eval stack, the clustering and what-to-fix layer that feeds the self-improving evaluators) sits alongside as the zero-config error monitor: auto-clusters related vision-grounding and OCR-faithfulness failures (50 traces → 1 issue), auto-writes the root cause plus a quick fix plus a long-term recommendation, and tracks rising/steady/falling trend per issue so multimodal regressions surface like exceptions rather than buried in trace search. Low-scoring sessions become a failure dataset that fi.opt.optimizers uses to adjust either the routing policy or the prompt template. The typical optimization for a multimodal chatbot is a routing rule: send screenshots under 800x600 to GPT-4o mini vision at low detail (85 tokens flat) and route only 1080p+ uploads to full GPT-4o at high detail. Without the loop, the team learns this manually after the quarterly cost review.
Where it falls short:
- The image-token estimation is tuned to the documented providers. If you bring a brand-new vision model with a non-public tiling schema, the estimate falls back to a generic “bytes / 1.4 * 0.7” formula until the model is tuned in.
- The image guardrail is a paid feature on Protect; the Agent Command Center gateway works without Protect, but you don’t get the 107 ms moderation hop unless you opt in.
Pricing: Free tier with 100K traces / month and 10K Protect-image evaluations. Scale tier starts at $99/month. Enterprise is custom with SOC 2 Type II, BAA, and per-image SLA. AWS Marketplace listing for procurement.
Score: 7/7 axes.
2. Portkey: Best for cross-provider vision routing with mature routing rules
Verdict: Portkey is the most polished hosted product when the requirement is “route between every commercially available vision model with config-driven rules.” Their hosted catalog is the widest in this list. The trade-off is no native image guardrail (you bring your own moderation hop) and the per-image cost attribution requires extra wiring on the client side.
What it does for image and vision LLM routing:
- Provider breadth. GPT-4o vision, GPT-4o mini vision, Claude 3.5 Sonnet vision, claude-opus-4-7 vision, Gemini 1.5 Pro, Gemini 2.0 Flash, Cohere Aya Vision, AWS Bedrock multimodal, and Together’s hosted LLaVA-1.6. The widest hosted catalog in this cohort.
- Image size and format limits. 20MB inline. WebP and HEIC pass through to providers that accept them; transcoding is opt-in via a beta plugin as of May 2026.
- Base64 vs URL. Both supported. No automatic conversion, the request shape decides.
- Per-image cost attribution. Aggregated at the request level by default; per-image breakdown requires the client to set a custom
image_idheader and Portkey to roll it up. Doable, not native. - Image-token estimation accuracy. Estimate is provided in the response metadata. GPT-4o estimate within 4-7% of billed; Claude within 6%; Gemini within 3% (Gemini is flat-fee and trivial). Estimate is not available pre-forwarding for budget enforcement; it’s computed after the response.
- Multimodal streaming. SSE works. Tool-use blocks preserved.
- Image guardrails. No native image moderation as of May 2026. Portkey’s text guardrails work for the response, not for the image input. You bring AWS Rekognition or another moderation hop separately.
Where it falls short:
- No native image moderation. For a content-moderation use case (UGC platform, public-facing chatbot), you need to wire AWS Rekognition or equivalent in front of Portkey, which adds 200-400ms to the request path and a second contract.
- Image-token estimation is post-hoc, so per-request budget caps can’t reject expensive images before they hit the model. The cap fires after the bill is incurred.
- The optimizer story doesn’t exist. Portkey gives you good observability and good routing rules; it doesn’t feed back into prompt or route updates automatically.
Pricing: Free tier with 10K requests/day. Scale tier starts at $99/month. Enterprise is custom with SOC 2 Type II.
Score: 5/7 axes (missing: native image guardrail, pre-forward token estimation, optimizer).
3. Maxim Bifrost: Best for OSS vision routing with OTel and fallback chains
Verdict: Maxim’s Bifrost is the OSS vision gateway that does the most out of the box. It’s Apache 2.0, ships with OpenTelemetry support, supports fallback chains (GPT-4o → Claude 3.5 → Gemini → LLaVA), and handles base64 and URL inputs symmetrically. The trade-off is less polish on the dashboard side and image guardrails that require a separate plugin.
What it does for image and vision LLM routing:
- Provider breadth. GPT-4o, Claude 3.5 Sonnet vision, claude-opus-4-7 vision, Gemini 1.5 Pro, Gemini 2.0 Flash, LLaVA-1.6 via vLLM, and Together-hosted LLaVA. Adding a provider is a YAML edit, which is the OSS-gateway pattern.
- Image size and format limits. 25MB inline. Transcoding is built-in for WebP, AVIF, and HEIC.
- Base64 vs URL. Both supported, with conversion logic comparable to Future AGI’s. The convert-to-URL path requires you to provide an S3-compatible bucket; the gateway doesn’t host images for you.
- Per-image cost attribution. Native, exposed via OTel spans. The schema is OTel-standard, so any OTel sink (Honeycomb, Grafana Tempo, Future AGI’s traceAI) can group by image.
- Image-token estimation accuracy. Within 4% for GPT-4o, 5-7% for Claude, 2% for Gemini. The estimate is exposed pre-forwarding, so budget caps work before the model call.
- Multimodal streaming. Works. The team explicitly tests streaming with vision-out responses (GPT-image-1) and preserves the binary deltas.
- Image guardrails. Plugin-based. The default plugin is a basic NSFW classifier with 220ms median latency. PII-in-screenshot, watermark, and prompt-injection-in-image plugins are community contributions; quality varies.
Where it falls short:
- The polished dashboard is a paid Maxim product separate from OSS Bifrost. If you want the visual cost-by-image breakdown, plan for the upsell.
- The image guardrails are plugin-quality, not a tuned production product. For regulated content moderation, you will write your own or bolt on Rekognition.
- No optimizer or feedback loop. The OTel data is a sink, not a feedback input.
- Documentation depth on the multimodal path lags behind the text-only one. Expect to read source to confirm behavior on edge cases like Gemini’s >3072x3072 silent-flat-fee shift.
Pricing: OSS under Apache 2.0. Maxim’s hosted SaaS starts around $99/month with the dashboard, eval suite, and managed deployment.
Score: 5.5/7 axes (missing: native polished image guardrail, optimizer).
4. OpenRouter: Best for OSS LLaVA + frontier vision model marketplace
Verdict: OpenRouter is the right pick when the goal is fast experimentation across the widest possible vision-model catalog, including hosted LLaVA-1.6, Pixtral 12B, Qwen2-VL-72B, and frontier vision from OpenAI, Anthropic, and Google in one place. The trade-off is that OpenRouter is consumer-facing, its strengths are catalog and convenience, not enterprise observability or guardrails.
What it does for image and vision LLM routing:
- Provider breadth. The widest in this list when you include OSS hosted models. LLaVA-1.6 34B, Pixtral 12B, Qwen2-VL-72B, InternVL, and a long tail of community-hosted vision models alongside frontier GPT-4o, Claude, and Gemini. If you want to A/B a frontier model against an OSS model on the same image input, this is the fastest path.
- Image size and format limits. 20MB. Transcoding is provider-dependent. OpenRouter passes through; if the downstream model rejects HEIC, the request fails.
- Base64 vs URL. Both supported. No conversion.
- Per-image cost attribution. Aggregated at the request level. No per-image breakdown. The activity log shows per-request cost with the model and image-token count.
- Image-token estimation accuracy. Reported after the response. Estimates are within 5-10% for GPT-4o and Claude; Gemini estimate is exact. Estimate isn’t pre-forward, so budget enforcement is reactive.
- Multimodal streaming. Works for text-out. Image-out streaming is supported for the providers that offer it but not normalized across the catalog.
- Image guardrails. None. OpenRouter is a routing layer, not a moderation layer. For a content-moderation use case, OpenRouter is the wrong abstraction.
Where it falls short:
- No native moderation. For UGC, you must add a moderation hop before OpenRouter.
- Per-image attribution is absent. The activity log is per-request.
- OpenRouter’s billing model is credit-based and not a fit for finance teams that want per-developer chargeback. It’s consumer-friendly, not enterprise-friendly.
- No optimizer.
- The SLA is best-effort on community-hosted models. For production vision workloads on Pixtral 12B, expect occasional 502s when the upstream Together or Fireworks endpoint hiccups.
Pricing: Pay-as-you-go credit balance, no monthly minimums. Token markup ranges from 5-10% on top of the provider’s price; the markup is published per model.
Score: 4/7 axes (missing: per-image attribution, image guardrails, pre-forward estimation, optimizer).
5. LiteLLM: Best for self-hosted Python-native vision routing
Verdict: LiteLLM is the right pick when the constraint is that vision traffic (which can include customer PII in screenshots, internal documents, or medical imagery) can’t leave the VPC. It’s source-available, Python-native, and ships with all the routing primitives the hosted gateways have. The trade-off is that the polish is thinner and the image-specific guardrails aren’t built in.
What it does for image and vision LLM routing:
- Provider breadth. GPT-4o, Claude 3.5 Sonnet vision, claude-opus-4-7 vision, Gemini 1.5 Pro, Gemini 2.0 Flash, AWS Bedrock multimodal, Azure OpenAI multimodal, and any OpenAI-compatible LLaVA endpoint. Comparable to Bifrost.
- Image size and format limits. 20MB. Transcoding isn’t built-in; the request shape is whatever the client sends.
- Base64 vs URL. Both supported. No automatic conversion.
- Per-image cost attribution. Native. The
usage.image_tokensandimage_detailsfields are populated for vision providers, and the spend-tracking table groups by image when the client tags requests. - Image-token estimation accuracy. Estimate within 5% for GPT-4o, 6% for Claude, 2% for Gemini. Available pre-forward via the
prevalidate_requesthook, so budget caps work before the model call. - Multimodal streaming. Works for SSE. Tool-use blocks preserved.
- Image guardrails. No native image moderation. LiteLLM hooks let you call AWS Rekognition or any external moderation API; you write the integration.
Where it falls short:
- No native image moderation. For content moderation, you wire it in via the request lifecycle hooks; expect 200-500ms latency for the moderation hop depending on which provider you choose.
- The dashboard is functional, not polished. For per-image breakdowns by team or workload, plan to export to a SQL warehouse.
- No optimizer. The spend data is a snapshot, not a feedback input.
- The “vision routing” rule syntax is general routing applied to multimodal; there’s no special handling for “route low-detail images to mini variants automatically.” You write the rule.
Pricing: Open source under MIT. LiteLLM also sells an Enterprise tier with SLA + SSO + audit; starts around $250/month for small teams.
Score: 4.5/7 axes (missing: native image guardrail, polished dashboard, optimizer).
Capability matrix
| Axis | Future AGI | Portkey | Maxim Bifrost | OpenRouter | LiteLLM |
|---|---|---|---|---|---|
| Provider breadth (vision) | ✅ Frontier + OSS + Bedrock | ✅ Frontier + OSS hosted | ✅ Frontier + OSS via vLLM | ✅ Widest catalog including OSS | ✅ Frontier + Bedrock + Azure |
| Image size + format limits | ✅ 20/100MB + transcoding | ⚠️ 20MB, transcoding beta | ✅ 25MB + transcoding | ⚠️ 20MB, passthrough | ⚠️ 20MB, passthrough |
| Base64 ↔ URL conversion | ✅ Automatic | ✅ Both, no auto-convert | ✅ Both, auto-convert | ✅ Both, no auto-convert | ✅ Both, no auto-convert |
| Per-image cost attribution | ✅ Native span | ⚠️ Custom header | ✅ Native OTel | ❌ Per-request only | ✅ Native field |
| Token estimation accuracy | ✅ 3% / 5% / 2% pre-forward | ⚠️ 4-7% post-hoc | ✅ 4% / 5-7% / 2% pre-forward | ⚠️ 5-10% post-hoc | ✅ 5% / 6% / 2% pre-forward |
| Multimodal streaming | ✅ Native parse | ✅ SSE | ✅ SSE + image-out | ✅ SSE | ✅ SSE |
| Image guardrails | ✅ 107 ms Protect | ❌ BYO moderation | ⚠️ Plugins | ❌ None | ❌ BYO moderation |
| Optimizer / feedback loop | ✅ fi.opt | ❌ | ❌ | ❌ | ❌ |
Decision framework: Choose X if
Choose Future AGI if the vision workload is regulated (UGC moderation, healthcare imagery, customer screenshots) and you need a 107 ms image guardrail in the request path. Also pick this when vision is a significant line item ($10K+/month) and you want the cost curve to bend downward, the optimizer learns to route low-detail images to mini variants automatically.
Choose Portkey if the priority is cross-provider vision routing with the widest hosted model catalog and you have a moderation hop already (Rekognition, internal classifier). Pick this when procurement wants a polished hosted SaaS with mature controls and the optimizer isn’t yet on the requirements list.
Choose Maxim Bifrost if the deploy target is OSS-first, OTel is already in the stack, and the team is comfortable wiring plugins for moderation. Pick this when self-host + multimodal is the constraint and Future AGI’s BYOC is overkill.
Choose OpenRouter if the goal is fast model comparison across the widest possible catalog including OSS LLaVA, Pixtral, and Qwen2-VL. Pick this for prototyping, model selection, and early-stage products where the right vision model is still unknown.
Choose LiteLLM if compliance forbids vision traffic leaving the VPC and the team is Python-native. Pick this when source-availability and self-host control beat the polish of the hosted alternatives, and you’re willing to bolt on moderation separately.
Common mistakes when routing vision LLMs through a gateway
| Mistake | What goes wrong | Fix |
|---|---|---|
| Sending 4K screenshots at “high detail” by default | A $0.005 question becomes a $0.04 question; spend is 8x higher than necessary | Route under-1080p images to “low detail” or to mini variants automatically; only escalate to high detail when OCR/text-extraction fails |
| Forwarding base64 directly for >5MB images | Latency goes up 800ms-1.2s on the request leg; provider rejects requests > the inline limit | Upload large images to a signed URL bucket and forward the URL; Future AGI and Bifrost convert automatically |
| Not transcoding HEIC from iPhone uploads | Claude 3.5 Sonnet vision rejects HEIC; GPT-4o sometimes accepts and silently degrades | Transcode HEIC → PNG before forwarding; Future AGI and Bifrost do this natively |
| Estimating image tokens post-forward | Budget caps fire after the bill is incurred; finance learns about the spike at month-end | Use a gateway with pre-forward estimation (Future AGI, Bifrost, LiteLLM) so the cap is enforced before the model call |
| Treating multimodal streams as text | Tool-use blocks and image-out deltas get re-serialized; downstream parser breaks | Use a gateway that parses Anthropic’s content-block schema and OpenAI’s choices.delta schema natively |
| No image moderation in front of UGC vision | Customer-uploaded screenshots with PII or NSFW go straight to the model; logs are an audit problem | Run image guardrails in the request path; Future AGI Protect is 107 ms p95, which is the only latency budget that keeps the round-trip under 1s |
| Treating image-token estimation as exact | Budget caps trigger at the wrong volume; over-pause or under-pause | Treat the estimate as an envelope; use a 10% safety margin on the cap (cap at $0.018 if you want to never exceed $0.02) |
How Future AGI closes the loop on vision LLM spend
The other four gateways treat vision routing as a static problem: configure the rules, capture the traces, look at the dashboard. Future AGI treats it as the input to a feedback loop. The loop has six stages applied to multimodal workloads specifically:
-
Trace. Every vision turn produces a span tree via
traceAI(Apache 2.0). Spans capture image hash, dimensions, detail level, estimated tokens, billed tokens, model used, and session ID. Multi-image requests get one child span per image. -
Evaluate.
fi.evalsscores every multimodal turn against multimodal-aware rubrics: image-text grounding, hallucination-on-image, OCR faithfulness for document workloads, and tool-use accuracy. -
Cluster. Low-scoring sessions cluster by failure mode. Common patterns: “high-detail request routed to low-detail variant” (rule too aggressive), “OCR failure on rotated images” (model failure unrelated to routing), “PII leak in screenshot reached the model” (moderation gap).
-
Optimize.
fi.opt.optimizers(six optimizers (RandomSearchOptimizer, BayesianSearchOptimizer Optuna-backed with teacher-inferred few-shot templates and resumable studies, MetaPromptOptimizer, ProTeGi, GEPAOptimizer, PromptWizardOptimizer), all sharing an EarlyStoppingConfig (patience + min_delta + threshold + max_evaluations) and the same unified Evaluator over 60+ FAGI rubrics) adjusts the routing policy or prompt template. The typical multimodal-chatbot optimization is a two-tier rule: low-res and grayscale screenshots to GPT-4o mini vision at low detail (85 tokens), 1080p+ uploads to full GPT-4o at high detail, document-OCR images to Claude 3.5 Sonnet vision (higher OCR faithfulness on our test set). -
Route. Agent Command Center’s gateway applies the updated policy on the next request.
-
Re-deploy. Prompt + route are versioned. Roll forward; automatic rollback if the eval score regresses.
Net effect: a team starting at $30,000/month on multimodal vision typically trends down 20-35% within six weeks without changing the user-facing product. The router learns to pick the right model + detail level per image, the optimizer rewrites over-prompting templates, and the eval data tells the loop where the model is hallucinating on image content.
The three building blocks are open source:
traceAI, github.com/future-agi/traceAI (Apache 2.0)ai-evaluation, github.com/future-agi/ai-evaluation (Apache 2.0)agent-opt, github.com/future-agi/agent-opt (Apache 2.0)
The hosted Agent Command Center adds the failure-cluster view, live Protect image guardrails at 107 ms p95 (arXiv 2510.13351), RBAC, SOC 2 Type II certified, and AWS Marketplace for procurement.
What we did not include
We deliberately left out three vision-capable gateways that show up in other 2026 listicles:
- Cloudflare AI Gateway. Vision support is functional but the worker-based observability doesn’t yet expose per-image attribution as of May 2026. The worker-cost model also doesn’t fit per-image budget caps well.
- Vellum. Strong on prompt management and evals but the vision-specific routing surface is still in beta in May 2026.
- Helicone. Solid lightweight observability proxy but vision-specific features (image-token estimation, per-image attribution) lag behind the cohort above; we covered Helicone in our token-monitoring post where it’s a stronger fit.
If your situation is different, all three are worth a second look in Q3 2026.
Related reading
- What Is an AI Gateway? The 2026 Definition
- Best LLM Gateways in 2026
- Best AI Gateways to Monitor Claude Code Token Usage in 2026
- Best AI Gateways for Agentic AI in 2026
Sources
- OpenAI GPT-4o vision pricing and tile geometry, platform.openai.com/docs/guides/vision
- Anthropic Claude 3.5 Sonnet vision documentation, docs.anthropic.com/en/docs/build-with-claude/vision
- Google Gemini 1.5 Pro multimodal pricing, ai.google.dev/pricing
- Future AGI Agent Command Center, futureagi.com/platform/monitor/command-center
- Future AGI Protect latency benchmarks, arxiv.org/abs/2510.13351 (65 ms text, 107 ms image)
- Portkey vision routing, portkey.ai/docs/integrations/llms/vision
- Maxim Bifrost OSS gateway, github.com/maximhq/bifrost
- OpenRouter model catalog, openrouter.ai/models
- LiteLLM proxy, github.com/BerriAI/litellm
Frequently asked questions
What is the cheapest way to add per-image cost attribution?
How do I estimate image tokens before the model call?
Should I send base64 or URLs through the gateway?
Can I route between frontier vision and OSS LLaVA?
How do image guardrails affect end-to-end latency?
Can the gateway block prompt injection inside image text?
Is it safe to route customer screenshots through a hosted gateway?
Routing-policy eval is not model eval. The 2026 playbook: route correctness, cost-savings realized vs theory, quality preservation under substitution, and fallback correctness — instrumented end to end.
Five AI gateways for embedding API routing in 2026 scored on provider breadth, dimension consistency, batch-API support, input-hash cache, model-migration tooling, per-tenant attribution, and online p95 latency.
Five AI gateways scored on routing Claude Code requests in production: deterministic policy expressiveness, per-region routing, failover semantics, P99 overhead, observability for routing decisions, cache interaction, and burst resilience.