Research

What is an AI Gateway? Governance, Routing, and Observability in 2026

An AI gateway sits between applications and LLM providers to handle governance, routing, and observability. What it is, how it differs from an API gateway, and why teams adopt it in 2026.

July 24, 2025

10 min read

ai-gateway llm-gateway ai-governance model-routing ai-observability guardrails ai-budgets 2026

A finance team’s first AI rollout was a single bot calling OpenAI from a Python service. Six months later there are fourteen services, three providers, two clouds, and one CFO who wants to know exactly how much the company spent on AI last quarter, broken down by team and use case. Engineering finds: each service has its own keys, the keys are not rotated, three of the keys are over their old monthly budget, the cost dashboard is a manual SQL query against a spreadsheet, and one of the services started leaking PII into prompts last week without anyone noticing. Every one of these problems has a name; collectively they are what an AI gateway exists to solve.

This piece walks through what an AI gateway is, what it does that an API gateway does not, the architecture, the major products in 2026, and the patterns teams use when they adopt one.

TL;DR: What an AI gateway is

An AI gateway is a service that sits between applications and LLM providers and centralizes three concerns: governance (auth, budgets, guardrails, PII redaction, audit logs), routing (provider translation, fallback, load balancing, semantic caching), and observability (tracing, cost attribution, eval scores, dashboards). The gateway exposes a single OpenAI-compatible endpoint; applications call it instead of provider SDKs. The major OSS self-hostable gateways are LiteLLM Proxy (MIT for non-enterprise code), Helicone Gateway, and Portkey AI Gateway (open source plus hosted/enterprise); major hosted gateways are OpenRouter and Cloudflare AI Gateway; TrueFoundry covers self-hostable enterprise / on-prem; plugin-style products include Kong AI Gateway and Apigee. Datadog LLM Observability and AWS Bedrock Guardrails are adjacent observability and guardrail tools that pair with a router rather than ship as gateways themselves; FutureAGI’s Agent Command Center bundles gateway capabilities with deeper LLM observability and eval. The line between “AI gateway” and “LLM gateway” is fuzzy; both terms point at the same architecture, with “AI gateway” usually emphasizing the governance and observability layers on top of routing.

Why AI gateways exist

Three forces converged.

First, the LLM provider list grew past one. By 2026 a serious provider list includes OpenAI, Anthropic, Bedrock, Vertex, Azure OpenAI, Mistral, Cohere, Groq, Together, Fireworks, DeepSeek, plus open-weights served by vLLM or Ollama. Each provider has its own SDK, its own auth, its own retry semantics. Application code that targets one provider does not move to another without rewrites. A gateway absorbs the heterogeneity.

Second, governance moved from “nice to have” to “compliance prerequisite.” PII redaction, audit logs, per-team budgets, key rotation, prompt-injection protection, are now table stakes for enterprise AI rollouts. Building those at the application layer is duplicated cost across services. Centralizing them in one gateway is the obvious play.

Third, the cost story got real. Frontier-model usage at scale is high enough per active user (depending on workload, model, and token volume) that uncapped usage routinely produces five-to-six-figure monthly invoices for small teams. Teams need per-user, per-team, per-feature budgets that are token-aware (a budget that counts requests is meaningless when one request can cost dollars). An AI gateway is where those budgets live.

The architecture is the same architecture API gateways followed in 2010-2015. Applications used to talk directly to backends; the gateway pattern centralized cross-cutting concerns. AI gateways are the LLM-shaped variant of the same pattern.

What an AI gateway does

Three lanes.

Governance lane

The governance lane handles policy enforcement before the call goes to a provider and after the response comes back.

Auth and key management. Virtual keys per team, per service, or per user; central rotation; revocation without redeploys.
Budgets. Token-aware dollar caps. Per virtual key, per team, per feature flag. Soft warnings and hard cuts.
Guardrails. Pre-call: PII detection, prompt-injection detection, banned-content checks. Post-call: content moderation, schema validation, citation checks. Some gateways pipe through specialist services (Aporia, Lakera, Presidio); others ship native rules.
Audit logs. Every call recorded with key id, model id, prompt hash, response hash, tokens, cost, latency, status, applied policies.

Routing lane

The routing lane decides which provider serves each call.

Provider translation. One OpenAI-compatible API in; many provider APIs out. Translate the request shape, dispatch, translate the response back.
Fallback chains. If the primary provider rate-limits or errors, retry on the next provider in the chain. Application code does not see the failover.
Load balancing. Latency-based, cost-based, weighted, or round-robin across multiple deployments of the same model.
Semantic caching. Hash the embedding of the prompt; on cache hit, return the cached response without hitting a provider. Hit rates can be high on repetitive workloads (FAQ-style support, classification); the actual rate depends on similarity threshold, cache policy, and traffic distribution, so measure it per workload before quoting a number.
Speculative execution. Send the same call to two providers in parallel, return the first response, cancel the other. Cuts tail latency at the cost of doubling token spend on the duplicated calls.

Observability lane

The observability lane captures what happened.

Tracing. OpenTelemetry spans with gen_ai.* attributes per the OTel GenAI semantic conventions: gen_ai.request.model, gen_ai.response.model, gen_ai.usage.input_tokens, gen_ai.usage.output_tokens, prompt, completion, tool calls.
Cost attribution. Per-call cost computed from the model price table. Aggregated per virtual key, team, model, feature.
Eval scoring. Online evaluators score each call (relevance, groundedness, content policy). Scores attach to the span.
Dashboards. Latency p50/p95/p99, error rate, cost per user, cost per feature, eval-score distribution.

The three lanes compose: a single call traverses governance, routing, and observability in one pass.

How an AI gateway differs from an API gateway

API gateways solve HTTP-shaped problems: auth, rate limits per request, mTLS, request transformation, schema validation against OpenAPI specs. AI gateways solve LLM-shaped problems on top: token-aware budgets, prompt-level guardrails, semantic caching, provider failover, per-call cost attribution, eval scoring.

Some AI gateways are built as plugins on top of API gateways (Kong AI Gateway is a Kong plugin; Apigee adds AI policies on top of Apigee). Others are LLM-native services that do not assume an underlying API gateway. The choice depends on whether your platform team already runs an API gateway and prefers to extend it.

Architectural patterns

Four common shapes.

1. Self-hosted singleton in front of providers

LiteLLM Proxy, Helicone Gateway, Portkey OSS Gateway, or TrueFoundry runs as a service in your VPC. Every internal LLM call goes through it. Provider keys live in the gateway, not in services. Budgets, guardrails, and observability are centralized.

2. Hosted multi-tenant gateway

OpenRouter, Portkey, or Cloudflare AI Gateway run in the vendor’s cloud. Application calls them with credentials. Faster to set up; data transits the vendor.

3. Hybrid: hosted with BYOK

Hosted gateway with bring-your-own-keys to the providers. The gateway handles routing and observability; the actual model spend bills directly to your provider account. Mixes the operational ease of hosted with the contract advantage of direct provider deals.

4. Embedded as a library

For teams not ready to run a separate service, the LiteLLM Python SDK gives you the routing and observability features as a library. Less governance surface than the proxy mode, but a lighter rollout.

Major AI gateways in 2026

A practical map.

FutureAGI Agent Command Center. BYOK gateway across 100+ providers with semantic caching, fallback, virtual keys, per-tenant budgets, plus 18+ runtime guardrails and span-attached evals on the same Apache 2.0 self-hostable plane. Recommended pick when gateway, observability, eval, and guardrails should live in one runtime instead of four vendors.
LiteLLM Proxy. OSS, MIT-licensed for non-enterprise code with separately licensed Enterprise features; broad provider surface (100+), full self-host. Common pick for self-hosted with broad coverage when an external observability backend is already in place.
OpenRouter. Hosted-only, broad curated marketplace, public per-model pricing, BYOK option. Common pick for hosted with one bill.
Portkey. Open-source self-hostable gateway plus hosted/enterprise platform. Strong guardrails, prompt management, RBAC. Common pick for governance-heavy enterprise.
Helicone Gateway. OSS Rust proxy, OpenAI-compatible. Newer; smaller community than LiteLLM.
TrueFoundry AI Gateway. Kubernetes-native, self-hostable enterprise / on-prem, governance-heavy. Common pick for K8s-mandated enterprise.
Cloudflare AI Gateway. Hosted-only, free with usage caps, integrated with Cloudflare Workers, smaller catalog. Common pick for Cloudflare-native stacks.
Kong AI Gateway. Kong plugin. Common pick if you already run Kong.
AWS Bedrock Guardrails plus a custom router. Adjacent guardrail surface; not a gateway by itself.
Datadog LLM Observability. Adjacent observability and policy surface; not a routing gateway by itself.

The choice is workload-shape-dependent: hosting model, provider list, governance depth, observability depth, integration with the rest of the platform.

Production patterns

Three that show up.

1. Centralized gateway with virtual keys

Every service in the company has a virtual key issued by the gateway. Each key has a budget, an allowed-models list, and applied guardrails. Cost data flows to the finance team in one report. Key rotation is a gateway admin action.

2. Smart-path / cheap-path split

Gateway routes “easy” requests (classification, summarization of short text) to a cheap open-weights model served by a self-hosted vLLM or by Together. “Hard” requests go to a frontier model. The split is decided by a fast classifier in the guardrail lane or by an explicit user-tier flag. Cost can drop materially when evals confirm quality parity on the cheap path; the multiplier depends on workload, router accuracy, and threshold, so verify with your own eval rubric before quoting a number.

3. Multi-region failover

Primary OpenAI deployment in us-east; fallback to OpenAI in eu-west; fallback to Anthropic; fallback to a self-hosted Llama. Gateway handles all four transitions; application code sees one base URL.

Common mistakes

Skipping budgets. A virtual key without a budget is a runaway-cost incident waiting to happen.
Treating guardrails as optional. PII redaction is non-negotiable for regulated workloads; injection detection is non-negotiable for prompt-attack-prone surfaces.
No semantic cache. A cache that misses every call is fine for unique prompts; for repetitive workloads (support FAQ, classification), it is leaving cost on the table.
Over-aggressive caching. A semantic cache with a low similarity threshold returns wrong answers to subtly different questions. Tune the threshold; review hits in eval.
Hard-coding the gateway URL in code. Put it in config; the URL changes when you migrate from hosted to self-hosted or vice versa.
No observability beyond the gateway. The gateway sees the LLM call. Application traces (the agent loop, the retriever, the tool calls) need their own instrumentation.
Single-node self-host. A singleton gateway is a single point of failure. Run two in active-active with shared state.
Ignoring audit logs. Audit logs are the artifact compliance reviews; not querying them means you cannot answer “who called what when” until the auditor asks.

How FutureAGI implements an AI gateway

FutureAGI is the production-grade AI gateway plus evaluation plus observability stack built around the closed reliability loop that other gateway picks stitch together by hand. The full stack runs on one Apache 2.0 self-hostable plane:

Gateway, the Agent Command Center is a BYOK gateway across 100+ providers with semantic caching, fallback, latency-based load balancing, virtual keys, and per-tenant budgets; it also ingests OTLP from external gateways (LiteLLM, Helicone, Portkey, Cloudflare AI Gateway, Kong AI Gateway) so teams keep their existing gateway and use FutureAGI as the consolidated backend.
Tracing, traceAI (Apache 2.0) auto-instruments 35+ frameworks across Python, TypeScript, Java, and C#, with surrounding agent and retriever spans rendered alongside gateway spans in one trace tree.
Evals on the gateway path, 50+ first-party metrics attach as span attributes; BYOK lets any LLM serve as the judge at zero platform fee, and turing_flash runs the same rubrics at 50 to 70 ms p95 with full templates at about 1 to 2 seconds, so per-key cost and per-rubric quality sit side by side.
Guardrails and simulation, 18+ runtime guardrails (PII, prompt injection, jailbreak, tool-call enforcement) enforce policy inline at the gateway, and persona-driven simulation exercises the gateway path before live traffic with the same scorer contract that judges production.

Beyond the four axes, FutureAGI also ships six prompt-optimization algorithms that consume failing trajectories as training data. Pricing starts free with a 50 GB tracing tier, 100,000 gateway requests, and 100,000 cache hits; Boost is $250 per month, Scale is $750 per month with HIPAA, and Enterprise from $2,000 per month with SOC 2 Type II.

Most teams comparing AI gateways end up running three or four products in production: one for the gateway, one for traces, one for evals, one for guardrails. FutureAGI is the recommended pick because gateway, tracing, evals, simulation, and guardrails all live on one self-hostable runtime; the loop closes without stitching.

Sources

Series cross-link

Frequently asked questions

What is an AI gateway in plain terms?

An AI gateway is a service that sits between applications and LLM providers, doing for AI traffic what an API gateway does for HTTP traffic. It exposes a single endpoint that speaks one API shape (typically OpenAI Chat Completions or Responses), translates calls to the right underlying provider (OpenAI, Anthropic, Bedrock, Vertex, Azure, open-weights), and adds three layers on top: governance (PII redaction, guardrails, per-team budgets, key management), routing (fallback, load balancing, cost-based or latency-based provider selection), and observability (tracing, eval scoring, cost attribution, audit logs).

What is the difference between an AI gateway and an LLM gateway?

Often used interchangeably; in 2026 the convention is that LLM gateway emphasizes the routing-and-translation layer (one API for many providers), and AI gateway is the broader term that includes governance, observability, and policy enforcement on top. LiteLLM Proxy, OpenRouter, and Helicone are positioned as LLM gateways; Portkey, Kong AI Gateway, TrueFoundry, Cloudflare AI Gateway, and FutureAGI's Agent Command Center are positioned as AI gateways. The line is fuzzy and most products in either category overlap on most features.

What does an AI gateway do that an API gateway does not?

An API gateway routes HTTP traffic with auth, rate limits, and observability, but it does not understand the LLM domain. An AI gateway adds: token-aware budgets (per-user, per-team, per-key dollar caps that count tokens not requests), prompt-level guardrails (PII detection, injection detection, banned-content checks), semantic caching (hash the embedding of the prompt, hit on similar inputs), provider failover (switch from Anthropic to OpenAI when Anthropic 5xxs), cost attribution per call (model price tables), and LLM-aware observability (gen_ai.* attributes, eval scores attached per call). Some AI gateways are built as plugins on top of existing API gateways (Kong AI Gateway is a Kong plugin); others are LLM-native services.

Why do teams adopt an AI gateway?

Five reasons. (1) Governance: per-team budgets, key rotation, audit logs, PII redaction. (2) Provider portability: write once, swap providers without code change. (3) Reliability: fallback to a second provider on outage or rate limit. (4) Cost optimization: route cheap requests to cheaper providers, pay for the frontier model only on hard requests. (5) Observability: every LLM call lands in one trace stream with consistent attributes. The combination is what makes AI rollouts auditable at the enterprise level.

What are the major AI gateways in 2026?

OSS self-hostable gateways: LiteLLM Proxy (broad provider list, MIT for non-enterprise code), Helicone Gateway (Rust, OpenAI-compatible), Portkey AI Gateway (open-source self-hostable plus a hosted/enterprise platform). Hosted: OpenRouter (broad curated marketplace), Cloudflare AI Gateway (Cloudflare-native, free with caps). Self-hostable enterprise / on-prem: TrueFoundry AI Gateway (Kubernetes-native, governance-heavy). Plugin-style: Kong AI Gateway, Apigee. Adjacent observability and guardrail tools (paired with a router rather than a gateway themselves): Datadog LLM Observability, AWS Bedrock Guardrails. Bundled platform with gateway capabilities: FutureAGI Agent Command Center.

Where does an AI gateway fit in the architecture?

Between application code and LLM providers. The application calls the gateway with one OpenAI-compatible URL; the gateway handles auth, applies governance policies, picks the provider, dispatches the call, captures the result, emits observability data, and returns the response. For self-hosted deploys, the gateway runs as a service in your VPC; for hosted, it runs in the vendor's cloud. The application code does not change when you add or remove providers; the gateway absorbs the change.

Does an AI gateway slow things down?

It adds a hop, so yes, but well-built gateways add single-digit to low-tens of milliseconds on the critical path depending on deployment (vendor-published numbers range from under 5 ms to roughly 40 ms across LiteLLM, Portkey, TrueFoundry, and others). That is often within the variance between providers anyway. Semantic caching can cut latency drastically when the cache hits. The cost is usually not the latency, it is the operational complexity of running the gateway. For real-time content checks, the gateway needs fast judges; FutureAGI's `turing_flash` runs guardrail screening at 50 to 70 ms p95, with full eval templates at about 1 to 2 seconds.

Should I build my own AI gateway or buy one?

For most teams, buy a hosted option (OpenRouter, Portkey hosted, Cloudflare AI Gateway) or self-host an OSS option (LiteLLM, Helicone, Portkey OSS). Building one is justified only when your governance or routing policies are unusual enough that no off-the-shelf product covers them. The OSS options give you the wire-format translation, the basic governance, and the observability hooks; you build the policy layer on top. Building from scratch on top of a generic API gateway is rarely worth it once LLM-native gateways exist.

View all

Research

OpenRouter Alternatives in 2026: 5 LLM Gateway Platforms Compared

Portkey, LiteLLM, TrueFoundry, Helicone, and FutureAGI as OpenRouter alternatives in 2026. Pricing, OSS license, BYOK fees, and what each won't solve.

Nikhil Pareek · Dec 23, 2025

16 min

Research

What is LiteLLM? The Universal LLM API Translator in 2026

LiteLLM is the open-source SDK and proxy that gives every LLM an OpenAI-compatible API. What it is, how the SDK and proxy differ, and how teams use it in 2026.

Vrinda Damani · Aug 18, 2025

9 min

Research

TrueFoundry Alternatives in 2026: 5 AI Gateway Platforms Compared

Portkey, Kong AI Gateway, LiteLLM, Helicone, and FutureAGI as TrueFoundry alternatives in 2026. K8s vs hosted, OSS license, and tradeoffs.

NVJK Kartik · Apr 24, 2026

14 min