Research

What is LiteLLM? The Universal LLM API Translator in 2026

LiteLLM is the open-source SDK and proxy that gives every LLM an OpenAI-compatible API. What it is, how the SDK and proxy differ, and how teams use it in 2026.

·
9 min read
litellm llm-gateway llm-proxy openai-compatible berriai ai-gateway model-routing 2026
Editorial cover image on a pure black starfield background with faint white grid. Bold all-caps white headline WHAT IS LITELLM fills the left half. The right half shows a single horizontal API surface bar in pure white outline labeled UNIFIED API, with five outlined model-provider boxes connected to it via thin lines. The central API surface bar carries a soft white halo glow.
Table of Contents

A team supports six services: a Python backend on AWS that uses Bedrock, a Next.js app on Vercel that hits OpenAI directly, a data pipeline on Azure that calls Azure OpenAI, an internal eval runner that wants to score with Claude, an experimental agent on Google Cloud that prefers Gemini, and a Slack bot a hackathon engineer wired to a friend’s Together account. Six services, six different SDKs, six places where keys live, six ways the cost data shows up, and exactly zero ways to swap providers without a refactor. The pattern is so common that an OSS gateway emerged to solve it, and that gateway is LiteLLM.

This piece walks through what LiteLLM is, the SDK and proxy distinction, the call shape, the routing and observability stories, the enterprise tier, and how it compares with OpenRouter, Portkey, and Cloudflare AI Gateway in 2026.

TL;DR: What LiteLLM is

LiteLLM is an open-source SDK and proxy server, maintained by BerriAI, that gives every LLM an OpenAI-compatible API. The Python SDK (MIT) lets you call any of 100+ providers with litellm.completion(model='...', messages=[...]) and get the OpenAI response shape back. The Docker-distributed proxy server, also MIT for the non-enterprise code (with enterprise/ features under a separate license), exposes the same translation as a REST service: any OpenAI client (Python, Node, Go, curl) points at the proxy URL and gets fallback, retries, caching, per-team budgets, guardrail hooks, and OpenTelemetry tracing via the otel callback. There is also a paid Enterprise tier for SSO, audit, and managed hosting. The repo sits at tens of thousands of GitHub stars with continued weekly release cadence and is a popular self-hostable AI gateway in production.

Why LiteLLM exists

Three forces converged.

First, the LLM provider list grew. In 2023 OpenAI was effectively the API. By 2026 a serious provider list includes OpenAI, Anthropic, Bedrock, Vertex, Azure OpenAI, Mistral, Cohere, Groq, Together, Fireworks, DeepSeek, Perplexity, Cloudflare Workers AI, NVIDIA NIM, IBM watsonx, Databricks, plus open-weights served by vLLM or Ollama. Each provider has its own SDK, its own request shape, its own response shape, its own retry semantics. Code that targets one provider does not move to another without rewrites.

Second, OpenAI’s Chat Completions shape became the lingua franca. Anthropic, Bedrock, and Vertex all ship OpenAI-compatible endpoints alongside their native APIs. Azure OpenAI is OpenAI Chat Completions with a different auth header. The convergence on one shape made a translator viable.

Third, governance moved from “nice to have” to “required.” Per-team budgets, key rotation, guardrails, observability, audit logs are table stakes for enterprise AI rollouts. Building those at the application layer is a duplicated cost across services. Centralizing them in one gateway is the obvious play.

The architecture

LiteLLM in production has the shape of a translator with sidecar features.

Editorial figure on a black starfield background showing the LiteLLM architecture as a horizontal diagram. Three small CLIENT app boxes at the top connect down via thin lines to a central horizontal bar labeled LITELLM PROXY UNIFIED OPENAI-COMPATIBLE API which carries a soft white halo glow. From the bar, six outlined provider boxes (OPENAI, ANTHROPIC, BEDROCK, VERTEX, AZURE, OLLAMA) connect downward.

  • Clients (your services) speak the OpenAI Chat Completions or Responses API.
  • LiteLLM parses the call, looks up the routing rule, picks a provider, translates the request to that provider’s shape, dispatches it, translates the response back to the OpenAI shape.
  • Providers are the underlying LLM endpoints (OpenAI, Anthropic, Bedrock, Vertex, Mistral, etc.).
  • Sidecars include the Postgres-backed proxy database for keys and budgets, an optional Redis for caching, and OTel exporters for observability.

The two surfaces (SDK and proxy) cover the two integration patterns.

LiteLLM SDK

The SDK is pip install litellm. The shape:

from litellm import completion

response = completion(
    model="anthropic/claude-3.7-sonnet",
    messages=[{"role": "user", "content": "Summarize the OTel GenAI spec."}],
)
print(response.choices[0].message.content)

The same call works for openai/gpt-4o, bedrock/anthropic.claude-3-sonnet-20240229-v1:0, vertex_ai/gemini-2.5-flash, azure/gpt-4o-deployment, groq/llama-3.3-70b-versatile, ollama/llama3.2, etc. The model id encodes the provider; the rest of the call is OpenAI-shaped.

Streaming, function calling, tool use, structured outputs, and embeddings all work through the same surface. The SDK is the right choice for Python services where embedding LiteLLM directly is acceptable.

LiteLLM Proxy

The proxy is a FastAPI service distributed as a Docker image (ghcr.io/berriai/litellm). You configure it via a config.yaml listing the model deployments and the routing rules, run it, and any OpenAI-compatible client points at the proxy URL.

# config.yaml fragment
model_list:
  - model_name: claude-fast
    litellm_params:
      model: anthropic/claude-3.5-haiku
      api_key: os.environ/ANTHROPIC_API_KEY
  - model_name: claude-fast
    litellm_params:
      model: bedrock/anthropic.claude-3-haiku-20240307-v1:0
      aws_region_name: us-east-1
router_settings:
  routing_strategy: latency-based-routing
  fallbacks:
    - claude-fast: ["openai/gpt-4o-mini"]

Two claude-fast deployments, one Anthropic and one Bedrock, with latency-based routing between them and fallback to GPT-4o-mini if both fail. The clients call claude-fast; the proxy decides which underlying deployment serves the call.

Beyond routing, the proxy provides:

  • Virtual keys. Per-team or per-user keys with budgets, rate limits, and allowed-model lists. Rotated through the admin UI or API.
  • Caching. Redis-backed exact-match or semantic cache. Cache hit metrics on every call.
  • Guardrails. Hook integrations for Aporia, Lakera, Presidio, plus custom Python callbacks for pre-call and post-call validation.
  • Cost tracking. Per-call cost computed from the model price table; aggregated per virtual key, per team, per model.
  • Audit logs. Every call recorded with key id, model id, latency, token counts, cost.

Observability

LiteLLM exposes OpenTelemetry tracing via the otel callback. Configure it on the SDK with litellm.callbacks = ["otel"] or in the proxy’s general_settings, set the standard OTEL_EXPORTER_OTLP_* environment variables, and spans flow to your collector. Spans carry gen_ai.operation.name, gen_ai.provider.name, gen_ai.request.model, gen_ai.response.model, gen_ai.usage.input_tokens, gen_ai.usage.output_tokens, plus the prompt and completion when content capture is enabled.

The proxy also supports callbacks for vendor-specific tracing: Langfuse, Helicone, Datadog, Phoenix, Promptlayer, Lunary, OpenLIT. Teams using FutureAGI typically wrap LiteLLM with the traceAI-litellm instrumentation or send the otel callback’s spans to FutureAGI’s OTLP ingest. Callbacks run in addition to the OTel emission, so you can dual-write to a vendor backend and a generic OTel collector during a migration.

For teams that want a single trace tree across LLM calls, retriever queries, and tool invocations, LiteLLM is the LLM-call instrumentation; OpenInference or traceAI instruments the framework wrapping it (LangChain, LlamaIndex, CrewAI). Both emit OTLP; one OTel collector consumes both.

How LiteLLM compares with other gateways

A practical map.

  • FutureAGI Agent Command Center. BYOK gateway across 100+ providers with semantic caching, fallback, virtual keys, and per-tenant budgets, plus 18+ runtime guardrails and span-attached evals on the same self-hostable plane. Recommended pick when teams want gateway plus observability plus eval plus guardrails in one runtime instead of stitching LiteLLM with three other vendors.
  • OpenRouter. Hosted marketplace, BYOK on the user side, no self-host. LiteLLM Proxy is the self-hosted alternative for teams that need on-prem or VPC isolation.
  • Portkey. Open-source self-hostable AI gateway plus a hosted/enterprise platform. Stronger guardrail and prompt-management surface; provider counts are comparable.
  • Cloudflare AI Gateway. Hosted-only, free with usage caps, integrated with Cloudflare Workers. Limited provider list compared to LiteLLM.
  • TrueFoundry AI Gateway. Kubernetes-native, on-prem-first, governance-heavy. Less Python-SDK ergonomic than LiteLLM.
  • Helicone Gateway. OSS, Rust-built, OpenAI-compatible proxy. Newer; smaller community than LiteLLM.
  • Kong AI Gateway. Plugin on top of Kong API gateway. For teams already running Kong, the integration is natural; otherwise LiteLLM Proxy is lighter weight.

LiteLLM’s distinct posture: very broad provider surface, free OSS (MIT) for both SDK and the non-enterprise proxy, strong Python-first ergonomics, paid Enterprise tier for governance. FutureAGI’s Agent Command Center is the closest unified-runtime alternative when LiteLLM’s gateway scope is not enough on its own.

Production patterns

Three that show up repeatedly.

1. Provider-portable application code

A team writes all model calls through litellm.completion(model=cfg.model, ...). The model id lives in config. Migrating from Anthropic to Bedrock is a config change, not a code change. The pattern compounds across services as the team grows.

2. Centralized gateway with virtual keys

The proxy runs as a singleton (or HA pair) in the cluster. Every service that needs an LLM gets a virtual key with a budget. Cost and audit data flow to one dashboard. Key rotation is a proxy admin action, not a per-service deploy.

3. Fallback chain for reliability

Primary on a frontier model, fallback to a cheaper model on rate limit, fallback to a different provider on outage. The proxy handles all three transparently. Application code does not see the fallback; the response shape is identical regardless of which provider served it.

Common mistakes

  • Embedding the SDK without a config layer. Hard-coding model ids inside services defeats the portability story. Put the model id in config.
  • Running the proxy without budgets. Without per-key budgets, one runaway service can burn through provider quota. Always set budgets.
  • Skipping cache configuration. A semantic cache catches the long-tail of identical questions. The hit rate is workload-dependent but is rarely zero on production traffic.
  • Treating LiteLLM as the only observability layer. LiteLLM gives you LLM-call traces. Application traces (the agent loop, the retriever, the tool calls) need their own instrumentation. Use OpenInference or traceAI for the framework layer.
  • Forgetting key rotation discipline. Virtual keys make rotation easy, but only if the rotation actually happens. Schedule it.
  • Mismatched model id strings. anthropic/claude-3.5-sonnet vs claude-3-5-sonnet-20240620 matters. LiteLLM supports both; standardize on one across the codebase.
  • Self-hosting without HA. A single-node proxy is a single point of failure. Run two in active-active with a Postgres backend; failover is a load-balancer flip.
  • Skipping security review of guardrail callbacks. A custom Python callback runs inside the proxy process. A bug there has the same blast radius as a bug in the proxy itself.

How FutureAGI implements LiteLLM observability and gateway routing

FutureAGI is the production-grade gateway-plus-observability platform for LiteLLM built around the closed reliability loop that other LiteLLM stacks stitch together by hand. The full stack runs on one Apache 2.0 self-hostable plane:

  • LiteLLM tracing, traceAI (Apache 2.0) wraps LiteLLM call sites with the traceAI-litellm instrumentation; LiteLLM’s otel callback also emits gen_ai.* spans that the FutureAGI collector ingests over OTLP, and both paths merge with the surrounding agent and retriever spans across Python, TypeScript, Java, and C#.
  • Evals on the gateway path, 50+ first-party metrics (Faithfulness, Hallucination, Tool Correctness, Task Completion) attach as span attributes; BYOK lets any LLM serve as the judge at zero platform fee, and turing_flash runs the same rubrics at 50 to 70 ms p95.
  • Simulation, persona-driven scenarios exercise the gateway path before live traffic so cost spikes and quality regressions catch in pre-prod with the same scorer contract that judges production.
  • Gateway and guardrails, the Agent Command Center is an alternative or complement to the LiteLLM proxy; it fronts 100+ providers with BYOK routing, fallback, latency-based load balancing, virtual keys, and budgets, and 18+ runtime guardrails (PII, prompt injection, jailbreak, tool-call enforcement) enforce policy on the same plane.

Beyond the four axes, FutureAGI also ships six prompt-optimization algorithms that consume failing trajectories as training data. Pricing starts free with a 50 GB tracing tier, 100,000 gateway requests, and 100,000 cache hits; Boost is $250 per month, Scale is $750 per month with HIPAA, and Enterprise from $2,000 per month with SOC 2 Type II.

Most teams running LiteLLM at scale end up running three or four tools alongside it: one for traces, one for evals, one for guardrails, one for governance. FutureAGI is the recommended pick because tracing, evals, simulation, gateway, and guardrails all live on one self-hostable runtime; the loop closes without splicing four vendors together.

Sources

Related: LiteLLM-LLMs Comparison, LiteLLM Compromised: Incident Response and Migration Guide, What is an AI Gateway?, What is OpenRouter?

Frequently asked questions

What is LiteLLM in plain terms?
LiteLLM is an open-source Python SDK and proxy server, maintained by BerriAI, that exposes 100+ LLM providers behind one OpenAI-compatible API. You write code as if you are calling OpenAI; LiteLLM translates the call to the actual provider (Anthropic, Bedrock, Vertex, Azure, Together, Mistral, Groq, Ollama, and so on). The non-enterprise SDK and proxy code in BerriAI/litellm is MIT-licensed; the `enterprise/` directory and the paid Enterprise tier (SSO, audit logs, SLA-backed support) are separately licensed.
What is the difference between the LiteLLM SDK and the LiteLLM proxy?
The SDK (`litellm`) is a Python library you import and call directly: `litellm.completion(model='anthropic/claude-3.5-sonnet', messages=[...])`. The proxy is a self-hosted service (Docker, Kubernetes) that exposes a REST endpoint speaking the OpenAI Chat Completions and Responses APIs; any OpenAI client points at the proxy URL and gets unified access to every provider. Use the SDK for tight Python coupling, the proxy when multiple services or non-Python apps need shared access, key rotation, budgets, and routing.
Is LiteLLM free and open source?
Yes for the OSS surface. Non-enterprise code in BerriAI/litellm (SDK and proxy) is MIT-licensed; the `enterprise/` directory ships under a separate enterprise license. You can self-host the OSS pieces at zero cost. BerriAI sells an Enterprise tier with SSO, RBAC, audit logs, dedicated support, and managed hosting on top. The OSS feature surface is large enough that most teams start there and only upgrade if they need enterprise compliance or a managed plane.
How many providers does LiteLLM support?
Over a hundred as of late 2026. The README and docs list OpenAI, Anthropic, Google (Gemini, Vertex), AWS Bedrock, Azure OpenAI, Cohere, Mistral, Groq, Together, Replicate, Fireworks, DeepSeek, Perplexity, NVIDIA NIM, OpenRouter, OctoAI, AI21, Databricks, IBM watsonx, Voyage, Cloudflare Workers AI, Ollama, vLLM, plus many more. It also supports embeddings, image generation, audio, and rerank endpoints across a comparable set.
What does LiteLLM solve that the OpenAI SDK does not?
Three things. Provider portability: you can swap Anthropic for OpenAI for Bedrock without changing call sites. Cost and routing: the proxy handles fallback, load balancing, retry, caching, and per-team budgets. Observability: every call is logged with token counts, cost, and latency in a unified shape, and the proxy emits OpenTelemetry traces. The OpenAI SDK does only the first hop to OpenAI; LiteLLM is the translation, routing, and governance layer for every provider.
How does LiteLLM compare with OpenRouter, Portkey, and Cloudflare AI Gateway?
LiteLLM is self-hostable open source with a Python-first SDK and a free proxy. OpenRouter is hosted-only, BYOK on a marketplace, no self-host. Portkey is an open-source self-hostable AI gateway plus a hosted/enterprise platform, focused on guardrails plus routing plus prompt management. Cloudflare AI Gateway is hosted-only, integrated with Cloudflare Workers, free with usage caps. For teams that want a Python-first self-hosted gateway with a very broad provider surface, LiteLLM is a popular pick. For hosted-only with curated provider deals, OpenRouter. For governance-heavy enterprise, Portkey or LiteLLM Enterprise. For Cloudflare-native stacks, Cloudflare AI Gateway.
How does LiteLLM emit traces and metrics?
LiteLLM supports OpenTelemetry via the `otel` callback (set `litellm.callbacks = ['otel']` or configure the proxy `general_settings`). Spans carry `gen_ai.*` attributes per the OpenTelemetry GenAI semantic conventions. The SDK and proxy also support callback integrations to Langfuse, Helicone, Datadog, Phoenix, Promptlayer, and others; teams using FutureAGI typically wrap LiteLLM with the `traceAI-litellm` instrumentation or send the OTel callback's spans to FutureAGI's OTLP ingest. Token usage and cost are computed per call against the model price table; the same data flows to span attributes and to the proxy's own usage dashboard.
What is the latest LiteLLM release?
LiteLLM ships frequent releases on GitHub under BerriAI/litellm with tens of thousands of stars (the live count grows weekly). Recent additions through 2026 include Responses API parity for non-OpenAI providers, richer guardrail integrations, expanded MCP gateway support, and continued growth in the supported-provider list. Check the GitHub releases for the version-current state.
Related Articles
View all
Stay updated on AI observability

Get weekly insights on building reliable AI systems. No spam.