Guides

Best 5 AI Gateways for Aider with Local Models in 2026

Five AI gateways scored on Aider with local models in 2026: OpenAI-compatible passthrough to Ollama and vLLM, hosted fallback, GPU-aware routing, gaps.

March 11, 2026

20 min read

ai-gateway 2026 aider

Table of Contents

Aider is a CLI coding agent that pair-programs with you from a terminal. Point it at hosted Claude or GPT and it works on day one. Point it at a 14B local model on a single H100 and the picture changes: 32K context instead of 200K, shakier tool calling, fast on small diffs and brutally slow on multi-file refactors, and the repo never leaves the machine. The local-model setup is what makes Aider attractive to teams that can’t send code through a hosted API.

“Aider plus a local model” is a configuration, not a product. You have to decide which model handles which turn, which turns spill out to a hosted backup, how observability survives when half the traffic never hits a managed dashboard, and what stops the small local model from confidently writing the wrong patch. A gateway sits between Aider and the mix of ollama serve, vllm serve, llama-server, and api.anthropic.com that backs it. (If you are new to running models locally, Ollama is the local LLM runtime most teams reach for first.) It turns the configuration into a workflow.

This post scores the five gateways usable for Aider plus local models in May 2026. Only one turns the local-vs-hosted traces into a feedback loop that gets the routing decision right more often each week.

TL;DR

Future AGI Agent Command Center is the strongest pick for an AI gateway in front of Aider with local models because it exposes Ollama, vLLM, llama.cpp, and LM Studio alongside Anthropic, Bedrock, and Vertex behind one OpenAI-compatible OPENAI_API_BASE, with deterministic hosted-spillover triggers (GPU OOM, context overflow, latency-budget breach), GPU-aware health checks against each local replica, per-developer virtual keys, and OpenTelemetry-native cost telemetry on local-and-hosted traces in the same dashboard. The other four picks below win on specific edges.

Future AGI Agent Command Center — Best overall. Local-plus-hosted routing under one base URL, deterministic spillover triggers, GPU-aware health checks, and unified cost telemetry.
LiteLLM — Best self-hosted Python proxy that fronts Ollama, vLLM, llama.cpp under one OpenAI URL. Python-native, source-available, every local backend has a working adapter; pin commits after the March 24, 2026 PyPI compromise.
Portkey — Best when the local model is the primary and you only need a clean spillover to hosted. Hosted gateway with virtual keys (verify the Palo Alto Networks acquisition timeline before signing multi-year).
vLLM with a proxy front — Best raw throughput on a single GPU, paired with a minimal compatibility shim. GPU-native serving for the lowest steady-state latency.
Kong AI Gateway — Best if you already run Kong and want the AI-specific policies inside the same control plane. API-gateway-grade plugin stack on top of your existing platform.

Why Aider with local models needs a gateway

Aider speaks the OpenAI chat-completions API and accepts OPENAI_API_BASE set to Ollama, vLLM, or any OpenAI-compatible endpoint. That config works for one developer, one model, one machine. It doesn’t survive the second engineer, the second model, or the first time the GPU is saturated.

Three properties make this a routing problem.

Local models aren’t interchangeable with hosted ones. A qwen2.5-coder-14b at 4-bit quantization runs at roughly 60-90 tokens per second on a single H100 SXM with a 32K context window. claude-sonnet-4-6 runs at hundreds of tokens per second under load with 200K context. Not substitutable on hard turns. The gateway has to know which turn goes where.
Local inference fails differently from hosted inference. Hosted returns a 429 you can retry. Local returns a CUDA OOM, a model-not-loaded error, a context-overflow truncation, or a process death because someone kicked off a fine-tune on the same node. The gateway has to translate those into a clean fallback, not propagate the GPU error to Aider’s terminal.
The point of running local is that the prompt never leaves the box. If the gateway ships traces to a hosted observability backend, the data-flow argument collapses. Either the gateway speaks to a self-hosted sink, or the local-model story doesn’t hold.

For the rest of this post, “Aider plus local models” means Aider with OPENAI_API_BASE pointing at a gateway, the gateway fronting one or more local backends (Ollama, vLLM, llama.cpp, LM Studio), and an optional hosted fallback.

The 7 axes we score on

Generic “best AI gateway” axes are too coarse for the local-model variant. We scored each pick on seven axes specific to Aider plus local models.

Axis	What it measures
1. Local backend adapter coverage	Does the gateway have working adapters for Ollama, vLLM, llama.cpp, and LM Studio without per-team Python?
2. Hosted spillover policy	Can it deterministically fall back to a hosted model on GPU OOM, context overflow, or context-too-large, without breaking the Aider session?
3. Turn-routing by complexity	Can it route easy turns (small diffs, single-file edits) to the local model and hard turns to a hosted one, by a rule a non-ML engineer can read?
4. Streaming + tool-call fidelity	Aider streams tokens and (with `--auto-commits`, `--lint`) issues tool-shaped calls; do those pass through without buffering or re-serialization?
5. Local-only observability sink	Can traces stay inside the perimeter, in a self-hosted Phoenix / Loki / ClickHouse / Postgres?
6. GPU-aware health checks	Does the gateway distinguish a stuck Ollama process from a slow one, and route around the dead replica?
7. Self-improving loop	Do captured traces drive next-week’s routing and prompts, or do they just sit in a dashboard?

The verdict line at the end of each pick scores all seven.

How we picked

We started from the public AI gateways that ship an OpenAI-compatible endpoint and document a local-backend adapter as of May 2026. We removed gateways without an adapter for at least Ollama and vLLM. We removed gateways that broke tool calls on translation. We removed two consumer-facing model directories whose self-host story isn’t real. The remaining five are below.

A note on the 2026 trust cohort: LiteLLM had a PyPI supply-chain compromise on versions 1.82.7 and 1.82.8 (March 24, 2026), remediated past 1.83.7. Portkey is mid-acquisition by Palo Alto Networks (announced April 30, 2026). Both are still in this list, both still production-ready. But the procurement-independence question is now real.

1. Future AGI Agent Command Center: Best for local-plus-hosted Aider routing

Verdict: Future AGI exposes Ollama, vLLM, llama.cpp, and LM Studio alongside Anthropic, Bedrock, and Vertex behind one OpenAI-compatible OPENAI_API_BASE, with deterministic hosted-spillover triggers (GPU OOM, context overflow, latency-budget breach), per-developer virtual keys, cross-developer cache, and GPU-aware health checks against each local replica. Per-turn cost and quality sit in the same OpenTelemetry span tree, so finance and engineering both read off the same local-and-hosted dashboard rather than reconciling two systems.

What it does for Aider with local models:

Local backend adapter coverage. Native adapters for Ollama, vLLM, llama.cpp, and LM Studio. Each normalizes the OpenAI tool-call JSON so Aider’s apply_patch and shell-exec survive the local-to-hosted hop. Config entry, not a wrapper.
Hosted spillover with three deterministic triggers: GPU OOM, context overflow above the local window, and a latency-budget breach (default P95 over 8 seconds for turns under 4K input). Default targets are claude-sonnet-4-6 for OOM and overflow, claude-haiku-4-5 for latency.
Turn-routing by complexity via a YAML route policy that reads the request body. Default: under 8K input tokens to qwen2.5-coder-14b, 8K to 32K to claude-sonnet-4-6, over 32K to claude-opus-4-7. The optimizer rewrites it weekly from eval scores.
Streaming + tool-call fidelity. SSE pass-through on both legs; tool-use JSON is parsed and re-emitted, not re-serialized as text.
Local-only observability sink via Agent Command Center BYOC plus the Apache 2.0 traceAI library. Traces, evals, spans stay in your VPC.
GPU-aware health checks via periodic dummy completions against each Ollama and vLLM replica. Stuck processes get pulled in under 30 seconds at defaults.
Self-improving loop. Every Aider turn becomes a traceAI span (50+ AI surfaces across Python, TypeScript, Java, and C# (including Spring Boot starter, Spring AI, LangChain4j, Semantic Kernel), OpenInference-native), gets scored by fi.evals, low-scoring turns get clustered, and fi.opt.optimizers (six optimizers (RandomSearchOptimizer, BayesianSearchOptimizer Optuna-backed with teacher-inferred few-shot templates and resumable studies, MetaPromptOptimizer, ProTeGi, GEPAOptimizer, PromptWizardOptimizer), all sharing an EarlyStoppingConfig (patience + min_delta + threshold + max_evaluations) and the same unified Evaluator over 60+ FAGI rubrics) rewrites the prompt or the routing rule. Error Feed (the part of the eval stack, the clustering and what-to-fix layer that feeds the self-improving evaluators) sits alongside as the zero-config error monitor: auto-clusters related per-model failures into named issues (50 traces → 1 issue), auto-writes the root cause plus a quick fix plus a long-term recommendation per issue, and tracks rising/steady/falling trend per issue so emerging local-model regressions surface like exceptions. Typical week-one discovery: the local model is handling turns where its failure rate is 4x the hosted model, but the rule was sending them local on token count alone. The optimizer adds a language and task-type heuristic and the cluster-failure rate drops.

Where it falls short:

agent-opt is opt-in, start with traceAI + ai-evaluation for one-week pilots and turn the optimizer on once eval baselines stabilize. If the goal is just “front Ollama with one URL,” LiteLLM is a smaller surface area.
The Protect guardrail layer is gated behind the enterprise tier; the free tier exposes routing and traces but not realtime guardrails (Protect runs at ~65 ms text latency per arXiv 2510.13351).

Pricing: Free tier with 100K traces / month. Scale tier starts at $99/month. Enterprise is custom with SOC 2 Type II and a BAA. AWS Marketplace listing for procurement.

Score: 7/7 axes.

2. LiteLLM: Best for self-hosted Python proxy fronting Ollama, vLLM, and llama.cpp

Verdict: LiteLLM is the gateway most teams reach for first when “Aider plus local models” is the workload. It runs as a Python proxy inside your VPC, has adapters for every local backend that matters, and exposes the unified OpenAI URL Aider expects. The strongest fit for the day-one configuration. It doesn’t optimize back, the dashboard slicing is shallow, and the March 2026 supply-chain incident is now part of the procurement story.

What it does for Aider with local models:

Local backend adapter coverage for Ollama, vLLM, llama.cpp, LM Studio, Together, HuggingFace TGI, and Replicate. Each maps to a model entry in the proxy config.
Hosted spillover via the fallbacks config, declare an ordered list (qwen2.5-coder-14b → claude-sonnet-4-6 → claude-opus-4-7) and LiteLLM walks it on configured error types (context-overflow, 429, 5xx).
Turn-routing by complexity via the Router class. A custom routing function in Python inspects the request body and picks the model; “small turns local, big turns hosted” is ten lines.
Streaming + tool-call fidelity confirmed for Aider’s diff-edit and shell-exec on local backends. Tool calls translate correctly between Anthropic and OpenAI shapes.
Local-only observability sink via built-in Postgres logging plus optional OTel exporter. Wire Phoenix, Langfuse self-hosted, or Future AGI’s BYOC traceAI behind LiteLLM and traces stay in the perimeter.
GPU-aware health checks via the proxy’s health-check endpoint. Request-roundtrip check, not a GPU-utilization read.
Self-improving loop. Not built in.

Where it falls short:

The March 24, 2026 supply-chain compromise on PyPI versions 1.82.7 and 1.82.8 (remediated past 1.83.7) is the procurement conversation now. Pin the version, vendor the dependency, scan with Sigstore.
The UI is functional, not polished. Slicing by developer or repo means a SQL dashboard.
The Router policy is Python, fine for ML engineers, friction for a platform team that wants YAML.
No native optimizer; the gateway is as smart as the last commit a human made.

Pricing: Open source under MIT. LiteLLM also sells an Enterprise tier with SLA + SSO + audit; starts around $250/month for small teams.

Score: 5.5/7 axes (missing: native polished dashboard, optimizer).

3. Portkey: Best for hosted gateway with virtual keys when local is primary and hosted is the fallback

Verdict: Portkey is the pick when the local model handles 80%+ of Aider turns and the gateway’s main job is to clean up the spillover. The local-only-observability angle is weaker than LiteLLM or Future AGI BYOC. Portkey’s traces live in the hosted environment by default. But the BYOC option closes most of that gap. The Palo Alto Networks acquisition (announced April 30, 2026) is the new procurement context.

What it does for Aider with local models:

Local backend adapter coverage. 250+ provider integrations advertised; local-backends include Ollama, vLLM, and llama.cpp via custom_host. Register the Ollama URL as a Portkey provider, route from a virtual key.
Hosted spillover via fallback and loadbalance configs in declarative YAML / JSON.
Turn-routing by complexity via conditional-routing config matching request-body. Config, not code.
Streaming + tool-call fidelity confirmed with Aider on claude-sonnet-4-6 and Ollama qwen2.5-coder-14b. SSE solid; gRPC roadmap.
Local-only observability sink. Default is Portkey’s hosted dashboard, wrong fit if “no prompt leaves the box” is the rule. BYOC deploys the gateway in your VPC with traces in your own stack.
GPU-aware health checks via request-roundtrip provider pings, pulls a dead Ollama replica, not a slow one.
Self-improving loop. Not built in.

Where it falls short:

Default hosted-trace path is the wrong fit for the strict-local case. Inside-perimeter traces require BYOC, which is enterprise-tier.
No optimizer.
Palo Alto Networks acquisition (announced April 30, 2026, close expected in PANW fiscal Q4) is the new procurement question for shops that prefer vendor independence.

Pricing: Free tier with 10K requests/day. Scale tier starts at $99/month. Enterprise is custom with SOC 2 Type II.

Score: 6/7 axes (missing: feedback loop / optimization).

4. vLLM with a proxy front: Best for GPU-native serving with a thin OpenAI-compatibility shim

Verdict: vLLM isn’t a gateway on its own. It’s a GPU-native LLM server with the highest throughput per H100 of any open-source runtime as of May 2026, published benchmarks show roughly 2 to 4x the throughput of llama.cpp on the same hardware for a 14B-class model. Pair vLLM with a thin proxy (LiteLLM, an Envoy filter, or a 200-line FastAPI shim) and you get a gateway for the team whose priority is “serve the local model as fast as possible and route on top.”

What it does for Aider with local models:

Local backend adapter. vLLM serves OpenAI-compatible chat-completions and Responses API natively via vllm serve <model>. Aider points at it directly. The “proxy front” (LiteLLM, Envoy, Kong) is what you add for routing and observability.
Spillover and turn-routing are the proxy’s job. vLLM’s contribution is to make the local path fast enough that more turns stay local.
Streaming + tool-call fidelity. Excellent SSE. Tool calling on local models is model-dependent: qwen2.5-coder-32b-instruct and llama-3.3-70b-instruct work reliably; smaller qwen2.5-coder-14b is fine on simple schemas, flakier on multi-tool dispatch.
Local-only observability sink. Prometheus metrics native, per-request latency, KV-cache hit rate, GPU utilization, queue depth, feed cleanly into self-hosted Grafana.
GPU-aware health checks. This is the pick where GPU-awareness is real. Prometheus exposes actual GPU memory, KV pressure, and queue depth, so a proxy in front can load-shed on signal rather than a heartbeat.
Self-improving loop. Not built in. The loop is whatever you build on top.

Where it falls short:

“vLLM with a proxy front” is two products held together by your platform team. If the proxy goes down, vLLM is exposed; if vLLM goes down, the proxy has to spillover correctly.
Local-model tool-calling is still model-quality dependent. Run a tool-use eval on your specific model before production.
The OpenAI-compatibility surface is narrower than a real gateway’s. Edge cases (logprobs, response_format, beta features) are sometimes missing or behind a flag.
No native multi-provider translation. Claude routing needs a real gateway underneath; vLLM alone won’t do it.

Pricing: vLLM is Apache 2.0. The proxy front is whatever you choose (LiteLLM MIT, Kong open-source / enterprise). Hardware is the meaningful line item.

Score: 4.5/7 axes (missing: native multi-provider, native optimizer, native turn-routing).

5. Kong AI Gateway: Best for plugin-stack control plane on top of your existing Kong

Verdict: Kong AI Gateway is the pick when the platform team already runs Kong for the rest of the company’s APIs and the path of least resistance is to extend that stack with the AI Proxy plugin. Strengths: plugin ecosystem, operational familiarity, single control plane. Weaknesses: AI-specific shallowness, the local-backend adapters are improving in 3.6+ but less mature than LiteLLM’s, and most observability and routing is plugin composition rather than first-class.

What it does for Aider with local models:

Local backend adapter coverage via the AI Proxy plugin’s custom-LLM provider option. Kong 3.6+ documents Ollama and OpenAI-compatible custom endpoints; vLLM and llama.cpp work through the OpenAI path. Functional, not breadth-first.
Hosted spillover and turn-routing via the AI Proxy fallback config plus request-transformer plugin, plus expression-based routing on body fields. “Local first, hosted on error” is a multi-plugin composition, half a day for a platform engineer.
Streaming + tool-call fidelity supported in 3.6+. SSE works; tool-call JSON survives.
Local-only observability sink through Kong’s plugin ecosystem. OpenTelemetry to a self-hosted collector, Prometheus to your scraper, Splunk / Datadog / Loki for the log line.
GPU-aware health checks through Kong’s upstream health-check, a heartbeat, not a GPU-load read. A custom Lua plugin can ping vLLM’s /metrics, but that’s code you write.
Self-improving loop. Not built in.

Where it falls short:

AI-specific features lag dedicated AI gateways by typically two quarters.
The “five plugins glued together” pattern is fine for an existing Kong team and miserable for a team setting up a control plane from scratch for one workload.
No optimizer.
The AI Proxy plugin’s local-backend adapters are documented but not as battle-tested as LiteLLM’s; expect upstream bugs in the first month.

Pricing: Kong is open source. Kong Konnect (managed) starts free. Enterprise plans for SLA, plugins, and support start around $1.5K/month.

Score: 4.5/7 axes (missing: native AI observability depth, optimizer, mature local-backend adapters).

Capability matrix

Axis	Future AGI	LiteLLM	Portkey	vLLM + proxy	Kong AI Gateway
Local backend adapter coverage	Ollama, vLLM, llama.cpp, LM Studio	Ollama, vLLM, llama.cpp, LM Studio, TGI	Ollama, vLLM, llama.cpp (custom_host)	vLLM only; proxy adds rest	Ollama + OpenAI-compatible custom (3.6+)
Hosted spillover policy	Three triggers, declarative YAML	Fallbacks list, declarative	Fallback config, declarative	Proxy’s job	Plugin composition
Turn-routing by complexity	YAML route policy, optimizer-tuned	Python Router class	Conditional-routing config	Proxy’s job	Expression-based routing
Streaming + tool-call fidelity	Yes	Yes	Yes	Yes (vLLM 0.10+)	Yes (3.6+)
Local-only observability sink	BYOC + traceAI Apache 2.0	Postgres + OTel	BYOC tier	Prometheus + your stack	Plugins to your sink
GPU-aware health checks	Heartbeat + latency budget	Heartbeat	Heartbeat	Prometheus metrics native	Heartbeat + custom Lua
Self-improving loop	fi.opt + fi.evals + traceAI	None	None	None	None

Decision framework: Choose X if

Choose Future AGI if the goal is “Aider plus local models gets better at the routing decision every week without a human in the loop.” Pick this when the cost-quality curve is the metric leadership cares about. The loop is the wedge; BYOC is what makes it acceptable to security.

Choose LiteLLM if the goal is “front Ollama, vLLM, and llama.cpp with one OpenAI URL inside the VPC, with deterministic fallback to hosted, and the team is comfortable in Python.” Pin past 1.83.7+, vendor the dependency, ship. This is where most teams that get to production start.

Choose Portkey if the local model handles most turns and the gateway’s main job is to clean up the hosted spillover with mature virtual keys and observability. Pick the BYOC tier if “no prompt leaves the box” is non-negotiable.

Choose vLLM with a proxy front if the GPU utilization story dominates the routing story, every percentage point of better throughput means more turns stay local. Pair with LiteLLM for routing, Future AGI traceAI for observability, and own the integration.

Choose Kong AI Gateway if Kong is already the control plane for the rest of the company’s APIs and the platform team would rather extend a known stack than introduce a new one.

Common mistakes when wiring Aider through a gateway for local models

Mistake	What goes wrong	Fix
Pointing Aider only at the local model with no fallback	A 32K-context overflow on a multi-file refactor truncates input; Aider commits the wrong diff	Always configure a hosted fallback for context overflow
`--auto-commits` with a flaky-tool-call local model	Aider commits hallucinated function calls	Use `--no-auto-commits` until the local model passes a tool-use eval on your repo
Routing to local by token count alone	A 500-token rename across 12 files fails where a 500-token bug-fix succeeds	Add a language and task-type heuristic; let the optimizer learn the rest
One Ollama replica shared across developers	One long context overflows the KV cache and stalls everyone	Run multiple replicas behind the gateway and load-balance
Hosted observability while running local models	Defeats the point	Use a self-hosted sink: traceAI, Phoenix, Langfuse self-hosted, Postgres
Stuck on LiteLLM `1.82.7` / `1.82.8`	Known-bad version	Pin past `1.83.7`, vendor the dep, scan with Sigstore
Treating vLLM throughput numbers as a routing signal	Throughput is per-batch; tail latency is what Aider’s UX feels	Add a P95 latency budget to the routing rule

How Future AGI closes the loop on Aider with local models

The other four gateways treat local-vs-hosted routing as a one-time configuration: declare the rule, ship, hope it ages. Future AGI treats it as the input to a feedback loop. Six stages:

Trace. Every Aider turn, local or hosted, produces a span tree via traceAI (Apache 2.0). Spans capture inputs, outputs, tool calls, model used, latency, GPU replica, and the file paths Aider was operating on. Local traces stay inside the perimeter when running BYOC.
Evaluate. fi.evals scores every turn against task-completion, faithfulness, and code-correctness rubrics. Wire your CI’s unit-test-pass-rate as an additional signal, it’s the single most predictive eval for an Aider workload.
Cluster. Low-scoring sessions get clustered by failure mode. Two common week-one patterns: “local model is consistently failing tool-call dispatch on the same kind of refactor,” and “the routing rule is sending high-context turns to the local model and the context is silently truncating.”
Optimize. fi.opt.optimizers (six optimizers (RandomSearchOptimizer, BayesianSearchOptimizer Optuna-backed with teacher-inferred few-shot templates and resumable studies, MetaPromptOptimizer, ProTeGi, GEPAOptimizer, PromptWizardOptimizer), all sharing an EarlyStoppingConfig (patience + min_delta + threshold + max_evaluations) and the same unified Evaluator over 60+ FAGI rubrics) rewrites the system prompt or the routing policy against the clustered failures. Typical edits: route diffs spanning more than three files to claude-sonnet-4-6, route Python turns local and Rust turns hosted, route turns after a tool-call failure to hosted.
Route. The gateway applies the updated policy on the next request. The local-replica health check and the optimizer-tuned rule cooperate, the gateway won’t route to a stuck replica even if the rule says it should.
Re-deploy. New prompt and route are versioned. Roll forward; on eval regression, automatic rollback.

Net effect: a team starting with Aider plus a local 14B-coder typically sees the “wrong-routing” rate drop from roughly one in four turns to one in 12 within four weeks, without any developer changing their workflow.

The three building blocks are open source:

traceAI, github.com/future-agi/traceAI (Apache 2.0)
ai-evaluation, github.com/future-agi/ai-evaluation (Apache 2.0)
agent-opt, github.com/future-agi/agent-opt (Apache 2.0)

The hosted Agent Command Center adds the failure-cluster view, the Future AGI Protect model family as the inline guardrail layer at ~65 ms p50 text and ~107 ms p50 image (arXiv 2510.13351) (FAGI’s own fine-tuned Gemma 3n adapters across content moderation, bias detection, security/prompt-injection, and data privacy/PII, multi-modal across text/image/audio, a model family rather than a plugin chain) plus RBAC, SOC 2 Type II certified, and AWS Marketplace for procurement.

What we did not include

We deliberately left out three options that show up in other 2026 listicles:

Helicone. Strong observability layer but the local-backend story is shallower than LiteLLM’s, and routing is round-robin / failover rather than complexity-aware.
OpenRouter. Consumer-facing model directory; no local-model story.
Cloudflare AI Gateway. Strong edge-deployment story but the local-backend adapter path requires a Workers-side custom backend, which is workable but not a published feature.

All three are worth a second look in Q3 2026.

Sources

Aider documentation, aider.chat
Aider repository, github.com/Aider-AI/aider
Ollama, ollama.com
vLLM, github.com/vllm-project/vllm
Future AGI Agent Command Center, futureagi.com/platform/monitor/command-center
LiteLLM proxy, github.com/BerriAI/litellm
Portkey AI gateway, portkey.ai
Kong AI Gateway, konghq.com/products/kong-ai-gateway
Future AGI Protect latency benchmarks, arxiv.org/abs/2510.13351 (65 ms text, 107 ms image)

Frequently asked questions

Why run Aider with a local model in the first place?

Cost (the marginal cost of a turn on a paid-for H100 is electricity, not API tokens), privacy (the prompt and the repo never leave the machine), and latency (a small local model on a warm GPU often beats a hosted call on first-token latency for short turns).

Does Aider support OpenAI-compatible endpoints?

Yes. Set `OPENAI_API_BASE` (or `--openai-api-base`) to the gateway's URL. All five gateways here support this.

Can I route Aider through multiple local models on the same machine?

Yes — that is the routing-by-complexity case. The gateway exposes a single URL to Aider and routes to the right local model based on the request body. Ollama and vLLM both handle multi-model hosting; the gateway handles dispatch.

How do I track cost per developer when the model is local?

Cost per turn on a local model is 'GPU-hours allocated divided by turns served.' Future AGI and LiteLLM both compute this if you wire the per-replica GPU-hours into the gateway config. Per-developer attribution requires tagging the request with a developer ID.

What happens to Aider's tool calls when the gateway routes to a local model?

All five gateways pass tool calls through intact on the local leg. The catch is the model's tool-calling reliability — a `qwen2.5-coder-14b` is correct on simple schemas but flakier on multi-tool dispatch. Run a tool-use eval on your model before production.

Is it safe to send source code through an AI gateway?

For local-only deployments, the data flow is Aider → gateway → local model, all inside the VPC. No prompt leaves the perimeter. Future AGI BYOC, LiteLLM, Portkey BYOC, vLLM + proxy, and Kong all support this. Run the trace sink on the same network and the property holds end-to-end.

How is Future AGI Agent Command Center different from LiteLLM for Aider with local models?

LiteLLM is the right gateway when the goal is to ship the day-one configuration: front Ollama and vLLM under one URL, fall back to hosted on error, ship. Future AGI does the same, plus scores every turn, clusters the failures, and rewrites the routing rule on the next deploy. LiteLLM is a static gateway you tune by hand; Agent Command Center is a self-improving gateway you tune by reviewing the optimizer's suggestions.

View all

Guides

LLM Eval with Shadow Traffic and Canary Deployment in 2026

Shadow is not canary. Mirror routing with no user effect vs percentage routing with rollback. Score-attached traffic, ACC patterns, gotchas.

Rishav Hada · May 21, 2026

12 min

Guides

Evaluating Azure OpenAI LLM Apps in 2026

Azure OpenAI eval has three Azure-specific axes: deployment-name drift, region-pinning, and Content Safety precision on benign queries. Here's the pattern.

Vrinda Damani · May 20, 2026

12 min

Guides

Evaluating AWS Bedrock Agents in 2026

Bedrock's built-in eval is dev-loop only. Score action-group correctness, KB retrieval quality, and guardrail precision/recall on every release.

Rishav Hada · May 19, 2026

11 min

TL;DR

Why Aider with local models needs a gateway

The 7 axes we score on

How we picked

1. Future AGI Agent Command Center: Best for local-plus-hosted Aider routing

2. LiteLLM: Best for self-hosted Python proxy fronting Ollama, vLLM, and llama.cpp

3. Portkey: Best for hosted gateway with virtual keys when local is primary and hosted is the fallback

4. vLLM with a proxy front: Best for GPU-native serving with a thin OpenAI-compatibility shim

5. Kong AI Gateway: Best for plugin-stack control plane on top of your existing Kong

Capability matrix

Decision framework: Choose X if

Common mistakes when wiring Aider through a gateway for local models

How Future AGI closes the loop on Aider with local models

What we did not include

Related reading

Sources

Frequently asked questions