Guides

Best AI Gateway for Continue.dev VSCode Workflow in 2026

Five AI gateways scored on the Continue.dev VSCode workflow in 2026: autocomplete latency, chat-session tracking, per-user attribution in shared config, hybrid local+hosted routing, prompt template versioning, retrieval observability, and OSS-friendly self-host.

·
16 min read
ai-gateway 2026 comparison
Editorial cover image for Best AI Gateway for Continue.dev VSCode Workflow in 2026
Table of Contents

A 40-engineer platform team standardises on Continue.dev because they want one VSCode extension that does autocomplete and chat, self-host the routing layer, and mix hosted Claude with local Qwen2.5-Coder. Three weeks in, two things have broken. Autocomplete acceptance collapsed from a 28% baseline to 14% because the proxy in front of OpenRouter adds 380ms to a workflow with a 300ms p95 budget. And the shared config.json in the team’s dotfiles repo means every completion looks like one user, the chargeback report is a single line.

This is the gateway question for Continue.dev, and it doesn’t have the same answer as the one for Claude Code or Cursor. Continue is OSS, model-agnostic, runs on a shared config, and burns its latency budget on autocomplete the way the other agents don’t.

This is the 2026 cohort, scored on seven axes that matter for Continue.dev specifically.


TL;DR

Future AGI Agent Command Center is the strongest pick for an AI gateway in front of Continue.dev workflows because it ships per-developer attribution that reads from the VSCode SSO hook, separate dashboard views for autocomplete (4ms p50 overhead) and chat sessions, hybrid local-plus-hosted routing with Ollama and Anthropic in one view, and Apache 2.0 self-host that matches Continue’s licensing posture. The other four picks below win on specific edges.

  1. Future AGI Agent Command Center — Best overall. Per-developer SSO-tagged attribution, hybrid Ollama + Anthropic + OpenAI in one dashboard, and 4ms p50 autocomplete overhead.
  2. LiteLLM — Best for pure OSS self-host matching Continue’s licensing posture. Source-available Python-native proxy that fits the Helm chart; pin commits after the March 24, 2026 PyPI compromise.
  3. Portkey — Best when your team treats .prompt files as a managed artifact. Mature hosted prompt-template versioning (verify the Palo Alto Networks acquisition timeline before signing multi-year).
  4. Helicone — Best when the ask is “show me what each developer typed.” Drop-in proxy with lightweight per-request observability (treat as planned migration after the March 3, 2026 Mintlify acquisition).
  5. Kong AI Gateway + Ollama — Best when Ollama lives behind Kong for the same reason your REST APIs do. The enterprise local-model pattern on an existing control plane.

Why Continue.dev needs a gateway in front of it

Continue (continue.dev) is an open-source autocomplete + chat extension for VSCode and JetBrains. The wedge is that it’s model-agnostic: any OpenAI-compatible endpoint goes in config.json. The community pushes hard on local models. Qwen2.5-Coder-7B on Ollama for tab-complete, hosted Claude or GPT-class for chat and codebase Q&A.

That shape creates four gateway-shaped problems that don’t look like the Claude Code workload:

  1. Autocomplete has a hard latency budget. Tab-complete fires on a debounced keystroke. Past 300ms p95 it feels broken; acceptance rates fall through the floor. In our May 2026 benchmarks a team saw acceptance drop from 28% to 14% when proxy overhead pushed p95 from 240ms to 410ms.

  2. The shared config.json hides per-user identity. Typical rollouts check one config.json into dotfiles with a shared API key. From the provider’s view, every developer is one user. Attribution comes from a per-developer header or virtual key injected at extension startup.

  3. Chat and autocomplete are two workloads through one config. Chat is multi-turn with @-mention retrieval; autocomplete is high-frequency, low-token single-turn. A gateway that treats them as one stream buries the chat signal in the autocomplete firehose.

  4. Hybrid local + hosted routing is the norm. Continue’s defaults push local autocomplete + hosted chat. The gateway has to route to http://localhost:11434/v1 for one path and https://api.anthropic.com for another, with per-path observability. Most gateways were designed for the all-hosted case.

A gateway sits between the Continue extension and whichever backend handles the request, attaching metadata (developer, workspace, prompt version, mode). Done right, you get attribution without burning your latency budget.


The 7 axes we score on

AxisWhat it measures
1. Autocomplete latency overheadWhat does the gateway add to a 240ms baseline tab-complete? Stays under the 300ms p95 budget?
2. Chat-session trackingGroups multi-turn chat-panel conversations into one session view, separate from the autocomplete firehose?
3. Per-user attribution in shared configWith one shared config.json, can it still attribute usage per developer?
4. Hybrid local + hosted routingRoutes some requests to local Ollama and others to a hosted provider, one set of dashboards?
5. Prompt template versioningVersions Continue’s .prompt files and lets you A/B without redeploying?
6. Retrieval observabilitySees the @-mention codebase retrieval call and tags what context made it into the prompt?
7. OSS-friendly self-hostRuns inside the team’s infra, source-available, no phone-home?

How we picked

We started from gateways that ship an OpenAI-compatible endpoint as of May 2026. Continue requires that protocol. We removed gateways measuring above 300ms median for a 64-token completion (three early-stage proxies dropped here for batching SSE chunks). We removed gateways that can’t run inside a customer VPC without a vendor connect-back tunnel.

We benchmarked autocomplete latency on Qwen2.5-Coder-7B / Ollama behind each gateway, 200 completions, single-developer cold path. Numbers are May 2026.


1. Future AGI Agent Command Center: Best for per-developer Continue.dev attribution across autocomplete + chat

Verdict: Future AGI ships per-developer attribution that reads from the VSCode SSO hook, separate dashboard views for autocomplete (4ms p50 overhead) and chat sessions (with session_id propagation), hybrid local-plus-hosted routing where qwen2.5-coder:7b to local Ollama, claude-sonnet-4-6 to Anthropic, and gpt-5.2-mini to OpenAI all land in one dashboard, and Apache 2.0 self-host that matches Continue’s licensing posture.

What it does for Continue.dev:

  • Autocomplete latency overhead measured 4ms p50, 11ms p95. The gateway speaks OpenAI’s streaming protocol natively and doesn’t buffer-and-batch.
  • Chat-session tracking is native. Continue’s chat panel sets a session_id; Agent Command Center reads it and groups retrieval calls and follow-up turns under the parent session. Separate dashboard views for autocomplete and chat.
  • Per-user attribution in shared config through header injection at extension start. A 30-line VSCode hook reads SSO identity and sets fi.attributes.user.id on every outbound request.
  • Hybrid local + hosted routing through route-by-model rules. qwen2.5-coder:7b to local Ollama, claude-sonnet-4-6 to Anthropic, gpt-5.2-mini to OpenAI. One dashboard.
  • Prompt template versioning through fi.prompts. Continue’s .prompt files commit into a Future AGI registry; the gateway resolves at request time. A/B without redeploying.
  • Retrieval observability through retrieval spans. Continue’s @-mention retrieval becomes a child span; you see which files were pulled, payload size, and whether retrieval helped (eval scores faithfulness against retrieved context).
  • OSS-friendly self-host through BYOC plus three Apache 2.0 libraries: traceAI, ai-evaluation, agent-opt. The community self-hosts the open libraries; the hosted Command Center is the enterprise tier.

The loop. fi.evals scores faithfulness, code-correctness, retrieval-precision. Low-scoring sessions cluster into failure modes. fi.opt.optimizers (six optimizers (RandomSearchOptimizer, BayesianSearchOptimizer Optuna-backed with teacher-inferred few-shot templates and resumable studies, MetaPromptOptimizer, ProTeGi, GEPAOptimizer, PromptWizardOptimizer), all sharing an EarlyStoppingConfig (patience + min_delta + threshold + max_evaluations) and the same unified Evaluator over 60+ FAGI rubrics) rewrites prompts or adjusts routing. For Continue, the typical optimization is “60% of autocomplete to Qwen2.5-Coder-7B, escalate to 32B only when local context exceeds 6K tokens”, weeks of hand-tuning collapsed into days.

The Protect guardrail layer ships at 65 ms text / 107 ms image latency (arXiv 2510.13351), used on chat traffic but bypassed by default on autocomplete, the 300ms budget is non-negotiable.

Where it falls short:

  • agent-opt is opt-in, for a 5-developer team that just wants per-user numbers, start with traceAI + ai-evaluation and turn the optimizer on once eval baselines stabilize.
  • The prompt-template registry has shipped since Q1 2026; the version-comparison UI is opinionated, fewer side-by-side knobs than Portkey, which keeps the daily workflow tight.
  • Self-hosted retrieval indexing is on the roadmap; today, retrieval observability works against Continue’s built-in indexer plus the hosted Command Center’s vector store. Fully air-gapped retrieval isn’t yet there.

Pricing: Free 100K traces / month. Scale starts $99/month. Enterprise is custom with SOC 2 Type II, BAA, AWS Marketplace listing.

Score: 7/7 axes.


2. LiteLLM: Best for OSS-only self-host

Verdict: LiteLLM is the pick when the Continue community’s “OSS or it didn’t happen” reflex matches your team’s policy. Source-available, Python-native, runs in your VPC, OpenAI-compatible. Less observability than the hosted options, but the source is yours, the license is MIT, and the operational story is “one more service in our Helm chart.”

What it does for Continue.dev:

  • Autocomplete latency overhead measured 9ms p50, 22ms p95. Deploy in the same VPC as Ollama, cross-region adds 30-80ms.
  • Chat-session tracking through metadata pass-through. Wire metadata.session_id from Continue’s chat panel; LiteLLM logs it. The built-in UI is functional but you will end up writing your own dashboard for serious session views.
  • Per-user attribution in shared config through virtual keys fanning out to one underlying provider key. The shared config.json becomes a template the extension fills in at startup.
  • Hybrid local + hosted routing through model groups. autocomplete-local points at Ollama, chat-hosted at Anthropic; Continue references the group name. Fallback chains are first-class.
  • Prompt template versioning is shallow. Request-time routing exists, not a full registry. Version templates in git.
  • Retrieval observability is absent. LiteLLM sees the prompt that arrives, not the retrieval that built it.
  • OSS-friendly self-host is the strongest in this list. MIT, runs in your stack, no telemetry leaves.

Where it falls short:

  • No optimizer.
  • No native retrieval observability. Continue’s @-mention retrieval happens before the proxy ever sees the request.
  • The dashboard is engineering-grade; not the artifact you hand to finance. Plan a downstream OTel sink (Grafana, Future AGI traceAI, Honeycomb).
  • Prompt-template versioning is bring-your-own-git.

Pricing: MIT open source. LiteLLM Enterprise adds SSO, audit, SLA; starts ~$250/month.

Score: 5/7 axes (missing: prompt template registry, retrieval observability, optimizer).


3. Portkey: Best for mature prompt-template versioning

Verdict: Portkey is the most polished hosted gateway here and has the deepest prompt-template story. If .prompt files are managed artifacts that DevRel curates, Portkey’s prompt registry is the most ergonomic. The cost: it’s hosted-first; BYOC exists but isn’t the documented path.

What it does for Continue.dev:

  • Autocomplete latency overhead measured 28ms p50, 71ms p95, highest of the five, mostly the round-trip to a Portkey region. Acceptable for chat; on autocomplete, prefer Portkey BYOC.
  • Chat-session tracking through trace_id and Portkey’s native session view. Chat-panel grouping works without much wiring.
  • Per-user attribution in shared config through virtual keys fanned out to the team’s provider keys, preserving bulk pricing.
  • Hybrid local + hosted routing works through Portkey’s gateway config, but routing to local Ollama via the hosted path requires exposing Ollama publicly (don’t) or BYOC.
  • Prompt template versioning is best-in-class. Versioned, A/B-able, deployable without redeploying, with diff views.
  • Retrieval observability is partial. Portkey logs retrieved-context blocks but doesn’t model retrieval as a separate span, the @-mention shows up flattened into the user message.
  • OSS-friendly self-host through Portkey BYOC, which works but is closed-source. The Continue community will notice.

Where it falls short:

  • Highest autocomplete latency overhead of the five. The hosted path isn’t where autocomplete traffic should live.
  • Closed-source. The Continue community reflex is to question gateways they can’t read the code of.
  • No optimizer. Traces inform humans, not the gateway.

Pricing: Free 10K req/day. Scale starts $99/month. Enterprise with SOC 2 Type II.

Score: 5.5/7 axes (missing: native retrieval modeling, OSS, optimizer).


4. Helicone: Best for lightweight per-request observability

Verdict: Helicone is the right pick when the only ask is “show me what each developer typed, what the model returned, and what it cost.” Drop the proxy URL in front of an OpenAI-compatible endpoint, tag with two headers, get a usable dashboard. Anything beyond (routing intelligence, prompt versioning, retrieval observability, optimization) isn’t where Helicone is investing.

What it does for Continue.dev:

  • Autocomplete latency overhead measured 14ms p50, 33ms p95. Cloudflare-Workers-based; latency is consistent across geographies.
  • Chat-session tracking through Helicone-Session-Id and Helicone-Session-Path. The chat panel must set these; skip the wiring and chat traffic blends.
  • Per-user attribution in shared config through Helicone-User-Id set at extension startup.
  • Hybrid local + hosted routing isn’t Helicone’s strong suit. One upstream per deployment; running two deployments is the documented path, with two dashboards.
  • Prompt template versioning through Helicone Prompts. Exists, but shallower than Portkey’s; the diff UI is minimal.
  • Retrieval observability through custom properties. You can’t model retrieval as a separate span without your own instrumentation.
  • OSS-friendly self-host through Helicone’s self-host option, solid for low-volume teams. Scale-out beyond a few hundred RPS gets operational.

Where it falls short:

  • No native unified view across local + hosted routing.
  • No optimizer.
  • Value-add stops at observation. Routing intelligence, deep prompt versioning, and retrieval modeling all need the other picks.

Pricing: Free 10K req/month. Pro starts $25/month. Enterprise is custom.

Score: 4.5/7 axes (missing: unified hybrid routing, deep prompt versioning, retrieval modeling, optimizer).


5. Kong AI Gateway in front of Ollama: Best for enterprise local-model patterns

Verdict: Kong AI Gateway is the pick when the platform team already runs Kong for REST APIs and the path of least resistance is to put Kong in front of Ollama too. Strengths: SLA, plugin ecosystem, ops familiarity. Weaknesses: AI-specific shallowness, and operational weight if you don’t already run Kong.

What it does for Continue.dev:

  • Autocomplete latency overhead measured 7ms p50, 19ms p95 when Kong is co-located with Ollama. The AI Proxy plugin doesn’t add measurable latency for streaming completions. Cross-VPC adds the usual 30-80ms.
  • Chat-session tracking through Kong’s OTel plugin. Span attributes wire through Lua or the AI Proxy plugin’s metadata fields. Session grouping lives in Grafana or your OTel backend.
  • Per-user attribution in shared config through Kong consumers + tags. Each developer is a consumer; consumer ID maps to the developer; chargeback aggregation happens in the OTel backend.
  • Hybrid local + hosted routing through Kong’s upstream selectors, this is what Kong does well. Single ingress for Continue, three upstreams behind it.
  • Prompt template versioning isn’t native. A Kong plugin does template substitution but not a versioned registry with A/B.
  • Retrieval observability is plugin-driven; modeling retrieval as a span requires custom instrumentation in your OTel pipeline.
  • OSS-friendly self-host is the Kong story. Apache 2.0 core, runs anywhere.

Where it falls short:

  • Most LLM-specific observability is plugin work. Plan two engineer-weeks of platform-team time to build the chargeback view finance will accept.
  • No optimizer.
  • Default dashboard is the API-gateway view, not the LLM-cost view.
  • If you don’t already run Kong, the standup weight is higher than any of the other four picks.

Pricing: Kong is open source. Konnect (managed) starts free. Enterprise SLA/plugins/support starts ~$1.5K/month.

Score: 5/7 axes (missing: native LLM observability, prompt-template registry, optimizer).


Capability matrix

AxisFuture AGILiteLLMPortkeyHeliconeKong + Ollama
Autocomplete p95 overhead11ms22ms71ms33ms19ms
Chat-session trackingNativeMetadataNativeHeaderPlugin
Per-user attribution in shared configNativeVirtual keyVirtual keyHeaderConsumer
Hybrid local + hosted routingNative unifiedModel groupBYOC onlyTwo deploymentsUpstream selector
Prompt template versioningNative registryBYO gitBest-in-classShallowPlugin only
Retrieval observabilityNative spanAbsentPartialCustom propertyPlugin
OSS-friendly self-hostBYOC + 3 Apache 2.0 libsMIT, source-availableClosed BYOCOSS optionApache 2.0 core
Feedback loop / optimizerfi.optAbsentAbsentAbsentAbsent

Decision framework: Choose X if

Choose Future AGI if the Continue rollout is becoming a significant line item (25+ engineers, $5K+/month combined hosted-model spend plus self-hosted infra) and you want the gateway to do more than monitor. The OSS reflex is answered (traceAI, ai-evaluation, agent-opt are Apache 2.0); the loop bends the cost curve down.

Choose LiteLLM if the OSS-first reflex is also your team’s policy. Source-availability beats polish; “one more Python service in the Helm chart” is a smaller bet than a new vendor.

Choose Portkey if Continue prompt templates are a managed artifact. DevRel or a designer curates the .prompt files; the team needs a versioned registry with A/B and rollback. Deploy BYOC for autocomplete traffic.

Choose Helicone if the only ask is per-request visibility and the team is under 10 engineers. The simpler product is the right fit; the chargeback is a CSV export.

Choose Kong AI Gateway if you already operate Kong for REST APIs. Platform-team familiarity is the largest force on the decision.


Common mistakes when wiring Continue.dev through a gateway

MistakeWhat goes wrongFix
Putting the gateway on autocomplete without measuring latencyAcceptance collapses silently — developers complain, you blame the modelBenchmark p95 with your Ollama model; reject anything over 50ms
Routing autocomplete to a hosted gateway regionThe round-trip adds 60-120ms the keystroke loop cannot absorbBYOC or self-hosted for autocomplete; hosted is fine for chat
Sharing one config.json with a hard-coded API keyAll chargeback rolls up to one userExtension hook reads SSO at startup, sets a per-user header or virtual key
Tagging only at request level, not session levelMulti-turn chat costs are illegibleSet both user_id and session_id
Treating autocomplete and chat as one workloadAutocomplete firehose buries chat signalTag mode=autocomplete vs mode=chat; two dashboards
Not preserving the streaming protocolGhost-text flickers or appears all-at-onceConfirm SSE forwarding without buffer-and-batch — measure it
Skipping retrieval observability for chatWhen chat hallucinates, you cannot tell if retrieval missed or the model fumbledCapture retrieval spans before the LLM call; correlate with eval score

How Future AGI closes the loop on Continue.dev

The other four gateways treat observability as an end state. Future AGI treats it as the input to a feedback loop with six stages.

  1. Trace. Every Continue request produces a span tree via traceAI (Apache 2.0). Spans capture developer ID, workspace, prompt version, model, retrieved context, payloads, and cost.

  2. Evaluate. fi.evals scores autocomplete traces against acceptance (did the developer accept the ghost text?) and code-correctness; chat traces against faithfulness and task completion.

  3. Cluster. Low-scoring traces cluster by failure mode. “Autocomplete called with 5K+ tokens of context that didn’t help”, cost-to-quality mismatch visible. “Chat retrieved the wrong files because the @-mention indexer is stale”, re-index, don’t retrain.

  4. Optimize. fi.opt.optimizers (six optimizers (RandomSearchOptimizer, BayesianSearchOptimizer Optuna-backed with teacher-inferred few-shot templates and resumable studies, MetaPromptOptimizer, ProTeGi, GEPAOptimizer, PromptWizardOptimizer), all sharing an EarlyStoppingConfig (patience + min_delta + threshold + max_evaluations) and the same unified Evaluator over 60+ FAGI rubrics) rewrites the prompt or adjusts routing. For Continue, the optimizer converges on rules like “autocomplete under 4K context to Qwen2.5-Coder-7B, 4K-16K to 32B, over 16K to hosted Claude Haiku”, two sprints of hand-tuning compressed into days.

  5. Route. The gateway applies the updated policy on the next request. The shared config.json doesn’t change; the gateway resolves the model alias at runtime.

  6. Re-deploy. Template + route are versioned. Roll forward; if acceptance regresses, automatic rollback.

Net effect: autocomplete acceptance trends from 22-28% baseline to 32-38% over four to six weeks, and combined token spend trends down 18-25%. No developer changes behaviour; the gateway gets better.

The three building blocks are Apache 2.0: traceAI, ai-evaluation, agent-opt (github.com/future-agi). The hosted Agent Command Center adds the failure-cluster view, live Protect guardrails (65 ms text, 107 ms image, arXiv 2510.13351), RBAC, SOC 2 Type II certified, and AWS Marketplace for procurement.


What we did not include

Three gateways show up in other 2026 Continue.dev listicles but didn’t make the cut:

  • OpenRouter. Great for model exploration but the routing is consumer-facing, the chargeback story is thin, and autocomplete latency is inconsistent.
  • Cloudflare AI Gateway. Strong edge primitives but Continue-specific hybrid local + hosted routing needs custom worker code.
  • Together AI. Excellent if you run OSS models through Together; one upstream, not a routing layer.

All three are worth a second look in Q3 2026.



Sources

  • Continue.dev documentation, docs.continue.dev
  • Continue.dev config.json reference, docs.continue.dev/customize/config
  • Future AGI Agent Command Center, futureagi.com/platform/monitor/command-center
  • Future AGI Protect latency benchmarks, arxiv.org/abs/2510.13351 (65 ms text, 107 ms image)
  • LiteLLM proxy, github.com/BerriAI/litellm
  • Portkey AI gateway, portkey.ai
  • Helicone proxy, helicone.ai
  • Kong AI Gateway, konghq.com/products/kong-ai-gateway
  • Autocomplete acceptance rate baselines. Continue.dev community telemetry summary, Q1 2026

Frequently asked questions

What is the cheapest way to observe a Continue.dev rollout?
Helicone's free tier or LiteLLM open-source. Both give per-request visibility with minimal wiring. Per-user attribution under a shared `config.json` requires a small VSCode hook setting a header at startup.
Does Continue.dev support OpenAI-compatible endpoints?
Yes. All five gateways above ship that protocol. The `apiBase` field on a Continue model entry points at the gateway.
Can I route Continue.dev through both a local Ollama and a hosted provider?
Yes — the dominant pattern in 2026. Future AGI, LiteLLM, and Kong handle this natively; Portkey requires BYOC; Helicone documents two deployments.
How do I track Continue.dev cost per developer when everyone shares one config.json?
Use virtual keys (Future AGI, Portkey, LiteLLM) or header-based attribution (Helicone, Future AGI), plus a small VSCode hook that reads SSO at startup. The shared `config.json` stays clean.
What happens to autocomplete latency through a gateway?
May 2026 benchmark on Qwen2.5-Coder-7B local: Future AGI 11ms p95, Kong 19ms, LiteLLM 22ms, Helicone 33ms, Portkey hosted 71ms. The 300ms budget absorbs most; the Portkey hosted path is where it gets tight, so BYOC is the right call.
Is it safe to send source code through an AI gateway?
For hosted gateways the data flow is gateway → provider; both see the code. If your compliance forbids this, the safe picks are LiteLLM, Future AGI BYOC, Helicone self-host, or Kong — all four run inside your VPC.
How is Future AGI Agent Command Center different from LiteLLM?
LiteLLM is a routing proxy. Agent Command Center is a routing proxy plus an evaluation layer plus an optimizer. LiteLLM tells you what happened; Agent Command Center improves what happens next.
Related Articles
View all