Engineering

How to Optimize Vapi Voice Agent Latency in 2026: 12 Techniques with Real Config

Optimize Vapi voice agent latency to sub-500ms p95 in 2026. 12 techniques with real Vapi config: streaming STT, partial TTS, prompt caching, regional routing, async eval.

May 20, 2026

14 min read

voice-ai 2026 vapi latency optimization how-to

How to Optimize Vapi Voice Agent Latency in 2026

To optimize Vapi voice agent latency, configure 12 knobs in the assistant config: pick a streaming transcriber (deepgram or flux), tune startSpeakingPlan.smartEndpointingPlan for fast turn-taking, anchor the model.messages system prompt so provider prefix caching engages, route short turns to a smaller model.model, set firstMessageMode to assistant-speaks-first for a prebuffered greeting, fire tool calls with async: true, tune stopSpeakingPlan for barge-in, and forward the end-of-call-report from server.url to Future AGI ai-evaluation for async scoring. Stacked, these techniques drop a 1200ms Vapi turn to 500-700ms p95.

TL;DR pick by Vapi config knob

Technique	Vapi config knob	Expected p95 win
Streaming STT	`transcriber.provider = "deepgram"` (flux for low-latency EOT)	200-400ms
Partial LLM into TTS	Managed by Vapi runtime (provider streaming)	200-500ms
Prompt prefix caching	`model.messages` byte-stable, `model.provider`	200-400ms
Edge model routing	`model.provider` + `model.model` per assistant	60-150ms
Prefetch tool calls	`tools[].async = true`, `server.url`	200-400ms
Audio prebuffering	`firstMessageMode = "assistant-speaks-first"`	80-200ms
Async evaluation	`server.url` end-of-call-report to FAGI	100-300ms
TTS warm-up	Vapi-managed connection reuse	50-150ms
Smaller models for short turns	`model.model` per intent route	100-300ms
Semantic cache for common intents	Agent Command Center gateway	400-800ms on hits
KV-cache reuse across turns	Provider session caching via stable prefix	100-300ms
Regional routing	Per-component (deepgram, 11labs, cartesia)	30-80ms

How to read this guide

The parent methodology hub covers the 12 techniques in framework form. This post maps each one to the Vapi assistant config. The pattern is consistent: Vapi handles the streaming wiring out of the box, where teams need control over what gets cached, prefetched, routed, or measured, you reach into the config knobs below. Where Vapi abstracts a technique entirely, this post says so and shows the lower-level equivalent via traceAI.

Spans matter. Every “we shipped X technique” claim is hand-wavy without per-stage timing. traceAI emits OpenInference spans for STT, LLM, TTS, and tool calls in one trace per Vapi conversation. 30+ documented integrations across Python and TypeScript cover the voice stack. Apache 2.0. For Vapi dashboards, no SDK is needed: native voice observability ingests via Vapi API key plus Assistant ID through a Future AGI Agent Definition.

1. Streaming STT with first-partial routing

What it does. Switch from a batch transcriber to a streaming one that emits partial transcripts every 100-200ms while the user is still speaking. Feed the latest partial to the LLM the moment intent confidence crosses 0.85. The parent post covers the theory in §1.

Vapi config. Set transcriber.provider to deepgram. For the lowest-latency turn-taking, pick the Flux model and tune eotThreshold plus eotTimeoutMs. For broad-language transcription, pick nova-3 and let startSpeakingPlan.smartEndpointingPlan decide when the user is done.

{
  "transcriber": {
    "provider": "deepgram",
    "model": "flux-general-en",
    "language": "en",
    "eotThreshold": 0.7,
    "eotTimeoutMs": 600,
    "keyterm": ["refund", "balance", "appointment"]
  }
}

Common mistake. Leaving eotTimeoutMs at the default 1000+ms on a conversational agent. That single field adds 400ms to every turn end. Push it to 500-700ms for short turns, higher only when callers tend to pause mid-thought.

Where Vapi abstracts. Vapi auto-feeds the first stable partial to the model and runs end-of-turn detection through startSpeakingPlan.smartEndpointingPlan when numFastTurns patterns apply. If you want to see when the first partial actually fires, traceAI captures gen_ai.voice.latency.stt_first_partial_ms and gen_ai.voice.latency.stt_final_ms on the STT span. The gap between those two is the parallel window the LLM gets for free.

2. Partial LLM tokens piped into TTS

What it does. Stream LLM tokens. The moment the first sentence boundary lands, fire that sentence to TTS. The user hears the first word before the LLM finishes the response. The parent post covers the theory in §2.

Vapi config. Vapi runs this pipe automatically when model.provider is one that supports streaming (openai, anthropic, groq, together-ai). You do not toggle it. You verify it by watching the tts_first_audio_ms attribute on the TTS span.

# fi-instrumentation==0.4.2
from fi_instrumentation import register
from fi_instrumentation.fi_types import ProjectType

tracer_provider = register(
    project_type=ProjectType.OBSERVE,
    project_name="vapi-prod",
    set_global_tracer_provider=True,
)
# traceAI now emits gen_ai.evaluation.* spans plus gen_ai.voice.latency.* attributes
# Output:
# Registered project: vapi-prod
# Tracer provider attached. Voice attrs: gen_ai.voice.latency.ttfb_ms, gen_ai.voice.latency.tts_first_audio_ms

Common mistake. Picking a model.provider that does not stream tokens (a few self-hosted backends still default to blocking responses). The turn latency doubles because TTS waits on the full response.

Where Vapi abstracts. Fully managed. The lower-level view is the gap between gen_ai.voice.latency.llm_ttft_ms, gen_ai.voice.latency.llm_first_sentence_ms, and gen_ai.voice.latency.tts_first_audio_ms on traceAI spans. A healthy Vapi turn shows TTS starting within 80-150ms of the first sentence boundary.

3. LLM prompt prefix caching

What it does. Anchor the system prompt at the top of model.messages. Keep it byte-identical across turns so the provider’s cache lookup hits. The parent post covers the theory in §3.

Vapi config. Vapi passes model.messages straight to the upstream provider. For Anthropic, cache headers attach automatically when the prefix qualifies. For OpenAI, automatic prompt caching engages on prompts above 1024 tokens when the prefix is byte-stable.

{
  "model": {
    "provider": "anthropic",
    "model": "claude-sonnet-4-5",
    "temperature": 0.3,
    "messages": [
      {
        "role": "system",
        "content": "You are a refund support agent for Acme. Resolve refunds within policy. Escalate edge cases. Speak in 1-2 sentences."
      }
    ]
  }
}

Common mistake. Interpolating Today is {date} or User ID: {uid} near the top of the system prompt. That breaks the prefix. Put dynamic content at the end, or pass it as the first user message.

Where Vapi abstracts. Vapi does not surface a cache toggle. The provider handles it. To verify it engaged, look at the provider’s response and check gen_ai.usage.cached_input_tokens on the LLM span emitted by traceAI. A healthy production Vapi agent should show 80%+ cache hit rate on the system prompt after warm-up.

4. Edge model routing

What it does. Route the LLM call to the provider region with the freshest prefix cache for your system prompt. The parent post covers the theory in §4.

Vapi config. Vapi routes through the provider’s regional endpoint when you pick the right model.provider. For multi-region agents, run separate Vapi assistants per region with the same prompt, and route at the dialer level (Twilio number, SIP origination) to the regional Assistant ID.

{
  "model": {
    "provider": "groq",
    "model": "llama-3.3-70b",
    "temperature": 0.2,
    "maxTokens": 200
  }
}

Common mistake. Picking a provider with a single region for a global voice agent. The transcontinental RTT adds 60-150ms to every turn.

Where Vapi abstracts. Vapi does not expose a generic region knob on the assistant. The control point is the provider you pick, the regional credentials you configure, and the custom-server URL hosted close to the user. For US, EU, and APAC traffic, the practical pattern is three assistants with provider/credential choices that pin each call to a regional edge. Capture gen_ai.voice.region and gen_ai.voice.edge_pop as traceAI span attributes.

5. Prefetch tool calls on high-confidence intent

What it does. When intent confidence on the first stable transcript partial is above 0.85, fire the tool call in parallel with the LLM call. If intent changes in later partials, cancel the prefetched call. The parent post covers the theory in §5.

Vapi config. Set async: true on the tool. Vapi fires the tool to server.url without blocking the agent’s spoken response. Combine that with the transcript event on the server URL to start work on the partial transcript before the user finishes the sentence.

{
  "tools": [
    {
      "type": "function",
      "async": true,
      "function": {
        "name": "lookup_order_status",
        "description": "Fetch order status by order ID. Call as soon as the user mentions an order number.",
        "parameters": {
          "type": "object",
          "properties": {"order_id": {"type": "string"}},
          "required": ["order_id"]
        }
      },
      "server": {
        "url": "https://api.acme.com/vapi/tools/order-status",
        "timeoutSeconds": 8
      },
      "messages": [
        {"type": "request-start", "content": "Looking that up now."},
        {"type": "request-response-delayed", "content": "Still pulling that, one moment."}
      ]
    }
  ]
}

Common mistake. Setting timeoutSeconds to the default 20 on a turn-critical tool. A 20-second hang ruins the conversation. Set it to 6-10 seconds and use the request-response-delayed message as the verbal stall.

Where Vapi abstracts. Vapi runs the request to your server.url and handles the spoken stall through messages. The cancellation logic on intent change runs in your server, not in Vapi. Track gen_ai.voice.tool_prefetched and gen_ai.voice.tool_call_cancelled as traceAI attributes. A prefetch success rate above 90% means the strategy is paying.

6. Audio prebuffering

What it does. Open the TTS connection and synthesize the first audio frame before the user expects sound. The buffer absorbs network jitter so playback is smooth. The parent post covers the theory in §6.

Vapi config. Set firstMessageMode to assistant-speaks-first. Vapi pre-synthesizes the greeting and starts playback the moment the call connects. Tune silenceTimeoutSeconds to 20-30 to avoid hanging up on slow users.

{
  "firstMessage": "Hi, this is the Acme support line. How can I help?",
  "firstMessageMode": "assistant-speaks-first",
  "silenceTimeoutSeconds": 25,
  "maxDurationSeconds": 600,
  "backgroundDenoisingEnabled": true
}

Common mistake. Using firstMessageMode: "assistant-waits-for-user" on outbound calls. The agent waits 200-400ms before its first word, which on an outbound dial feels broken. Reserve assistant-waits-for-user for inbound calls where the user has a question ready.

Where Vapi abstracts. Vapi pre-synthesizes the first message and warms the TTS WebSocket before the call connects. The lower-level view is gen_ai.voice.latency.tts_prebuffer_ms and gen_ai.voice.tts_playback_underrun_count on the TTS span. Underrun count should be zero on a properly tuned buffer.

7. Async evaluation

What it does. Run scoring after the turn commits. Never block the critical path on an LLM judge. The parent post covers the theory in §7.

Vapi config. Configure server.url on the assistant. Vapi posts the end-of-call-report to that URL when the call ends. Your server forwards it to Future AGI ai-evaluation. None of this fires inline.

# vapi-python==1.4.0, ai-evaluation==0.6.1
from fi.evals import Evaluator
from fi.evals.templates import ConversationCoherence, ConversationResolution, AudioQuality

async def on_end_of_call_report(payload):
    msg = payload["message"]
    artifact = msg.get("artifact", {})
    evaluator = Evaluator()
    result = await evaluator.run_async(
        templates=[
            ConversationCoherence(),
            ConversationResolution(),
            AudioQuality(),
        ],
        inputs={
            "transcript": artifact.get("transcript", ""),
            "recording_url": artifact.get("recordingUrl"),
            "call_id": msg["call"]["id"],
        },
    )
    return result
# Output:
# {"conversation_coherence": 0.91, "conversation_resolution": 0.84, "audio_quality": 0.88}

Common mistake. Running an inline LLM-judge eval on every turn. A 200ms judge inside a 500ms turn breaks the budget. Reserve inline scoring for the routes where it is genuinely safety-critical, and use a classifier-grade model when you do.

Where Vapi abstracts. Vapi delivers the end-of-call-report wrapped under message, with message.artifact.transcript, message.artifact.recordingUrl, message.call.id, message.summary, message.cost, and message.durationSeconds. 70+ pre-built eval templates in ai-evaluation cover the voice surface, including audio_transcription, audio_quality, conversation_coherence, conversation_resolution, and task_completion. Plus faithfulness, tool-use accuracy, and groundedness. Unlimited custom evaluators are authored by an in-product agent that reads your traces. Apache 2.0.

8. Parallel TTS warm-up

What it does. Keep a warm TTS connection open per session so the first audio frame arrives 50-150ms faster than a cold start. The parent post covers the theory in §8.

Vapi config. Vapi manages the TTS WebSocket per session. The connection is opened at call start and reused for every turn. You pick the voice and provider, Vapi handles the connection reuse.

{
  "voice": {
    "provider": "11labs",
    "voiceId": "21m00Tcm4TlvDq8ikWAM",
    "model": "eleven_turbo_v2_5",
    "speed": 1.0,
    "optimizeStreamingLatency": 3,
    "useSpeakerBoost": false
  }
}

Common mistake. Picking eleven_multilingual_v2 for an English-only agent. Multilingual models add 80-150ms TTFT versus Turbo. If the agent speaks one language, use the Turbo variant.

Where Vapi abstracts. Vapi-managed. To verify the WebSocket actually stayed warm, traceAI captures gen_ai.voice.tts_connection_state (cold, warm, reused) and gen_ai.voice.latency.tts_first_audio_ms. Warm should beat cold by the 50-150ms saving.

9. Smaller models for short turns

What it does. Route short conversational turns (“yes”, “no”, “can you repeat that”) to a smaller and faster model. Route complex tool turns to the larger model. The parent post covers the theory in §9.

Vapi config. Two paths. Static: set model.model to the smaller model and accept the quality tradeoff. Per-call: handle the assistant-request event on server.url for inbound calls without a fixed assistantId, return the right assistant config (and model.model) based on caller signal. Per-turn: point model.provider at a custom-llm gateway and let the gateway pick the model.

{
  "model": {
    "provider": "openai",
    "model": "gpt-4o-mini",
    "temperature": 0.3,
    "maxTokens": 100,
    "messages": [
      {"role": "system", "content": "You are a refund support agent. Respond in 1-2 sentences."}
    ]
  }
}

Common mistake. Routing every turn to gpt-4o or claude-sonnet-4-5 “for quality”. TTFT on those models sits 200-400ms above the small-model alternatives. The mix matters more than the peak.

Where Vapi abstracts. Vapi does not run a router for you. For multi-model routing at the gateway level, the Agent Command Center covers 15+ providers through a single OpenAI-compatible endpoint with per-route policy. Hosted with RBAC, AWS Marketplace, multi-region, SOC 2 Type II, HIPAA, GDPR, CCPA, and ISO 27001 certified per the trust page. Capture gen_ai.request.model and gen_ai.voice.route_reason as traceAI attributes.

10. Semantic cache for common intents

What it does. Embed the user’s transcript on the live partial. Search a cache of recently-answered queries by embedding similarity. On a hit above threshold, return the cached audio answer in 30-80ms. The parent post covers the theory in §10.

Vapi config. Vapi does not ship a semantic cache. The gateway-level approach goes through the Agent Command Center: route Vapi’s model calls to the ACC OpenAI-compatible endpoint, configure semantic cache there, and Vapi gets cache hits without code changes.

# fi-instrumentation==0.4.2
# Point Vapi's model call at the Agent Command Center gateway
vapi_assistant = {
    "model": {
        "provider": "custom-llm",
        "url": "https://gateway.futureagi.com/v1",
        "model": "gpt-4o-mini",
        "headers": {"X-FAGI-Cache": "semantic", "X-FAGI-Cache-Threshold": "0.92"},
    },
}
# Output:
# Vapi calls now go through ACC. Cache hits return cached completion in 30-80ms.
# Semantic cache hit-rate of 15-30% is realistic on support agents.

Common mistake. Setting the similarity threshold too low (0.80 or below). Below 0.90, false positives ship the wrong answer. Start at 0.92, scope the cache by tenant ID, and tune downward only on traffic that proves safe.

Where Vapi abstracts. Vapi does not run the cache. The ACC layer does. Capture gen_ai.voice.semantic_cache_hit and gen_ai.voice.semantic_cache_similarity as traceAI attributes so you see hit rate over time.

11. KV-cache reuse across turns

What it does. Provider session caching can reduce repeated prefix processing on multi-turn calls. The model skips reprocessing the conversation history that is already cached. The parent post covers the theory in §11.

Vapi config. Anchor the system prompt at the top of model.messages and let Vapi append the conversation transcript turn by turn. Pick a provider that supports prompt or session caching (Anthropic, OpenAI on 4o-class models, Groq with prompt cache). The cache works as long as the prefix stays byte-stable.

{
  "model": {
    "provider": "anthropic",
    "model": "claude-sonnet-4-5",
    "messages": [
      {"role": "system", "content": "You are the Acme support agent. Stay concise. Escalate when uncertain."}
    ],
    "temperature": 0.2,
    "maxTokens": 150
  }
}

Common mistake. Sending the entire conversation history as a single concatenated string instead of message objects. Providers cache on the message structure, not raw string equivalence. Keep the structure consistent across turns.

Where Vapi abstracts. Vapi formats the messages array per provider. The lower-level view is gen_ai.usage.cached_input_tokens plus gen_ai.voice.latency.llm_ttft_ms on the LLM span. TTFT on turn N+1 should be 100-300ms lower than turn 1 because the conversation prefix is cached.

12. Regional routing for STT and TTS

What it does. Pin STT and TTS to the closest regional endpoint of the provider. Many voice providers route based on the gateway’s region by default. The parent post covers the theory in §12.

Vapi config. Region routing is per-component on Vapi. Deepgram, ElevenLabs, and Cartesia each pick their closest edge POP automatically based on the origin region of the call. Where the provider exposes regional credentials or custom endpoints, configure them and pin a server.url close to the user.

{
  "transcriber": {
    "provider": "deepgram",
    "model": "nova-3",
    "language": "en"
  },
  "voice": {
    "provider": "cartesia",
    "voiceId": "248be419-c632-4f23-adf1-5324ed7dbf1d",
    "model": "sonic-3",
    "speed": 1.0
  }
}

Common mistake. Running one assistant config and a single server.url for users across continents. The transcontinental RTT adds 60-120ms to every turn. The fix is per-region assistants with regional server.url endpoints and provider credentials that route to the nearest edge.

Where Vapi abstracts. Provider-managed at the edge. Capture gen_ai.voice.stt_region and gen_ai.voice.tts_region as traceAI span attributes so misrouted-region traffic surfaces on the p95 heatmap.

Bonus: barge-in latency on Vapi

Barge-in is the case the 12 techniques above all forget. When the user interrupts the agent mid-sentence, three things happen: the in-flight TTS has to flush, the in-flight LLM has to cancel, and STT has to start fresh on the new audio. Naive implementations pay 200-400ms on the barge-in turn.

Vapi gives you two knobs: startSpeakingPlan and stopSpeakingPlan. Tune both.

{
  "startSpeakingPlan": {
    "waitSeconds": 0.4,
    "smartEndpointingPlan": {"provider": "vapi"}
  },
  "stopSpeakingPlan": {
    "numWords": 0,
    "voiceSeconds": 0.2,
    "backoffSeconds": 1.0
  }
}

With numWords: 0, Vapi runs the fastest VAD-driven interruption and voiceSeconds: 0.2 is the threshold of detected voice that counts as a barge-in. The backoffSeconds: 1.0 gives a one-second cooldown before the agent can speak again. If you need transcription-based interruption (more accurate, slower), set numWords to 2 or 3 so two confirmed words trigger the cut. The transcription path adds 100-200ms versus VAD.

Capture gen_ai.voice.barge_in_count, gen_ai.voice.latency.barge_in_flush_ms, and gen_ai.voice.latency.barge_in_recovery_ms as traceAI span attributes. A healthy Vapi agent has barge-in rates of 5-15% on conversational turns and 1-3% on scripted turns. Higher rates indicate the agent is talking too long or filling silence with unhelpful holds.

Stacking the techniques: a Vapi config that hits sub-500ms

The full assistant config that stacks every technique above. Drop this into a POST to https://api.vapi.ai/assistant and you have a starting point for sub-500ms p95 on short turns.

{
  "name": "acme-support-prod-us",
  "transcriber": {
    "provider": "deepgram",
    "model": "flux-general-en",
    "language": "en",
    "eotThreshold": 0.7,
    "eotTimeoutMs": 600,
    "keyterm": ["refund", "balance", "appointment"]
  },
  "model": {
    "provider": "anthropic",
    "model": "claude-sonnet-4-5",
    "temperature": 0.3,
    "maxTokens": 150,
    "messages": [
      {"role": "system", "content": "You are an Acme refund support agent. Respond in 1-2 sentences. Escalate when uncertain."}
    ]
  },
  "voice": {
    "provider": "11labs",
    "voiceId": "21m00Tcm4TlvDq8ikWAM",
    "model": "eleven_turbo_v2_5",
    "speed": 1.0,
    "optimizeStreamingLatency": 3
  },
  "firstMessage": "Hi, Acme support. How can I help?",
  "firstMessageMode": "assistant-speaks-first",
  "silenceTimeoutSeconds": 25,
  "maxDurationSeconds": 600,
  "backgroundDenoisingEnabled": true,
  "hipaaEnabled": false,
  "startSpeakingPlan": {"waitSeconds": 0.4, "smartEndpointingPlan": {"provider": "vapi"}},
  "stopSpeakingPlan": {"numWords": 0, "voiceSeconds": 0.2, "backoffSeconds": 1.0},
  "server": {
    "url": "https://api.acme.com/vapi/events",
    "timeoutSeconds": 10
  }
}

Realistic p95 budget for this config on short turns, measured on traceAI spans:

Stage	Budget	Vapi knob
STT end-of-turn	120ms	`transcriber.eotTimeoutMs: 600` (Flux)
LLM TTFT	220ms	prefix cache via stable `model.messages`
Tool prefetch	0ms (parallel)	`tools[].async: true`
TTS first-audio	130ms	`voice.model: "eleven_turbo_v2_5"`
Network RTT	50ms	regional `server.url` + provider edge
Total p95	470-550ms	stacked

The exact numbers depend on the provider, the prompt length, and the region. The pattern holds. Stacking these knobs drops a 1200ms default Vapi turn into the sub-500ms zone on short turns and 700-800ms on tool-heavy turns.

Future AGI for Vapi monitoring

Three ways Future AGI plugs into Vapi. Pick the one that matches the team’s stack.

Path one: native voice observability. Create a Future AGI Agent Definition. Paste in the Vapi API key plus Assistant ID. Future AGI captures every call through the Vapi API. No SDK. End-of-call payloads, transcripts, recordings, and per-stage timing land in Observe automatically. This is the lowest-friction path and the right default for Vapi-only teams.

Path two: traceAI for cross-provider visibility. Install fi-instrumentation, call register() with ProjectType.OBSERVE, and the SDK emits OpenInference spans for STT, LLM, TTS, and tool calls. Real gen_ai.voice.* and gen_ai.evaluation.* namespaces. 30+ documented integrations across Python and TypeScript plus dedicated traceAI-pipecat and traceai-livekit packages. Apache 2.0. Use this when the team runs more than one voice runtime and wants one trace model.

Path three: server URL webhook to ai-evaluation. Configure server.url on the assistant. Forward end-of-call-report payloads to Future AGI ai-evaluation. 70+ built-in eval templates including audio_transcription, audio_quality, conversation_coherence, conversation_resolution, task_completion, plus faithfulness, tool-use accuracy, and groundedness. Unlimited custom evaluators are authored by an in-product agent that reads your code and conversation traces. The programmatic eval API lets you configure and re-run evals as traces accumulate. Async. Never on the turn budget.

For inline safety, Future AGI Protect runs sub-100ms per arXiv 2510.13351. Built on Gemma 3n with LoRA-trained adapters across 4 safety dimensions (Content Moderation, Bias Detection, Security, Data Privacy Compliance). Multi-modal across text, image, and audio.

# fi-protect==0.3.0
from fi.protect import Protector

protector = Protector()
result = protector.protect(
    inputs="I want a full refund for my last 6 orders",
    protect_rules=["content_moderation", "data_privacy_compliance"],
    action="block",
    reason=True,
    timeout=25000,
)
# Output:
# {"passed": True, "reason": "no violation detected", "latency_ms": 78}

For audio-aware evals on the recorded call, pass the recording URL as MLLMAudio:

# ai-evaluation==0.6.1
from fi.evals.templates import AudioQuality
from fi.evals.types import MLLMAudio

result = evaluator.run(
    template=AudioQuality(),
    inputs={"audio": MLLMAudio(url="https://storage.vapi.ai/recordings/abc123.mp3", local=False)},
)
# Output:
# {"audio_quality": 0.87, "rationale": "Low background noise, consistent volume"}

When evaluation scores plateau, agent-opt closes the loop. Six prompt optimizers ship: Bayesian Search, Meta-Prompt per arXiv 2505.09666, ProTeGi, GEPA Genetic-Pareto per arXiv 2507.19457, Random Search per arXiv 2311.09569, and PromptWizard. Point an optimizer at the Vapi system prompt, score against the dataset of past calls, and ship the candidate that scores best on conversation_resolution plus task_completion.

Error Feed is the clustering layer over traces and evals. Zero-config auto-clustering turns latency outliers and failure patterns into named issues with auto-written root cause, quick fix, and long-term recommendation. Instead of 10,000 raw Vapi calls, you see 12 named issues with a fix path.

Sources and references

Vapi assistant create reference: docs.vapi.ai/api-reference/assistants/create
Vapi tools API: docs.vapi.ai/tools
Vapi server URL events: docs.vapi.ai/server-url
Future AGI Protect benchmarks: arXiv 2510.13351
GEPA optimizer: arXiv 2507.19457
Meta-Prompt: arXiv 2505.09666
Future AGI trust and compliance: futureagi.com/trust
OpenInference span specification: github.com/Arize-ai/openinference

Frequently asked questions

What's a realistic p95 latency target for a Vapi voice agent in 2026?

Sub-500ms p95 is the target for short conversational turns and 700-800ms p95 for tool-heavy turns. The way Vapi's managed runtime lays out STT, LLM, and TTS, a default assistant config sits around 900-1200ms p95 turn latency. To pull it under 500ms, configure the Deepgram Flux transcriber with eotThreshold around 0.7 and eotTimeoutMs in the 500-700ms range, pick a streaming-capable model provider, attach a low-latency voice on ElevenLabs Turbo or Cartesia Sonic, and keep the system prompt byte-stable so provider-side prefix caching engages. Track p50, p95, and p99 separately. p95 is the number that determines whether users feel the agent is fast.

Does Vapi support prompt prefix caching out of the box?

Vapi passes the model call through to the upstream provider, so prompt caching works whenever the underlying provider supports it. With model.provider set to anthropic, anchor the system prompt at the top and rely on Anthropic's cache_control behavior. With openai, automatic prompt caching engages on prompts above 1024 tokens when the prefix is byte-stable. The mistake teams make is interpolating per-turn timestamps or user IDs near the top of the system prompt. That defeats caching. Put dynamic content near the end. Monitor cached_prompt_tokens through traceAI to confirm hit rate.

How do I prefetch tool calls in Vapi without breaking the conversation?

Vapi tools accept an async: true flag. When async is true, Vapi fires the tool call to your server URL without blocking the agent's spoken response. You return acknowledgement text immediately, then complete the actual work in the background. For prefetch-on-high-confidence-intent, classify intent on the live transcript event from the server URL webhook and start the work before the user finishes speaking. If intent changes, cancel the in-flight request server-side. Net win is 200-400ms on prefetched turns. The cost is roughly 2-5% wasted calls per intent change.

Can I route different Vapi turns to different LLMs based on intent?

Yes. The model field on a Vapi assistant accepts a provider and model, and the practical pattern is to pick the route at call setup. For inbound calls without a fixed assistantId, the assistant-request event on the server URL returns the assistant config to use, so you can route by caller number, customer tier, or any signal available at setup. For per-turn switching, point model.provider at a custom-llm gateway (Agent Command Center) and route inside the gateway. Short conversational turns hit gpt-4o-mini or claude haiku for 100-300ms lower TTFT. Tool-heavy turns stay on the larger model.

What does Vapi abstract away that I'd otherwise have to build?

Vapi handles the streaming pipe between STT and the LLM, the streaming pipe between the LLM and TTS, the WebSocket connection management on the voice provider, and the barge-in plumbing through startSpeakingPlan and stopSpeakingPlan. Where teams need lower-level control, traceAI emits OpenInference spans for STT, LLM, TTS, and tool calls so you see per-stage timing inside the managed runtime. The native voice observability path on Future AGI ingests Vapi calls through an Agent Definition tied to a Vapi API key plus Assistant ID. No SDK required for ingestion.

How do I run evaluation on Vapi calls without adding latency to the turn?

Configure server.url on the assistant. Vapi posts an end-of-call-report when the call ends. Forward that payload to Future AGI ai-evaluation and run the eval async, after the call commits. The 70+ pre-built eval templates include audio_transcription, audio_quality, conversation_coherence, conversation_resolution, and task_completion. None of that fires inline, so none of it touches the turn budget. For inline safety, run Future AGI Protect on the user transcript at the start of each turn. Protect runs sub-100ms per arXiv 2510.13351 and fits inside the orchestration slice.

Can I pin Vapi STT and TTS to a specific region?

Region routing on Vapi is per-component, not a single assistant-region knob. Deepgram, ElevenLabs, and Cartesia each pick their nearest edge POP automatically based on call origin. Where the provider exposes regional credentials or a custom endpoint, configure them. Host the server.url close to the user too so call events do not round-trip across continents. For multi-region agents, run separate assistants per region with regional server URLs. Capture gen_ai.voice.stt_region and gen_ai.voice.tts_region as traceAI attributes so misrouted traffic surfaces on the p95 dashboard.

How does Future AGI plug into a Vapi pipeline?

Three paths. First, native voice observability: create a Future AGI Agent Definition, paste in the Vapi API key and Assistant ID, and Future AGI captures call events through the Vapi API. No SDK needed. Second, traceAI: install fi-instrumentation, call register() with the project type, and the SDK emits gen_ai.evaluation.* spans plus gen_ai.voice.latency.* attributes for STT, LLM, TTS, and tool calls. Third, server URL webhook: configure server.url on the assistant and forward end-of-call-report payloads to ai-evaluation for async scoring across the 70+ templates.

View all

Engineering

How to Optimize LiveKit Voice Agent Latency in 2026: 12 Techniques + Code

Optimize LiveKit Agents voice latency to sub-500ms p95 in 2026. 12 techniques with real AgentSession code: streaming STT, partial TTS, prefix caching, regional routing, async eval.

NVJK Kartik · May 20, 2026

13 min

Engineering

How to Optimize Retell Voice Agent Latency in 2026: 12 Techniques + Code

Optimize Retell AI voice agent latency to sub-500ms p95 in 2026. 12 techniques with real Retell agent config: STT, response_engine, backchannel, states, async eval.

NVJK Kartik · May 20, 2026

15 min

Engineering

How to Optimize Pipecat Voice Agent Latency in 2026: 12 Techniques + Code

Optimize Pipecat voice agent latency to sub-500ms p95 in 2026. 12 techniques with real pipeline code: streaming STT, partial TTS, prefix caching, regional routing, async eval.

NVJK Kartik · May 20, 2026

13 min

How to Optimize Vapi Voice Agent Latency in 2026

TL;DR pick by Vapi config knob

How to read this guide

1. Streaming STT with first-partial routing

2. Partial LLM tokens piped into TTS

3. LLM prompt prefix caching

4. Edge model routing

5. Prefetch tool calls on high-confidence intent

6. Audio prebuffering

7. Async evaluation

8. Parallel TTS warm-up

9. Smaller models for short turns

10. Semantic cache for common intents

11. KV-cache reuse across turns

12. Regional routing for STT and TTS

Bonus: barge-in latency on Vapi

Stacking the techniques: a Vapi config that hits sub-500ms

Future AGI for Vapi monitoring

Related reading

Sources and references

Frequently asked questions