Engineering

How to Optimize Retell Voice Agent Latency in 2026: 12 Techniques with Real Config

Optimize Retell AI voice agent latency to sub-500ms p95 in 2026. 12 techniques with real Retell config: STT, response_engine, backchannel, async eval.

March 1, 2026

Updated May 20, 2026

15 min read

voice-ai 2026 retell latency optimization how-to

How to Optimize Retell Voice Agent Latency in 2026

To optimize Retell voice agent latency, configure 12 knobs across the agent and the underlying retell-llm or conversation-flow resource: keep stt_mode on fast (or switch to custom with a tuned endpointing_ms), pick a streaming voice_model like eleven_turbo_v2 or sonic-3, anchor a byte-stable general_prompt on the retell-llm resource so provider prefix caching engages, point response_engine.type at conversation-flow so common intents resolve in lighter subagent nodes, tune interruption_sensitivity and backchannel_frequency for fast turn-taking, set model_high_priority on the LLM resource for consistent TTFT, fire custom function tools from flow edges to prefetch lookups, and forward call_analyzed from webhook_url to Future AGI ai-evaluation for async scoring. Stacked, these knobs drop a 1200ms default Retell turn to 500-700ms p95.

TL;DR pick by Retell config knob

Technique	Retell config knob	Expected p95 win
Streaming STT first-partial	`stt_mode: "fast"` or `custom_stt_config.endpointing_ms`	200-400ms
Partial LLM into TTS	Managed by Retell runtime (response_engine streaming)	200-500ms
Prompt prefix caching	`general_prompt` byte-stable, `model` selection	200-400ms
Edge model routing	`model` per state node in conversation-flow	60-150ms
Prefetch tool calls	Custom function tool on state edge, `webhook_url`	200-400ms
Audio prebuffering	`begin_message`, `begin_message_delay_ms`, `ambient_sound`	80-200ms
Async evaluation	`webhook_url` + `call_analyzed` event to FAGI	100-300ms
TTS warm-up	Retell-managed connection reuse, `fallback_voice_ids`	50-150ms
Smaller models for short turns	`model` per state in conversation-flow	100-300ms
Semantic cache for common intents	Conversation-flow states or Agent Command Center gateway	400-800ms on hits
KV-cache reuse across turns	Stable `general_prompt` + state framing	100-300ms
Regional routing	Provider-managed via voice_model + `custom_stt_config`	30-80ms

How to read this guide

The parent methodology hub covers the 12 techniques as a framework. This post maps each one to the Retell agent config. The pattern is consistent. Retell handles the streaming wiring out of the box. Where teams need control over what gets cached, prefetched, routed, or measured, you reach into the knobs below. Where Retell abstracts a technique away, this post says so and shows the lower-level equivalent through traceAI.

Spans matter. Every “we shipped X technique” claim is hand-wavy without per-stage timing. traceAI emits OpenInference spans for STT, LLM, TTS, and tool calls in one trace per Retell call. 30+ documented integrations across Python and TypeScript cover the voice stack, including dedicated traceAI-pipecat and traceai-livekit packages plus OpenAI Realtime, Anthropic, LiteLLM, and Vertex AI. Apache 2.0. For Retell dashboards, no SDK is needed: native voice observability ingests via Retell API key plus Agent ID through a Future AGI Agent Definition.

1. Streaming STT with first-partial routing

What it does. Switch from batch STT to streaming STT that emits partial transcripts every 100-200ms while the user is still speaking. Feed the latest partial to the LLM the moment intent confidence crosses 0.85. The parent post covers the theory in §1.

Retell config. Set stt_mode to fast to bias the transcriber toward latency over accuracy. For more control, set stt_mode to custom and pass custom_stt_config with the provider (azure, deepgram, or soniox) plus an endpointing_ms tuned to your traffic. Push boosted_keywords for the proper nouns the model will mishear.

{
  "stt_mode": "custom",
  "custom_stt_config": {
    "provider": "deepgram",
    "endpointing_ms": 250
  },
  "boosted_keywords": ["refund", "balance", "appointment"],
  "vocab_specialization": "general",
  "denoising_mode": "noise-cancellation"
}

Common mistake. Switching stt_mode to accurate on a conversational agent for marginal accuracy. That choice trades 100-300ms of extra confirmation latency per turn. The Retell default fast (or custom with endpointing_ms around 200-300ms) is the right starting point for short conversational turns.

Where Retell abstracts. Retell auto-feeds stable partials into the response engine. If you want to see when the first partial actually fires, traceAI captures gen_ai.voice.latency.stt_first_partial_ms and gen_ai.voice.latency.stt_final_ms on the STT span. The gap between those two is the parallel window the LLM gets for free.

2. Partial LLM tokens piped into TTS

What it does. Stream LLM tokens. The moment the first sentence boundary lands, fire that sentence to TTS. The user hears the first word before the LLM finishes the response. The parent post covers the theory in §2.

Retell config. Managed by the Retell runtime. When response_engine.type is retell-llm or conversation-flow, Retell pipes tokens to TTS at sentence boundaries with no extra config. The only verification path is the per-stage timing on traceAI spans.

# fi-instrumentation==0.4.2
from fi_instrumentation import register
from fi_instrumentation.fi_types import ProjectType

tracer_provider = register(
    project_type=ProjectType.OBSERVE,
    project_name="retell-prod",
    set_global_tracer_provider=True,
)
# traceAI now emits gen_ai.evaluation.* spans plus gen_ai.voice.latency.* attributes
# Output:
# Registered project: retell-prod
# Tracer provider attached. Voice attrs: gen_ai.voice.latency.ttfb_ms, gen_ai.voice.latency.tts_first_audio_ms

Common mistake. Setting response_engine.type to custom-llm and pointing it at a webhook that buffers the full LLM response before returning. That doubles turn latency because TTS waits on the complete answer. Stream tokens from your custom LLM webhook the same way OpenAI streams them.

Where Retell abstracts. Fully managed for retell-llm and conversation-flow response engines. The lower-level view is the gap between gen_ai.voice.latency.llm_ttft_ms, gen_ai.voice.latency.llm_first_sentence_ms, and gen_ai.voice.latency.tts_first_audio_ms on traceAI spans. A healthy Retell turn shows TTS starting within 80-150ms of the first sentence boundary.

3. LLM prompt prefix caching

What it does. Anchor the system prompt at the top of the assembled prompt. Keep it byte-identical across turns so the provider’s cache lookup hits. The parent post covers the theory in §3.

Retell config. On the retell-llm resource (created separately via create-retell-llm and referenced from the agent by llm_id), general_prompt is appended to every system prompt and stays stable as the agent walks through states. Place the durable instructions there. Push any dynamic content into default_dynamic_variables placeholders consumed late in the prompt so the prefix bytes do not shift turn to turn.

// POST /create-retell-llm
{
  "model": "claude-4.5-sonnet",
  "model_temperature": 0.2,
  "model_high_priority": true,
  "general_prompt": "You are an Acme refund support agent. Resolve refunds within policy. Escalate edge cases. Speak in 1-2 sentences.",
  "default_dynamic_variables": {
    "customer_tier": "{{customer_tier}}",
    "today": "{{today}}"
  }
}
// Then attach via response_engine: {type: "retell-llm", llm_id: "<id>"} on create-agent

Common mistake. Interpolating today’s date or the user ID directly into general_prompt. That breaks the prefix. Push the variable into default_dynamic_variables and reference it from a state prompt near the end of the assembly so the front of the prompt stays cache-stable.

Where Retell abstracts. Retell does not surface a cache toggle. The provider behind model handles it. Confirm by checking gen_ai.usage.cached_input_tokens on the LLM span in traceAI. A healthy production Retell agent should show 80% plus cache hit rate on the system prompt after warm-up.

4. Edge model routing

What it does. Route short conversational turns to a smaller, faster model. Route complex tool turns to the larger model. The parent post covers the theory in §4.

Retell config. Switch from a single-prompt retell-llm to a conversation-flow response engine (created via create-conversation-flow). Each subagent node carries its own model selection, so the greeting node and acknowledgment nodes pick gpt-4.1-mini while the lookup nodes pick gpt-4.1 or claude-4.5-sonnet. The Retell pricing model rewards this directly because each node bills time times model price.

// POST /create-conversation-flow (excerpt)
{
  "model_choice": {"model": "gpt-4.1-mini", "high_priority": true},
  "nodes": [
    {"id": "greet", "type": "conversation",
     "instruction": {"type": "prompt", "text": "Greet the caller in one sentence."},
     "model_choice": {"model": "gpt-4.1-mini", "high_priority": true}},
    {"id": "lookup", "type": "conversation",
     "instruction": {"type": "prompt", "text": "Resolve the refund using order context. Call get_order_status."},
     "model_choice": {"model": "claude-4.5-sonnet", "high_priority": true}},
    {"id": "wrap_up", "type": "conversation",
     "instruction": {"type": "prompt", "text": "Confirm next steps in one sentence and end the call."},
     "model_choice": {"model": "gpt-4.1-mini", "high_priority": true}}
  ]
}

Common mistake. Picking the largest model on every node “for quality”. TTFT on claude-4.5-sonnet or gpt-4.1 sits 100-300ms above gpt-4.1-mini on the same prompt. The mix matters more than the peak.

Where Retell abstracts. Retell evaluates the active node, calls the matching model, and rolls cost into the call. For lower-level multi-model routing across providers, the Agent Command Center covers 15+ providers through a single OpenAI-compatible endpoint with per-route policy. Hosted with RBAC, AWS Marketplace, multi-region, SOC 2 Type II, HIPAA, GDPR, CCPA, and ISO 27001 certified per the trust page. Capture gen_ai.request.model and gen_ai.voice.route_reason as traceAI attributes.

5. Prefetch tool calls on high-confidence intent

What it does. When intent confidence on the first stable transcript partial is above 0.85, fire the tool call in parallel with the LLM call. If intent changes in later partials, cancel the prefetched call. The parent post covers the theory in §5.

Retell config. Two-step. Model the high-confidence intent as a dedicated state in a conversation-flow agent. Attach a custom function node next to a conversation node. When the edge condition matches (the user mentions an order number, an appointment date, a refund), Retell fires the custom function to your tool URL before the next conversation node commits the verbal response. Return the tool response shape Retell expects so the next node has the answer ready.

// POST /create-conversation-flow (excerpt)
{
  "nodes": [
    {
      "id": "collect_order_id",
      "type": "conversation",
      "instruction": {"type": "prompt", "text": "If the user mentions an order number, capture it as order_id."},
      "edges": [
        {"transition_condition": "user shared an order id",
         "destination_node_id": "lookup_order_status"}
      ]
    },
    {
      "id": "lookup_order_status",
      "type": "function",
      "tool_id": "tool_get_order_status",
      "tool_type": "custom",
      "edges": [
        {"transition_condition": "default",
         "destination_node_id": "share_status"}
      ]
    }
  ]
}
// Custom function tool (created separately) carries url + timeout_ms + response variables

Common mistake. Setting the tool timeout_ms to 20 seconds on a turn-critical lookup. A 20-second hang ruins the conversation. Keep it at 6-10 seconds and let the next state read the response or fall back to a verbal stall.

Where Retell abstracts. Retell runs the function call, parses response_variables, and feeds them into the next state’s prompt. Cancellation on stale intent happens in your server. Track gen_ai.voice.tool_prefetched and gen_ai.voice.tool_call_cancelled as traceAI attributes. A prefetch success rate above 90% means the strategy is paying.

6. Audio prebuffering

What it does. Open the TTS path and synthesize the first audio frame before the user expects sound. The buffer absorbs network jitter so playback is smooth. The parent post covers the theory in §6.

Retell config. Set begin_message on the retell-llm resource (or the start node of a conversation flow) so Retell pre-synthesizes the greeting. Tune begin_message_delay_ms on the agent between 0 and 200ms depending on whether you want immediate speech (inbound) or a brief pause (outbound). Add ambient_sound to mask hand-off micro-gaps and enable speech normalization via the handbook_config preset so numbers and currency render cleanly without re-synthesis stalls.

// On create-retell-llm (or the start node of a conversation flow)
{
  "begin_message": "Hi, Acme support. How can I help?"
}

// On create-agent
{
  "begin_message_delay_ms": 0,
  "ambient_sound": "call-center",
  "ambient_sound_volume": 0.4,
  "handbook_config": {
    "speech_normalization": true,
    "natural_filler_words": true,
    "default_personality": true
  }
}

Common mistake. Leaving begin_message empty on an outbound dial. Retell waits for the user to speak first, which on a cold dial feels broken. Reserve the empty begin_message for inbound flows where the caller has a question ready.

Where Retell abstracts. Retell pre-synthesizes the begin message and warms the TTS WebSocket as soon as the agent attaches to the call. The lower-level view is gen_ai.voice.latency.tts_prebuffer_ms and gen_ai.voice.tts_playback_underrun_count on the TTS span. Underrun count should sit at zero on a properly tuned buffer.

7. Async evaluation

What it does. Run scoring after the turn commits. Never block the critical path on an LLM judge. The parent post covers the theory in §7.

Retell config. Configure webhook_url on the agent and subscribe to call_analyzed. Retell posts the transcript, recording URL, and a call_analysis block (with call_summary, user_sentiment, call_successful, plus any custom_analysis_data you configured via post_call_analysis_data on the agent) once analysis completes. Forward that payload to Future AGI ai-evaluation. None of this fires inline.

# retell-sdk==4.21.0, ai-evaluation==0.6.1
from fi.evals import Evaluator
from fi.evals.templates import ConversationCoherence, ConversationResolution, AudioQuality

async def on_retell_webhook(payload):
    event = payload.get("event")
    if event != "call_analyzed":
        return
    call = payload["call"]
    analysis = call.get("call_analysis", {})
    evaluator = Evaluator()
    result = await evaluator.run_async(
        templates=[
            ConversationCoherence(),
            ConversationResolution(),
            AudioQuality(),
        ],
        inputs={
            "transcript": call.get("transcript", ""),
            "recording_url": call.get("recording_url"),
            "call_id": call["call_id"],
            "call_summary": analysis.get("call_summary"),
            "user_sentiment": analysis.get("user_sentiment"),
            "call_successful": analysis.get("call_successful"),
            "custom_analysis_data": analysis.get("custom_analysis_data", {}),
        },
    )
    return result
# Output:
# {"conversation_coherence": 0.91, "conversation_resolution": 0.84, "audio_quality": 0.88}

Common mistake. Running an LLM-judge eval inside a transcript_updated webhook handler. That fires every turn and adds 200ms plus to the turn budget. Use transcript_updated only for state mirroring; gate evals on call_analyzed.

Where Retell abstracts. Retell delivers call_analyzed with call.transcript, call.recording_url, and a call.call_analysis block carrying call_summary, user_sentiment, call_successful, plus the custom_analysis_data extractions configured via post_call_analysis_data on the agent. 70+ pre-built eval templates in ai-evaluation cover the voice surface including audio_transcription, audio_quality, conversation_coherence, conversation_resolution, and task_completion. Plus faithfulness, tool-use accuracy, and groundedness. Unlimited custom evaluators are authored by an in-product agent that reads your traces. Apache 2.0.

8. Parallel TTS warm-up

What it does. Keep a warm TTS connection open per session so the first audio frame arrives 50-150ms faster than a cold start. The parent post covers the theory in §8.

Retell config. Retell-managed. The TTS WebSocket is opened when the call attaches and reused across turns. You pick the voice and model. Retell handles the connection reuse. Add fallback_voice_ids so a provider blip does not cold-start mid-call.

{
  "voice_id": "11labs-Adrian",
  "voice_model": "eleven_turbo_v2",
  "voice_temperature": 0.6,
  "voice_speed": 1.0,
  "fallback_voice_ids": ["cartesia-Brooke", "openai-alloy"]
}

Common mistake. Picking eleven_multilingual_v2 or a heavy sonic-3 variant for an English-only agent. Multilingual and large voice models add 80-150ms TTFT versus eleven_turbo_v2, eleven_flash_v2, or sonic-3 in their lighter tiers. If the agent speaks one language, pick the turbo or flash variant.

Where Retell abstracts. Retell-managed. To verify the WebSocket stays warm, traceAI captures gen_ai.voice.tts_connection_state (cold, warm, reused) and gen_ai.voice.latency.tts_first_audio_ms. Warm should beat cold by the 50-150ms saving.

9. Smaller models for short turns

What it does. Route short conversational turns (“yes”, “no”, “can you repeat that”) to a smaller and faster model. Route complex tool turns to the larger model. The parent post covers the theory in §9.

Retell config. Two paths. Static: set model on a single-prompt create-retell-llm payload to the lighter option and accept the quality tradeoff for the whole agent. Per-node: switch to conversation-flow and set model_choice per conversation node so acknowledgment nodes pick the lighter model and lookup nodes pick the heavier one. Per-turn dynamic routing happens by setting response_engine.type to custom-llm and routing inside the custom LLM WebSocket service behind llm_websocket_url.

// POST /create-conversation-flow (excerpt)
{
  "model_choice": {"model": "gpt-4.1-mini", "high_priority": true},
  "nodes": [
    {"id": "ack", "type": "conversation",
     "instruction": {"type": "prompt", "text": "Acknowledge in 5 words or less."},
     "model_choice": {"model": "gpt-4.1-mini", "high_priority": true},
     "model_temperature": 0.1},
    {"id": "resolve", "type": "conversation",
     "instruction": {"type": "prompt", "text": "Resolve the refund using context. Stay concise."},
     "model_choice": {"model": "claude-4.5-sonnet", "high_priority": true},
     "model_temperature": 0.3}
  ]
}

Common mistake. Routing every node to gpt-5 or claude-4.5-sonnet for “quality”. TTFT on those models sits 100-300ms above the smaller alternatives. The mix matters more than the peak.

Where Retell abstracts. Retell evaluates the current node and dispatches the matching model. For provider-agnostic routing across 15+ models, point response_engine.type at custom-llm and route inside the Agent Command Center gateway. Capture gen_ai.request.model and gen_ai.voice.route_reason as traceAI attributes.

10. Semantic cache for common intents

What it does. Embed the user’s transcript on the live partial. Search a cache of recently-answered queries by embedding similarity. On a hit above threshold, return the cached audio answer in 30-80ms. The parent post covers the theory in §10.

Retell config. Two patterns. In-flow: model the top FAQ intents as dedicated states in a conversation-flow agent so the lookup short-circuits the heavy LLM with a fixed response from the state prompt. Gateway: switch response_engine.type to custom-llm, point the URL at the Agent Command Center, and configure semantic cache there so cache hits return in 30-80ms without code changes inside Retell.

# fi-instrumentation==0.4.2
# Point Retell's custom LLM at the Agent Command Center gateway.
# Retell only carries llm_websocket_url; configure model + cache policy
# on the WebSocket service behind that URL (ACC handles it for you).
retell_agent_patch = {
    "response_engine": {
        "type": "custom-llm",
        "llm_websocket_url": "wss://gateway.futureagi.com/v1/retell/acme-support",
    }
}
# Output:
# Retell calls now flow through ACC. Cache hits return cached completion in 30-80ms.
# Semantic cache hit-rate of 15-30% is realistic on support agents.

Common mistake. Setting the similarity threshold too low (0.80 or below). Below 0.90, false positives ship the wrong answer to the caller. Start at 0.92, scope the cache by tenant ID, and tune downward only on traffic that proves safe.

Where Retell abstracts. Retell does not run the cache. The conversation-flow states short-circuit common intents, and the ACC gateway covers the embedding plus similarity lookup. Capture gen_ai.voice.semantic_cache_hit and gen_ai.voice.semantic_cache_similarity as traceAI attributes so hit rate trends over time.

11. KV-cache reuse across turns

What it does. Provider session caching can reduce repeated prefix processing on multi-turn calls. The model skips reprocessing the conversation history that is already cached. The parent post covers the theory in §11.

Retell config. Anchor general_prompt at the top of the assembled prompt and let Retell append the conversation transcript turn by turn. Pick a model whose provider supports prompt or session caching (Anthropic, OpenAI 4o-class, Gemini). The cache works as long as the prefix stays byte-stable. State transitions in conversation-flow modify the appended state_prompt, not the general_prompt, so prefix caching survives.

// POST /create-retell-llm
{
  "model": "claude-4.5-sonnet",
  "model_high_priority": true,
  "general_prompt": "You are an Acme support agent. Stay concise. Escalate when uncertain.",
  "model_temperature": 0.2
}
// Then attach via response_engine: {type: "retell-llm", llm_id: "<id>"} on create-agent

Common mistake. Rewriting general_prompt per turn with a templated string that interpolates dynamic values. That changes the byte string and busts the cache. Keep general_prompt literal, push variables into default_dynamic_variables, and use placeholders only in state prompts.

Where Retell abstracts. Retell assembles the system prompt as general_prompt plus the active state_prompt. The provider behind model handles the cache. The lower-level view is gen_ai.usage.cached_input_tokens plus gen_ai.voice.latency.llm_ttft_ms on the LLM span. TTFT on turn N+1 should sit 100-300ms below turn 1 because the prefix is cached.

12. Regional routing for STT and TTS

What it does. Pin STT and TTS to the closest regional endpoint of the provider. Many voice providers route based on the gateway’s region by default. The parent post covers the theory in §12.

Retell config. Region routing is per-component on Retell. The STT provider behind stt_mode (or custom_stt_config) picks its closest edge POP based on call origin. The TTS provider behind voice_model does the same. For multi-region agents, run separate Retell agents per region with regional webhook_url endpoints and provider choices that route to the nearest edge.

{
  "agent_name": "acme-support-eu",
  "stt_mode": "custom",
  "custom_stt_config": {"provider": "deepgram", "endpointing_ms": 250},
  "voice_id": "11labs-Adrian",
  "voice_model": "eleven_turbo_v2",
  "webhook_url": "https://eu.api.acme.com/retell/events",
  "timezone": "Europe/Amsterdam"
}

Common mistake. Running one Retell agent and a single webhook_url for users across continents. The transcontinental round-trip adds 60-120ms to every turn. The fix is per-region agents with regional webhook_url endpoints and provider choices that hit the nearest edge.

Where Retell abstracts. Provider-managed at the edge. Capture gen_ai.voice.stt_region and gen_ai.voice.tts_region as traceAI span attributes so misrouted-region traffic surfaces on the p95 heatmap.

Bonus: backchanneling latency on Retell

Backchannel words (“yeah”, “uh-huh”, “got it”) soften the conversation and reduce perceived latency on long-running tool calls. Retell ships two knobs: enable_backchannel to switch the behavior on, and backchannel_frequency between 0 and 1 to control how often the agent interjects. Set backchannel_words to phrases that match your brand voice.

{
  "enable_backchannel": true,
  "backchannel_frequency": 0.6,
  "backchannel_words": ["got it", "okay", "right"],
  "interruption_sensitivity": 0.85,
  "enable_dynamic_responsiveness": true,
  "responsiveness": 0.9
}

interruption_sensitivity at 0.85 lets the user barge in cleanly without the agent stepping on the next 200ms of audio. enable_dynamic_responsiveness lets Retell match the caller’s pace, which on a fast talker shaves 100-200ms off perceived turn lag. Capture gen_ai.voice.barge_in_count and gen_ai.voice.latency.barge_in_flush_ms as traceAI attributes so you see whether the barge-in tax is fair. A healthy Retell agent has barge-in rates of 5-15% on conversational turns and 1-3% on scripted turns. Higher rates indicate the agent is talking too long or backchannel_frequency is annoying the caller.

Stacking the techniques: a Retell agent config that hits sub-500ms

Two calls compose the production stack: create-conversation-flow (or create-retell-llm) first, then create-agent referencing it. Together they stack every technique above and land a starting point for sub-500ms p95 on short turns.

// 1. POST /create-conversation-flow (excerpt)
{
  "model_choice": {"model": "gpt-4.1-mini", "high_priority": true},
  "default_dynamic_variables": {"customer_tier": "{{customer_tier}}"},
  "nodes": [
    {"id": "start", "type": "conversation",
     "instruction": {"type": "prompt", "text": "Greet the caller in one sentence."},
     "begin_message": "Hi, Acme support. How can I help?"},
    {"id": "resolve", "type": "conversation",
     "instruction": {"type": "prompt", "text": "Resolve the refund using order context. Stay concise."},
     "model_choice": {"model": "claude-4.5-sonnet", "high_priority": true}}
  ]
}

// 2. POST /create-agent
{
  "agent_name": "acme-support-prod-us",
  "response_engine": {
    "type": "conversation-flow",
    "conversation_flow_id": "conversation_flow_acme_support_v3"
  },
  "voice_id": "11labs-Adrian",
  "voice_model": "eleven_turbo_v2",
  "voice_temperature": 0.6,
  "voice_speed": 1.0,
  "fallback_voice_ids": ["cartesia-Brooke"],
  "language": "en-US",
  "stt_mode": "custom",
  "custom_stt_config": {"provider": "deepgram", "endpointing_ms": 250},
  "boosted_keywords": ["refund", "balance", "appointment"],
  "denoising_mode": "noise-cancellation",
  "interruption_sensitivity": 0.85,
  "enable_backchannel": true,
  "backchannel_frequency": 0.5,
  "backchannel_words": ["got it", "right"],
  "begin_message_delay_ms": 0,
  "ambient_sound": "call-center",
  "ambient_sound_volume": 0.3,
  "end_call_after_silence_ms": 30000,
  "max_call_duration_ms": 600000,
  "voicemail_detection_timeout_ms": 25000,
  "voicemail_message": "Hi, please call us back at your convenience.",
  "post_call_analysis_data": [
    {"type": "call_summary"},
    {"type": "call_successful"},
    {"type": "user_sentiment"}
  ],
  "post_call_analysis_model": "gpt-4.1-mini",
  "webhook_url": "https://api.acme.com/retell/events",
  "webhook_events": ["call_started", "call_ended", "call_analyzed"],
  "handbook_config": {
    "speech_normalization": true,
    "natural_filler_words": true,
    "default_personality": true
  },
  "opt_in_signed_url": true
}

Realistic p95 budget for this config on short turns, measured on traceAI spans:

Stage	Budget	Retell knob
STT end-of-turn	130ms	`stt_mode: "custom"`, `endpointing_ms: 250`
LLM TTFT	220ms	prefix cache via stable `general_prompt`, `model_high_priority: true`
Tool prefetch	0ms (parallel)	custom function state on edge
TTS first-audio	130ms	`voice_model: "eleven_turbo_v2"`, warm connection
Network RTT	50ms	regional `webhook_url` + provider edge
Total p95	480-550ms	stacked

Exact numbers depend on the provider, the prompt length, and the region. The pattern holds. Stacking these knobs drops a 1200ms default Retell turn into the sub-500ms zone on short turns and 700-800ms on tool-heavy turns.

Future AGI for Retell monitoring

Three paths plug Future AGI into a Retell pipeline. Pick the one that matches the team’s stack.

Path one: native voice observability. Create a Future AGI Agent Definition. Paste the Retell API key plus Agent ID. Future AGI captures every call through the Retell API. No SDK. Call payloads, transcripts, recordings, post_call_analysis_data extractions, and per-stage timing land in Observe automatically. This is the lowest-friction path and the right default for Retell-only teams.

Path two: traceAI for cross-provider visibility. Install fi-instrumentation, call register() with ProjectType.OBSERVE, and the SDK emits OpenInference spans for STT, LLM, TTS, and tool calls. Real gen_ai.voice.* and gen_ai.evaluation.* namespaces. 30+ documented integrations across Python and TypeScript plus dedicated traceAI-pipecat and traceai-livekit packages. Apache 2.0. Use this when the team runs more than one voice runtime and wants one trace model.

Path three: webhook to ai-evaluation. Configure webhook_url on the agent. Forward call_analyzed payloads to Future AGI ai-evaluation. 70+ built-in eval templates including audio_transcription, audio_quality, conversation_coherence, conversation_resolution, task_completion, plus faithfulness, tool-use accuracy, and groundedness. Unlimited custom evaluators are authored by an in-product agent that reads your code and conversation traces. The programmatic eval API lets you configure and re-run evals as traces accumulate. Async. Never on the turn budget.

For inline safety on the user utterance, Future AGI Protect runs sub-100ms per arXiv 2510.13351. Built on Gemma 3n with LoRA-trained adapters across 4 safety dimensions (Content Moderation, Bias Detection, Security, Data Privacy Compliance). Multi-modal across text, image, and audio.

# fi-protect==0.3.0
from fi.protect import Protector

protector = Protector()
result = protector.protect(
    inputs="I want a full refund for my last 6 orders",
    protect_rules=["content_moderation", "data_privacy_compliance"],
    action="block",
    reason=True,
    timeout=25000,
)
# Output:
# {"passed": True, "reason": "no violation detected", "latency_ms": 78}

For audio-aware evals on the recorded call, pass the Retell recording_url as MLLMAudio:

# ai-evaluation==0.6.1
from fi.evals.templates import AudioQuality
from fi.evals.types import MLLMAudio

result = evaluator.run(
    template=AudioQuality(),
    inputs={"audio": MLLMAudio(url="https://recordings.retellai.com/call_abc123.wav", local=True)},
)
# Output:
# {"audio_quality": 0.87, "rationale": "Low background noise, consistent volume"}

When evaluation scores plateau, agent-opt closes the loop. Six prompt optimizers ship: Bayesian Search, Meta-Prompt per arXiv 2505.09666, ProTeGi, GEPA Genetic-Pareto per arXiv 2507.19457, Random Search per arXiv 2311.09569, and PromptWizard. Point an optimizer at the Retell general_prompt, score against the dataset of past calls, and ship the candidate that scores best on conversation_resolution plus task_completion.

Error Feed is the clustering layer over traces and evals. Zero-config auto-clustering turns latency outliers and failure patterns into named issues with auto-written root cause, quick fix, and long-term recommendation. Instead of 10,000 raw Retell calls, you see 12 named issues with a fix path. Pricing on the Future AGI side: free to get started, pay-as-you-go scales with usage, compliance and enterprise add-ons layer on as you need them. See pricing.

Sources and references

Retell create-agent reference: docs.retellai.com/api-references/create-agent
Retell create-retell-llm reference: docs.retellai.com/api-references/create-retell-llm
Retell conversation flow: docs.retellai.com/build/conversation-flow
Future AGI Protect benchmarks: arXiv 2510.13351
GEPA optimizer: arXiv 2507.19457
Meta-Prompt: arXiv 2505.09666
Random Search optimizer: arXiv 2311.09569
Future AGI trust and compliance: futureagi.com/trust
OpenInference span specification: github.com/Arize-ai/openinference

Frequently asked questions

What is a realistic p95 latency target for a Retell agent in 2026?

Sub-500ms p95 is the target for short conversational turns and 700-800ms p95 for tool-heavy turns. A default Retell agent on a single-prompt retell-llm response engine with stt_mode left at fast but an untuned endpointing_ms and a heavy voice_model sits around 900-1200ms p95 turn latency. To pull it under 500ms, keep stt_mode at fast (or switch to custom with a 200-300ms endpointing_ms), pick a streaming voice_model like eleven_turbo_v2 or sonic-3, route response_engine to a conversation-flow agent so common intents short-circuit the heavy LLM, tune interruption_sensitivity for fast barge-in, and forward call_analyzed payloads from webhook_url to Future AGI ai-evaluation for async scoring. Track p50, p95, p99 separately. p95 is the number callers feel.

Does Retell support prompt prefix caching out of the box?

Retell does not expose a cache toggle. The retell-llm response engine passes the assembled prompt (general_prompt + active state_prompt) to the upstream provider, so caching engages whenever the provider supports it and the prefix is byte-stable. With model set to claude-4.5-sonnet, Anthropic's ephemeral cache fires when the prefix qualifies. With gpt-4.1 or gpt-5, OpenAI auto-caches above 1024 tokens on a byte-stable prefix. The mistake teams make is interpolating dynamic variables near the top of general_prompt. Put dynamic_variables placeholders near the end so the prefix stays stable across turns. Use traceAI to confirm cached_input_tokens climb after warm-up.

How do I prefetch tool calls in a Retell agent without breaking the conversation?

Retell does not ship an async tool flag. The pattern is two-step. First, model the high-confidence intent as a dedicated state in a conversation-flow agent. When the user mentions the intent (order ID, account number), the state edge fires the custom function tool through webhook_url before the LLM commits a verbal response. Second, return the response_variables shape Retell expects so the next state has the answer ready. Net win is 200-400ms on prefetched turns. The cost is roughly 2-5 percent wasted calls when intent changes. Cancel server-side on the call_ended webhook event for stale work.

Can I route different Retell turns to different LLMs based on intent?

Yes through a conversation-flow agent. Each subagent node has its own model selection on the retell-llm response engine. Short conversational nodes pick gpt-4.1-mini or claude haiku for 100-300ms lower TTFT. Tool-heavy nodes stay on gpt-4.1 or claude-4.5-sonnet. The pricing model rewards this pattern directly because each node bills time spent times model price per second. For more dynamic routing, point response_engine type to custom-llm with the webhook on a gateway like Agent Command Center and route inside the gateway across 15 plus providers behind one endpoint.

What does Retell abstract away that I would otherwise have to build?

Retell handles the streaming pipe between STT and the LLM, the streaming pipe between LLM tokens and TTS, the WebSocket connection management on the voice provider, the barge-in plumbing through interruption_sensitivity, and the voicemail and IVR detection paths. Where teams need lower-level control, traceAI emits OpenInference spans for STT, LLM, TTS, and tool calls. Native voice observability on Future AGI ingests Retell calls through an Agent Definition tied to a Retell API key plus Agent ID. No SDK required for ingestion. The code path adds span-level depth on top of the native dashboard.

How do I run evaluation on Retell calls without adding latency to the turn?

Configure webhook_url on the agent and subscribe to call_analyzed. Retell posts the full transcript, recording URL, call summary, and post_call_analysis_data fields once the call ends. Forward that payload to Future AGI ai-evaluation and run scoring async, after the call commits. The 70+ pre-built eval templates include audio_transcription, audio_quality, conversation_coherence, conversation_resolution, task_completion, faithfulness, tool-use accuracy, and groundedness. None of that fires inline, so none of it touches the turn budget. For inline safety on the user utterance, Future AGI Protect runs sub-100ms per arXiv 2510.13351 and fits inside the orchestration slice.

Can I pin Retell STT and TTS to a specific region?

Retell does not surface a single region knob on the agent. The control points are stt_mode (fast versus accurate versus custom), custom_stt_config when you bring your own STT, and the voice_model selection which determines the TTS provider edge. For multi-region agents, run separate agents per region with regional webhook_url endpoints close to your users and pick voice_models whose providers serve your traffic from the right POPs. Capture gen_ai.voice.stt_region and gen_ai.voice.tts_region as traceAI attributes so misrouted traffic surfaces on the p95 heatmap.

How does Future AGI plug into a Retell pipeline?

Three paths. First, native voice observability: create a Future AGI Agent Definition, paste the Retell API key and Agent ID, and Future AGI captures call events through the Retell API. No SDK. Second, traceAI: install fi-instrumentation, call register with ProjectType.OBSERVE, and the SDK emits gen_ai.voice.* and gen_ai.evaluation.* spans for STT, LLM, TTS, and tool calls. Third, webhook to ai-evaluation: configure webhook_url on the agent and forward call_analyzed payloads to ai-evaluation for async scoring across the 70+ templates. All three land in the same Observe project.

View all

Engineering

How to Optimize Vapi Voice Agent Latency in 2026: 12 Techniques + Code

Optimize Vapi voice agent latency to sub-500ms p95 in 2026. 12 techniques with real Vapi config: streaming STT, partial TTS, prompt caching, regional.

Nikhil Pareek · Apr 29, 2026

14 min

Engineering

How to Optimize Pipecat Voice Agent Latency in 2026: 12 Techniques + Code

Cut Pipecat voice agent latency to sub-500ms p95 in 2026. 12 techniques with real pipeline code: streaming STT, partial TTS, prefix caching, routing.

Vrinda Damani · Mar 30, 2026

13 min

Engineering

How to Optimize LiveKit Voice Agent Latency in 2026: 12 Techniques + Code

Cut LiveKit Agents voice latency to sub-500ms p95 in 2026. 12 techniques with real AgentSession code: streaming STT, partial TTS, prefix caching, regional.

Rishav Hada · Mar 6, 2026

13 min

How to Optimize Retell Voice Agent Latency in 2026

TL;DR pick by Retell config knob

How to read this guide

1. Streaming STT with first-partial routing

2. Partial LLM tokens piped into TTS

3. LLM prompt prefix caching

4. Edge model routing

5. Prefetch tool calls on high-confidence intent

6. Audio prebuffering

7. Async evaluation

8. Parallel TTS warm-up

9. Smaller models for short turns

10. Semantic cache for common intents

11. KV-cache reuse across turns

12. Regional routing for STT and TTS

Bonus: backchanneling latency on Retell

Stacking the techniques: a Retell agent config that hits sub-500ms

Future AGI for Retell monitoring

Related reading

Sources and references

Frequently asked questions