How to Optimize Retell Voice Agent Latency in 2026: 12 Techniques with Real Config
Optimize Retell AI voice agent latency to sub-500ms p95 in 2026. 12 techniques with real Retell agent config: STT, response_engine, backchannel, states, async eval.
Table of Contents
How to Optimize Retell Voice Agent Latency in 2026
To optimize Retell voice agent latency, configure 12 knobs across the agent and the underlying retell-llm or conversation-flow resource: keep stt_mode on fast (or switch to custom with a tuned endpointing_ms), pick a streaming voice_model like eleven_turbo_v2 or sonic-3, anchor a byte-stable general_prompt on the retell-llm resource so provider prefix caching engages, point response_engine.type at conversation-flow so common intents resolve in lighter subagent nodes, tune interruption_sensitivity and backchannel_frequency for fast turn-taking, set model_high_priority on the LLM resource for consistent TTFT, fire custom function tools from flow edges to prefetch lookups, and forward call_analyzed from webhook_url to Future AGI ai-evaluation for async scoring. Stacked, these knobs drop a 1200ms default Retell turn to 500-700ms p95.
TL;DR pick by Retell config knob
| Technique | Retell config knob | Expected p95 win |
|---|---|---|
| Streaming STT first-partial | stt_mode: "fast" or custom_stt_config.endpointing_ms | 200-400ms |
| Partial LLM into TTS | Managed by Retell runtime (response_engine streaming) | 200-500ms |
| Prompt prefix caching | general_prompt byte-stable, model selection | 200-400ms |
| Edge model routing | model per state node in conversation-flow | 60-150ms |
| Prefetch tool calls | Custom function tool on state edge, webhook_url | 200-400ms |
| Audio prebuffering | begin_message, begin_message_delay_ms, ambient_sound | 80-200ms |
| Async evaluation | webhook_url + call_analyzed event to FAGI | 100-300ms |
| TTS warm-up | Retell-managed connection reuse, fallback_voice_ids | 50-150ms |
| Smaller models for short turns | model per state in conversation-flow | 100-300ms |
| Semantic cache for common intents | Conversation-flow states or Agent Command Center gateway | 400-800ms on hits |
| KV-cache reuse across turns | Stable general_prompt + state framing | 100-300ms |
| Regional routing | Provider-managed via voice_model + custom_stt_config | 30-80ms |
How to read this guide
The parent methodology hub covers the 12 techniques as a framework. This post maps each one to the Retell agent config. The pattern is consistent. Retell handles the streaming wiring out of the box. Where teams need control over what gets cached, prefetched, routed, or measured, you reach into the knobs below. Where Retell abstracts a technique away, this post says so and shows the lower-level equivalent through traceAI.
Spans matter. Every “we shipped X technique” claim is hand-wavy without per-stage timing. traceAI emits OpenInference spans for STT, LLM, TTS, and tool calls in one trace per Retell call. 30+ documented integrations across Python and TypeScript cover the voice stack, including dedicated traceAI-pipecat and traceai-livekit packages plus OpenAI Realtime, Anthropic, LiteLLM, and Vertex AI. Apache 2.0. For Retell dashboards, no SDK is needed: native voice observability ingests via Retell API key plus Agent ID through a Future AGI Agent Definition.
1. Streaming STT with first-partial routing
What it does. Switch from batch STT to streaming STT that emits partial transcripts every 100-200ms while the user is still speaking. Feed the latest partial to the LLM the moment intent confidence crosses 0.85. The parent post covers the theory in §1.
Retell config. Set stt_mode to fast to bias the transcriber toward latency over accuracy. For more control, set stt_mode to custom and pass custom_stt_config with the provider (azure, deepgram, or soniox) plus an endpointing_ms tuned to your traffic. Push boosted_keywords for the proper nouns the model will mishear.
{
"stt_mode": "custom",
"custom_stt_config": {
"provider": "deepgram",
"endpointing_ms": 250
},
"boosted_keywords": ["refund", "balance", "appointment"],
"vocab_specialization": "general",
"denoising_mode": "noise-cancellation"
}
Common mistake. Switching stt_mode to accurate on a conversational agent for marginal accuracy. That choice trades 100-300ms of extra confirmation latency per turn. The Retell default fast (or custom with endpointing_ms around 200-300ms) is the right starting point for short conversational turns.
Where Retell abstracts. Retell auto-feeds stable partials into the response engine. If you want to see when the first partial actually fires, traceAI captures gen_ai.voice.latency.stt_first_partial_ms and gen_ai.voice.latency.stt_final_ms on the STT span. The gap between those two is the parallel window the LLM gets for free.
2. Partial LLM tokens piped into TTS
What it does. Stream LLM tokens. The moment the first sentence boundary lands, fire that sentence to TTS. The user hears the first word before the LLM finishes the response. The parent post covers the theory in §2.
Retell config. Managed by the Retell runtime. When response_engine.type is retell-llm or conversation-flow, Retell pipes tokens to TTS at sentence boundaries with no extra config. The only verification path is the per-stage timing on traceAI spans.
# fi-instrumentation==0.4.2
from fi_instrumentation import register
from fi_instrumentation.fi_types import ProjectType
tracer_provider = register(
project_type=ProjectType.OBSERVE,
project_name="retell-prod",
set_global_tracer_provider=True,
)
# traceAI now emits gen_ai.evaluation.* spans plus gen_ai.voice.latency.* attributes
# Output:
# Registered project: retell-prod
# Tracer provider attached. Voice attrs: gen_ai.voice.latency.ttfb_ms, gen_ai.voice.latency.tts_first_audio_ms
Common mistake. Setting response_engine.type to custom-llm and pointing it at a webhook that buffers the full LLM response before returning. That doubles turn latency because TTS waits on the complete answer. Stream tokens from your custom LLM webhook the same way OpenAI streams them.
Where Retell abstracts. Fully managed for retell-llm and conversation-flow response engines. The lower-level view is the gap between gen_ai.voice.latency.llm_ttft_ms, gen_ai.voice.latency.llm_first_sentence_ms, and gen_ai.voice.latency.tts_first_audio_ms on traceAI spans. A healthy Retell turn shows TTS starting within 80-150ms of the first sentence boundary.
3. LLM prompt prefix caching
What it does. Anchor the system prompt at the top of the assembled prompt. Keep it byte-identical across turns so the provider’s cache lookup hits. The parent post covers the theory in §3.
Retell config. On the retell-llm resource (created separately via create-retell-llm and referenced from the agent by llm_id), general_prompt is appended to every system prompt and stays stable as the agent walks through states. Place the durable instructions there. Push any dynamic content into default_dynamic_variables placeholders consumed late in the prompt so the prefix bytes do not shift turn to turn.
// POST /create-retell-llm
{
"model": "claude-4.5-sonnet",
"model_temperature": 0.2,
"model_high_priority": true,
"general_prompt": "You are an Acme refund support agent. Resolve refunds within policy. Escalate edge cases. Speak in 1-2 sentences.",
"default_dynamic_variables": {
"customer_tier": "{{customer_tier}}",
"today": "{{today}}"
}
}
// Then attach via response_engine: {type: "retell-llm", llm_id: "<id>"} on create-agent
Common mistake. Interpolating today’s date or the user ID directly into general_prompt. That breaks the prefix. Push the variable into default_dynamic_variables and reference it from a state prompt near the end of the assembly so the front of the prompt stays cache-stable.
Where Retell abstracts. Retell does not surface a cache toggle. The provider behind model handles it. Confirm by checking gen_ai.usage.cached_input_tokens on the LLM span in traceAI. A healthy production Retell agent should show 80% plus cache hit rate on the system prompt after warm-up.
4. Edge model routing
What it does. Route short conversational turns to a smaller, faster model. Route complex tool turns to the larger model. The parent post covers the theory in §4.
Retell config. Switch from a single-prompt retell-llm to a conversation-flow response engine (created via create-conversation-flow). Each subagent node carries its own model selection, so the greeting node and acknowledgment nodes pick gpt-4.1-mini while the lookup nodes pick gpt-4.1 or claude-4.5-sonnet. The Retell pricing model rewards this directly because each node bills time times model price.
// POST /create-conversation-flow (excerpt)
{
"model_choice": {"model": "gpt-4.1-mini", "high_priority": true},
"nodes": [
{"id": "greet", "type": "conversation",
"instruction": {"type": "prompt", "text": "Greet the caller in one sentence."},
"model_choice": {"model": "gpt-4.1-mini", "high_priority": true}},
{"id": "lookup", "type": "conversation",
"instruction": {"type": "prompt", "text": "Resolve the refund using order context. Call get_order_status."},
"model_choice": {"model": "claude-4.5-sonnet", "high_priority": true}},
{"id": "wrap_up", "type": "conversation",
"instruction": {"type": "prompt", "text": "Confirm next steps in one sentence and end the call."},
"model_choice": {"model": "gpt-4.1-mini", "high_priority": true}}
]
}
Common mistake. Picking the largest model on every node “for quality”. TTFT on claude-4.5-sonnet or gpt-4.1 sits 100-300ms above gpt-4.1-mini on the same prompt. The mix matters more than the peak.
Where Retell abstracts. Retell evaluates the active node, calls the matching model, and rolls cost into the call. For lower-level multi-model routing across providers, the Agent Command Center covers 15+ providers through a single OpenAI-compatible endpoint with per-route policy. Hosted with RBAC, AWS Marketplace, multi-region, SOC 2 Type II, HIPAA, GDPR, CCPA, and ISO 27001 certified per the trust page. Capture gen_ai.request.model and gen_ai.voice.route_reason as traceAI attributes.
5. Prefetch tool calls on high-confidence intent
What it does. When intent confidence on the first stable transcript partial is above 0.85, fire the tool call in parallel with the LLM call. If intent changes in later partials, cancel the prefetched call. The parent post covers the theory in §5.
Retell config. Two-step. Model the high-confidence intent as a dedicated state in a conversation-flow agent. Attach a custom function node next to a conversation node. When the edge condition matches (the user mentions an order number, an appointment date, a refund), Retell fires the custom function to your tool URL before the next conversation node commits the verbal response. Return the tool response shape Retell expects so the next node has the answer ready.
// POST /create-conversation-flow (excerpt)
{
"nodes": [
{
"id": "collect_order_id",
"type": "conversation",
"instruction": {"type": "prompt", "text": "If the user mentions an order number, capture it as order_id."},
"edges": [
{"transition_condition": "user shared an order id",
"destination_node_id": "lookup_order_status"}
]
},
{
"id": "lookup_order_status",
"type": "function",
"tool_id": "tool_get_order_status",
"tool_type": "custom",
"edges": [
{"transition_condition": "default",
"destination_node_id": "share_status"}
]
}
]
}
// Custom function tool (created separately) carries url + timeout_ms + response variables
Common mistake. Setting the tool timeout_ms to 20 seconds on a turn-critical lookup. A 20-second hang ruins the conversation. Keep it at 6-10 seconds and let the next state read the response or fall back to a verbal stall.
Where Retell abstracts. Retell runs the function call, parses response_variables, and feeds them into the next state’s prompt. Cancellation on stale intent happens in your server. Track gen_ai.voice.tool_prefetched and gen_ai.voice.tool_call_cancelled as traceAI attributes. A prefetch success rate above 90% means the strategy is paying.
6. Audio prebuffering
What it does. Open the TTS path and synthesize the first audio frame before the user expects sound. The buffer absorbs network jitter so playback is smooth. The parent post covers the theory in §6.
Retell config. Set begin_message on the retell-llm resource (or the start node of a conversation flow) so Retell pre-synthesizes the greeting. Tune begin_message_delay_ms on the agent between 0 and 200ms depending on whether you want immediate speech (inbound) or a brief pause (outbound). Add ambient_sound to mask hand-off micro-gaps and enable speech normalization via the handbook_config preset so numbers and currency render cleanly without re-synthesis stalls.
// On create-retell-llm (or the start node of a conversation flow)
{
"begin_message": "Hi, Acme support. How can I help?"
}
// On create-agent
{
"begin_message_delay_ms": 0,
"ambient_sound": "call-center",
"ambient_sound_volume": 0.4,
"handbook_config": {
"speech_normalization": true,
"natural_filler_words": true,
"default_personality": true
}
}
Common mistake. Leaving begin_message empty on an outbound dial. Retell waits for the user to speak first, which on a cold dial feels broken. Reserve the empty begin_message for inbound flows where the caller has a question ready.
Where Retell abstracts. Retell pre-synthesizes the begin message and warms the TTS WebSocket as soon as the agent attaches to the call. The lower-level view is gen_ai.voice.latency.tts_prebuffer_ms and gen_ai.voice.tts_playback_underrun_count on the TTS span. Underrun count should sit at zero on a properly tuned buffer.
7. Async evaluation
What it does. Run scoring after the turn commits. Never block the critical path on an LLM judge. The parent post covers the theory in §7.
Retell config. Configure webhook_url on the agent and subscribe to call_analyzed. Retell posts the transcript, recording URL, and a call_analysis block (with call_summary, user_sentiment, call_successful, plus any custom_analysis_data you configured via post_call_analysis_data on the agent) once analysis completes. Forward that payload to Future AGI ai-evaluation. None of this fires inline.
# retell-sdk==4.21.0, ai-evaluation==0.6.1
from fi.evals import Evaluator
from fi.evals.templates import ConversationCoherence, ConversationResolution, AudioQuality
async def on_retell_webhook(payload):
event = payload.get("event")
if event != "call_analyzed":
return
call = payload["call"]
analysis = call.get("call_analysis", {})
evaluator = Evaluator()
result = await evaluator.run_async(
templates=[
ConversationCoherence(),
ConversationResolution(),
AudioQuality(),
],
inputs={
"transcript": call.get("transcript", ""),
"recording_url": call.get("recording_url"),
"call_id": call["call_id"],
"call_summary": analysis.get("call_summary"),
"user_sentiment": analysis.get("user_sentiment"),
"call_successful": analysis.get("call_successful"),
"custom_analysis_data": analysis.get("custom_analysis_data", {}),
},
)
return result
# Output:
# {"conversation_coherence": 0.91, "conversation_resolution": 0.84, "audio_quality": 0.88}
Common mistake. Running an LLM-judge eval inside a transcript_updated webhook handler. That fires every turn and adds 200ms plus to the turn budget. Use transcript_updated only for state mirroring; gate evals on call_analyzed.
Where Retell abstracts. Retell delivers call_analyzed with call.transcript, call.recording_url, and a call.call_analysis block carrying call_summary, user_sentiment, call_successful, plus the custom_analysis_data extractions configured via post_call_analysis_data on the agent. 70+ pre-built eval templates in ai-evaluation cover the voice surface including audio_transcription, audio_quality, conversation_coherence, conversation_resolution, and task_completion. Plus faithfulness, tool-use accuracy, and groundedness. Unlimited custom evaluators are authored by an in-product agent that reads your traces. Apache 2.0.
8. Parallel TTS warm-up
What it does. Keep a warm TTS connection open per session so the first audio frame arrives 50-150ms faster than a cold start. The parent post covers the theory in §8.
Retell config. Retell-managed. The TTS WebSocket is opened when the call attaches and reused across turns. You pick the voice and model. Retell handles the connection reuse. Add fallback_voice_ids so a provider blip does not cold-start mid-call.
{
"voice_id": "11labs-Adrian",
"voice_model": "eleven_turbo_v2",
"voice_temperature": 0.6,
"voice_speed": 1.0,
"fallback_voice_ids": ["cartesia-Brooke", "openai-alloy"]
}
Common mistake. Picking eleven_multilingual_v2 or a heavy sonic-3 variant for an English-only agent. Multilingual and large voice models add 80-150ms TTFT versus eleven_turbo_v2, eleven_flash_v2, or sonic-3 in their lighter tiers. If the agent speaks one language, pick the turbo or flash variant.
Where Retell abstracts. Retell-managed. To verify the WebSocket stays warm, traceAI captures gen_ai.voice.tts_connection_state (cold, warm, reused) and gen_ai.voice.latency.tts_first_audio_ms. Warm should beat cold by the 50-150ms saving.
9. Smaller models for short turns
What it does. Route short conversational turns (“yes”, “no”, “can you repeat that”) to a smaller and faster model. Route complex tool turns to the larger model. The parent post covers the theory in §9.
Retell config. Two paths. Static: set model on a single-prompt create-retell-llm payload to the lighter option and accept the quality tradeoff for the whole agent. Per-node: switch to conversation-flow and set model_choice per conversation node so acknowledgment nodes pick the lighter model and lookup nodes pick the heavier one. Per-turn dynamic routing happens by setting response_engine.type to custom-llm and routing inside the custom LLM WebSocket service behind llm_websocket_url.
// POST /create-conversation-flow (excerpt)
{
"model_choice": {"model": "gpt-4.1-mini", "high_priority": true},
"nodes": [
{"id": "ack", "type": "conversation",
"instruction": {"type": "prompt", "text": "Acknowledge in 5 words or less."},
"model_choice": {"model": "gpt-4.1-mini", "high_priority": true},
"model_temperature": 0.1},
{"id": "resolve", "type": "conversation",
"instruction": {"type": "prompt", "text": "Resolve the refund using context. Stay concise."},
"model_choice": {"model": "claude-4.5-sonnet", "high_priority": true},
"model_temperature": 0.3}
]
}
Common mistake. Routing every node to gpt-5 or claude-4.5-sonnet for “quality”. TTFT on those models sits 100-300ms above the smaller alternatives. The mix matters more than the peak.
Where Retell abstracts. Retell evaluates the current node and dispatches the matching model. For provider-agnostic routing across 15+ models, point response_engine.type at custom-llm and route inside the Agent Command Center gateway. Capture gen_ai.request.model and gen_ai.voice.route_reason as traceAI attributes.
10. Semantic cache for common intents
What it does. Embed the user’s transcript on the live partial. Search a cache of recently-answered queries by embedding similarity. On a hit above threshold, return the cached audio answer in 30-80ms. The parent post covers the theory in §10.
Retell config. Two patterns. In-flow: model the top FAQ intents as dedicated states in a conversation-flow agent so the lookup short-circuits the heavy LLM with a fixed response from the state prompt. Gateway: switch response_engine.type to custom-llm, point the URL at the Agent Command Center, and configure semantic cache there so cache hits return in 30-80ms without code changes inside Retell.
# fi-instrumentation==0.4.2
# Point Retell's custom LLM at the Agent Command Center gateway.
# Retell only carries llm_websocket_url; configure model + cache policy
# on the WebSocket service behind that URL (ACC handles it for you).
retell_agent_patch = {
"response_engine": {
"type": "custom-llm",
"llm_websocket_url": "wss://gateway.futureagi.com/v1/retell/acme-support",
}
}
# Output:
# Retell calls now flow through ACC. Cache hits return cached completion in 30-80ms.
# Semantic cache hit-rate of 15-30% is realistic on support agents.
Common mistake. Setting the similarity threshold too low (0.80 or below). Below 0.90, false positives ship the wrong answer to the caller. Start at 0.92, scope the cache by tenant ID, and tune downward only on traffic that proves safe.
Where Retell abstracts. Retell does not run the cache. The conversation-flow states short-circuit common intents, and the ACC gateway covers the embedding plus similarity lookup. Capture gen_ai.voice.semantic_cache_hit and gen_ai.voice.semantic_cache_similarity as traceAI attributes so hit rate trends over time.
11. KV-cache reuse across turns
What it does. Provider session caching can reduce repeated prefix processing on multi-turn calls. The model skips reprocessing the conversation history that is already cached. The parent post covers the theory in §11.
Retell config. Anchor general_prompt at the top of the assembled prompt and let Retell append the conversation transcript turn by turn. Pick a model whose provider supports prompt or session caching (Anthropic, OpenAI 4o-class, Gemini). The cache works as long as the prefix stays byte-stable. State transitions in conversation-flow modify the appended state_prompt, not the general_prompt, so prefix caching survives.
// POST /create-retell-llm
{
"model": "claude-4.5-sonnet",
"model_high_priority": true,
"general_prompt": "You are an Acme support agent. Stay concise. Escalate when uncertain.",
"model_temperature": 0.2
}
// Then attach via response_engine: {type: "retell-llm", llm_id: "<id>"} on create-agent
Common mistake. Rewriting general_prompt per turn with a templated string that interpolates dynamic values. That changes the byte string and busts the cache. Keep general_prompt literal, push variables into default_dynamic_variables, and use placeholders only in state prompts.
Where Retell abstracts. Retell assembles the system prompt as general_prompt plus the active state_prompt. The provider behind model handles the cache. The lower-level view is gen_ai.usage.cached_input_tokens plus gen_ai.voice.latency.llm_ttft_ms on the LLM span. TTFT on turn N+1 should sit 100-300ms below turn 1 because the prefix is cached.
12. Regional routing for STT and TTS
What it does. Pin STT and TTS to the closest regional endpoint of the provider. Many voice providers route based on the gateway’s region by default. The parent post covers the theory in §12.
Retell config. Region routing is per-component on Retell. The STT provider behind stt_mode (or custom_stt_config) picks its closest edge POP based on call origin. The TTS provider behind voice_model does the same. For multi-region agents, run separate Retell agents per region with regional webhook_url endpoints and provider choices that route to the nearest edge.
{
"agent_name": "acme-support-eu",
"stt_mode": "custom",
"custom_stt_config": {"provider": "deepgram", "endpointing_ms": 250},
"voice_id": "11labs-Adrian",
"voice_model": "eleven_turbo_v2",
"webhook_url": "https://eu.api.acme.com/retell/events",
"timezone": "Europe/Amsterdam"
}
Common mistake. Running one Retell agent and a single webhook_url for users across continents. The transcontinental round-trip adds 60-120ms to every turn. The fix is per-region agents with regional webhook_url endpoints and provider choices that hit the nearest edge.
Where Retell abstracts. Provider-managed at the edge. Capture gen_ai.voice.stt_region and gen_ai.voice.tts_region as traceAI span attributes so misrouted-region traffic surfaces on the p95 heatmap.
Bonus: backchanneling latency on Retell
Backchannel words (“yeah”, “uh-huh”, “got it”) soften the conversation and reduce perceived latency on long-running tool calls. Retell ships two knobs: enable_backchannel to switch the behavior on, and backchannel_frequency between 0 and 1 to control how often the agent interjects. Set backchannel_words to phrases that match your brand voice.
{
"enable_backchannel": true,
"backchannel_frequency": 0.6,
"backchannel_words": ["got it", "okay", "right"],
"interruption_sensitivity": 0.85,
"enable_dynamic_responsiveness": true,
"responsiveness": 0.9
}
interruption_sensitivity at 0.85 lets the user barge in cleanly without the agent stepping on the next 200ms of audio. enable_dynamic_responsiveness lets Retell match the caller’s pace, which on a fast talker shaves 100-200ms off perceived turn lag. Capture gen_ai.voice.barge_in_count and gen_ai.voice.latency.barge_in_flush_ms as traceAI attributes so you see whether the barge-in tax is fair. A healthy Retell agent has barge-in rates of 5-15% on conversational turns and 1-3% on scripted turns. Higher rates indicate the agent is talking too long or backchannel_frequency is annoying the caller.
Stacking the techniques: a Retell agent config that hits sub-500ms
Two calls compose the production stack: create-conversation-flow (or create-retell-llm) first, then create-agent referencing it. Together they stack every technique above and land a starting point for sub-500ms p95 on short turns.
// 1. POST /create-conversation-flow (excerpt)
{
"model_choice": {"model": "gpt-4.1-mini", "high_priority": true},
"default_dynamic_variables": {"customer_tier": "{{customer_tier}}"},
"nodes": [
{"id": "start", "type": "conversation",
"instruction": {"type": "prompt", "text": "Greet the caller in one sentence."},
"begin_message": "Hi, Acme support. How can I help?"},
{"id": "resolve", "type": "conversation",
"instruction": {"type": "prompt", "text": "Resolve the refund using order context. Stay concise."},
"model_choice": {"model": "claude-4.5-sonnet", "high_priority": true}}
]
}
// 2. POST /create-agent
{
"agent_name": "acme-support-prod-us",
"response_engine": {
"type": "conversation-flow",
"conversation_flow_id": "conversation_flow_acme_support_v3"
},
"voice_id": "11labs-Adrian",
"voice_model": "eleven_turbo_v2",
"voice_temperature": 0.6,
"voice_speed": 1.0,
"fallback_voice_ids": ["cartesia-Brooke"],
"language": "en-US",
"stt_mode": "custom",
"custom_stt_config": {"provider": "deepgram", "endpointing_ms": 250},
"boosted_keywords": ["refund", "balance", "appointment"],
"denoising_mode": "noise-cancellation",
"interruption_sensitivity": 0.85,
"enable_backchannel": true,
"backchannel_frequency": 0.5,
"backchannel_words": ["got it", "right"],
"begin_message_delay_ms": 0,
"ambient_sound": "call-center",
"ambient_sound_volume": 0.3,
"end_call_after_silence_ms": 30000,
"max_call_duration_ms": 600000,
"voicemail_detection_timeout_ms": 25000,
"voicemail_message": "Hi, please call us back at your convenience.",
"post_call_analysis_data": [
{"type": "call_summary"},
{"type": "call_successful"},
{"type": "user_sentiment"}
],
"post_call_analysis_model": "gpt-4.1-mini",
"webhook_url": "https://api.acme.com/retell/events",
"webhook_events": ["call_started", "call_ended", "call_analyzed"],
"handbook_config": {
"speech_normalization": true,
"natural_filler_words": true,
"default_personality": true
},
"opt_in_signed_url": true
}
Realistic p95 budget for this config on short turns, measured on traceAI spans:
| Stage | Budget | Retell knob |
|---|---|---|
| STT end-of-turn | 130ms | stt_mode: "custom", endpointing_ms: 250 |
| LLM TTFT | 220ms | prefix cache via stable general_prompt, model_high_priority: true |
| Tool prefetch | 0ms (parallel) | custom function state on edge |
| TTS first-audio | 130ms | voice_model: "eleven_turbo_v2", warm connection |
| Network RTT | 50ms | regional webhook_url + provider edge |
| Total p95 | 480-550ms | stacked |
Exact numbers depend on the provider, the prompt length, and the region. The pattern holds. Stacking these knobs drops a 1200ms default Retell turn into the sub-500ms zone on short turns and 700-800ms on tool-heavy turns.
Future AGI for Retell monitoring
Three paths plug Future AGI into a Retell pipeline. Pick the one that matches the team’s stack.
Path one: native voice observability. Create a Future AGI Agent Definition. Paste the Retell API key plus Agent ID. Future AGI captures every call through the Retell API. No SDK. Call payloads, transcripts, recordings, post_call_analysis_data extractions, and per-stage timing land in Observe automatically. This is the lowest-friction path and the right default for Retell-only teams.
Path two: traceAI for cross-provider visibility. Install fi-instrumentation, call register() with ProjectType.OBSERVE, and the SDK emits OpenInference spans for STT, LLM, TTS, and tool calls. Real gen_ai.voice.* and gen_ai.evaluation.* namespaces. 30+ documented integrations across Python and TypeScript plus dedicated traceAI-pipecat and traceai-livekit packages. Apache 2.0. Use this when the team runs more than one voice runtime and wants one trace model.
Path three: webhook to ai-evaluation. Configure webhook_url on the agent. Forward call_analyzed payloads to Future AGI ai-evaluation. 70+ built-in eval templates including audio_transcription, audio_quality, conversation_coherence, conversation_resolution, task_completion, plus faithfulness, tool-use accuracy, and groundedness. Unlimited custom evaluators are authored by an in-product agent that reads your code and conversation traces. The programmatic eval API lets you configure and re-run evals as traces accumulate. Async. Never on the turn budget.
For inline safety on the user utterance, Future AGI Protect runs sub-100ms per arXiv 2510.13351. Built on Gemma 3n with LoRA-trained adapters across 4 safety dimensions (Content Moderation, Bias Detection, Security, Data Privacy Compliance). Multi-modal across text, image, and audio.
# fi-protect==0.3.0
from fi.protect import Protector
protector = Protector()
result = protector.protect(
inputs="I want a full refund for my last 6 orders",
protect_rules=["content_moderation", "data_privacy_compliance"],
action="block",
reason=True,
timeout=25000,
)
# Output:
# {"passed": True, "reason": "no violation detected", "latency_ms": 78}
For audio-aware evals on the recorded call, pass the Retell recording_url as MLLMAudio:
# ai-evaluation==0.6.1
from fi.evals.templates import AudioQuality
from fi.evals.types import MLLMAudio
result = evaluator.run(
template=AudioQuality(),
inputs={"audio": MLLMAudio(url="https://recordings.retellai.com/call_abc123.wav", local=True)},
)
# Output:
# {"audio_quality": 0.87, "rationale": "Low background noise, consistent volume"}
When evaluation scores plateau, agent-opt closes the loop. Six prompt optimizers ship: Bayesian Search, Meta-Prompt per arXiv 2505.09666, ProTeGi, GEPA Genetic-Pareto per arXiv 2507.19457, Random Search per arXiv 2311.09569, and PromptWizard. Point an optimizer at the Retell general_prompt, score against the dataset of past calls, and ship the candidate that scores best on conversation_resolution plus task_completion.
Error Feed is the clustering layer over traces and evals. Zero-config auto-clustering turns latency outliers and failure patterns into named issues with auto-written root cause, quick fix, and long-term recommendation. Instead of 10,000 raw Retell calls, you see 12 named issues with a fix path. Pricing on the Future AGI side: free to get started, pay-as-you-go scales with usage, compliance and enterprise add-ons layer on as you need them. See pricing.
Related reading
- How to Optimize Voice Agent Latency: 12 Techniques for 2026
- How to Optimize Vapi Voice Agent Latency in 2026: 12 Techniques + Code
- How to Optimize LiveKit Voice Agent Latency in 2026: 12 Techniques + Code
- Sub-500ms Voice AI: The Complete Latency Budget Guide for 2026
- How to Measure Voice AI Latency: The Complete 2026 Guide
- How to Implement Voice AI Observability in 2026
Sources and references
- Retell create-agent reference: docs.retellai.com/api-references/create-agent
- Retell create-retell-llm reference: docs.retellai.com/api-references/create-retell-llm
- Retell conversation flow: docs.retellai.com/build/conversation-flow
- Future AGI Protect benchmarks: arXiv 2510.13351
- GEPA optimizer: arXiv 2507.19457
- Meta-Prompt: arXiv 2505.09666
- Random Search optimizer: arXiv 2311.09569
- Future AGI trust and compliance: futureagi.com/trust
- OpenInference span specification: github.com/Arize-ai/openinference
Frequently asked questions
What is a realistic p95 latency target for a Retell agent in 2026?
Does Retell support prompt prefix caching out of the box?
How do I prefetch tool calls in a Retell agent without breaking the conversation?
Can I route different Retell turns to different LLMs based on intent?
What does Retell abstract away that I would otherwise have to build?
How do I run evaluation on Retell calls without adding latency to the turn?
Can I pin Retell STT and TTS to a specific region?
How does Future AGI plug into a Retell pipeline?
Optimize LiveKit Agents voice latency to sub-500ms p95 in 2026. 12 techniques with real AgentSession code: streaming STT, partial TTS, prefix caching, regional routing, async eval.
Optimize Vapi voice agent latency to sub-500ms p95 in 2026. 12 techniques with real Vapi config: streaming STT, partial TTS, prompt caching, regional routing, async eval.
Optimize Pipecat voice agent latency to sub-500ms p95 in 2026. 12 techniques with real pipeline code: streaming STT, partial TTS, prefix caching, regional routing, async eval.