How to Optimize LiveKit Voice Agent Latency in 2026: 12 Techniques with Real Code
Optimize LiveKit Agents voice latency to sub-500ms p95 in 2026. 12 techniques with real AgentSession code: streaming STT, partial TTS, prefix caching, regional routing, async eval.
Table of Contents
LiveKit Agents ships the orchestration. You ship the tuning. A vanilla AgentSession with default STT, LLM, and TTS plugins lands around 1.2-1.4 seconds p95 turn latency. The same session with 12 specific optimizations lands around 500-650ms p95 on the same hardware and the same providers. This guide walks each of the 12 techniques from the parent voice latency methodology and shows the exact LiveKit Agents knob, plugin, or callback that implements it, with Python code you can paste into a livekit-agents service today.
TL;DR: LiveKit plugin or config to expected win
| # | Technique | LiveKit surface | Expected p95 saving |
|---|---|---|---|
| 1 | Streaming STT first-partial routing | inference.STT() or deepgram.STT(interim_results=True) plus preemptive_generation=True | 200-400ms |
| 2 | Partial LLM tokens into TTS | Built-in via AgentSession; pipe streaming LLM into streaming TTS | 200-500ms |
| 3 | LLM prompt prefix caching | Anchor chat_ctx system prompt, set cache_control on Anthropic plugin | 200-400ms |
| 4 | Edge model routing | Per-Agent llm= swap inside on_user_turn_completed | 100-300ms |
| 5 | Prefetch tool calls on high-confidence intent | Classify in on_user_turn_completed, fire asyncio.create_task for @function_tool | 200-400ms |
| 6 | Audio prebuffering | aec_warmup_duration on session plus withPreConnectAudio on the client | 80-200ms perceived |
| 7 | Async evaluation | traceAI-livekit spans exported async to Future AGI; eval scoring runs off the critical path | 100-300ms |
| 8 | Parallel TTS warm-up | Pre-instantiate TTS plugin in prewarm; first turn skips the cold-connect tax | 50-150ms |
| 9 | Smaller models for short turns | Swap llm= per turn in on_user_turn_completed based on intent length | 100-300ms |
| 10 | Semantic cache for common intents | Custom check inside on_user_turn_completed; short-circuit with session.say | 400-800ms on hits |
| 11 | KV-cache reuse across turns | Stable chat_ctx ordering plus provider session caching | 100-300ms |
| 12 | Regional routing for STT and TTS | region= on Deepgram and Cartesia plugins; pin LiveKit Cloud region | 30-80ms |
Stacked, the techniques drop a 1.2-1.4 second LiveKit turn into the 500-650ms p95 band on most workloads. Short turns hit sub-500ms.
How to read this guide
The parent post covers each technique’s theory and the per-stage latency budget. This post is the LiveKit-specific implementation map. Each section below answers four questions:
- What it does in LiveKit terms. Which plugin, which
AgentSessionparameter, whichAgentcallback. - Code. A 5-15 line Python snippet that runs against
livekit-agents>=1.5. - Where LiveKit does it for you. If the framework handles the technique automatically.
- Common mistake. What people break when they wire it the first time.
Pin SDK versions on every install. The examples here are written against livekit-agents==1.5.8, the May 2026 release. Older versions ship older AgentSession signatures.
pip install "livekit-agents[deepgram,openai,cartesia,silero,turn-detector]==1.5.8"
pip install traceai-livekit
1. Streaming STT with first-partial routing
What it does. Switch from batch STT to streaming STT that emits partial transcripts every 100-200ms while the user is still speaking. Feed the latest partial to the LLM before the turn detector commits.
LiveKit surface. Streaming is the default on every modern STT plugin (Deepgram, AssemblyAI, OpenAI Whisper streaming). The win comes from turning on preemptive_generation=True on AgentSession, which fires the LLM call on the STT partial instead of waiting for the final transcript.
# livekit-agents==1.5.8
from livekit.agents import AgentSession, inference
from livekit.plugins import silero
from livekit.plugins.turn_detector.multilingual import MultilingualModel
session = AgentSession(
stt=inference.STT("deepgram/nova-3", language="multi"),
llm=inference.LLM("openai/gpt-4o-mini"),
tts=inference.TTS("cartesia/sonic-3"),
vad=silero.VAD.load(),
turn_detection=MultilingualModel(),
preemptive_generation=True,
)
Where LiveKit does it for you. STT plugins already stream by default. You do not write the partial loop.
Common mistake. Leaving preemptive_generation at its default. Without it the LLM waits for the turn-detector commit, which adds 150-350ms to every turn.
2. Partial LLM tokens piped into TTS
What it does. Stream LLM tokens. When the first sentence boundary lands, fire that sentence to TTS. The user hears the first word before the LLM has finished writing the response.
LiveKit surface. Fully automatic inside AgentSession. When the LLM plugin streams (which all the major plugins do) and the TTS plugin streams (Cartesia Sonic, ElevenLabs Turbo v2.5, OpenAI TTS streaming), the llm_node to tts_node pipe flushes at sentence boundaries with no extra config.
# livekit-agents==1.5.8
# Streaming LLM + streaming TTS is the default
session = AgentSession(
stt=inference.STT("deepgram/nova-3"),
llm=inference.LLM("openai/gpt-4o-mini"), # streams by default
tts=inference.TTS("cartesia/sonic-3"), # streams by default
vad=silero.VAD.load(),
turn_detection=MultilingualModel(),
)
Where LiveKit does it for you. Everything. The tts_node consumes the LLM token stream as an AsyncIterable[str] and flushes audio at sentence breaks.
Common mistake. Wrapping the LLM plugin in a custom llm_node that buffers the full response before returning. That kills the stream. If you need to override llm_node, return an AsyncIterable[llm.ChatChunk] and yield chunks as they arrive.
3. LLM prompt prefix caching
What it does. Anchor the system prompt at the top of the chat context. Keep it byte-identical across turns. Anthropic, OpenAI, and Google all cache prompt prefixes server-side, which slashes TTFT on cache hits.
LiveKit surface. chat_ctx on the Agent class. The Anthropic plugin accepts a cache_control block when you author the system prompt. OpenAI auto-caches whenever the prefix is byte-stable. Build the prompt once, attach it to the Agent, and resist the urge to interpolate per-turn timestamps near the top.
# livekit-agents==1.5.8
from livekit.agents import Agent, llm
from livekit.plugins import anthropic
SYSTEM_PROMPT = """You are a customer support agent for Acme Corp.
You answer questions about orders, returns, and shipping.
You speak in short sentences.""" # 1500+ tokens in production
class Assistant(Agent):
def __init__(self) -> None:
super().__init__(
instructions=SYSTEM_PROMPT,
llm=anthropic.LLM(
model="claude-haiku-4-5",
cache_control={"type": "ephemeral"}, # cache the prefix
),
)
Where LiveKit does it for you. chat_ctx preserves order across turns. The plugin passes cache_control through.
Common mistake. Putting the current timestamp or the user ID at the top of the system prompt for “personalization”. That changes the prefix byte-string every turn and defeats caching. Put dynamic content near the end of chat_ctx.
4. Edge model routing
What it does. Route short conversational turns to a smaller, faster model. Route complex tool turns to the larger model. Both LLMs hit the same chat_ctx so context is shared.
LiveKit surface. Override on_user_turn_completed on the Agent. Inspect the new message. Swap the llm= plugin on the session before the LLM node runs.
# livekit-agents==1.5.8
from livekit.agents import Agent, llm
from livekit.plugins import openai, anthropic
LIGHT_LLM = openai.LLM(model="gpt-4o-mini")
HEAVY_LLM = anthropic.LLM(model="claude-sonnet-4-5")
class Assistant(Agent):
async def on_user_turn_completed(
self, turn_ctx: llm.ChatContext, new_message: llm.ChatMessage
) -> None:
text = new_message.text_content or ""
if len(text) < 60 and "?" not in text:
self.llm = LIGHT_LLM
else:
self.llm = HEAVY_LLM
Where LiveKit does it for you. Nothing. Routing is your call.
Common mistake. Routing every turn to the largest model “for quality”. A Sonnet-class model is 200-400ms slower to TTFT than a Haiku-class model on the same prompt. Most conversational turns (“yes”, “thanks”, “can you repeat that”) do not need the larger model.
5. Prefetch tool calls on high-confidence intent
What it does. When the STT partial commits, classify intent. If confidence is above 0.85 and the intent maps to a known @function_tool, fire the tool call in parallel with the LLM call. If the LLM picks a different tool, cancel the prefetched future.
LiveKit surface. on_user_turn_completed runs once per turn after the turn detector commits. Fire the tool as an asyncio task. Stash the future on the Agent instance. When the LLM later requests the tool inside the same turn, await the prefetched future instead of running it cold.
# livekit-agents==1.5.8
import asyncio
from livekit.agents import Agent, function_tool, llm, RunContext
class Assistant(Agent):
def __init__(self) -> None:
super().__init__(instructions="...")
self._prefetch: asyncio.Task | None = None
async def on_user_turn_completed(
self, turn_ctx: llm.ChatContext, new_message: llm.ChatMessage
) -> None:
text = (new_message.text_content or "").lower()
if "order" in text and "status" in text:
self._prefetch = asyncio.create_task(self._lookup_order(text))
@function_tool
async def get_order_status(self, ctx: RunContext, order_id: str) -> str:
if self._prefetch is not None:
try:
return await self._prefetch
finally:
self._prefetch = None
return await self._lookup_order(order_id)
Where LiveKit does it for you. Nothing. The decorator wires the function tool to the LLM. You wire the prefetch.
Common mistake. Forgetting to cancel the prefetched future when the LLM picks a different tool. Leaked tasks pile up at scale. Wrap the prefetch in a try/finally that cancels on exit if the future was not awaited.
6. Audio prebuffering
What it does. Open the audio path before the user speaks. Buffer the first 80-200ms of audio so the STT plugin starts processing from frame zero instead of paying a connection-setup tax on the first frame.
LiveKit surface. Two knobs. On the agent side, aec_warmup_duration on AgentSession pre-runs the acoustic echo canceller so it is warm when the first user audio arrives. On the client side, the LiveKit SDKs support withPreConnectAudio() (Swift, Android, Flutter) and preConnectBuffer: true (web JS), which queues the user’s first audio frames into LiveKit before the SFU handshake finishes.
# livekit-agents==1.5.8
from livekit.agents import AgentSession, inference
from livekit.plugins import silero
from livekit.plugins.turn_detector.multilingual import MultilingualModel
session = AgentSession(
stt=inference.STT("deepgram/nova-3"),
llm=inference.LLM("openai/gpt-4o-mini"),
tts=inference.TTS("cartesia/sonic-3"),
vad=silero.VAD.load(),
turn_detection=MultilingualModel(),
preemptive_generation=True,
aec_warmup_duration=3.0,
)
Where LiveKit does it for you. The audio path is managed by the SFU. You configure the warmup window.
Common mistake. Skipping the client-side pre-connect option. Without it, the user’s first 100-200ms of audio sits in the client buffer while the SFU handshake completes. That adds straight to first-turn latency.
7. Async evaluation
What it does. Run scoring after the turn commits. Never block the critical path on an LLM judge. Use a classifier model for inline rubrics if you absolutely need inline scoring.
LiveKit surface. traceai-livekit emits OpenInference spans through the OpenTelemetry batch processor. Span export is async. Scoring inside Future AGI’s Observe project runs off the critical path on the captured trace. Inline scoring only fires on the routes that need it.
# traceai-livekit, livekit-agents==1.5.8
from fi_instrumentation import register
from fi_instrumentation.fi_types import ProjectType
from traceai_livekit import enable_http_attribute_mapping
register(
project_name="livekit-voice-agent",
project_type=ProjectType.OBSERVE,
set_global_tracer_provider=True,
)
enable_http_attribute_mapping()
Register inside the worker entrypoint, not in the module top level that the LiveKit job-runner pickles across processes. Otherwise you can hit pickling errors when LiveKit forks the agent worker.
Where Future AGI does it for you. 70+ pre-built eval templates (including audio_transcription, audio_quality, conversation_coherence, conversation_resolution, task_completion) run async on captured traces by default. You opt into inline scoring per route.
Common mistake. Running an LLM judge synchronously inside on_user_turn_completed “to gate the response”. That adds 200-500ms to every turn. The classifier-based ai-evaluation models are tuned for the LLM-as-judge cost and latency tradeoff so the async path stays affordable at scale.
8. Parallel TTS warm-up
What it does. Keep a warm TTS connection open per session. When the LLM emits the first sentence, the connection is already authenticated and the voice is already preloaded, so the first audio frame arrives 50-150ms faster than a cold start.
LiveKit surface. Pre-instantiate the TTS plugin inside the worker prewarm function. The same TTS object is reused across the session. Cartesia Sonic, ElevenLabs Turbo, and OpenAI TTS streaming all keep the WebSocket warm between turns.
# livekit-agents==1.5.8
from livekit.agents import JobProcess, WorkerOptions, cli
from livekit.plugins import silero, cartesia
def prewarm(proc: JobProcess) -> None:
proc.userdata["vad"] = silero.VAD.load()
proc.userdata["tts"] = cartesia.TTS(
model="sonic-3",
voice="9626c31c-bec5-4cca-baa8-f8ba9e84c8bc",
)
async def entrypoint(ctx):
session = AgentSession(
stt=inference.STT("deepgram/nova-3"),
llm=inference.LLM("openai/gpt-4o-mini"),
tts=ctx.proc.userdata["tts"],
vad=ctx.proc.userdata["vad"],
turn_detection=MultilingualModel(),
)
cli.run_app(WorkerOptions(entrypoint_fnc=entrypoint, prewarm_fnc=prewarm))
Where LiveKit does it for you. prewarm is the hook. The plugin’s WebSocket reuse is the saving.
Common mistake. Constructing the TTS plugin inside entrypoint instead of prewarm. That defers the cold-connect tax to the first turn, where users notice it.
9. Smaller models for short turns
What it does. Acknowledgments (“yes”, “okay”, “go on”) do not need a frontier model. Route them to a smaller LLM. Route tool turns and reasoning turns to the larger one.
LiveKit surface. Same hook as technique 4: on_user_turn_completed. Inspect the turn. Swap self.llm. The next llm_node call uses the new plugin.
# livekit-agents==1.5.8
from livekit.agents import Agent, llm
from livekit.plugins import openai
NANO = openai.LLM(model="gpt-4o-mini")
FULL = openai.LLM(model="gpt-4.1")
class Assistant(Agent):
async def on_user_turn_completed(
self, turn_ctx: llm.ChatContext, new_message: llm.ChatMessage
) -> None:
text = (new_message.text_content or "").strip()
if text.split() and len(text.split()) <= 4:
self.llm = NANO
else:
self.llm = FULL
Where LiveKit does it for you. Nothing. Routing is yours.
Common mistake. Routing based on token count rather than intent class. A four-word turn (“cancel my entire order”) may still need the heavier model. Use a tiny classifier or a hard list of acknowledgment phrases instead of raw length when stakes are high.
10. Semantic cache for common intents
What it does. Embed the user’s partial transcript. Search a cache of recently-answered queries by embedding similarity. On a hit above threshold, return the cached audio answer directly via session.say and skip the LLM and TTS pipeline.
LiveKit surface. on_user_turn_completed runs before the LLM. Check the cache there. On a hit, call session.say to play the cached answer and session.interrupt to suppress the LLM response that would otherwise queue.
# livekit-agents==1.5.8
from livekit.agents import Agent, llm
class Assistant(Agent):
async def on_user_turn_completed(
self, turn_ctx: llm.ChatContext, new_message: llm.ChatMessage
) -> None:
text = new_message.text_content or ""
hit = await self._semantic_cache.lookup(
text,
tenant_id=self._tenant_id,
threshold=0.92,
)
if hit is not None:
self.session.interrupt()
await self.session.say(hit.answer, allow_interruptions=False)
Where Future AGI does it for you. For teams that already route through the Agent Command Center, the gateway covers semantic cache, prompt cache, model fallback, and per-route routing across 15+ providers behind one endpoint.
Common mistake. Caching without a tenant ID. Customer A’s account-balance question must not return Customer B’s account balance. Always key the cache on tenant_id plus the intent embedding.
11. KV-cache reuse across turns
What it does. Multi-turn conversations on Anthropic, OpenAI, and Google all benefit from session or prefix caching. The model skips reprocessing the conversation history that is already cached.
LiveKit surface. chat_ctx is preserved across turns by AgentSession. The plugin passes the same chat context to the LLM on every call. As long as the order is stable and the prefix has not changed, the provider hits the cache.
# livekit-agents==1.5.8
from livekit.agents import Agent
from livekit.plugins import anthropic
class Assistant(Agent):
def __init__(self) -> None:
super().__init__(
instructions=SYSTEM_PROMPT,
llm=anthropic.LLM(
model="claude-sonnet-4-5",
cache_control={"type": "ephemeral"},
),
)
# AgentSession appends each turn to chat_ctx automatically.
# Stable order = cache hits on turns 2+.
Where LiveKit does it for you. chat_ctx ordering is stable by default. You break it if you rewrite history mid-conversation.
Common mistake. Mutating chat_ctx in on_user_turn_completed to “tidy up” prior turns. Every mutation invalidates the cache from that point on. If you must rewrite history, do it at session start, not mid-conversation.
12. Regional routing for STT and TTS
What it does. Pin STT and TTS to the closest regional endpoint of the provider. Many providers route based on the gateway region by default, but the explicit region= or model-string suffix can shave 30-80ms.
LiveKit surface. Plugin-level region parameters. Deepgram supports a base_url override for self-hosted or regional clusters. Cartesia and ElevenLabs accept region routing through their API. LiveKit Cloud rooms run in the closest region by default; you can pin per project in the dashboard.
# livekit-agents==1.5.8
from livekit.plugins import deepgram, cartesia
stt = deepgram.STT(
model="nova-3",
interim_results=True,
base_url="https://api.deepgram.com", # swap for the regional cluster
)
tts = cartesia.TTS(
model="sonic-3",
voice="9626c31c-bec5-4cca-baa8-f8ba9e84c8bc",
)
Where LiveKit does it for you. LiveKit’s SFU routes media via the closest POP. Provider-side region is on the plugin.
Common mistake. Running EU customers through a us-east STT cluster because the LiveKit Cloud project defaults to us-east. The audio crosses the Atlantic twice. Pin the LiveKit project region to eu for EU traffic and pin the Deepgram and Cartesia regions to their EU clusters.
Bonus: turn-taking latency on LiveKit
LiveKit’s turn detector is one of the strongest in the open-source voice stack. The MultilingualModel and EnglishModel from livekit-plugins-turn-detector use a small open-weights model layered on top of VAD and STT endpoint data to decide when the user has finished speaking. The defaults are tuned for naturalness; the latency-tuning levers are on TurnHandlingOptions.
# livekit-agents==1.5.8
from livekit.agents import AgentSession
from livekit.plugins import silero
from livekit.plugins.turn_detector.multilingual import MultilingualModel
session = AgentSession(
stt=inference.STT("deepgram/nova-3", language="multi"),
llm=inference.LLM("openai/gpt-4o-mini"),
tts=inference.TTS("cartesia/sonic-3"),
vad=silero.VAD.load(
min_silence_duration=0.4,
prefix_padding_duration=0.2,
),
turn_detection=MultilingualModel(),
min_endpointing_delay=0.4,
max_endpointing_delay=2.0,
preemptive_generation=True,
)
min_endpointing_delay is the floor. Lower it from the default to push response time down at the cost of more false turn boundaries on noisy audio. max_endpointing_delay is the ceiling so the agent never sits silent for too long if the model is uncertain. The Silero VAD min_silence_duration controls how long a silence has to last before the VAD flips to “not speaking”. A 0.4-second silence is a reasonable balance; 0.2 seconds is aggressive and trades naturalness for speed.
For push-to-talk workloads, set turn_detection="manual" and drive turns with session.commit_user_turn and session.clear_user_turn. The user controls the boundaries; latency on turn handover drops to under 50ms because there is no endpointing model to wait for.
Stacking the techniques: a typical LiveKit Agent that hits sub-500ms
The full reference agent. Roughly 50 lines. Combines streaming STT, partial TTS, prefix caching, edge routing, prefetch, audio prebuffering, async eval, parallel TTS warm-up, smaller models for short turns, KV-cache reuse, and regional routing.
# livekit-agents==1.5.8, traceai-livekit, ai-evaluation
import asyncio
from livekit.agents import (
Agent, AgentSession, JobContext, JobProcess, WorkerOptions, cli,
inference, llm, function_tool, RunContext,
)
from livekit.plugins import silero, deepgram, cartesia, openai
from livekit.plugins.turn_detector.multilingual import MultilingualModel
from fi_instrumentation import register
from fi_instrumentation.fi_types import ProjectType
from traceai_livekit import enable_http_attribute_mapping
SYSTEM_PROMPT = "You are an Acme support agent. Speak in short sentences."
LIGHT_LLM = openai.LLM(model="gpt-4o-mini")
HEAVY_LLM = openai.LLM(model="gpt-4.1")
class Assistant(Agent):
def __init__(self) -> None:
super().__init__(instructions=SYSTEM_PROMPT, llm=LIGHT_LLM)
self._prefetch: asyncio.Task | None = None
async def on_user_turn_completed(
self, turn_ctx: llm.ChatContext, new_message: llm.ChatMessage
) -> None:
text = (new_message.text_content or "").lower()
self.llm = LIGHT_LLM if len(text.split()) <= 6 else HEAVY_LLM
if "order" in text and "status" in text:
self._prefetch = asyncio.create_task(self._lookup_order(text))
@function_tool
async def get_order_status(self, ctx: RunContext, order_id: str) -> str:
if self._prefetch is not None:
try:
return await self._prefetch
finally:
self._prefetch = None
return await self._lookup_order(order_id)
async def _lookup_order(self, text: str) -> str:
# your real lookup
return "Order shipped on Monday."
def prewarm(proc: JobProcess) -> None:
proc.userdata["vad"] = silero.VAD.load(min_silence_duration=0.4)
proc.userdata["tts"] = cartesia.TTS(model="sonic-3")
async def entrypoint(ctx: JobContext) -> None:
register(
project_name="acme-livekit-voice",
project_type=ProjectType.OBSERVE,
set_global_tracer_provider=True,
)
enable_http_attribute_mapping()
session = AgentSession(
stt=deepgram.STT(model="nova-3", interim_results=True),
llm=LIGHT_LLM,
tts=ctx.proc.userdata["tts"],
vad=ctx.proc.userdata["vad"],
turn_detection=MultilingualModel(),
min_endpointing_delay=0.4,
preemptive_generation=True,
aec_warmup_duration=3.0,
)
await session.start(room=ctx.room, agent=Assistant())
cli.run_app(WorkerOptions(entrypoint_fnc=entrypoint, prewarm_fnc=prewarm))
Expected output on a clean run, measured against captured Future AGI spans:
stt_first_partial_ms = 140
llm_ttft_ms = 220 (cache hit)
tts_first_audio_ms = 90
turn_total_ms = 520 (p95 across a 1000-turn capture)
Vanilla AgentSession on the same prompt and the same providers clocks 1.2-1.4 seconds p95. The 12-technique stack drops it to 500-650ms p95 on most workloads. Short conversational turns (“yes”, “thanks”) sit comfortably under 400ms.
Future AGI for LiveKit monitoring
LiveKit Telemetry covers the media layer (WebRTC jitter, packet loss, codec stats). It does not score every call, auto-cluster failures, or run inline guardrails on the LLM response. That gap is where Future AGI sits. Two paths compose cleanly.
Native voice obs path. No SDK. In the Future AGI dashboard, create an Agent Definition. Select LiveKit as the provider, paste your LiveKit API key, paste the Assistant ID, enable observability. Every call lands in the Observe project with separate assistant and customer audio downloads, an auto transcript, and the 70+ pre-built eval template engine. The provider-API-key surface natively covers Vapi, Retell, and LiveKit.
Code-driven path. traceai-livekit emits OpenInference spans through the OpenTelemetry batch processor. The package is part of the traceAI family alongside dedicated traceAI-pipecat and OpenAI Realtime integrations.
# traceai-livekit, run inside the worker entrypoint
from fi_instrumentation import register
from fi_instrumentation.fi_types import ProjectType
from traceai_livekit import enable_http_attribute_mapping
register(
project_name="livekit-voice-agent",
project_type=ProjectType.OBSERVE,
set_global_tracer_provider=True,
)
enable_http_attribute_mapping()
Register in-process inside the LiveKit worker entrypoint, not at module top level. The LiveKit job-runner forks workers and you can hit pickling issues if the tracer provider is captured before fork.
The two paths land in the same Observe project. The native path gives you call-level audio and transcript. The code path gives you span-level depth (STT, LLM, tool call, TTS as nested children of a voice-turn root span). For Error Feed clustering, common LiveKit clusters surface as WebRTC packet loss correlated with audio quality drops, STT confidence drops on jitter, late barge-in detection after framework bumps, and tool argument schema mismatch on @function_tool calls.
For inline guardrails on the LLM response inside the voice budget, Future AGI Protect runs sub-100ms per arXiv 2510.13351. The model family is Gemma 3n with LoRA-trained adapters across 4 safety dimensions (Content Moderation, Bias Detection, Security, Data Privacy Compliance), multi-modal across text, image, and audio. ProtectFlash gives a single-call binary classifier path for the absolute lowest-latency surface.
For multi-modal audio scoring inside an evaluation pipeline:
# ai-evaluation
from fi.testcases import MLLMTestCase, MLLMAudio
from fi.evals import Evaluator, AudioQualityEvaluator
ev = Evaluator()
audio = MLLMAudio(url="path/to/livekit_assistant_leg.wav", local=True)
result = ev.evaluate(
eval_templates=[AudioQualityEvaluator()],
inputs=[MLLMTestCase(input=audio, query="Score TTS clarity")],
)
Six prompt optimizers ship in agent-opt (Bayesian Search, Meta-Prompt per arXiv 2505.09666, ProTeGi, GEPA Genetic-Pareto per arXiv 2507.19457, Random Search per arXiv 2311.09569, PromptWizard). Once your LiveKit traces accumulate failure patterns in Future AGI, point an optimization run at the dataset, pick an evaluator, pick one of the six optimizers, and review candidate prompts in the dashboard. Apache 2.0 across the SDK family. The Agent Command Center hosts the whole stack with RBAC, BYOC, multi-region, and the cert set on futureagi.com/trust (SOC 2 Type II, HIPAA, GDPR, CCPA, ISO 27001).
Related reading
- How to Optimize Voice Agent Latency: 12 Techniques That Work in 2026
- Voice AI Observability for LiveKit Agents: A 2026 Guide
- Sub-500ms Voice AI: The Complete Latency Budget Guide for 2026
- How to Measure Voice AI Latency: The Complete 2026 Guide
- Audio Caching for Voice AI: A Developer’s Guide to Latency Reduction in 2026
- Voice AI Barge-In and Turn-Taking in 2026
Sources and references
- LiveKit Agents docs: docs.livekit.io/agents
- LiveKit Agents repo: github.com/livekit/agents
- traceAI on GitHub: github.com/future-agi/traceAI
- ai-evaluation on GitHub: github.com/future-agi/ai-evaluation
- agent-opt on GitHub: github.com/future-agi/agent-opt
- Future AGI Protect benchmarks: arXiv 2510.13351
- GEPA optimizer paper: arXiv 2507.19457
- Meta-Prompt optimizer paper: arXiv 2505.09666
- OpenInference spec: github.com/Arize-ai/openinference
- Future AGI trust and compliance: futureagi.com/trust
Frequently asked questions
What is the fastest path to sub-500ms p95 on LiveKit Agents in 2026?
Where do partial LLM tokens stream into TTS inside AgentSession?
Does LiveKit support prompt prefix caching?
How do I prefetch tool calls on a LiveKit Agent?
What is preemptive_generation on AgentSession and how much does it save?
Does traceai-livekit add latency on the critical path?
Should I use LiveKit Inference or per-provider plugins for latency?
How does Future AGI Protect fit inside a LiveKit Agent loop?
Optimize Retell AI voice agent latency to sub-500ms p95 in 2026. 12 techniques with real Retell agent config: STT, response_engine, backchannel, states, async eval.
Optimize Vapi voice agent latency to sub-500ms p95 in 2026. 12 techniques with real Vapi config: streaming STT, partial TTS, prompt caching, regional routing, async eval.
Optimize Pipecat voice agent latency to sub-500ms p95 in 2026. 12 techniques with real pipeline code: streaming STT, partial TTS, prefix caching, regional routing, async eval.