Engineering

How to Optimize LiveKit Voice Agent Latency in 2026: 12 Techniques with Real Code

Cut LiveKit Agents voice latency to sub-500ms p95 in 2026. 12 techniques with real AgentSession code: streaming STT, partial TTS, prefix caching, regional.

March 6, 2026

Updated May 20, 2026

13 min read

voice-ai 2026 livekit latency optimization how-to

Table of Contents

LiveKit Agents ships the orchestration. You ship the tuning. A vanilla AgentSession with default STT, LLM, and TTS plugins lands around 1.2-1.4 seconds p95 turn latency. The same session with 12 specific optimizations lands around 500-650ms p95 on the same hardware and the same providers. This guide walks each of the 12 techniques from the parent voice latency methodology and shows the exact LiveKit Agents knob, plugin, or callback that implements it, with Python code you can paste into a livekit-agents service today.

TL;DR: LiveKit plugin or config to expected win

#	Technique	LiveKit surface	Expected p95 saving
1	Streaming STT first-partial routing	`inference.STT()` or `deepgram.STT(interim_results=True)` plus `preemptive_generation=True`	200-400ms
2	Partial LLM tokens into TTS	Built-in via `AgentSession`; pipe streaming LLM into streaming TTS	200-500ms
3	LLM prompt prefix caching	Anchor `chat_ctx` system prompt, set `cache_control` on Anthropic plugin	200-400ms
4	Edge model routing	Per-`Agent` `llm=` swap inside `on_user_turn_completed`	100-300ms
5	Prefetch tool calls on high-confidence intent	Classify in `on_user_turn_completed`, fire `asyncio.create_task` for `@function_tool`	200-400ms
6	Audio prebuffering	`aec_warmup_duration` on session plus `withPreConnectAudio` on the client	80-200ms perceived
7	Async evaluation	traceAI-livekit spans exported async to Future AGI; eval scoring runs off the critical path	100-300ms
8	Parallel TTS warm-up	Pre-instantiate TTS plugin in `prewarm`; first turn skips the cold-connect tax	50-150ms
9	Smaller models for short turns	Swap `llm=` per turn in `on_user_turn_completed` based on intent length	100-300ms
10	Semantic cache for common intents	Custom check inside `on_user_turn_completed`; short-circuit with `session.say`	400-800ms on hits
11	KV-cache reuse across turns	Stable `chat_ctx` ordering plus provider session caching	100-300ms
12	Regional routing for STT and TTS	`region=` on Deepgram and Cartesia plugins; pin LiveKit Cloud region	30-80ms

Stacked, the techniques drop a 1.2-1.4 second LiveKit turn into the 500-650ms p95 band on most workloads. Short turns hit sub-500ms.

How to read this guide

The parent post covers each technique’s theory and the per-stage latency budget. This post is the LiveKit-specific implementation map. Each section below answers four questions:

What it does in LiveKit terms. Which plugin, which AgentSession parameter, which Agent callback.
Code. A 5-15 line Python snippet that runs against livekit-agents>=1.5.
Where LiveKit does it for you. If the framework handles the technique automatically.
Common mistake. What people break when they wire it the first time.

Pin SDK versions on every install. The examples here are written against livekit-agents==1.5.8, the May 2026 release. Older versions ship older AgentSession signatures.

pip install "livekit-agents[deepgram,openai,cartesia,silero,turn-detector]==1.5.8"
pip install traceai-livekit

1. Streaming STT with first-partial routing

What it does. Switch from batch STT to streaming STT that emits partial transcripts every 100-200ms while the user is still speaking. Feed the latest partial to the LLM before the turn detector commits.

LiveKit surface. Streaming is the default on every modern STT plugin (Deepgram, AssemblyAI, OpenAI Whisper streaming). The win comes from turning on preemptive_generation=True on AgentSession, which fires the LLM call on the STT partial instead of waiting for the final transcript.

# livekit-agents==1.5.8
from livekit.agents import AgentSession, inference
from livekit.plugins import silero
from livekit.plugins.turn_detector.multilingual import MultilingualModel

session = AgentSession(
    stt=inference.STT("deepgram/nova-3", language="multi"),
    llm=inference.LLM("openai/gpt-4o-mini"),
    tts=inference.TTS("cartesia/sonic-3"),
    vad=silero.VAD.load(),
    turn_detection=MultilingualModel(),
    preemptive_generation=True,
)

Where LiveKit does it for you. STT plugins already stream by default. You do not write the partial loop.

Common mistake. Leaving preemptive_generation at its default. Without it the LLM waits for the turn-detector commit, which adds 150-350ms to every turn.

2. Partial LLM tokens piped into TTS

What it does. Stream LLM tokens. When the first sentence boundary lands, fire that sentence to TTS. The user hears the first word before the LLM has finished writing the response.

LiveKit surface. Fully automatic inside AgentSession. When the LLM plugin streams (which all the major plugins do) and the TTS plugin streams (Cartesia Sonic, ElevenLabs Turbo v2.5, OpenAI TTS streaming), the llm_node to tts_node pipe flushes at sentence boundaries with no extra config.

# livekit-agents==1.5.8
# Streaming LLM + streaming TTS is the default
session = AgentSession(
    stt=inference.STT("deepgram/nova-3"),
    llm=inference.LLM("openai/gpt-4o-mini"),  # streams by default
    tts=inference.TTS("cartesia/sonic-3"),    # streams by default
    vad=silero.VAD.load(),
    turn_detection=MultilingualModel(),
)

Where LiveKit does it for you. Everything. The tts_node consumes the LLM token stream as an AsyncIterable[str] and flushes audio at sentence breaks.

Common mistake. Wrapping the LLM plugin in a custom llm_node that buffers the full response before returning. That kills the stream. If you need to override llm_node, return an AsyncIterable[llm.ChatChunk] and yield chunks as they arrive.

3. LLM prompt prefix caching

What it does. Anchor the system prompt at the top of the chat context. Keep it byte-identical across turns. Anthropic, OpenAI, and Google all cache prompt prefixes server-side, which slashes TTFT on cache hits.

LiveKit surface. chat_ctx on the Agent class. The Anthropic plugin accepts a cache_control block when you author the system prompt. OpenAI auto-caches whenever the prefix is byte-stable. Build the prompt once, attach it to the Agent, and resist the urge to interpolate per-turn timestamps near the top.

# livekit-agents==1.5.8
from livekit.agents import Agent, llm
from livekit.plugins import anthropic

SYSTEM_PROMPT = """You are a customer support agent for Acme Corp.
You answer questions about orders, returns, and shipping.
You speak in short sentences."""  # 1500+ tokens in production

class Assistant(Agent):
    def __init__(self) -> None:
        super().__init__(
            instructions=SYSTEM_PROMPT,
            llm=anthropic.LLM(
                model="claude-haiku-4-5",
                cache_control={"type": "ephemeral"},  # cache the prefix
            ),
        )

Where LiveKit does it for you. chat_ctx preserves order across turns. The plugin passes cache_control through.

Common mistake. Putting the current timestamp or the user ID at the top of the system prompt for “personalization”. That changes the prefix byte-string every turn and defeats caching. Put dynamic content near the end of chat_ctx.

4. Edge model routing

What it does. Route short conversational turns to a smaller, faster model. Route complex tool turns to the larger model. Both LLMs hit the same chat_ctx so context is shared.

LiveKit surface. Override on_user_turn_completed on the Agent. Inspect the new message. Swap the llm= plugin on the session before the LLM node runs.

# livekit-agents==1.5.8
from livekit.agents import Agent, llm
from livekit.plugins import openai, anthropic

LIGHT_LLM = openai.LLM(model="gpt-4o-mini")
HEAVY_LLM = anthropic.LLM(model="claude-sonnet-4-5")

class Assistant(Agent):
    async def on_user_turn_completed(
        self, turn_ctx: llm.ChatContext, new_message: llm.ChatMessage
    ) -> None:
        text = new_message.text_content or ""
        if len(text) < 60 and "?" not in text:
            self.llm = LIGHT_LLM
        else:
            self.llm = HEAVY_LLM

Where LiveKit does it for you. Nothing. Routing is your call.

Common mistake. Routing every turn to the largest model “for quality”. A Sonnet-class model is 200-400ms slower to TTFT than a Haiku-class model on the same prompt. Most conversational turns (“yes”, “thanks”, “can you repeat that”) do not need the larger model.

5. Prefetch tool calls on high-confidence intent

What it does. When the STT partial commits, classify intent. If confidence is above 0.85 and the intent maps to a known @function_tool, fire the tool call in parallel with the LLM call. If the LLM picks a different tool, cancel the prefetched future.

LiveKit surface. on_user_turn_completed runs once per turn after the turn detector commits. Fire the tool as an asyncio task. Stash the future on the Agent instance. When the LLM later requests the tool inside the same turn, await the prefetched future instead of running it cold.

# livekit-agents==1.5.8
import asyncio
from livekit.agents import Agent, function_tool, llm, RunContext

class Assistant(Agent):
    def __init__(self) -> None:
        super().__init__(instructions="...")
        self._prefetch: asyncio.Task | None = None

    async def on_user_turn_completed(
        self, turn_ctx: llm.ChatContext, new_message: llm.ChatMessage
    ) -> None:
        text = (new_message.text_content or "").lower()
        if "order" in text and "status" in text:
            self._prefetch = asyncio.create_task(self._lookup_order(text))

    @function_tool
    async def get_order_status(self, ctx: RunContext, order_id: str) -> str:
        if self._prefetch is not None:
            try:
                return await self._prefetch
            finally:
                self._prefetch = None
        return await self._lookup_order(order_id)

Where LiveKit does it for you. Nothing. The decorator wires the function tool to the LLM. You wire the prefetch.

Common mistake. Forgetting to cancel the prefetched future when the LLM picks a different tool. Leaked tasks pile up at scale. Wrap the prefetch in a try/finally that cancels on exit if the future was not awaited.

6. Audio prebuffering

What it does. Open the audio path before the user speaks. Buffer the first 80-200ms of audio so the STT plugin starts processing from frame zero instead of paying a connection-setup tax on the first frame.

LiveKit surface. Two knobs. On the agent side, aec_warmup_duration on AgentSession pre-runs the acoustic echo canceller so it is warm when the first user audio arrives. On the client side, the LiveKit SDKs support withPreConnectAudio() (Swift, Android, Flutter) and preConnectBuffer: true (web JS), which queues the user’s first audio frames into LiveKit before the SFU handshake finishes.

# livekit-agents==1.5.8
from livekit.agents import AgentSession, inference
from livekit.plugins import silero
from livekit.plugins.turn_detector.multilingual import MultilingualModel

session = AgentSession(
    stt=inference.STT("deepgram/nova-3"),
    llm=inference.LLM("openai/gpt-4o-mini"),
    tts=inference.TTS("cartesia/sonic-3"),
    vad=silero.VAD.load(),
    turn_detection=MultilingualModel(),
    preemptive_generation=True,
    aec_warmup_duration=3.0,
)

Where LiveKit does it for you. The audio path is managed by the SFU. You configure the warmup window.

Common mistake. Skipping the client-side pre-connect option. Without it, the user’s first 100-200ms of audio sits in the client buffer while the SFU handshake completes. That adds straight to first-turn latency.

7. Async evaluation

What it does. Run scoring after the turn commits. Never block the critical path on an LLM judge. Use a classifier model for inline rubrics if you absolutely need inline scoring.

LiveKit surface. traceai-livekit emits OpenInference spans through the OpenTelemetry batch processor. Span export is async. Scoring inside Future AGI’s Observe project runs off the critical path on the captured trace. Inline scoring only fires on the routes that need it.

# traceai-livekit, livekit-agents==1.5.8
from fi_instrumentation import register
from fi_instrumentation.fi_types import ProjectType
from traceai_livekit import enable_http_attribute_mapping

register(
    project_name="livekit-voice-agent",
    project_type=ProjectType.OBSERVE,
    set_global_tracer_provider=True,
)
enable_http_attribute_mapping()

Register inside the worker entrypoint, not in the module top level that the LiveKit job-runner pickles across processes. Otherwise you can hit pickling errors when LiveKit forks the agent worker.

Where Future AGI does it for you. 70+ pre-built eval templates (including audio_transcription, audio_quality, conversation_coherence, conversation_resolution, task_completion) run async on captured traces by default. You opt into inline scoring per route.

Common mistake. Running an LLM judge synchronously inside on_user_turn_completed “to gate the response”. That adds 200-500ms to every turn. The classifier-based ai-evaluation models are tuned for the LLM-as-judge cost and latency tradeoff so the async path stays affordable at scale.

8. Parallel TTS warm-up

What it does. Keep a warm TTS connection open per session. When the LLM emits the first sentence, the connection is already authenticated and the voice is already preloaded, so the first audio frame arrives 50-150ms faster than a cold start.

LiveKit surface. Pre-instantiate the TTS plugin inside the worker prewarm function. The same TTS object is reused across the session. Cartesia Sonic, ElevenLabs Turbo, and OpenAI TTS streaming all keep the WebSocket warm between turns.

# livekit-agents==1.5.8
from livekit.agents import JobProcess, WorkerOptions, cli
from livekit.plugins import silero, cartesia

def prewarm(proc: JobProcess) -> None:
    proc.userdata["vad"] = silero.VAD.load()
    proc.userdata["tts"] = cartesia.TTS(
        model="sonic-3",
        voice="9626c31c-bec5-4cca-baa8-f8ba9e84c8bc",
    )

async def entrypoint(ctx):
    session = AgentSession(
        stt=inference.STT("deepgram/nova-3"),
        llm=inference.LLM("openai/gpt-4o-mini"),
        tts=ctx.proc.userdata["tts"],
        vad=ctx.proc.userdata["vad"],
        turn_detection=MultilingualModel(),
    )

cli.run_app(WorkerOptions(entrypoint_fnc=entrypoint, prewarm_fnc=prewarm))

Where LiveKit does it for you. prewarm is the hook. The plugin’s WebSocket reuse is the saving.

Common mistake. Constructing the TTS plugin inside entrypoint instead of prewarm. That defers the cold-connect tax to the first turn, where users notice it.

9. Smaller models for short turns

What it does. Acknowledgments (“yes”, “okay”, “go on”) do not need a frontier model. Route them to a smaller LLM. Route tool turns and reasoning turns to the larger one.

LiveKit surface. Same hook as technique 4: on_user_turn_completed. Inspect the turn. Swap self.llm. The next llm_node call uses the new plugin.

# livekit-agents==1.5.8
from livekit.agents import Agent, llm
from livekit.plugins import openai

NANO = openai.LLM(model="gpt-4o-mini")
FULL = openai.LLM(model="gpt-4.1")

class Assistant(Agent):
    async def on_user_turn_completed(
        self, turn_ctx: llm.ChatContext, new_message: llm.ChatMessage
    ) -> None:
        text = (new_message.text_content or "").strip()
        if text.split() and len(text.split()) <= 4:
            self.llm = NANO
        else:
            self.llm = FULL

Where LiveKit does it for you. Nothing. Routing is yours.

Common mistake. Routing based on token count rather than intent class. A four-word turn (“cancel my entire order”) may still need the heavier model. Use a tiny classifier or a hard list of acknowledgment phrases instead of raw length when stakes are high.

10. Semantic cache for common intents

What it does. Embed the user’s partial transcript. Search a cache of recently-answered queries by embedding similarity. On a hit above threshold, return the cached audio answer directly via session.say and skip the LLM and TTS pipeline.

LiveKit surface. on_user_turn_completed runs before the LLM. Check the cache there. On a hit, call session.say to play the cached answer and session.interrupt to suppress the LLM response that would otherwise queue.

# livekit-agents==1.5.8
from livekit.agents import Agent, llm

class Assistant(Agent):
    async def on_user_turn_completed(
        self, turn_ctx: llm.ChatContext, new_message: llm.ChatMessage
    ) -> None:
        text = new_message.text_content or ""
        hit = await self._semantic_cache.lookup(
            text,
            tenant_id=self._tenant_id,
            threshold=0.92,
        )
        if hit is not None:
            self.session.interrupt()
            await self.session.say(hit.answer, allow_interruptions=False)

Where Future AGI does it for you. For teams that already route through the Agent Command Center, the gateway covers semantic cache, prompt cache, model fallback, and per-route routing across 15+ providers behind one endpoint.

Common mistake. Caching without a tenant ID. Customer A’s account-balance question must not return Customer B’s account balance. Always key the cache on tenant_id plus the intent embedding.

11. KV-cache reuse across turns

What it does. Multi-turn conversations on Anthropic, OpenAI, and Google all benefit from session or prefix caching. The model skips reprocessing the conversation history that is already cached.

LiveKit surface. chat_ctx is preserved across turns by AgentSession. The plugin passes the same chat context to the LLM on every call. As long as the order is stable and the prefix has not changed, the provider hits the cache.

# livekit-agents==1.5.8
from livekit.agents import Agent
from livekit.plugins import anthropic

class Assistant(Agent):
    def __init__(self) -> None:
        super().__init__(
            instructions=SYSTEM_PROMPT,
            llm=anthropic.LLM(
                model="claude-sonnet-4-5",
                cache_control={"type": "ephemeral"},
            ),
        )
    # AgentSession appends each turn to chat_ctx automatically.
    # Stable order = cache hits on turns 2+.

Where LiveKit does it for you. chat_ctx ordering is stable by default. You break it if you rewrite history mid-conversation.

Common mistake. Mutating chat_ctx in on_user_turn_completed to “tidy up” prior turns. Every mutation invalidates the cache from that point on. If you must rewrite history, do it at session start, not mid-conversation.

12. Regional routing for STT and TTS

What it does. Pin STT and TTS to the closest regional endpoint of the provider. Many providers route based on the gateway region by default, but the explicit region= or model-string suffix can shave 30-80ms.

LiveKit surface. Plugin-level region parameters. Deepgram supports a base_url override for self-hosted or regional clusters. Cartesia and ElevenLabs accept region routing through their API. LiveKit Cloud rooms run in the closest region by default; you can pin per project in the dashboard.

# livekit-agents==1.5.8
from livekit.plugins import deepgram, cartesia

stt = deepgram.STT(
    model="nova-3",
    interim_results=True,
    base_url="https://api.deepgram.com",  # swap for the regional cluster
)

tts = cartesia.TTS(
    model="sonic-3",
    voice="9626c31c-bec5-4cca-baa8-f8ba9e84c8bc",
)

Where LiveKit does it for you. LiveKit’s SFU routes media via the closest POP. Provider-side region is on the plugin.

Common mistake. Running EU customers through a us-east STT cluster because the LiveKit Cloud project defaults to us-east. The audio crosses the Atlantic twice. Pin the LiveKit project region to eu for EU traffic and pin the Deepgram and Cartesia regions to their EU clusters.

Bonus: turn-taking latency on LiveKit

LiveKit’s turn detector is one of the strongest in the open-source voice stack. The MultilingualModel and EnglishModel from livekit-plugins-turn-detector use a small open-weights model layered on top of VAD and STT endpoint data to decide when the user has finished speaking. The defaults are tuned for naturalness; the latency-tuning levers are on TurnHandlingOptions.

# livekit-agents==1.5.8
from livekit.agents import AgentSession
from livekit.plugins import silero
from livekit.plugins.turn_detector.multilingual import MultilingualModel

session = AgentSession(
    stt=inference.STT("deepgram/nova-3", language="multi"),
    llm=inference.LLM("openai/gpt-4o-mini"),
    tts=inference.TTS("cartesia/sonic-3"),
    vad=silero.VAD.load(
        min_silence_duration=0.4,
        prefix_padding_duration=0.2,
    ),
    turn_detection=MultilingualModel(),
    min_endpointing_delay=0.4,
    max_endpointing_delay=2.0,
    preemptive_generation=True,
)

min_endpointing_delay is the floor. Lower it from the default to push response time down at the cost of more false turn boundaries on noisy audio. max_endpointing_delay is the ceiling so the agent never sits silent for too long if the model is uncertain. The Silero VAD min_silence_duration controls how long a silence has to last before the VAD flips to “not speaking”. A 0.4-second silence is a reasonable balance; 0.2 seconds is aggressive and trades naturalness for speed.

For push-to-talk workloads, set turn_detection="manual" and drive turns with session.commit_user_turn and session.clear_user_turn. The user controls the boundaries; latency on turn handover drops to under 50ms because there is no endpointing model to wait for.

Stacking the techniques: a typical LiveKit Agent that hits sub-500ms

The full reference agent. Roughly 50 lines. Combines streaming STT, partial TTS, prefix caching, edge routing, prefetch, audio prebuffering, async eval, parallel TTS warm-up, smaller models for short turns, KV-cache reuse, and regional routing.

# livekit-agents==1.5.8, traceai-livekit, ai-evaluation
import asyncio
from livekit.agents import (
    Agent, AgentSession, JobContext, JobProcess, WorkerOptions, cli,
    inference, llm, function_tool, RunContext,
)
from livekit.plugins import silero, deepgram, cartesia, openai
from livekit.plugins.turn_detector.multilingual import MultilingualModel
from fi_instrumentation import register
from fi_instrumentation.fi_types import ProjectType
from traceai_livekit import enable_http_attribute_mapping

SYSTEM_PROMPT = "You are an Acme support agent. Speak in short sentences."
LIGHT_LLM = openai.LLM(model="gpt-4o-mini")
HEAVY_LLM = openai.LLM(model="gpt-4.1")

class Assistant(Agent):
    def __init__(self) -> None:
        super().__init__(instructions=SYSTEM_PROMPT, llm=LIGHT_LLM)
        self._prefetch: asyncio.Task | None = None

    async def on_user_turn_completed(
        self, turn_ctx: llm.ChatContext, new_message: llm.ChatMessage
    ) -> None:
        text = (new_message.text_content or "").lower()
        self.llm = LIGHT_LLM if len(text.split()) <= 6 else HEAVY_LLM
        if "order" in text and "status" in text:
            self._prefetch = asyncio.create_task(self._lookup_order(text))

    @function_tool
    async def get_order_status(self, ctx: RunContext, order_id: str) -> str:
        if self._prefetch is not None:
            try:
                return await self._prefetch
            finally:
                self._prefetch = None
        return await self._lookup_order(order_id)

    async def _lookup_order(self, text: str) -> str:
        # your real lookup
        return "Order shipped on Monday."

def prewarm(proc: JobProcess) -> None:
    proc.userdata["vad"] = silero.VAD.load(min_silence_duration=0.4)
    proc.userdata["tts"] = cartesia.TTS(model="sonic-3")

async def entrypoint(ctx: JobContext) -> None:
    register(
        project_name="acme-livekit-voice",
        project_type=ProjectType.OBSERVE,
        set_global_tracer_provider=True,
    )
    enable_http_attribute_mapping()

    session = AgentSession(
        stt=deepgram.STT(model="nova-3", interim_results=True),
        llm=LIGHT_LLM,
        tts=ctx.proc.userdata["tts"],
        vad=ctx.proc.userdata["vad"],
        turn_detection=MultilingualModel(),
        min_endpointing_delay=0.4,
        preemptive_generation=True,
        aec_warmup_duration=3.0,
    )
    await session.start(room=ctx.room, agent=Assistant())

cli.run_app(WorkerOptions(entrypoint_fnc=entrypoint, prewarm_fnc=prewarm))

Expected output on a clean run, measured against captured Future AGI spans:

stt_first_partial_ms = 140
llm_ttft_ms          = 220   (cache hit)
tts_first_audio_ms   = 90
turn_total_ms        = 520   (p95 across a 1000-turn capture)

Vanilla AgentSession on the same prompt and the same providers clocks 1.2-1.4 seconds p95. The 12-technique stack drops it to 500-650ms p95 on most workloads. Short conversational turns (“yes”, “thanks”) sit comfortably under 400ms.

Future AGI for LiveKit monitoring

LiveKit Telemetry covers the media layer (WebRTC jitter, packet loss, codec stats). It does not score every call, auto-cluster failures, or run inline guardrails on the LLM response. That gap is where Future AGI sits. Two paths compose cleanly.

Native voice obs path. No SDK. In the Future AGI dashboard, create an Agent Definition. Select LiveKit as the provider, paste your LiveKit API key, paste the Assistant ID, enable observability. Every call lands in the Observe project with separate assistant and customer audio downloads, an auto transcript, and the 70+ pre-built eval template engine. The provider-API-key surface natively covers Vapi, Retell, and LiveKit.

Code-driven path. traceai-livekit emits OpenInference spans through the OpenTelemetry batch processor. The package is part of the traceAI family alongside dedicated traceAI-pipecat and OpenAI Realtime integrations.

# traceai-livekit, run inside the worker entrypoint
from fi_instrumentation import register
from fi_instrumentation.fi_types import ProjectType
from traceai_livekit import enable_http_attribute_mapping

register(
    project_name="livekit-voice-agent",
    project_type=ProjectType.OBSERVE,
    set_global_tracer_provider=True,
)
enable_http_attribute_mapping()

Register in-process inside the LiveKit worker entrypoint, not at module top level. The LiveKit job-runner forks workers and you can hit pickling issues if the tracer provider is captured before fork.

The two paths land in the same Observe project. The native path gives you call-level audio and transcript. The code path gives you span-level depth (STT, LLM, tool call, TTS as nested children of a voice-turn root span). For Error Feed clustering, common LiveKit clusters surface as WebRTC packet loss correlated with audio quality drops, STT confidence drops on jitter, late barge-in detection after framework bumps, and tool argument schema mismatch on @function_tool calls.

For inline guardrails on the LLM response inside the voice budget, Future AGI Protect runs sub-100ms per arXiv 2510.13351. The model family is Gemma 3n with LoRA-trained adapters across 4 safety dimensions (Content Moderation, Bias Detection, Security, Data Privacy Compliance), multi-modal across text, image, and audio. ProtectFlash gives a single-call binary classifier path for the absolute lowest-latency surface.

For multi-modal audio scoring inside an evaluation pipeline:

# ai-evaluation
from fi.testcases import MLLMTestCase, MLLMAudio
from fi.evals import Evaluator, AudioQualityEvaluator

ev = Evaluator()
audio = MLLMAudio(url="path/to/livekit_assistant_leg.wav", local=True)
result = ev.evaluate(
    eval_templates=[AudioQualityEvaluator()],
    inputs=[MLLMTestCase(input=audio, query="Score TTS clarity")],
)

Six prompt optimizers ship in agent-opt (Bayesian Search, Meta-Prompt per arXiv 2505.09666, ProTeGi, GEPA Genetic-Pareto per arXiv 2507.19457, Random Search per arXiv 2311.09569, PromptWizard). Once your LiveKit traces accumulate failure patterns in Future AGI, point an optimization run at the dataset, pick an evaluator, pick one of the six optimizers, and review candidate prompts in the dashboard. Apache 2.0 across the SDK family. The Agent Command Center hosts the whole stack with RBAC, BYOC, multi-region, and the cert set on futureagi.com/trust (SOC 2 Type II, HIPAA, GDPR, CCPA, ISO 27001).

Sources and references

LiveKit Agents docs: docs.livekit.io/agents
LiveKit Agents repo: github.com/livekit/agents
traceAI on GitHub: github.com/future-agi/traceAI
ai-evaluation on GitHub: github.com/future-agi/ai-evaluation
agent-opt on GitHub: github.com/future-agi/agent-opt
Future AGI Protect benchmarks: arXiv 2510.13351
GEPA optimizer paper: arXiv 2507.19457
Meta-Prompt optimizer paper: arXiv 2505.09666
OpenInference spec: github.com/Arize-ai/openinference
Future AGI trust and compliance: futureagi.com/trust

Frequently asked questions

What is the fastest path to sub-500ms p95 on LiveKit Agents in 2026?

Wire AgentSession with streaming STT (Deepgram Nova-3 via the livekit-plugins-deepgram package), stream LLM tokens straight into the TTS node, and turn on preemptive_generation=True on the session constructor. Add the MultilingualModel turn detector with min_endpointing_delay tuned to 0.3-0.5 seconds, pre-warm Silero VAD inside prewarm, and pin STT and TTS plugins to the closest provider region. That combination drops a vanilla LiveKit Agents loop from 1.2-1.4 seconds p95 to 500-650ms p95 on most workloads.

Where do partial LLM tokens stream into TTS inside AgentSession?

AgentSession does this for you. When you wire a streaming LLM plugin (OpenAI, Anthropic, Google) and a streaming TTS plugin (Cartesia, ElevenLabs, OpenAI) into AgentSession, the session piles tokens through tts_node as they arrive. Sentence boundaries are detected automatically. You do not write the buffering loop. What you do control: preemptive_generation=True on the session, which starts LLM inference on STT partials before the user finishes the turn.

Does LiveKit support prompt prefix caching?

Yes, via the LLM plugin. The Anthropic plugin accepts cache_control blocks in the chat context, and the OpenAI plugin uses auto-prefix caching whenever the system prompt prefix is byte-stable. To get the win, anchor your system prompt at the top of chat_ctx, keep it byte-identical across turns, and put any dynamic per-turn content (timestamps, user IDs) near the end. On a 1500-token system prompt with caching enabled, TTFT drops from 500-800ms to 200-300ms.

How do I prefetch tool calls on a LiveKit Agent?

Override on_user_turn_completed on your Agent subclass. The callback fires when the turn detector commits the partial transcript, before the LLM node runs. Inside the callback, classify intent against the new_message content. If confidence is above 0.85 and the intent maps to a known function tool, fire the tool call as an asyncio.create_task and stash the future on the Agent instance. When the LLM later requests the tool, await the prefetched future instead of running it cold. Cancel it if the LLM picks a different tool.

What is preemptive_generation on AgentSession and how much does it save?

preemptive_generation=True tells AgentSession to start LLM inference on the STT partial transcript while the user is still speaking, instead of waiting for the turn detector to commit the final transcript. If the partial is stable, the LLM is already mid-response when the turn closes. The user-perceived TTFT drops by 150-350ms. The cost is occasional discarded LLM calls when the partial changes meaningfully, which usually sits below 5 percent of turns. Net positive on every voice agent we have measured.

Does traceai-livekit add latency on the critical path?

No. The traceai-livekit package emits OpenInference spans asynchronously through the OpenTelemetry exporter. Span generation runs in a background batch processor with no blocking writes on the voice turn. Inside Future AGI, those spans drop into the Observe project alongside the native LiveKit dashboard integration. Eval scoring runs async by default. The only place you take latency on the critical path is if you wire inline Future AGI Protect, which is sub-100ms per arXiv 2510.13351 and fits inside the orchestration slice of a sub-500ms turn.

Should I use LiveKit Inference or per-provider plugins for latency?

Per-provider plugins win on latency tuning. LiveKit Inference is a managed routing layer that simplifies billing and provider swap, but adds a small hop versus calling Deepgram, OpenAI, and Cartesia directly from the livekit-plugins-* packages. For a sub-500ms target with regional pinning, run direct plugins and pin each to the closest provider region. Use LiveKit Inference when ease of switching providers matters more than the 20-40ms hop saving.

How does Future AGI Protect fit inside a LiveKit Agent loop?

Protect runs sub-100ms inline as a callable inside the agent loop, between the LLM response and the tts_node. The model family is Gemma 3n with LoRA-trained adapters across 4 safety dimensions (Content Moderation, Bias Detection, Security, Data Privacy Compliance), multi-modal across text, image, and audio per arXiv 2510.13351. ProtectFlash gives a single-call binary verdict path for the absolute lowest-latency surface. Either fits inside the orchestration slice of a sub-500ms voice turn.

View all

Engineering

How to Optimize Vapi Voice Agent Latency in 2026: 12 Techniques + Code

Optimize Vapi voice agent latency to sub-500ms p95 in 2026. 12 techniques with real Vapi config: streaming STT, partial TTS, prompt caching, regional.

Nikhil Pareek · Apr 29, 2026

14 min

Engineering

How to Optimize Pipecat Voice Agent Latency in 2026: 12 Techniques + Code

Cut Pipecat voice agent latency to sub-500ms p95 in 2026. 12 techniques with real pipeline code: streaming STT, partial TTS, prefix caching, routing.

Vrinda Damani · Mar 30, 2026

13 min

Engineering

How to Optimize Retell Voice Agent Latency in 2026: 12 Techniques + Code

Optimize Retell AI voice agent latency to sub-500ms p95 in 2026. 12 techniques with real Retell config: STT, response_engine, backchannel, async eval.

Nikhil Pareek · Mar 1, 2026

15 min

TL;DR: LiveKit plugin or config to expected win

How to read this guide

1. Streaming STT with first-partial routing

2. Partial LLM tokens piped into TTS

3. LLM prompt prefix caching

4. Edge model routing

5. Prefetch tool calls on high-confidence intent

6. Audio prebuffering

7. Async evaluation

8. Parallel TTS warm-up

9. Smaller models for short turns

10. Semantic cache for common intents

11. KV-cache reuse across turns

12. Regional routing for STT and TTS

Bonus: turn-taking latency on LiveKit

Stacking the techniques: a typical LiveKit Agent that hits sub-500ms

Future AGI for LiveKit monitoring

Related reading

Sources and references

Frequently asked questions