What Is Time to First Word (TTFW)?
The voice-agent latency from finalized user turn to the first intelligible word of the assistant's spoken reply.
What Is Time to First Word (TTFW)?
Time to First Word (TTFW) is the latency from a voice agent being ready to answer a user turn to the first intelligible word of the agent’s spoken reply. It is a voice observability metric inside production traces, not an LLM quality score. TTFW spans turn detection, ASR finalization, LLM first-token latency, TTS startup, audio transport, and playback. In FutureAGI traces from traceAI:livekit, engineers use it to separate a slow model from a slow speech or network path.
Why It Matters in Production LLM and Agent Systems
TTFW is the pause that makes a voice agent feel alive or broken. If the user finishes speaking and hears silence for 1.8 seconds, they start repeating themselves, interrupting the bot, or abandoning the call. That creates false barge-in events, duplicate intents, and higher escalation rates even when the eventual answer is correct.
The hard part is that silence can come from several places. A noisy microphone can delay voice activity detection. ASR can wait too long to finalize the turn. The LLM can sit in provider queueing or prefill a long prompt. TTS can wait for enough text before it starts synthesis. The WebRTC leg can add jitter before audio reaches the listener. A single end-to-end latency number hides all of that.
Developers feel this as non-reproducible “voice is slow” reports. SREs see p95 and p99 latency spikes on rooms, regions, or provider routes. Product teams see lower task completion and more “hello?” transcripts. Compliance teams care when delayed agents fail required disclosures or emergency handoff timing. In 2026-era agent pipelines, the voice turn often includes retrieval, tool calls, guardrails, and a second model pass before speech. Each step can hold the first word hostage unless the trace shows where the wait occurred.
How FutureAGI Handles Time to First Word
FutureAGI treats TTFW as a derived turn-level metric in the traceAI:livekit integration. A LiveKit room span owns the call; child spans capture turn detection, STT/ASR, LLM, TTS, and audio playback. The TTFW calculation uses two timestamps: the finalized user turn event and the first agent word boundary observed in transcript or TTS metadata. The trace metric is gen_ai.voice.latency.ttfw_ms, while gen_ai.voice.latency.ttfb_ms records first audio byte and gen_ai.server.time_to_first_token records the LLM’s first text token.
A real debugging workflow: p99 gen_ai.voice.latency.ttfw_ms jumps from 900ms to 2300ms for support calls in us-east. In the FutureAGI trace, gen_ai.server.queue_time is flat, but TTS child spans show delayed first word boundary after the first audio byte. The engineer keeps the LLM route unchanged, lowers the TTS chunk threshold, and adds an alert when TTFW exceeds 1500ms for five minutes.
FutureAGI’s approach is to keep word-level voice latency connected to quality and task outcome. Unlike generic APM tools such as Datadog, which usually see room duration or websocket timing, FutureAGI connects traceAI:livekit spans to ASRAccuracy, TTSAccuracy, and TaskCompletion results on the same conversation. If a latency fix improves TTFW but hurts ASRAccuracy, the release gate catches it before rollout. For pre-production testing, LiveKitEngine simulations replay the same timing path before traffic moves.
How to Measure or Detect It
Measure TTFW as a turn-level stopwatch, then break it into components:
- TTFW p50/p95/p99: time from finalized user turn to first recognized assistant word, grouped by room, language, model, voice, region, and tenant.
- First audio versus first word: compare
gen_ai.voice.latency.ttfb_mswithgen_ai.voice.latency.ttfw_ms; a large gap points to TTS buffering or silence padding. - LLM contribution: check
gen_ai.server.time_to_first_token,gen_ai.server.queue_time, and input tokens to separate model prefill from provider queueing. - Turn boundary health: track endpointing timeout, false-barge-in rate, and repeated user utterances within three seconds.
- Quality guard:
ASRAccuracyreturns a speech-to-text accuracy eval; falling accuracy means a faster endpointing policy may be cutting users off. - User proxy: watch thumbs-down rate, escalation rate, and “are you there?” transcripts on the same cohort as the TTFW spike.
Common Mistakes
- Starting the timer at microphone open. That mixes user speaking time with agent response delay; start at finalized user turn.
- Treating first audio byte as first word. Early silence, codec priming, or partial phonemes can make audio start before language is intelligible.
- Fixing the LLM route first. Many TTFW spikes come from TTS buffering, endpointing, or WebRTC jitter, not the model.
- Optimizing p50 only. Voice agents fail in the tail; p95 and p99 catch room-level jitter and provider queue spikes.
- Lowering endpointing timeouts globally. Fast cutoff reduces TTFW but truncates hesitant speakers and hurts
ASRAccuracyorTaskCompletion.
Frequently Asked Questions
What is Time to First Word (TTFW)?
TTFW is the voice-agent latency from a finalized user turn to the first intelligible word of the agent's spoken reply. It spans turn detection, ASR, LLM first-token delay, TTS startup, audio transport, and playback.
How is TTFW different from time to first audio?
Time to first audio fires when audio bytes or sound begin. TTFW waits until the user can hear a recognizable word, so it catches silence padding, codec priming, and TTS buffering that first-audio metrics can miss.
How do you measure TTFW?
Use FutureAGI's traceAI:livekit spans to derive gen_ai.voice.latency.ttfw_ms, then compare it with gen_ai.voice.latency.ttfb_ms and gen_ai.server.time_to_first_token. Pair it with ASRAccuracy so faster turn timing does not hide transcript damage.