Voice AI

What Is Time to First Audio?

The latency from a user turn to the first audible byte of the voice agent's response, spanning ASR, LLM, TTS, and network stages.

What Is Time to First Audio?

Time to first audio (TTFA) is the latency from the moment a voice agent receives a user turn to the first audible byte of the agent’s spoken response. It is an end-to-end metric spanning four stages: ASR end-of-utterance detection, LLM inference, TTS synthesis, and network delivery to the caller. TTFA decides whether a conversation feels real-time — under ~700ms p99 it sounds human, above ~1.5s it sounds broken. In a FutureAGI trace, TTFA is the wall-clock delta between the user-turn-end span and the first TTS-output frame on a traceAI-livekit span.

Why It Matters in Production LLM and Agent Systems

Voice users do not tolerate latency the way chat users do. A 2-second pause between “hello, I need help” and the agent’s first word is a near-instant trust collapse — the user starts speaking again, the agent barges in on its own response, the conversation derails. Worse, TTFA is the user’s primary signal for “is this thing alive” — a long wait reads as a hung connection, not as careful reasoning, regardless of how good the eventual answer is.

Voice teams feel TTFA when call-completion rates drop after a model swap. SREs feel it when a TTS provider regresses streaming-first-byte latency overnight and the on-call paging tree has no per-stage breakdown. Application engineers feel it when the LLM is fast in chat and slow in voice — because the agent is calling four tools serially before any token reaches TTS. Compliance teams feel it when audit reviews flag conversations where the agent took 4 seconds to respond to “are you a human.”

For 2026 voice-agent stacks, TTFA is the latency budget every architectural choice must respect. Agent multi-step planning is wonderful for chat; in voice, every extra step adds ~250ms to TTFA. Streaming TTS with token-level handoff cuts TTFA by 400ms but couples LLM streaming order to TTS prosody. Choosing model size, runtime, and tool-call topology is no longer abstract; each decision moves p99 TTFA by 100–500ms. FutureAGI’s job is to expose the per-stage breakdown so teams optimize the actual bottleneck.

How FutureAGI Handles Time to First Audio

FutureAGI does not implement TTFA — that is a measurement on the traces emitted by your voice runtime (LiveKit, Pipecat, Vapi, custom). FutureAGI’s role is to surface TTFA per stage, correlate it with quality, and run regression tests against the latency budget. The traceAI-livekit integration emits OpenTelemetry spans for ASR, LLM, TTS, and network layers; TTFA is the wall-clock delta between user-turn-end and first-TTS-frame. The simulate-sdk’s LiveKitEngine runs scripted personas to capture both audio and per-stage latency.

A real workflow: a customer-service voice team sees p99 TTFA spike from 820ms to 1.4s after a Friday deploy. They open the FutureAGI dashboard, slice TTFA by stage, and see ASR and TTS held flat — the LLM stage went from 380ms to 920ms. A model swap from gpt-4o to gpt-4o-mini was supposed to be faster; in fact it triggered a third tool call that the larger model had been answering directly. They roll back via Agent Command Center model fallback, and TTFA returns to budget within minutes. Along with TTSAccuracy and AudioQualityEvaluator, the team also uses eval-fail-rate-by-cohort to confirm no quality regression accompanied the rollback.

Unlike a generic APM tool that reports HTTP latency, FutureAGI ties latency to evaluator scores so teams optimize for fast and correct, not just fast.

How to Measure or Detect It

TTFA is best measured per stage, then aggregated; the per-stage view is what catches regressions:

  • TTFA p50, p90, p99 (dashboard signal): the canonical UX metric; alert on p99 above 1.2s for conversational voice.
  • ASR end-of-turn latency: time from speech-end to ASR final transcript; usually 100–400ms.
  • LLM time-to-first-token (TTFT): time from prompt submission to first token; depends on model size and runtime.
  • TTS time-to-first-byte: time from first LLM token to first audio frame; streaming TTS is essential here.
  • Network transit p99: last-mile delivery latency; varies by codec, region, and carrier.
  • AudioQualityEvaluator + TTSAccuracy: pair every TTFA optimization with quality scores so you do not trade correctness for speed.

Minimal Python:

from fi.evals import AudioQualityEvaluator, TTSAccuracy

audio_q = AudioQualityEvaluator()
tts_acc = TTSAccuracy()

quality = audio_q.evaluate(audio=tts_audio_bytes)
accuracy = tts_acc.evaluate(prediction=tts_audio_bytes, reference=expected_text)
print(quality.score, accuracy.score)

Common Mistakes

  • Optimizing TTFA without re-running quality evals. Switching to a faster model can save 300ms and cost 4 points of TaskCompletion; never optimize one without the other.
  • Reporting only mean TTFA. The median can be 600ms while p99 is 2.5s — voice users feel the tail, not the middle.
  • Treating TTFA as one number. Per-stage breakdown is what makes TTFA actionable; a single end-to-end number hides which stage regressed.
  • Ignoring agent multi-step inflation. Each additional tool call before TTS adds round-trip latency; cap pre-TTS steps for voice.
  • Skipping streaming TTS. Non-streaming TTS waits for full LLM completion before synthesizing; streaming cuts TTFA by 400–800ms on long responses.

Frequently Asked Questions

What is time to first audio?

Time to first audio (TTFA) is the latency from when a voice agent receives a user turn to the first audible byte of the agent's reply. It spans ASR, LLM, TTS, and network delivery.

How is TTFA different from time to first token?

Time to first token (TTFT) measures only the LLM stage. TTFA covers the full voice-agent stack — ASR end-of-turn detection, LLM TTFT, TTS first-audio latency, and network transit — so TTFA is always larger.

How do you measure TTFA?

FutureAGI surfaces TTFA on traceAI-livekit voice spans, runs LiveKitEngine simulations across personas, and pairs latency tracking with TTSAccuracy and AudioQualityEvaluator to catch quality regressions on faster paths.