Models

What Is Time to First Word?

The latency from a user request to the first complete user-perceivable word, used in chat and voice systems as the human-readable warmup metric.

What Is Time to First Word?

Time to first word (TTFW) is the latency from when a user submits a request to when the first complete user-perceivable word reaches them. In chat, TTFW equals time-to-first-token plus the time it takes the streaming decoder to emit a complete word — meaningful because tokenizers split words across multiple tokens and users do not perceive sub-word fragments. In voice, TTFW parallels time-to-first-audio but counts the first audible word rather than the first audio byte. It is the human-readable warmup metric, used when tokens or bytes are too low-level to communicate UX.

Why It Matters in Production LLM and Agent Systems

TTFW matters because users perceive words, not tokens. A streaming LLM that emits the first token at 220ms but completes the first word at 380ms feels sub-second; a backend that reports TTFT alone misses the actual perceptual gap. In voice, the difference is sharper — ah or um audio at 600ms is not the same as a real word at 900ms. Optimizing for TTFT alone can produce a system that hits a great latency dashboard but feels laggy to users.

Application engineers feel this when product teams complain about perceived slowness despite green latency dashboards. SREs feel it when a runtime upgrade emits more sub-word fragments per request, raising the time-from-first-token to first-word delta. Voice teams feel it when streaming TTS produces long phonetic warmups before the first lexical word lands. Product managers feel it when user research reports “the agent feels slow” but no metric shows a regression — the metric being measured is wrong.

For 2026 stacks TTFW is becoming the headline UX latency. Reasoning models prepend hidden chain-of-thought tokens that count for TTFT but not for TTFW; tool-using agents emit tool-call JSON that does not count as user-perceivable words. Reporting TTFW pulls those distortions out and gives a clean signal for what the user actually waits for. FutureAGI’s role is to capture both TTFT and TTFW on every span so teams can choose the right view per context.

How FutureAGI Handles Time to First Word

FutureAGI does not generate words — that is the LLM and TTS layer. FutureAGI captures the timing through traceAI: every LLM span carries start, first-token, and first-word events; every voice span via traceAI-livekit carries first-audio-byte and first-audio-word events. Users open the trace dashboard, slice by llm.model_name, prompt-length cohort, and user persona, and see per-stage TTFW.

A real workflow: a writing-assistant team measures TTFT at 180ms p99, but user research flags “feels slow.” They check TTFW and find it is 720ms p99 because the model often opens with reasoning tokens like <thinking> that get filtered before display. They route reasoning-heavy prompts through a non-reasoning model variant via Agent Command Center routing, restore TTFW to 320ms p99, and rerun a TaskCompletion regression eval to confirm no quality loss. The cohort affected by reasoning-token warmup is recoverable without sacrificing the path where reasoning helps.

For voice agents, the team uses LiveKitEngine simulations to measure TTFW across personas — the first audible word matters more than the first audible byte for naturalness. Pair with AudioQualityEvaluator and TTSAccuracy to ensure TTFW optimizations do not regress audio quality. Unlike a generic APM tool that reports HTTP latency, FutureAGI pairs latency with output quality so teams optimize for fast and correct.

How to Measure or Detect It

TTFW lives on the same span TTFT lives on, but with word-boundary parsing applied:

  • TTFW p50, p90, p99 (dashboard signal): the user-perceivable warmup metric; alert on p99 above 1s for chat, 1.2s for voice.
  • TTFW-minus-TTFT delta: time spent emitting partial sub-word fragments; rising deltas suggest tokenizer or runtime regressions.
  • llm.token_count.prompt (OTel attribute): correlate TTFW with prompt length per model.
  • Reasoning-token filter rate: percentage of completions whose first user-visible word lands after hidden chain-of-thought tokens.
  • Eval-fail-rate-by-cohort paired with TTFW: ensures latency wins do not come paired with quality losses.

Minimal Python:

from fi.evals import TaskCompletion

# traceAI auto-instruments TTFT and TTFW on every LLM span
result = TaskCompletion().evaluate(input=user_goal, trajectory=trace_spans)
print(result.score, result.reason)

Common Mistakes

  • Reporting only TTFT. It hides the sub-word gap users actually perceive; ship dashboards with both TTFT and TTFW.
  • Counting reasoning tokens as user-visible words. Hidden chain-of-thought emits early but contributes nothing to TTFW; filter before measuring.
  • Optimizing TTFW and skipping quality evals. A smaller model can land the first word faster and lose 3 points on TaskCompletion; pair every change.
  • Aggregating TTFW across models. Tokenizer differences (Llama-3 versus GPT-4 versus Claude) produce different first-token-to-first-word gaps; group by llm.model_name.
  • Treating TTFW the same in chat and voice. Chat measures lexical word-end; voice measures audible word-end after TTS synthesis; they are not directly comparable.

Frequently Asked Questions

What is time to first word?

Time to first word (TTFW) is the latency from when a user submits a request to when the first complete word becomes perceivable. For chat it is TTFT plus word-boundary detection; for voice it parallels time to first audio.

How is TTFW different from TTFT?

TTFT measures the first token, which may be a sub-word or partial fragment. TTFW measures the first complete word — the user's actual perceptual unit. TTFW is always greater than or equal to TTFT.

How do you measure TTFW?

FutureAGI captures word-boundary timing on streamed LLM spans via traceAI and on voice spans via traceAI-livekit, then pairs the latency view with evaluator scores like TaskCompletion to catch quality regressions on faster paths.