How is TTFT different from time-to-last-token and tokens-per-second?

TTFT is the gap until the first token; time-to-last-token (TTLT or end-to-end latency) is the gap until the response completes. Tokens-per-second is the streaming throughput between first and last token. TTFT is what users perceive as responsiveness; throughput drives total wait.

How do you measure TTFT in production?

Instrument with traceAI; every LLM span carries gen_ai.server.time_to_first_token. Aggregate p50, p95, p99 by model, route, and tenant. Pair with gen_ai.server.queue_time to separate provider queueing from first-decode latency.

What Is Time to First Token (TTFT)? FutureAGI Guide (2026)

Q: What is Time to First Token (TTFT)?

TTFT is the latency from sending an LLM request to the first generated token arriving at the client. It is the streaming-completion latency users actually feel and is captured per span as gen_ai.server.time_to_first_token.

What Is Time to First Token (TTFT)?

Time to First Token (TTFT) is the latency from sending an LLM request to the first generated token arriving at the client. It is the streaming-completion latency users actually feel — the gap between hitting Send and seeing words appear. TTFT is governed by network round-trip, provider queue time, prompt processing (prefill), and the model’s first decoder step. It is distinct from time-to-last-token (end-to-end completion latency) and tokens-per-second (throughput after the first token). In OpenTelemetry GenAI conventions it is gen_ai.server.time_to_first_token and is the primary streaming latency SLO.

Why It Matters in Production LLM and Agent Systems

For any streaming UX — chat, voice agents, code completion, AI search — TTFT is the latency budget the user feels. A 4-second time-to-last-token is acceptable if TTFT is 300ms (the answer streams visibly); the same 4 seconds with a 4-second TTFT and instant flush feels broken. Product teams see this in retention curves and ASR (answer success rate) metrics; SREs see it in p95 latency dashboards.

Three things drive TTFT regressions in production. First, provider queueing: a frontier model under peak load can push p95 queue time from 200ms to 2s overnight. Second, prompt size growth: a prompt that grew from 4K to 12K tokens (because someone added more context) doubles or triples prefill time. Third, routing decisions: a router that sends traffic to a cheaper but more loaded model spikes TTFT for cost-optimized routes.

For voice agents the stakes are higher. Voice TTFT (often called time-to-first-audio) needs to be under 800ms end-to-end across ASR + LLM + TTS to feel conversational. The LLM’s TTFT is one component; if it hits 1.2s the user perceives the agent as broken regardless of how good the response eventually is.

In agent stacks, TTFT compounds. A planner-executor-critic loop with three sequential LLM calls inherits 3× the TTFT into perceived latency. Streaming the planner’s tokens to the user (when safe) hides this; not streaming makes it user-visible.

How FutureAGI Handles TTFT

FutureAGI captures TTFT per LLM span via traceAI. The OpenAI, Anthropic, Bedrock, Vertex AI, Mistral, and Cohere integrations all start a streaming-aware timer when the request is sent and stop it on the first chunk callback, writing gen_ai.server.time_to_first_token (in seconds) onto the span. The same span carries gen_ai.server.time_per_output_token (post-TTFT throughput) and gen_ai.server.queue_time (provider-side queueing), so engineers can decompose where the latency went.

The platform exposes this on the trace view as a per-span TTFT histogram; on the monitoring side, every LLM span feeds a p50/p95/p99 TTFT metric sliced by model, route, region, and tenant. Alerts fire on rolling-window TTFT regressions: if p95 jumps by more than 30% over a 15-minute window for a given route, the on-call gets paged.

For voice agents, the related attribute is gen_ai.voice.latency.ttfb_ms — time-to-first-byte for the voice stack. FutureAGI’s traceAI-livekit integration emits this alongside the LLM gen_ai.server.time_to_first_token so the voice latency budget is visible in one trace.

The differentiator vs. classic APM is the model-aware decomposition. Datadog will show you HTTP request latency. FutureAGI separates queue, prefill (input-token-count-driven), and first decode, sliced by gen_ai.request.model. The Agent Command Center routing layer can then act on this — routing policy: least-latency reads recent TTFT per model and shifts traffic away from a degraded provider, with model fallback triggering when TTFT exceeds a threshold for N seconds.

How to Measure or Detect It

Wire these signals:

Per-span TTFT: gen_ai.server.time_to_first_token on every streaming LLM span.
Queueing decomposition: gen_ai.server.queue_time to isolate provider-side queue from compute time.
Throughput: gen_ai.server.time_per_output_token after the first token.
End-to-end: gen_ai.client.operation.duration for total span time.
Token context: gen_ai.usage.input_tokens to correlate prefill cost; long prompts → long TTFT.
Aggregations: p50, p95, p99 TTFT by model, route, region; rolling-mean drift detector.

# traceAI auto-emits TTFT for streaming OpenAI calls
from openai import OpenAI
client = OpenAI()
stream = client.chat.completions.create(
    model="gpt-4o", messages=msgs, stream=True
)
# gen_ai.server.time_to_first_token is on the span automatically
for chunk in stream: ...

Common Mistakes

Reporting average TTFT instead of p95. TTFT distributions are heavy-tailed; the average hides the bad tail your users feel. Track p95 and p99.
Conflating TTFT with end-to-end latency. A 200ms TTFT with a 3s tail is a great UX; a 3s TTFT with a 200ms tail is a broken one. Use both metrics.
Forgetting prompt-size effects. TTFT grows roughly linearly with input tokens because of prefill. Long-context bots have inherently slower TTFT — budget for it.
Ignoring queue time. A spike that looks like model slowdown is often provider-side queueing. gen_ai.server.queue_time separates the two.
Treating non-streaming calls as having TTFT. TTFT is only meaningful for streaming. For non-streaming requests, use gen_ai.client.operation.duration.