How are word-level timestamps different from transcription accuracy?

Transcription accuracy measures whether the words are correct. Word-level timestamps measure where those words occur in the audio, so a transcript can be accurate but poorly aligned.

How do you measure word-level timestamps?

Use FutureAGI's ASRAccuracy to validate transcript content, then inspect timestamp coverage, alignment drift, word duration anomalies, and traceAI-livekit spans for timing regressions.

What Are Word-Level Timestamps? FutureAGI Guide (2026)

Q: What are word-level timestamps?

Word-level timestamps are start and end time offsets attached to each recognized word in an audio transcript. They help engineers align speech, ASR output, captions, and voice-agent behavior.

What Are Word-Level Timestamps?

Word-level timestamps are start and end time offsets for each recognized word in an audio transcript. They are a voice AI observability field, not a language-model metric: they show up in ASR output, production traces, subtitle files, and voice-agent simulation records before the LLM reasons over text. FutureAGI uses them beside ASRAccuracy, LiveKitEngine runs, and traceAI-livekit spans to locate where transcript timing, turn handling, or downstream agent behavior first went wrong.

Why It Matters in Production LLM and Agent Systems

Bad word timing turns a readable transcript into weak operational evidence. A support voice agent may transcribe the right words, but if timestamps drift by two seconds, the system can attach “cancel my policy” to the wrong speaker turn, miss a barge-in, or replay the wrong audio segment during a quality review. The failure rarely looks like an ASR outage. It looks like silent turn drift, incorrect captions, broken interruption handling, or an agent that answers before the caller finished the phrase.

Developers feel this when a bug report includes audio but no reliable alignment between the transcript, the ASR segment, and the LLM span. SREs see rising p99 time-to-first-audio or more repeated turns without an obvious model regression. Product teams see users talk over the agent, get ignored, or repeat themselves. Compliance teams lose auditability when consent words exist in the transcript but cannot be tied to the audio second where the user said them.

For 2026 voice-agent pipelines, word-level timestamps matter because ASR output is no longer just a caption. It feeds turn detection, retrieval queries, tool arguments, summaries, coaching, and call analytics. Unlike a clean Whisper demo or a provider dashboard that reports aggregate transcript accuracy, production reliability depends on matching words to time, speaker, confidence, and downstream action.

How FutureAGI Handles Word-Level Timestamps

FutureAGI’s approach is to treat word-level timestamps as trace evidence around the ASR boundary, not as a standalone score. The source anchor for this term is none, so the practical FutureAGI surface is the voice simulation and trace record: LiveKitEngine captures spoken interactions, traceAI-livekit keeps voice stages close to LLM and tool spans, and ASRAccuracy checks whether the transcript text matches a reference. The timestamp fields then explain where an otherwise acceptable transcript became operationally unsafe.

A real workflow looks like this. A travel-booking voice agent runs nightly simulations over delayed-flight, noisy-airport, and caller-interruption scenarios. Each row stores audio_path, asr_transcript, reference_transcript, word, start_ms, end_ms, confidence, speaker_id, and the final task outcome. FutureAGI evaluates the transcript with ASRAccuracy, then the engineer checks timestamp coverage and alignment drift for words near booking dates, cancellation phrases, and payment authorization.

The next action is concrete. If ASRAccuracy remains high but timestamp drift exceeds 500 ms around interruptions, the team does not swap the LLM. They tune endpointing, change the ASR provider route for noisy calls, or add a regression scenario before release. If the drift clusters in one codec or mobile network, the alert is routed to the voice-infrastructure owner. Word-level timestamps become the bridge between audio evidence and agent behavior.

How to Measure or Detect Word-Level Timestamp Issues

Measure timestamps as alignment quality plus downstream impact:

Timestamp coverage: percentage of transcript words with non-null start_ms and end_ms; missing values should be sliced by provider, language, channel, and audio codec.
Alignment drift: absolute difference between ASR word offsets and a trusted manual or forced-alignment reference, especially around names, dates, prices, and consent phrases.
Duration anomalies: zero-length words, negative durations, overlapping words from one speaker, long gaps inside a phrase, or words that cross turn boundaries.
ASRAccuracy: FutureAGI evaluator for transcript correctness; use it with timestamp checks so timing bugs do not hide behind accurate text.
Trace and outcome signals: rising barge-in failures, repeated clarification turns, subtitle complaints, escalation rate, or eval-fail-rate-by-cohort.

from fi.evals import ASRAccuracy

evaluator = ASRAccuracy()
result = evaluator.evaluate(
    prediction=asr_transcript,
    reference=human_transcript,
)
print(result.score, word_timestamps[0]["start_ms"])

Common Mistakes

The common mistake is assuming correct words imply usable timing. That is only true when the transcript is read manually, not when a voice agent acts on a stream.

Scoring transcript text only. Accurate words can still be attached to the wrong audio second, speaker, or turn boundary.
Ignoring provider timestamp semantics. Some providers timestamp final words, partial words, subwords, or normalized text differently.
Using average drift only. A 120 ms mean can hide 900 ms drift around barge-ins, names, and payment confirmations.
Dropping low-confidence timing rows. Those rows often explain the exact user cohorts where the voice agent fails.
Treating captions and agent traces separately. The same timestamp error can break subtitles, retrieval context, and audit replay.