How is turn detection different from voice activity detection?

Voice activity detection identifies speech versus non-speech in the audio stream. Turn detection uses that signal plus timing, interruption, and conversation context to decide whether the user has actually handed the turn to the agent.

How do you measure turn detection?

In FutureAGI, measure it with LiveKitEngine simulations, turn-event traces, p99 turn-to-first-audio, barge-in handling, CustomerAgentInterruptionHandling, ASRAccuracy, and TaskCompletion by scenario cohort.

What Is Turn Detection? FutureAGI Guide (2026)

Q: What is turn detection?

Turn detection is the voice-AI process of deciding when a speaker has finished, paused, interrupted, or yielded the floor so the agent can respond at the right moment.

What Is Turn Detection?

Turn detection is the voice-AI process of deciding when a user has finished speaking, paused, interrupted, or yielded the floor so an agent can respond. It is a voice-agent timing signal that appears in audio runtimes, ASR pipelines, production traces, and simulation runs. FutureAGI teams test it with LiveKitEngine scenarios, inspect turn-event traces, and connect timing failures to ASRAccuracy, CustomerAgentInterruptionHandling, and TaskCompletion scores.

Why Turn Detection Matters in Production LLM and Agent Systems

Bad turn detection makes a correct model feel broken. If the agent responds before the user finishes, it reasons over a partial request and may call the wrong tool. If it waits too long, the caller hears dead air and repeats themselves. If it misses barge-in, the agent keeps speaking while the user is trying to correct the call.

The named failure modes are premature endpointing, missed endpointing, double-talk, and interruption handling failure. Developers see transcripts that look truncated or duplicated. SREs see p99 turn-to-first-audio drift after an ASR, VAD, or TTS provider change. Product teams see lower completion on mobile, noisy-channel, or accent cohorts. Compliance teams see a weaker audit trail because the final transcript hides whether the agent talked over the user.

This matters more in 2026-era agentic voice systems because a single bad boundary can start a multi-step pipeline: ASR finalizes a partial utterance, the LLM infers the wrong intent, a tool call updates the wrong record, and TTS speaks a confident answer. Common log symptoms include repeated clarification loops, high barge-in rate, long silence windows, user speech detected during agent playback, low transcription confidence near the endpoint, and calls marked resolved that later reopen. Turn detection is the control point between human conversation rhythm and machine action.

How FutureAGI Handles Turn Detection

FutureAGI’s approach is to treat turn detection as a conversation-boundary signal inside a scored voice workflow, not as a standalone transcript label. The current inventory does not expose a dedicated TurnDetection evaluator, so teams model it through simulation, runtime events, and adjacent evaluators.

A practical setup starts with simulate-sdk Persona and Scenario definitions for callers who pause mid-sentence, interrupt the agent, change their mind, speak over background noise, or give multi-part requests. LiveKitEngine runs those scenarios against the live voice agent and captures the audio, transcript, and per-call test result. The engineer preserves turn events such as user speech start, user speech stop, endpoint decision time, agent response start, agent playback stop, and barge-in detection. Those fields explain whether the agent answered at the right conversational boundary.

FutureAGI then attaches adjacent scores. ASRAccuracy checks whether the transcript after the boundary matches what was spoken. CustomerAgentInterruptionHandling is the closest evaluator for whether the agent handled an interruption correctly. TaskCompletion catches cases where timing looked acceptable but the call outcome failed. If a traceAI livekit integration is used, the same run can be inspected alongside ASR, LLM, tool, and TTS spans.

Unlike transcript review in a Vapi queue, this workflow keeps timing, audio, transcript, and outcome evidence together. An engineer can set a release gate: fail if missed barge-ins exceed 2% on noisy mobile calls, if p99 turn-to-first-audio exceeds 900ms, or if interruption-handling scores fall below the last approved build.

How to Measure or Detect Turn Detection

Measure turn detection as timing plus outcome, not silence duration alone.

Premature endpoint rate: percentage of turns where the agent started before the user’s semantic request was complete.
Missed endpoint rate: percentage of turns where the user finished but the agent waited past the target response window.
Barge-in handling rate: share of interruptions where playback stopped, the new user utterance was captured, and the agent revised its state.
p99 turn-to-first-audio: elapsed time from user turn end to agent audio start, sliced by channel, locale, provider, and scenario.
CustomerAgentInterruptionHandling: evaluator signal for whether the customer-agent workflow handled user interruptions appropriately.
ASRAccuracy: transcript fidelity around the boundary; a bad endpoint often creates missing words, repeated words, or unstable partials.
User proxies: repeated correction rate, hang-up rate after silence, thumbs-down rate, transfer-to-human rate, and reopened ticket rate.

Do not accept one global threshold. A healthcare scheduler, debt-collection agent, and sales qualifier have different tolerance for silence, overlap, and interruption.

Common Mistakes

Equating VAD with turn detection. Speech/non-speech is only one input; the decision also needs semantic completion and conversation state.
Tuning silence timeout globally. A 500ms pause can be natural in one language and a completed turn in another.
Ignoring barge-in during TTS. Users often correct agents mid-playback; missed interruption handling turns a recoverable error into a failed call.
Testing only clean audio. Endpoint thresholds that pass studio audio often fail with carrier noise, crosstalk, Bluetooth delay, or far-field microphones.
Averaging across cohorts. Overall latency can look stable while mobile, accent, or noisy-room calls regress sharply.