What Is Turn-Taking?
The timing logic that governs listening, waiting, interruption, acknowledgment, and speech in real-time voice agents.
What Is Turn-Taking?
Turn-taking in voice AI is the timing logic that decides when an agent listens, waits, interrupts, acknowledges, or speaks. It is a voice-agent reliability concept that appears in streaming audio, turn-detection, ASR, LLM, and TTS traces. In FutureAGI, teams evaluate turn-taking with CustomerAgentInterruptionHandling, LiveKitEngine simulations, and call-level timing signals so missed barge-ins, premature endpointing, long silence, and awkward overlap are caught before callers hear them.
Why Turn-Taking Matters in Production LLM and Agent Systems
Bad turn-taking makes a correct voice agent feel broken. If the agent starts speaking while the caller is still correcting a date, the LLM may reason over the wrong intent and trigger the wrong tool call. If endpointing fires too early, “I want to cancel… the second card” becomes a partial request. If the agent waits too long, users repeat themselves, abandon the call, or ask for a human.
The main failure modes are missed barge-in, premature endpointing, double-talk, dead air, and false interruption recovery. Developers see traces where the model output is technically right but attached to the wrong user turn. SREs see p99 time-to-first-audio, silence duration, or audio queue time drift after a provider change. Product teams see lower task completion for noisy mobile calls. Compliance teams lose auditability when a regulated correction is spoken but not handled.
This is especially hard in 2026-era agentic voice pipelines because a turn boundary can drive retrieval, identity checks, payment updates, and escalation policies. A single bad boundary can propagate through several agent steps. Unlike transcript-only QA in Vapi review queues or raw LiveKit logs, production turn-taking needs audio timing, transcript content, interruption events, and call outcome scored together.
How FutureAGI Handles Turn-Taking in Voice AI
FutureAGI’s approach is to treat turn-taking as a scored conversation behavior, not as a raw VAD setting. The specific anchor for this glossary entry is CustomerAgentInterruptionHandling, the FutureAGI evaluator surface for customer-agent interruption handling. The voice simulation surface is LiveKitEngine, which the inventory defines as the simulate-sdk engine for voice simulations with transcript and audio capture.
A practical workflow starts with a support agent that handles billing corrections. The engineer creates Persona and Scenario records for hurried callers, noisy speakers, mid-sentence corrections, and deliberate barge-ins. LiveKitEngine runs those calls against the voice agent and stores the transcript, audio path, and eval scores in TestReport or TestCaseResult artifacts. The traceAI livekit integration keeps the call attached to the same observability trail as the ASR, LLM, tool, and TTS stages.
The metric that matters is not “did the user stop talking?” It is whether the agent handled the interruption well enough to preserve the user’s goal. A release gate can fail when CustomerAgentInterruptionHandling drops on noisy mobile calls, when silence after a user correction exceeds the approved p99 threshold, or when barge-in recovery increases escalations. The engineer then replays failed audio, checks the relevant trace span, adjusts endpointing or response-start policy, and reruns the regression eval before rollout.
How to Measure or Detect Turn-Taking
Measure turn-taking as a timing-and-behavior scorecard:
CustomerAgentInterruptionHandling: FutureAGI evaluator for customer-agent interruption handling; use it as the primary release-gate score for barge-in and recovery quality.- Overlap duration: milliseconds where user and agent speech overlap after the caller starts speaking.
- Silence after user turn: p50, p90, and p99 delay between user end-of-turn and first agent audio.
- Premature endpointing rate: user turns cut before the intent is complete, often followed by corrections.
- Barge-in recovery rate: percent of interruptions where the agent stops, acknowledges, and continues with the corrected intent.
- User proxies: repeat utterance rate, hang-up rate, transfer-to-human rate, and reopened ticket rate.
Minimal fi.evals shape:
from fi.evals import CustomerAgentInterruptionHandling
call_transcript = "Customer: no, I meant tomorrow. Agent: sorry, tomorrow at 3?"
agent_turns = "Agent stopped, acknowledged the correction, and updated the date."
result = CustomerAgentInterruptionHandling().evaluate(
input=call_transcript,
output=agent_turns,
)
print(result.score)
Common Mistakes
- Treating turn detection as the whole problem. Detection predicts a boundary; turn-taking decides how the agent behaves after that boundary.
- Optimizing for silence only. Reducing pauses can increase double-talk if the agent responds before the caller completes a correction.
- Scoring transcripts without audio timing. Clean text hides clipped speech, overlapping speakers, late stop signals, and awkward recovery.
- Averaging all calls together. Overall interruption handling can look stable while noisy mobile, accented, or elderly-caller cohorts regress.
- Ignoring downstream tools. A bad turn boundary can still trigger a correct-looking tool call with the wrong user intent.
Frequently Asked Questions
What is turn-taking in voice AI?
Turn-taking is the timing logic that decides when a voice AI agent should listen, wait, interrupt, acknowledge, or speak. It shapes whether the conversation feels responsive instead of clipped, delayed, or overlapping.
How is turn-taking different from turn detection?
Turn detection predicts whether a speaker has started, stopped, or yielded the floor. Turn-taking is the wider conversation policy that uses those signals to decide when the agent should respond, pause, or recover from an interruption.
How do you measure turn-taking?
FutureAGI uses CustomerAgentInterruptionHandling with LiveKitEngine simulations and traceAI LiveKit traces. Track overlap duration, barge-in recovery, silence after user turns, time-to-first-audio, and escalation rate by cohort.