Voice AI Barge-In and Turn-Taking: A 2026 Implementation Guide
Implement barge-in and turn-taking that feels human. VAD tuning, false-barge-in defense, context preservation, and per-stage latency telemetry for 2026.
Table of Contents
A voice agent that talks over the caller is a voice agent that loses the call. Barge-in is the engineering surface that fixes it. Turn-taking is the conversational policy that decides when each speaker holds the floor. Both have a strict latency budget, both fail in distinct ways, and both can be measured. This guide walks the production-grade implementation in 2026: VAD pipeline, false-barge-in defense, context preservation across interrupts, and the per-stage telemetry you need to keep all of it tuned.
TL;DR: the barge-in pipeline
A working barge-in implementation has five components.
- Voice Activity Detection (VAD) continuously scores the incoming audio stream while the agent is speaking. Energy threshold plus a voice classifier plus a minimum-duration guard.
- Barge-in trigger fires when VAD confidence stays above the threshold for the minimum window. Below the threshold, the agent keeps speaking.
- TTS flush stops the audio output stream in under 60ms. The audio buffer is drained, the WebSocket closes, and the playback device returns to a listening state.
- LLM cancel terminates the in-flight LLM generation in under 40ms. Any partial response is discarded or stashed depending on the policy.
- Context preservation records the agent’s interrupted utterance, the in-flight tool calls, and the conversational state so the next turn can use them.
The combined budget for a natural-feeling barge-in is under 150ms from end-of-user-speech to TTS flush. The combined turn-taking gap from end-of-agent-speech to first-audio-of-next-turn is 200-450ms depending on use case.
Why barge-in is the hardest part of a voice agent
Most voice agent failures don’t show up in the LLM response quality. They show up in the conversational flow. The agent finishes a sentence the caller already interrupted. The agent’s TTS keeps playing while the caller is asking a new question. The caller has to repeat themselves. The conversation feels stilted.
Three reasons barge-in is hard:
The signal is noisy. Real phone calls have hold music, side conversations, coughs, paper rustling, and call-center background hum. A VAD that fires on any audio above an energy threshold will false-trigger constantly.
The latency budget is unforgiving. A 200ms TTS flush feels broken. The agent has to detect the barge-in, stop the audio, cancel the LLM, and yield the floor in under 150ms total. Each component has 30-50ms.
The state is in flight. When the user barges in, the agent has audio in the playback buffer, an LLM call still generating tokens, and possibly a tool call in progress. All three have to be handled cleanly.
The combination is what makes barge-in the failure mode users notice. They might not remember whether the agent answered the question correctly. They will remember that the agent kept talking over them.
The VAD pipeline
The Voice Activity Detection pipeline is the entry point. It runs continuously while the agent is speaking and produces a per-frame confidence score that the incoming audio contains a user voice.
A production-grade VAD pipeline has three layers. Note: pure VAD treats backchanneling (the “uh-huh / mhm” signals listeners emit to show engagement without taking the turn) as either silence or a full barge-in attempt. Neither is right. The 2026 production stack is migrating toward dedicated turn-taking models that classify backchannel vs. barge-in vs. continued silence as a learned signal instead of an energy threshold. Pipecat’s SmartTurnAnalyzer, LiveKit’s TurnDetector, and Vapi’s endpointing controls are the production examples. The VAD pipeline below is still the dominant pattern; the turn-taking-model layer is the migration target.
Layer 1: energy threshold
The first cut is an audio energy threshold. Frames below -45 dBFS are treated as silence and never trigger barge-in. Frames above -35 dBFS are candidates. The exact threshold depends on the audio codec, the microphone, and the call leg. A telephony-grade VAD typically tunes between -50 and -30 dBFS based on the codec.
The energy threshold is cheap and fast (microseconds per frame) but coarse. It does not distinguish speech from noise. That’s the next layer’s job.
Layer 2: voice classification
Above the energy threshold, a voice classifier decides whether the audio is human speech. Three common options:
- Silero VAD. A small neural network trained on multilingual speech. Streaming-friendly, low CPU cost, false-positive rate around 5-8% on noisy phone audio.
- WebRTC VAD. Classic GMM-based classifier shipped with WebRTC. Faster than Silero but higher false-positive rate, especially on music or call-center hum.
- Custom CNN. A small convolutional network trained on your own call audio. The right choice when you have a niche audio profile (e.g., heavy regional dialect or specific background noise).
Most production voice teams use Silero VAD as the default and switch to a custom CNN when the false-barge-in rate stays above the target. The classifier produces a confidence score (0-1) per frame.
Layer 3: minimum-duration guard
A 50ms cough above the energy threshold should not trigger barge-in. The minimum-duration guard requires sustained voice across multiple frames before firing. Typical values: 200-300ms of sustained voice with classifier confidence above 0.7.
The guard adds latency to the barge-in path (200ms more to detect intent). The tradeoff is worth it: the false-barge-in rate drops 60-80%.
The full VAD pipeline output is a binary signal: barge-in YES or NO at each frame. The trigger is conservative by design. False-barge-in is a worse failure than a slightly slow barge-in.
TTS flush: the 60ms target
When barge-in fires, the TTS audio has to stop within 60ms. Anything slower feels like the agent ignored the interruption.
Three engineering details matter.
Streaming TTS providers must support cancellation. Cartesia Sonic and ElevenLabs Turbo both support mid-stream cancellation via WebSocket close. PlayHT and OpenAI TTS have varying support. Test cancellation latency before picking a provider.
The audio buffer must drain fast. Most voice gateways buffer 200-400ms of TTS audio for jitter resilience. On barge-in, the buffer has to flush, which means dropping the queued audio. Implement a flush() method on the playback path that drops pending packets instead of waiting for them to drain.
The playback device has to release fast. On WebRTC the audio track has to stop or pause. On telephony the codec frame queue has to clear. Either way, the device-level state change has its own 10-20ms cost.
A 60ms TTS flush target breaks down as: 10ms VAD-to-flush dispatch, 20ms buffer drain, 20ms WebSocket close, 10ms device release. Each component is independently measurable. If the total slips above 80ms, profile each component to find the offender.
LLM cancel: the 40ms target
While TTS is flushing, the LLM that produced the in-flight audio is still generating tokens. Letting it finish costs money and wastes compute. Canceling it cleanly is a 40ms problem.
LLM cancellation works at the HTTP layer. Most major providers (OpenAI, Anthropic, Google, Bedrock) support AbortController or equivalent. The cancel call closes the stream and the server-side generation terminates.
Three caveats:
- Prefix-cached generations sometimes complete anyway. When the LLM is using a cached prefix, the first 100-200 tokens come back nearly instantly. Cancellation after those tokens have shipped is a no-op. Accept that some short generations will complete despite cancel.
- Tool calls in the partial response need handling. If the LLM emitted a tool call before cancellation, decide: do you execute it, ignore it, or stash it? The right choice depends on the tool. Idempotent reads (balance lookup, account status) are safe to execute. Mutations (refund, transfer) should be canceled.
- Track cancel latency as a metric. Network plus server-side processing adds 20-30ms to the cancel call. Add a span attribute for
llm_cancel_latency_msand plot the P95. Anything above 60ms is a problem.
Context preservation across interrupts
The hardest part of barge-in is what happens after. The user interrupted; now what does the agent know?
Three patterns work.
Pattern 1: stash the partial utterance
The agent’s interrupted utterance gets stashed in conversation state with a flag. The next LLM turn sees previous_agent_utterance_interrupted: "Your account balance is...". The LLM can decide whether to repeat, continue, or start fresh based on the new user input.
For most short turns the LLM correctly ignores the stashed utterance. For longer interruptions (e.g., the agent was reading a multi-step procedure), the stash lets the agent resume from the right place if asked.
Pattern 2: handle in-flight tool calls
If a tool call fired before the barge-in, three options:
- Cancel the tool call. Right when the new intent contradicts the in-flight tool. User asked for balance, agent fired
get_balance(), user interrupts with “actually, I want to transfer money.” Cancel the balance lookup. - Finish the tool call in the background. When the tool result might still be useful. User asked for balance, agent fired
get_balance(), user interrupts with “also, what’s my last transaction?” Finish the balance lookup and pass both results to the next LLM call. - Pause and resume. Rare, but useful when the tool is expensive and the interrupt is brief. User asked for full statement, agent fired
generate_pdf(), user coughs and pauses. Pause the PDF generation, resume after the user resumes.
The default policy should be option 2 for read-only tools and option 1 for mutations. Idempotent reads can complete and the result feeds the next turn. Mutations must wait for explicit confirmation.
Pattern 3: track conversational state
Conversation state should include:
agent_speaking: bool. True while TTS is active.agent_last_utterance: string. The most recent agent message (interrupted or complete).in_flight_tool_calls: list. Tool calls fired but not yet resolved.interrupt_count: int. Number of barge-ins so far in the conversation.last_interrupt_turn: int. The turn index of the most recent interruption.
The LLM gets this state as part of every turn’s context. A high interrupt count signals the agent is being long-winded. A repeated interrupt at the same point signals a prompt problem.
Turn-taking: the policy layer
Barge-in is the mechanism. Turn-taking is the policy. The policy decides:
- When is the agent’s turn over? End-of-sentence? End-of-thought? End-of-paragraph?
- How long should the agent wait before re-speaking? Immediately? After a 300ms pause?
- What counts as the user’s turn ending? Silence threshold? Sentence boundary?
- Can the agent interject for clarification? Or does it always wait for the user to finish?
Three turn-taking policies work in 2026 voice agents.
Policy A: strict end-of-turn
The agent waits for the user to finish completely before responding. End-of-turn detection uses a silence threshold (typically 800-1200ms of silence after the user’s last word) combined with semantic completeness (the LLM scoring whether the partial transcript looks complete).
Strict end-of-turn produces composed, considered conversations. The latency is higher (you pay the silence threshold on every turn). The fit is clinical, financial, legal, and complex troubleshooting.
Policy B: progressive turn-taking
The agent starts thinking on user speech partials, but only commits to speaking when the user pauses. STT first-partial fires the LLM with the partial transcript. The LLM streams a response in the background. If the user pauses (300-500ms silence), the agent speaks the buffered response. If the user keeps going, the agent cancels and restarts.
Progressive turn-taking is the right fit for sales, support, and IVR. Latency drops 200-400ms compared to strict end-of-turn. The cancel-and-restart rate stays below 5% if the silence threshold is tuned correctly.
Policy C: aggressive interjection
The agent interjects mid-user-turn for confirmations or backchannels (“ok”, “got it”, “I see”). The interjection is short (100-200ms of audio) and the user can talk over it. This pattern feels human in some contexts and rude in others.
Aggressive interjection works well for narrative or therapy-style agents where backchannels signal active listening. It works poorly for transactional agents where the user is conveying specific information and the interjection interrupts.
Pick the policy per use case. The right policy is enforceable in code: silence threshold, partial-commit threshold, interjection threshold are all tunable parameters.
False-barge-in: the production failure mode
The most common production complaint about barge-in is the agent cutting itself off when the user wasn’t actually trying to interrupt. Three failure patterns dominate.
1. Background noise triggers VAD. The caller is in a noisy environment (coffee shop, open office, busy household). Ambient noise pushes the energy above threshold and the classifier marks it as speech. The agent cuts off. Fix: tune the energy threshold for the deployment environment, or add a custom CNN trained on background-noise patterns from your call audio.
2. Side conversations trigger VAD. The caller is talking to someone else (handing the phone over, asking a partner a question). The agent cuts off mid-sentence and the original caller is confused. Fix: use a speaker-diarization signal if available (some voice gateways expose it), or accept some false-barge-in on side conversations as a deliberate tradeoff.
3. Codec artifacts trigger VAD. Some telephony codecs produce artifacts that look like speech to the classifier. G.711 is fine; some compressed codecs (G.729, Opus at low bitrates) introduce more noise. Fix: prefer higher-bitrate codecs where possible, or add codec-specific tuning to the VAD.
The combined false-barge-in rate target is under 2%. Above 5% the agent feels broken. Below 1% the VAD is probably too conservative and is missing real interruptions.
Mid-tool-call interruption: the worst case
The worst barge-in scenario is when the user interrupts mid-tool-call. The agent is silent (still waiting for the tool to return), the user gets impatient or distracted, and the user starts talking. The VAD fires barge-in but there’s no TTS to flush. What now?
Three rules.
Rule 1: respect the tool result. If the tool call is idempotent and read-only, let it finish. The result might be useful for the next turn.
Rule 2: surface the partial state. If the user’s new utterance is “are you still there?”, the agent should respond with what it’s doing: “I’m pulling your account, give me a moment.” This is the visible thinking signal pattern.
Rule 3: cancel on contradiction. If the user’s new utterance contradicts the in-flight tool (different account number, different request), cancel the tool. Don’t waste compute and don’t apply stale results.
The pattern in code:
async def handle_barge_in(state, new_partial):
if state.in_flight_tools:
for tool in state.in_flight_tools:
if tool.is_idempotent_read:
# let it finish in background
continue
if intent_contradicts(new_partial, tool):
await tool.cancel()
else:
# keep it running; result might be useful
continue
if state.agent_speaking:
await flush_tts()
await cancel_llm()
state.interrupt_count += 1
state.last_interrupt_turn = state.current_turn
The handler runs on every confirmed barge-in. Idempotent reads complete in the background. Mutations are canceled if the new intent contradicts. TTS and LLM are stopped if the agent was speaking.
Telemetry: what to capture per barge-in
Every barge-in event should produce a typed span with these attributes.
| Attribute | Type | Notes |
|---|---|---|
barge_in_event_id | string | UUID for the event |
turn_id | string | The agent turn that was interrupted |
vad_confidence | float | 0-1, classifier output |
vad_duration_ms | int | Length of the sustained voice before trigger |
energy_dbfs | float | Peak energy during the trigger window |
tts_flush_ms | int | Time from trigger to TTS stopped |
llm_cancel_ms | int | Time from trigger to LLM canceled |
total_handle_ms | int | End-to-end barge-in handling |
agent_utterance_truncated_at | int | Word index where the agent was cut off |
in_flight_tools_canceled | list | Tool names that were canceled |
in_flight_tools_completed | list | Tool names that finished anyway |
false_barge_in | bool | Set after manual or automated review |
The full event lets you slice by VAD confidence, by latency component, by tool, by turn type. Plot P95 weekly per attribute. Investigate any regression above 30ms or 1%.
from fi_instrumentation import FITracer
tracer = FITracer(tracer_provider.get_tracer(__name__))
def record_barge_in(state, event):
with tracer.start_as_current_span(
"barge_in",
attributes={
"barge_in_event_id": event.id,
"turn_id": state.current_turn,
"vad_confidence": event.vad_confidence,
"tts_flush_ms": event.tts_flush_ms,
"llm_cancel_ms": event.llm_cancel_ms,
"total_handle_ms": event.total_ms,
"in_flight_tools_canceled": event.canceled_tools,
},
) as span:
if event.total_ms > 150:
span.set_attribute("budget_exceeded", True)
The span data feeds the observability stack. The Error Feed surface auto-clusters barge-in events into named failure modes (false barge-in on background noise, slow TTS flush on specific provider, repeated barge-in on a specific intent). The cluster gets a root cause and a quick fix written automatically.
Turn-taking gap: the visible metric
The metric the user actually perceives is the turn-taking gap. End of the user’s speech, beginning of the agent’s audio. Plot it as P95.
def record_turn_taking_gap(state):
gap_ms = (state.agent_first_audio_ts - state.user_end_of_speech_ts) * 1000
with tracer.start_as_current_span(
"turn_taking_gap",
attributes={
"turn_id": state.current_turn,
"gap_ms": gap_ms,
"use_case": state.use_case,
},
):
pass
Targets by use case:
| Use case | Turn-taking gap P95 target |
|---|---|
| Sales outbound | 250-350ms |
| Support deflection | 350-450ms |
| Receptionist | 300-400ms |
| IVR replacement | 250-350ms |
| Clinical intake | 500-700ms with thinking signal |
| Financial advice | 500-800ms with thinking signal |
| Legal triage | 500-800ms with thinking signal |
| Casual conversation | 300-500ms |
Outside the target range, the conversation feels wrong. Above the range it feels slow. Below the range it feels rushed.
A reference barge-in pipeline
A production-grade barge-in pipeline for a US sales voice agent:
- VAD: Silero VAD with energy gate at -40 dBFS, classifier confidence threshold 0.75, minimum duration 250ms.
- TTS provider: Cartesia Sonic with WebSocket cancellation, 200ms buffer.
- LLM provider: GPT-4o-mini with
AbortControllersupport, 60ms average cancel latency. - State: in-memory conversation state with stash for interrupted utterance and in-flight tools.
- Telemetry:
traceAI-pipecatinstrumenting the Pipecat pipeline, span per barge-in event. - Eval:
conversation_coherenceandconversation_resolutionscoring multi-turn flow. - Guardrails:
Future AGI Protectmodel family inline, sub-100ms inline on Gemma 3n plus LoRA-trained adapters.
Resulting metrics:
- Barge-in success rate: 97.8%.
- False-barge-in rate: 1.4%.
- TTS flush P95: 54ms.
- LLM cancel P95: 38ms.
- Total barge-in handle P95: 142ms.
- Turn-taking gap P95: 310ms.
The pipeline is tuned over 4-6 weeks of production traffic. The biggest wins came from the minimum-duration guard (cut false-barge-in by 60%), the WebSocket cancellation (cut TTS flush from 120ms to 55ms), and the AbortController upgrade (cut LLM cancel from 90ms to 38ms).
Common failure modes and fixes
Six failure modes show up in production barge-in systems.
1. The whisper problem. Some users speak softly. Energy stays below the threshold and barge-in never fires. The user is talking, the agent keeps speaking, and the call goes off the rails. Fix: lower the energy threshold (try -50 dBFS) for users where the audio profile is consistently quiet. Auto-tune on session start by sampling the user’s typical energy.
2. The TV problem. Background TV audio that sounds like speech triggers VAD. The agent cuts itself off repeatedly. Fix: train a custom CNN on call audio containing TV background. Or accept the false-barge-in tradeoff and surface it to the operations team for follow-up tuning.
3. The codec switch problem. When the call transfers between codecs (e.g., from Opus to G.711 mid-call), the VAD has to re-tune. Fix: monitor for codec changes via the gateway, recalibrate the energy threshold on transition.
4. The cancel-and-restart loop. When the cancel-and-restart rate exceeds 10%, the agent feels twitchy. Each barge-in is followed by another barge-in 200ms later as the agent restarts. Fix: extend the minimum-duration guard to 350ms, raise the classifier threshold to 0.85, and require a hard silence (300ms) before the agent re-speaks.
5. The mid-tool interrupt regression. A backend tool gets slower and the in-flight tool window grows. Users start interrupting mid-tool more often. The agent’s response quality degrades because tools are getting canceled. Fix: cap tool budgets, prefetch tools on high-confidence intent, and surface the “still working on it” thinking signal proactively.
6. The accent drift. The VAD was tuned on American English speakers. Indian English speakers, with different prosody, trigger different patterns. False-barge-in rises on the Indian cohort. Fix: regional VAD tuning. Train per-region classifiers or per-region threshold tables.
Each failure mode has a clean fix once you can measure it. The measurement infrastructure is the prerequisite. Without span-level telemetry on every barge-in event, the failures stay invisible.
Eval rubrics for turn-taking quality
Beyond the per-event metrics, run multi-turn eval rubrics on conversation flow.
conversation_coherencescores whether turns build on each other coherently. Frequent barge-ins lower coherence when context preservation breaks.conversation_resolutionscores whether the conversation reached resolution. Slow barge-in and dropped tool calls lower resolution.task_completionscores whether the agent completed the task. Tool cancellation on mutations can drop task completion.is_concisescores whether responses are appropriately concise. Long agent utterances increase barge-in surface area.is_politescores politeness. Cutting off the user (failed barge-in) or talking over them lowers this score.
All five rubrics ship in ai-evaluation as part of the 70+ built-in eval templates. Apache 2.0. Run them on every simulation batch and on a sampled fraction of production calls.
Future AGI on barge-in and turn-taking
traceAI captures every barge-in event as a typed span with VAD confidence, energy, TTS flush time, LLM cancel time, and in-flight tool state. 30+ documented integrations across Python and TypeScript, including the dedicated traceAI-pipecat and traceai-livekit packages that instrument the voice frameworks teams actually use in production. OpenInference-compatible spans. Apache 2.0.
ai-evaluation ships 70+ built-in eval templates including conversation_coherence, conversation_resolution, task_completion, is_polite, and is_concise that measure the conversational quality barge-in affects. Custom evaluators authored by an in-product agent. Per-route eval gating so async eval never blocks the critical voice path. Programmatic eval API for configure plus re-run. Apache 2.0.
Future AGI Protect runs sub-100ms inline on Gemma 3n foundation with LoRA-trained adapters per safety dimension per arXiv 2510.13351. Multi-modal across text, image, and audio. ProtectFlash gives a single-call binary classifier path for the absolute lowest-latency surface. Inline safety fits inside the same turn budget as the rest of the pipeline.
Error Feed auto-clusters barge-in failures into named issues with auto-written root cause, quick fix, and long-term recommendation. False-barge-in on background noise, slow TTS flush on a specific provider, repeated barge-in on a specific intent each get their own cluster instead of drowning in 10,000 raw spans.
Agent Command Center hosts the whole stack with RBAC, SOC 2 Type II, HIPAA, GDPR, CCPA, and ISO 27001 certified per the trust page. AWS Marketplace, multi-region hosted, BYOC for regulated workloads. Native voice observability for Vapi, Retell, and LiveKit via provider API key plus Assistant ID, no SDK required.
The simulation surface for barge-in policy validation is Simulate: 18 pre-built personas plus unlimited custom personas with controls for gender, age, location, accent, background noise, and multilingual. Auto-generated branching scenarios in Workflow Builder produce hundreds of conversation paths from a single description. Error Localization pinpoints the exact turn where a barge-in policy broke. The combination is what lets a team test barge-in across thousands of synthetic calls before launch.
Where this still falls short
VAD tuning is partly hand-rolled. The energy threshold, classifier threshold, and minimum-duration guard all need tuning per deployment environment. There’s no fully-automatic tuner that hits production targets on day one. We surface the metrics and the cluster patterns so the tuning cycle is days, not months.
Custom voice classifiers need labeled data. Training a custom CNN for your audio profile requires labeled call audio. Smaller teams might not have the data volume to justify it. The Silero VAD default works for most cases; the custom CNN is an upgrade path for the cohorts that need it.
Mid-tool interrupt policy is opinionated. The default policy (idempotent reads complete, mutations cancel on contradiction) is what most teams want. Teams with unusual tool semantics need to override it. The override surface is in code, not in config, which is friction. That friction is intentional: tool semantics should be in code review.
Related reading
- Sub-500ms Voice AI: The Complete Latency Budget Guide for 2026
- How to Measure Voice AI Latency: The Complete 2026 Guide
- How to Optimize Voice Agent Latency: 12 Techniques That Work in 2026
- How to Implement Voice AI Observability in 2026
Sources and references
- Future AGI Protect: arXiv 2510.13351
- OpenInference span specification: github.com/Arize-ai/openinference
- Future AGI trust and compliance: futureagi.com/trust
- WebRTC VAD reference: WebRTC project documentation
- Silero VAD: Silero AI Team open-source release notes
- Anthropic streaming and cancellation: anthropic.com/docs
- OpenAI streaming cancellation: platform.openai.com/docs
Frequently asked questions
What is barge-in in voice AI and why does it matter?
What's the difference between barge-in and turn-taking?
What causes false-barge-in and how do you prevent it?
What does the turn-taking gap look like for a natural voice agent?
How do you preserve context when the user interrupts mid-tool-call?
Which metrics matter for a barge-in implementation?
How does Future AGI help debug barge-in failures?
How to hit a sub-500ms P95 voice AI turn in 2026. Per-stage budget, engineering choices, when sub-500ms is the right target and when it is not.
Measure voice AI latency end-to-end in 2026. Per-stage budgets for STT, LLM, TTS, network. OpenInference spans, P95 SLOs, runnable traceAI code.
Voice agent eval is end-task scoring plus pipeline-stage attribution plus conversation coherence. WER scores the ASR component, not the agent.