Guides

Voice AI Barge-In and Turn-Taking: A 2026 Implementation Guide

Implement barge-in and turn-taking that feels human. VAD tuning, false-barge-in defense, context preservation, and per-stage latency telemetry for 2026.

·
Updated
·
18 min read
voice-ai 2026 barge-in turn-taking latency
Editorial cover image for Voice AI Barge-In and Turn-Taking: A 2026 Implementation Guide

A voice agent that talks over the caller is a voice agent that loses the call. Barge-in is the engineering surface that fixes it. Turn-taking is the conversational policy that decides when each speaker holds the floor. Both have a strict latency budget, both fail in distinct ways, and both can be measured. This guide walks the production-grade implementation in 2026: VAD pipeline, false-barge-in defense, context preservation across interrupts, and the per-stage telemetry you need to keep all of it tuned.

TL;DR: the barge-in pipeline

A working barge-in implementation has five components.

  1. Voice Activity Detection (VAD) continuously scores the incoming audio stream while the agent is speaking. Energy threshold plus a voice classifier plus a minimum-duration guard.
  2. Barge-in trigger fires when VAD confidence stays above the threshold for the minimum window. Below the threshold, the agent keeps speaking.
  3. TTS flush stops the audio output stream in under 60ms. The audio buffer is drained, the WebSocket closes, and the playback device returns to a listening state.
  4. LLM cancel terminates the in-flight LLM generation in under 40ms. Any partial response is discarded or stashed depending on the policy.
  5. Context preservation records the agent’s interrupted utterance, the in-flight tool calls, and the conversational state so the next turn can use them.

The combined budget for a natural-feeling barge-in is under 150ms from end-of-user-speech to TTS flush. The combined turn-taking gap from end-of-agent-speech to first-audio-of-next-turn is 200-450ms depending on use case.

Why barge-in is the hardest part of a voice agent

Most voice agent failures don’t show up in the LLM response quality. They show up in the conversational flow. The agent finishes a sentence the caller already interrupted. The agent’s TTS keeps playing while the caller is asking a new question. The caller has to repeat themselves. The conversation feels stilted.

Three reasons barge-in is hard:

The signal is noisy. Real phone calls have hold music, side conversations, coughs, paper rustling, and call-center background hum. A VAD that fires on any audio above an energy threshold will false-trigger constantly.

The latency budget is unforgiving. A 200ms TTS flush feels broken. The agent has to detect the barge-in, stop the audio, cancel the LLM, and yield the floor in under 150ms total. Each component has 30-50ms.

The state is in flight. When the user barges in, the agent has audio in the playback buffer, an LLM call still generating tokens, and possibly a tool call in progress. All three have to be handled cleanly.

The combination is what makes barge-in the failure mode users notice. They might not remember whether the agent answered the question correctly. They will remember that the agent kept talking over them.

The VAD pipeline

The Voice Activity Detection pipeline is the entry point. It runs continuously while the agent is speaking and produces a per-frame confidence score that the incoming audio contains a user voice.

A production-grade VAD pipeline has three layers. Note: pure VAD treats backchanneling (the “uh-huh / mhm” signals listeners emit to show engagement without taking the turn) as either silence or a full barge-in attempt. Neither is right. The 2026 production stack is migrating toward dedicated turn-taking models that classify backchannel vs. barge-in vs. continued silence as a learned signal instead of an energy threshold. Pipecat’s SmartTurnAnalyzer, LiveKit’s TurnDetector, and Vapi’s endpointing controls are the production examples. The VAD pipeline below is still the dominant pattern; the turn-taking-model layer is the migration target.

Layer 1: energy threshold

The first cut is an audio energy threshold. Frames below -45 dBFS are treated as silence and never trigger barge-in. Frames above -35 dBFS are candidates. The exact threshold depends on the audio codec, the microphone, and the call leg. A telephony-grade VAD typically tunes between -50 and -30 dBFS based on the codec.

The energy threshold is cheap and fast (microseconds per frame) but coarse. It does not distinguish speech from noise. That’s the next layer’s job.

Layer 2: voice classification

Above the energy threshold, a voice classifier decides whether the audio is human speech. Three common options:

  • Silero VAD. A small neural network trained on multilingual speech. Streaming-friendly, low CPU cost, false-positive rate around 5-8% on noisy phone audio.
  • WebRTC VAD. Classic GMM-based classifier shipped with WebRTC. Faster than Silero but higher false-positive rate, especially on music or call-center hum.
  • Custom CNN. A small convolutional network trained on your own call audio. The right choice when you have a niche audio profile (e.g., heavy regional dialect or specific background noise).

Most production voice teams use Silero VAD as the default and switch to a custom CNN when the false-barge-in rate stays above the target. The classifier produces a confidence score (0-1) per frame.

Layer 3: minimum-duration guard

A 50ms cough above the energy threshold should not trigger barge-in. The minimum-duration guard requires sustained voice across multiple frames before firing. Typical values: 200-300ms of sustained voice with classifier confidence above 0.7.

The guard adds latency to the barge-in path (200ms more to detect intent). The tradeoff is worth it: the false-barge-in rate drops 60-80%.

The full VAD pipeline output is a binary signal: barge-in YES or NO at each frame. The trigger is conservative by design. False-barge-in is a worse failure than a slightly slow barge-in.

TTS flush: the 60ms target

When barge-in fires, the TTS audio has to stop within 60ms. Anything slower feels like the agent ignored the interruption.

Three engineering details matter.

Streaming TTS providers must support cancellation. Cartesia Sonic and ElevenLabs Turbo both support mid-stream cancellation via WebSocket close. PlayHT and OpenAI TTS have varying support. Test cancellation latency before picking a provider.

The audio buffer must drain fast. Most voice gateways buffer 200-400ms of TTS audio for jitter resilience. On barge-in, the buffer has to flush, which means dropping the queued audio. Implement a flush() method on the playback path that drops pending packets instead of waiting for them to drain.

The playback device has to release fast. On WebRTC the audio track has to stop or pause. On telephony the codec frame queue has to clear. Either way, the device-level state change has its own 10-20ms cost.

A 60ms TTS flush target breaks down as: 10ms VAD-to-flush dispatch, 20ms buffer drain, 20ms WebSocket close, 10ms device release. Each component is independently measurable. If the total slips above 80ms, profile each component to find the offender.

LLM cancel: the 40ms target

While TTS is flushing, the LLM that produced the in-flight audio is still generating tokens. Letting it finish costs money and wastes compute. Canceling it cleanly is a 40ms problem.

LLM cancellation works at the HTTP layer. Most major providers (OpenAI, Anthropic, Google, Bedrock) support AbortController or equivalent. The cancel call closes the stream and the server-side generation terminates.

Three caveats:

  • Prefix-cached generations sometimes complete anyway. When the LLM is using a cached prefix, the first 100-200 tokens come back nearly instantly. Cancellation after those tokens have shipped is a no-op. Accept that some short generations will complete despite cancel.
  • Tool calls in the partial response need handling. If the LLM emitted a tool call before cancellation, decide: do you execute it, ignore it, or stash it? The right choice depends on the tool. Idempotent reads (balance lookup, account status) are safe to execute. Mutations (refund, transfer) should be canceled.
  • Track cancel latency as a metric. Network plus server-side processing adds 20-30ms to the cancel call. Add a span attribute for llm_cancel_latency_ms and plot the P95. Anything above 60ms is a problem.

Context preservation across interrupts

The hardest part of barge-in is what happens after. The user interrupted; now what does the agent know?

Three patterns work.

Pattern 1: stash the partial utterance

The agent’s interrupted utterance gets stashed in conversation state with a flag. The next LLM turn sees previous_agent_utterance_interrupted: "Your account balance is...". The LLM can decide whether to repeat, continue, or start fresh based on the new user input.

For most short turns the LLM correctly ignores the stashed utterance. For longer interruptions (e.g., the agent was reading a multi-step procedure), the stash lets the agent resume from the right place if asked.

Pattern 2: handle in-flight tool calls

If a tool call fired before the barge-in, three options:

  1. Cancel the tool call. Right when the new intent contradicts the in-flight tool. User asked for balance, agent fired get_balance(), user interrupts with “actually, I want to transfer money.” Cancel the balance lookup.
  2. Finish the tool call in the background. When the tool result might still be useful. User asked for balance, agent fired get_balance(), user interrupts with “also, what’s my last transaction?” Finish the balance lookup and pass both results to the next LLM call.
  3. Pause and resume. Rare, but useful when the tool is expensive and the interrupt is brief. User asked for full statement, agent fired generate_pdf(), user coughs and pauses. Pause the PDF generation, resume after the user resumes.

The default policy should be option 2 for read-only tools and option 1 for mutations. Idempotent reads can complete and the result feeds the next turn. Mutations must wait for explicit confirmation.

Pattern 3: track conversational state

Conversation state should include:

  • agent_speaking: bool. True while TTS is active.
  • agent_last_utterance: string. The most recent agent message (interrupted or complete).
  • in_flight_tool_calls: list. Tool calls fired but not yet resolved.
  • interrupt_count: int. Number of barge-ins so far in the conversation.
  • last_interrupt_turn: int. The turn index of the most recent interruption.

The LLM gets this state as part of every turn’s context. A high interrupt count signals the agent is being long-winded. A repeated interrupt at the same point signals a prompt problem.

Turn-taking: the policy layer

Barge-in is the mechanism. Turn-taking is the policy. The policy decides:

  • When is the agent’s turn over? End-of-sentence? End-of-thought? End-of-paragraph?
  • How long should the agent wait before re-speaking? Immediately? After a 300ms pause?
  • What counts as the user’s turn ending? Silence threshold? Sentence boundary?
  • Can the agent interject for clarification? Or does it always wait for the user to finish?

Three turn-taking policies work in 2026 voice agents.

Policy A: strict end-of-turn

The agent waits for the user to finish completely before responding. End-of-turn detection uses a silence threshold (typically 800-1200ms of silence after the user’s last word) combined with semantic completeness (the LLM scoring whether the partial transcript looks complete).

Strict end-of-turn produces composed, considered conversations. The latency is higher (you pay the silence threshold on every turn). The fit is clinical, financial, legal, and complex troubleshooting.

Policy B: progressive turn-taking

The agent starts thinking on user speech partials, but only commits to speaking when the user pauses. STT first-partial fires the LLM with the partial transcript. The LLM streams a response in the background. If the user pauses (300-500ms silence), the agent speaks the buffered response. If the user keeps going, the agent cancels and restarts.

Progressive turn-taking is the right fit for sales, support, and IVR. Latency drops 200-400ms compared to strict end-of-turn. The cancel-and-restart rate stays below 5% if the silence threshold is tuned correctly.

Policy C: aggressive interjection

The agent interjects mid-user-turn for confirmations or backchannels (“ok”, “got it”, “I see”). The interjection is short (100-200ms of audio) and the user can talk over it. This pattern feels human in some contexts and rude in others.

Aggressive interjection works well for narrative or therapy-style agents where backchannels signal active listening. It works poorly for transactional agents where the user is conveying specific information and the interjection interrupts.

Pick the policy per use case. The right policy is enforceable in code: silence threshold, partial-commit threshold, interjection threshold are all tunable parameters.

False-barge-in: the production failure mode

The most common production complaint about barge-in is the agent cutting itself off when the user wasn’t actually trying to interrupt. Three failure patterns dominate.

1. Background noise triggers VAD. The caller is in a noisy environment (coffee shop, open office, busy household). Ambient noise pushes the energy above threshold and the classifier marks it as speech. The agent cuts off. Fix: tune the energy threshold for the deployment environment, or add a custom CNN trained on background-noise patterns from your call audio.

2. Side conversations trigger VAD. The caller is talking to someone else (handing the phone over, asking a partner a question). The agent cuts off mid-sentence and the original caller is confused. Fix: use a speaker-diarization signal if available (some voice gateways expose it), or accept some false-barge-in on side conversations as a deliberate tradeoff.

3. Codec artifacts trigger VAD. Some telephony codecs produce artifacts that look like speech to the classifier. G.711 is fine; some compressed codecs (G.729, Opus at low bitrates) introduce more noise. Fix: prefer higher-bitrate codecs where possible, or add codec-specific tuning to the VAD.

The combined false-barge-in rate target is under 2%. Above 5% the agent feels broken. Below 1% the VAD is probably too conservative and is missing real interruptions.

Mid-tool-call interruption: the worst case

The worst barge-in scenario is when the user interrupts mid-tool-call. The agent is silent (still waiting for the tool to return), the user gets impatient or distracted, and the user starts talking. The VAD fires barge-in but there’s no TTS to flush. What now?

Three rules.

Rule 1: respect the tool result. If the tool call is idempotent and read-only, let it finish. The result might be useful for the next turn.

Rule 2: surface the partial state. If the user’s new utterance is “are you still there?”, the agent should respond with what it’s doing: “I’m pulling your account, give me a moment.” This is the visible thinking signal pattern.

Rule 3: cancel on contradiction. If the user’s new utterance contradicts the in-flight tool (different account number, different request), cancel the tool. Don’t waste compute and don’t apply stale results.

The pattern in code:

async def handle_barge_in(state, new_partial):
    if state.in_flight_tools:
        for tool in state.in_flight_tools:
            if tool.is_idempotent_read:
                # let it finish in background
                continue
            if intent_contradicts(new_partial, tool):
                await tool.cancel()
            else:
                # keep it running; result might be useful
                continue
    
    if state.agent_speaking:
        await flush_tts()
        await cancel_llm()
    
    state.interrupt_count += 1
    state.last_interrupt_turn = state.current_turn

The handler runs on every confirmed barge-in. Idempotent reads complete in the background. Mutations are canceled if the new intent contradicts. TTS and LLM are stopped if the agent was speaking.

Telemetry: what to capture per barge-in

Every barge-in event should produce a typed span with these attributes.

AttributeTypeNotes
barge_in_event_idstringUUID for the event
turn_idstringThe agent turn that was interrupted
vad_confidencefloat0-1, classifier output
vad_duration_msintLength of the sustained voice before trigger
energy_dbfsfloatPeak energy during the trigger window
tts_flush_msintTime from trigger to TTS stopped
llm_cancel_msintTime from trigger to LLM canceled
total_handle_msintEnd-to-end barge-in handling
agent_utterance_truncated_atintWord index where the agent was cut off
in_flight_tools_canceledlistTool names that were canceled
in_flight_tools_completedlistTool names that finished anyway
false_barge_inboolSet after manual or automated review

The full event lets you slice by VAD confidence, by latency component, by tool, by turn type. Plot P95 weekly per attribute. Investigate any regression above 30ms or 1%.

from fi_instrumentation import FITracer

tracer = FITracer(tracer_provider.get_tracer(__name__))

def record_barge_in(state, event):
    with tracer.start_as_current_span(
        "barge_in",
        attributes={
            "barge_in_event_id": event.id,
            "turn_id": state.current_turn,
            "vad_confidence": event.vad_confidence,
            "tts_flush_ms": event.tts_flush_ms,
            "llm_cancel_ms": event.llm_cancel_ms,
            "total_handle_ms": event.total_ms,
            "in_flight_tools_canceled": event.canceled_tools,
        },
    ) as span:
        if event.total_ms > 150:
            span.set_attribute("budget_exceeded", True)

The span data feeds the observability stack. The Error Feed surface auto-clusters barge-in events into named failure modes (false barge-in on background noise, slow TTS flush on specific provider, repeated barge-in on a specific intent). The cluster gets a root cause and a quick fix written automatically.

Turn-taking gap: the visible metric

The metric the user actually perceives is the turn-taking gap. End of the user’s speech, beginning of the agent’s audio. Plot it as P95.

def record_turn_taking_gap(state):
    gap_ms = (state.agent_first_audio_ts - state.user_end_of_speech_ts) * 1000
    with tracer.start_as_current_span(
        "turn_taking_gap",
        attributes={
            "turn_id": state.current_turn,
            "gap_ms": gap_ms,
            "use_case": state.use_case,
        },
    ):
        pass

Targets by use case:

Use caseTurn-taking gap P95 target
Sales outbound250-350ms
Support deflection350-450ms
Receptionist300-400ms
IVR replacement250-350ms
Clinical intake500-700ms with thinking signal
Financial advice500-800ms with thinking signal
Legal triage500-800ms with thinking signal
Casual conversation300-500ms

Outside the target range, the conversation feels wrong. Above the range it feels slow. Below the range it feels rushed.

A reference barge-in pipeline

A production-grade barge-in pipeline for a US sales voice agent:

  • VAD: Silero VAD with energy gate at -40 dBFS, classifier confidence threshold 0.75, minimum duration 250ms.
  • TTS provider: Cartesia Sonic with WebSocket cancellation, 200ms buffer.
  • LLM provider: GPT-4o-mini with AbortController support, 60ms average cancel latency.
  • State: in-memory conversation state with stash for interrupted utterance and in-flight tools.
  • Telemetry: traceAI-pipecat instrumenting the Pipecat pipeline, span per barge-in event.
  • Eval: conversation_coherence and conversation_resolution scoring multi-turn flow.
  • Guardrails: Future AGI Protect model family inline, sub-100ms inline on Gemma 3n plus LoRA-trained adapters.

Resulting metrics:

  • Barge-in success rate: 97.8%.
  • False-barge-in rate: 1.4%.
  • TTS flush P95: 54ms.
  • LLM cancel P95: 38ms.
  • Total barge-in handle P95: 142ms.
  • Turn-taking gap P95: 310ms.

The pipeline is tuned over 4-6 weeks of production traffic. The biggest wins came from the minimum-duration guard (cut false-barge-in by 60%), the WebSocket cancellation (cut TTS flush from 120ms to 55ms), and the AbortController upgrade (cut LLM cancel from 90ms to 38ms).

Common failure modes and fixes

Six failure modes show up in production barge-in systems.

1. The whisper problem. Some users speak softly. Energy stays below the threshold and barge-in never fires. The user is talking, the agent keeps speaking, and the call goes off the rails. Fix: lower the energy threshold (try -50 dBFS) for users where the audio profile is consistently quiet. Auto-tune on session start by sampling the user’s typical energy.

2. The TV problem. Background TV audio that sounds like speech triggers VAD. The agent cuts itself off repeatedly. Fix: train a custom CNN on call audio containing TV background. Or accept the false-barge-in tradeoff and surface it to the operations team for follow-up tuning.

3. The codec switch problem. When the call transfers between codecs (e.g., from Opus to G.711 mid-call), the VAD has to re-tune. Fix: monitor for codec changes via the gateway, recalibrate the energy threshold on transition.

4. The cancel-and-restart loop. When the cancel-and-restart rate exceeds 10%, the agent feels twitchy. Each barge-in is followed by another barge-in 200ms later as the agent restarts. Fix: extend the minimum-duration guard to 350ms, raise the classifier threshold to 0.85, and require a hard silence (300ms) before the agent re-speaks.

5. The mid-tool interrupt regression. A backend tool gets slower and the in-flight tool window grows. Users start interrupting mid-tool more often. The agent’s response quality degrades because tools are getting canceled. Fix: cap tool budgets, prefetch tools on high-confidence intent, and surface the “still working on it” thinking signal proactively.

6. The accent drift. The VAD was tuned on American English speakers. Indian English speakers, with different prosody, trigger different patterns. False-barge-in rises on the Indian cohort. Fix: regional VAD tuning. Train per-region classifiers or per-region threshold tables.

Each failure mode has a clean fix once you can measure it. The measurement infrastructure is the prerequisite. Without span-level telemetry on every barge-in event, the failures stay invisible.

Eval rubrics for turn-taking quality

Beyond the per-event metrics, run multi-turn eval rubrics on conversation flow.

  • conversation_coherence scores whether turns build on each other coherently. Frequent barge-ins lower coherence when context preservation breaks.
  • conversation_resolution scores whether the conversation reached resolution. Slow barge-in and dropped tool calls lower resolution.
  • task_completion scores whether the agent completed the task. Tool cancellation on mutations can drop task completion.
  • is_concise scores whether responses are appropriately concise. Long agent utterances increase barge-in surface area.
  • is_polite scores politeness. Cutting off the user (failed barge-in) or talking over them lowers this score.

All five rubrics ship in ai-evaluation as part of the 70+ built-in eval templates. Apache 2.0. Run them on every simulation batch and on a sampled fraction of production calls.

Future AGI on barge-in and turn-taking

traceAI captures every barge-in event as a typed span with VAD confidence, energy, TTS flush time, LLM cancel time, and in-flight tool state. 30+ documented integrations across Python and TypeScript, including the dedicated traceAI-pipecat and traceai-livekit packages that instrument the voice frameworks teams actually use in production. OpenInference-compatible spans. Apache 2.0.

ai-evaluation ships 70+ built-in eval templates including conversation_coherence, conversation_resolution, task_completion, is_polite, and is_concise that measure the conversational quality barge-in affects. Custom evaluators authored by an in-product agent. Per-route eval gating so async eval never blocks the critical voice path. Programmatic eval API for configure plus re-run. Apache 2.0.

Future AGI Protect runs sub-100ms inline on Gemma 3n foundation with LoRA-trained adapters per safety dimension per arXiv 2510.13351. Multi-modal across text, image, and audio. ProtectFlash gives a single-call binary classifier path for the absolute lowest-latency surface. Inline safety fits inside the same turn budget as the rest of the pipeline.

Error Feed auto-clusters barge-in failures into named issues with auto-written root cause, quick fix, and long-term recommendation. False-barge-in on background noise, slow TTS flush on a specific provider, repeated barge-in on a specific intent each get their own cluster instead of drowning in 10,000 raw spans.

Agent Command Center hosts the whole stack with RBAC, SOC 2 Type II, HIPAA, GDPR, CCPA, and ISO 27001 certified per the trust page. AWS Marketplace, multi-region hosted, BYOC for regulated workloads. Native voice observability for Vapi, Retell, and LiveKit via provider API key plus Assistant ID, no SDK required.

The simulation surface for barge-in policy validation is Simulate: 18 pre-built personas plus unlimited custom personas with controls for gender, age, location, accent, background noise, and multilingual. Auto-generated branching scenarios in Workflow Builder produce hundreds of conversation paths from a single description. Error Localization pinpoints the exact turn where a barge-in policy broke. The combination is what lets a team test barge-in across thousands of synthetic calls before launch.

Where this still falls short

VAD tuning is partly hand-rolled. The energy threshold, classifier threshold, and minimum-duration guard all need tuning per deployment environment. There’s no fully-automatic tuner that hits production targets on day one. We surface the metrics and the cluster patterns so the tuning cycle is days, not months.

Custom voice classifiers need labeled data. Training a custom CNN for your audio profile requires labeled call audio. Smaller teams might not have the data volume to justify it. The Silero VAD default works for most cases; the custom CNN is an upgrade path for the cohorts that need it.

Mid-tool interrupt policy is opinionated. The default policy (idempotent reads complete, mutations cancel on contradiction) is what most teams want. Teams with unusual tool semantics need to override it. The override surface is in code, not in config, which is friction. That friction is intentional: tool semantics should be in code review.

Sources and references

  • Future AGI Protect: arXiv 2510.13351
  • OpenInference span specification: github.com/Arize-ai/openinference
  • Future AGI trust and compliance: futureagi.com/trust
  • WebRTC VAD reference: WebRTC project documentation
  • Silero VAD: Silero AI Team open-source release notes
  • Anthropic streaming and cancellation: anthropic.com/docs
  • OpenAI streaming cancellation: platform.openai.com/docs

Frequently asked questions

What is barge-in in voice AI and why does it matter?
Barge-in is the agent's ability to stop speaking when the user starts speaking. Without it the agent talks over the caller and the conversation feels robotic. With it the agent yields the floor the same way a human does. In 2026 the production bar for barge-in is a 200-400ms turn-taking gap, a false-barge-in rate under 2%, and a TTS flush time below 60ms. Most production voice failures trace back to one of those three numbers being out of range.
What's the difference between barge-in and turn-taking?
Barge-in is the mechanism. Turn-taking is the policy. Barge-in covers the act of stopping mid-utterance when interrupted. Turn-taking covers when each speaker should hold the floor, when to yield, and when a pause counts as end-of-turn versus a thinking pause. A working voice agent needs both: barge-in handles interruption, turn-taking handles the natural gap between turns. The combination is what makes voice agents feel human in 2026.
What causes false-barge-in and how do you prevent it?
False-barge-in fires when the VAD treats background noise, music, side conversations, or coughs as a real user turn. The agent cuts itself off, the caller is confused. Prevention combines three signals: energy threshold tuning (typically -45 to -35 dBFS), voice-classification VAD (Silero, WebRTC VAD, or a custom CNN that distinguishes speech from noise), and a minimum-duration guard (200-300ms of sustained voice before triggering barge-in). The combined false-barge-in rate target is under 2%.
What does the turn-taking gap look like for a natural voice agent?
Natural human conversation has 200-400ms between turns. Voice agents that target sub-300ms feel snappy but interruptive. Agents that target sub-500ms feel composed. Agents that exceed 600ms feel slow. The right target depends on the use case. Sales and IVR want 250-350ms. Support wants 350-450ms. Clinical and financial want 500-700ms with a visible thinking signal. Plot the gap as a P95 metric, not an average; outliers are what users remember.
How do you preserve context when the user interrupts mid-tool-call?
The tool call is in flight when the user barges in. Three options: cancel the tool call and restart from the new intent, finish the tool call in the background and let the next turn use the result, or pause the tool call and resume after the interrupt resolves. The right pattern is usually option two: never lose work, but never block the next turn on the previous turn's tool. Track the in-flight tool result in the conversation state and surface it when relevant. Cancel only when the new intent contradicts the in-flight tool.
Which metrics matter for a barge-in implementation?
Five metrics: barge-in success rate (target above 96%), false-barge-in rate (target under 2%), TTS flush latency (target under 60ms), LLM cancel latency (target under 40ms), turn-taking gap P95 (target 200-450ms by use case). Future AGI traceAI captures all five as span attributes via the dedicated traceAI-pipecat and traceai-livekit packages. Plot the P95 weekly. Any 30ms regression triggers an investigation.
How does Future AGI help debug barge-in failures?
Instrument each barge-in event as a traceAI/OpenInference-compatible span with VAD confidence, energy level, TTS flush time, and LLM cancel time on the attributes. The 70+ built-in eval templates in ai-evaluation include conversation_coherence and conversation_resolution that score multi-turn flow. Error Feed auto-clusters false-barge-in events into named issues with auto-written root cause and quick fix. Future AGI Protect runs sub-100ms inline on Gemma 3n plus LoRA-trained adapters per arXiv 2510.13351, so safety scanning fits inside the same budget the rest of the turn lives in.
Related Articles
View all