What is endpointing in voice AI?

Endpointing decides when a speaker has finished a turn so a voice agent can stop listening, send audio to ASR, and respond. It directly affects latency, interruptions, and whether the agent reasons over the full user intent.

How is endpointing different from voice activity detection?

Voice activity detection detects whether speech is present. Endpointing decides that the current user turn is complete enough for the agent to act, which requires timing, pause, interruption, and task context.

How do you measure endpointing?

In FutureAGI, inspect LiveKitEngine simulation traces for endpoint delay, early cutoff, barge-in rate, and time-to-first-audio. Pair those signals with AudioQualityEvaluator and ASRAccuracy to separate timing bugs from audio or transcript errors.

What Is Endpointing? Definition & FutureAGI Guide (2026)

What Is Endpointing?

Endpointing is the voice AI process that decides when a speaker has finished a turn so a voice agent can stop listening, send audio to ASR, and start responding. It appears in production traces between voice activity detection, transcription, the LLM turn, and text-to-speech. In agent systems, endpointing controls perceived latency and conversational correctness: too early clips intent, too late creates dead air. FutureAGI tracks endpointing through voice simulations, trace timing, interruption events, and downstream task outcomes.

Why It Matters in Production LLM and Agent Systems

Endpointing failures make voice agents sound slow, rude, or dangerously sure about partial input. If the endpoint fires too early, the agent may hear “I want to cancel…” before the user says “…the appointment, not the policy” and then route the wrong cancellation. If it waits too long, time-to-first-audio p99 grows, callers repeat themselves, and the agent may double-process silence as a new turn.

Developers feel this as non-reproducible LLM behavior because the transcript looks complete only after the fact. SREs see latency regressions clustered around ASR or TTS even when the LLM is healthy. Product teams see lower completion on older callers, mobile networks, noisy rooms, or languages with longer pauses. Compliance teams may see missing consent, payment authorization, or escalation language because the final transcript begins after the real user intent.

The logs usually show early-cutoff transcripts, repeated “sorry, can you repeat that?” turns, high barge-in counts, empty audio frames, or a sawtooth pattern where the user speaks over the first TTS token. Endpointing is especially important in 2026 voice-agent pipelines because each turn can trigger retrieval, tool calls, verification, and follow-up speech. Unlike raw WebRTC VAD thresholding, production endpointing has to respect task risk, latency budgets, and user speaking style.

How FutureAGI Handles Endpointing

Endpointing has fagi_anchor: none, so there is no dedicated FutureAGI endpointing evaluator class to cite. FutureAGI’s approach is to treat endpointing as a voice-runtime decision that must be visible beside audio, transcript, and task outcome. The closest product surfaces are simulate-sdk LiveKitEngine and the traceAI livekit integration: the simulation supplies repeatable call scenarios, while the trace records turn timing, ASR input/output, TTS start time, interruptions, and call-level outcome.

A support team might define Scenario cases where a caller pauses before a policy number, changes intent mid-sentence, or interrupts TTS with “wait, that’s not right.” LiveKitEngine runs those calls against the production voice agent. Each trace should keep fields such as endpoint_delay_ms, speech_final_at, asr_started_at, tts_first_audio_at, barge_in_count, and the final transcript.

The engineer then does not tune a single silence timeout. They review examples where endpoint delay exceeded 800 ms, early cutoff occurred before an entity, or barge-in preceded task failure. AudioQualityEvaluator helps separate endpointing mistakes from clipped or noisy audio, while ASRAccuracy catches transcript corruption after the endpoint. FutureAGI turns the fix into a regression eval: rerun the same scenarios, compare time-to-first-audio, task completion, and clarification rate, then release only if both speed and correctness move together.

In our 2026 evals, the useful threshold is rarely universal; it changes by route, language, and risk.

How to Measure or Detect Endpointing Issues

Measure endpointing as timing plus outcome, not a binary audio event:

endpoint_delay_ms: elapsed time from last voiced frame to the final endpoint; track p50, p95, and p99 by route.
Early-cutoff rate: share of turns where the transcript misses trailing intent, negation, numbers, or entity values.
Late-turn rate: share of turns where silence exceeds the conversation budget before ASR or the LLM starts.
Barge-in and interruption rate: count user speech over TTS, premature agent speech, and repeated repair turns.
AudioQualityEvaluator: FutureAGI audio-quality evaluator; use it to separate endpointing bugs from bad audio capture.
ASRAccuracy: FutureAGI speech-to-text evaluator; use it when endpointing changes alter transcript fidelity.
User-feedback proxy: escalation rate, repeat-contact rate, thumbs-down labels, and “agent interrupted me” annotations.

Minimal Python:

from fi.evals import ASRAccuracy

asr = ASRAccuracy()
result = asr.evaluate(
    output="cancel my appointment",
    ground_truth="cancel my appointment, not my policy",
)
print(result.score)

Review failures with the raw audio and the next agent action.

Common Mistakes

Common endpointing bugs often come from treating silence as universal, even though people pause differently by language, age, device, and task risk.

Setting one silence timeout for every route. Billing disputes, medical names, and identity checks need different patience than simple menu commands.
Tuning only for latency. A 200 ms faster response is a regression if early cutoff doubles clarification turns.
Ignoring barge-in semantics. User interruption can mean correction, impatience, or background speech; do not collapse all three into one metric.
Testing with synthetic clean audio only. Carrier noise, far-field microphones, crosstalk, and packet loss change endpoint timing.
Measuring endpointing outside task outcome. The right endpoint preserves intent and lets the agent finish safely and reliably.