Guides

Evaluating Voice AI Agents in 2026: End-Task, Pipeline-Stage, Conversation-Coherence

Voice agent eval is end-task scoring plus pipeline-stage attribution plus conversation coherence. WER scores the ASR component, not the agent.

April 13, 2026

Updated May 20, 2026

12 min read

voice-ai 2026 llm-evaluation ai-agents observability

Table of Contents

Voice agent eval is end-task scoring plus pipeline-stage attribution plus conversation-coherence scoring. WER is for the ASR component, not the agent. The agent’s job is task completion at sub-second turn latency with conversational stability across N turns, and a 5% WER on Common Voice can sit alongside a 32% task-failure rate in your live stack without contradiction, because the two numbers measure different things on different layers. This post is the production eval methodology: how to score the task as the launch gate, how to attribute failures back to ASR, LLM, TTS, or turn-taking, and how to score the conversation across turns instead of one turn at a time. Built for engineers running Pipecat, LiveKit, VAPI, Bland, Retell, or any real-time audio pipeline with ASR, an LLM brain, TTS, and a turn-taking layer.

The thesis

Eval the pipeline; score the task. Three scoring layers, one gate, two diagnostics.

End-task scoring is the launch gate. conversation_resolution and task_completion answer the question the caller cares about: did the agent book the appointment, refill the prescription, route the call to the right department, finish the refund? This is the only metric that matters for a pass-fail decision.
Pipeline-stage attribution is the diagnostic. When end-task fails, the failure happened at ASR, LLM, TTS, or the turn-taking layer. Each stage has its own per-stage rubric and its own latency budget. Stage-level scores explain end-task scores; they don’t replace them.
Conversation-coherence scoring is the second diagnostic. Single-turn rubrics miss the failures that compound across a call: persona drift, context loss, recovery from interruption, end-of-call cleanup. Score per turn for telemetry; aggregate per call before you make any decision.

Two ways teams get this wrong. Some score only WER and assume a 95% transcript means a 95% agent. They ship a broken product. Others score only end-task and watch the dashboard flatten without any signal on where to look. They debug by playing back recordings one at a time. The fix is to keep all three layers live and route them at the right decision point.

Why WER is for the ASR, not the agent

Word Error Rate is edit distance between a transcript and a reference. It’s a fine number for the ASR component. It’s the wrong gate for the agent for four reasons.

The ASR is one stage of four. A 92% accurate transcript that drops the “re-” from “reschedule” hands the LLM the wrong intent and produces a 100% wrong end task. Conversely, a transcript that misses a filler word but preserves the slot value scores worse on WER while the agent still books the right appointment. WER moves in directions that don’t track agent success.

Curated corpora are not your traffic. Mozilla Common Voice is read speech, balanced phonemes, quiet recording, no telephony codec. Production callers think out loud, stumble through proper nouns, switch languages mid-utterance, and arrive over a g711 codec. Public WER on the curated corpus tells you nothing about how your ASR handles “Mounjaro” pronounced by a Tamil-substrate caller in a moving vehicle.

LLM and TTS failures aren’t ASR failures. A monolingual STT model that resets context on a code-switch is an STT-architecture failure. A US-trained LLM that defaults “pavement repair” to road resurfacing instead of sidewalk repair is an LLM-interpretation failure. A TTS that pronounces a confirmation number like a phone number is a TTS failure. WER measures none of them.

Latency is invisible to WER. A perfect transcript that arrives 1.4 seconds late breaks the call. Callers start talking over the agent, barge-in fires on the response, the conversation spirals. WER on the final transcript looks great. The call still failed.

The fix isn’t to throw out WER. WER is the right number for the ASR component, kept as a per-stage diagnostic and stratified by accent group and background-noise bucket. It just isn’t the agent’s gate.

End-task scoring: the launch gate

The first eval that runs on every call is end-task. The whole point of the agent is whether the call resolved. Two rubrics from the ai-evaluation SDK cover the ground.

conversation_resolution. Did the agent resolve the caller’s intent? Multi-turn, scored against the goal list of the scenario.
task_completion. Did the agent finish the specific task (booking, refill, refund, transfer)? Partial completions and unnecessary escalations both count against the score.

Both are launch gates. A bot that posts 91% on ASR/STT_accuracy and 64% on conversation_resolution is still shipping a broken product. The transcript was fine; the agent failed the call. Don’t invert the priority.

from fi.evals import Evaluator
from fi.evals.templates import ConversationResolution, TaskCompletion

evaluator = Evaluator(fi_api_key="...", fi_secret_key="...")

result = evaluator.evaluate(
    eval_templates=[ConversationResolution(), TaskCompletion()],
    inputs=[conversation_test_case],
)

Define the end task precisely in the scenario goal list. Vague task statements make conversation_resolution an LLM-judge whim. Tight goal statements (“confirm caller identity; locate booking; offer next available slot; send confirmation SMS”) make the gate stable.

Pipeline-stage attribution: which stage failed

When the end-task gate fails, the next question is which stage caused it. Each pipeline stage has its own per-stage rubric and its own latency budget. The stages map cleanly to the audio path.

Stage	What it does	Per-stage rubric	Latency budget
ASR	Audio in, transcript out	`ASRAccuracy` (eval_name: `ASR/STT_accuracy`), `audio_quality`	50-150 ms first-hypothesis
LLM brain	Transcript in, text response out	`TaskCompletion`, `Groundedness`, `PromptAdherence`	300-500 ms first-token
TTS	Text in, audio out	`TTSAccuracy`, `audio_quality` on the synthesized output	100-200 ms first-byte
Turn-taking	VAD, barge-in, end-of-turn	`CustomerAgentInterruptionHandling`, `CustomerAgentTerminationHandling`	50-150 ms decision latency

Stage attribution lets one failed call answer one question instead of four. A call that fails conversation_resolution with ASR/STT_accuracy at 0.62 on a Tamil-substrate proper noun is an ASR vocabulary gap. A call that fails conversation_resolution with ASR/STT_accuracy at 0.94 and Groundedness at 0.41 is an LLM grounding failure. Same end-task miss, different fix.

Latency attribution is the second axis. Route the LLM brain through the Future AGI gateway and every response returns x-prism-latency-ms, x-prism-model-used, x-prism-fallback-used, and x-prism-routing-strategy. The LLM portion of the per-turn budget is one header lookup. ASR and TTS latencies come from your voice framework’s own spans, which traceAI captures with gen_ai.voice.* attributes.

import litellm

litellm.api_base = "https://gateway.futureagi.com/v1"
response = litellm.completion(
    model="gpt-4o-mini",
    messages=[{"role": "system", "content": SYSTEM_PROMPT}, *turns],
    extra_headers={"x-prism-routing-strategy": "lowest-latency"},
)
llm_latency_ms = int(
    response._hidden_params["additional_headers"]["x-prism-latency-ms"]
)

Sum the three stage latencies into per-turn end-to-end. Alert when the rolling p95 crosses one second. Without gateway headers, voice agents degrade silently as upstream model latency drifts; the first sign of trouble is a customer complaint three weeks later.

Turn-taking latency: the sub-second p95 target

Turn latency is the difference between a usable voice agent and an unusable one. Three thresholds matter, none of them aspirational.

Under 800 ms end-to-end, the call feels human. Callers wait for the agent.
Between 800 ms and 1.5 seconds, callers start talking over the agent. Barge-in fires. The conversation gets choppy.
Above 1.5 seconds, the call breaks. Users hang up or escalate.

The p95 target for production is under one second; under 800 ms is the goal. ASR is hard to push below 100 ms first-hypothesis. TTS first-byte is hard to push below 150 ms with a high-quality voice. The LLM portion is where 300-700 ms of variance lives, which is why routing the brain through a low-latency gateway moves the budget more than any TTS optimization. The gateway’s x-prism-latency-ms header is the signal you act on.

Turn-taking correctness is a separate failure mode. The agent has to decide when the user’s turn ended, when to barge in, and when to wait through a filler. Three rubrics from the customer-agent template family cover the surface.

CustomerAgentInterruptionHandling. Score barge-in correctness. Penalize false positives on background noise (cough, dog, traffic) and false negatives on actual user interruptions.
CustomerAgentTerminationHandling. Score end-of-call cleanup. Did the agent close the call appropriately, log the outcome, send the follow-up?
CustomerAgentLoopDetection. Score whether the agent recognized it was stuck in a loop and escalated or recovered instead of grinding forward.

False positives on barge-in destroy calls; false negatives destroy trust. Treat turn-taking as an eval axis, not a UX setting.

Conversation coherence across N turns

Persona drift, context loss, and recovery from interruption all happen at the call level, not the turn level. Score per turn for telemetry; aggregate per call before any decision. The drift you care about typically shows up around turn six, which is invisible to any single-turn rubric.

Three rubrics carry the coherence layer.

ConversationCoherence. Does the agent’s sequence of turns hang together as a single coherent thread? Penalizes contradictions, dropped threads, and shifts in tone or persona.
CustomerAgentContextRetention. Did the agent retain facts from earlier turns? The caller said their name in turn one; did the agent still know it in turn five?
CustomerAgentConversationQuality. Aggregate call-quality scoring across turns, weighted by failure severity.

The pattern we see most often: a call starts as “Maya from billing support” and ends as a generic assistant. The system prompt got crowded out by retrieved context by turn seven, the persona dropped, and the customer noticed even if the answer was technically correct. The call-level rubric catches this; the turn-level rubric never will.

from fi.evals.templates import (
    ConversationCoherence,
    CustomerAgentContextRetention,
    CustomerAgentInterruptionHandling,
    CustomerAgentTerminationHandling,
)

coherence_score = evaluator.evaluate(
    eval_templates=[
        ConversationCoherence(),
        CustomerAgentContextRetention(),
        CustomerAgentInterruptionHandling(),
        CustomerAgentTerminationHandling(),
    ],
    inputs=[conversation_test_case],
)

Roll up per-call. Alert on calls below threshold. Track per-turn for the trend line that shows you when drift starts.

Persona-driven test generation with simulate-sdk

A static recording corpus rots the moment your call distribution shifts. Persona-driven generation runs against your live stack, regenerates when intents change, and produces end-to-end calls instead of transcript proxies. simulate-sdk ships the primitive directly.

A Persona carries voice traits (accent, age range, speed, background noise, multilingual toggle). A Scenario carries the intent path and the goal list. TestRunner drives the persona through the scenario against an AgentWrapper that wraps your actual agent (OpenAI, LangChain, Gemini, or Anthropic-backed). The run produces a TestReport with per-call transcripts, audio, and a span tree.

from fi.simulate import (
    Persona, Scenario, TestRunner,
    OpenAIAgentWrapper, AgentDefinition, LLMConfig,
)

personas = [
    Persona(name="tamil_substrate_f_28", traits={
        "accent": "South Indian Tamil-influenced English",
        "speed": "fast", "background_noise": "moderate",
        "multilingual": True,
    }),
    Persona(name="us_southern_m_60", traits={
        "accent": "US Southern", "speed": "slow",
        "background_noise": "quiet",
    }),
    # ... weighted by your actual traffic distribution
]

scenarios = [
    Scenario(
        description="Caller wants to reschedule an existing appointment.",
        goals=["confirm caller identity", "locate booking", "offer next slot"],
    ),
    # ... top intents from production traffic
]

agent = OpenAIAgentWrapper(AgentDefinition(
    name="reschedule-bot",
    llm_config=LLMConfig(model="gpt-4o", temperature=0.4),
    system_prompt="You are a reschedule assistant for a healthcare clinic.",
))

runner = TestRunner(agent_wrapper=agent, personas=personas, scenarios=scenarios)
report = runner.run()
print(f"Pass rate: {report.pass_rate:.0%}")

Twelve personas times five scenarios at 100 rows per pair is 6,000 fresh calls. Run time roughly 14 hours of parallel simulation against the live agent. Each call carries TTS-rendered audio with the accent prosody, gets transcribed by your real STT, hits your real LLM, returns through your real TTS. End-to-end, not a proxy. For accent-coverage strategy in detail see accent and dialect testing for voice AI.

Production observability with traceAI

The same evaluator pool that scored the launch matrix scores live traffic. traceAI auto-instruments OpenAI, LangChain, Groq, Gemini, Portkey, and the voice-framework-specific instrumentors. Spans land in the same backend the launch suite wrote to.

For voice agents, tag the spans with a gen_ai.voice.* namespace so the platform renders them with the audio-specific UI. A typical per-turn span carries:

gen_ai.voice.asr.provider = "whisper"
gen_ai.voice.asr.transcript = "can you check the one for next tuesday"
gen_ai.voice.asr.confidence = 0.87
gen_ai.voice.asr.audio_duration_ms = 2140
gen_ai.voice.tts.provider = "elevenlabs"
gen_ai.voice.tts.text_source = "your appointment for tuesday is confirmed"
gen_ai.voice.tts.audio_duration_ms = 1840
gen_ai.voice.turn.event = "end_of_turn"
gen_ai.voice.turn.barge_in_fired = false
gen_ai.voice.accent_class = "tamil_substrate_english"
fi.span.kind = "AGENT"

The dashboard slices conversation_resolution by accent_class and by intent. A regression on Tamil-substrate English in the reschedule intent shows up as one cell turning red, not as a customer-support escalation three weeks later.

Failures cluster in Error Feed. HDBSCAN soft-clustering groups failed traces into recurring patterns; a Sonnet 4.5 Judge agent writes the immediate_fix field per cluster on a 30-turn budget. The clusters we see most often on voice deployments:

Cluster	Pipeline stage	Typical immediate_fix
ASR drops disfluencies, breaks intent on turn two	ASR	Tune VAD silence threshold; add filler tokens to language model
Proper-noun mistranscription on a slot value	ASR	Add phonetic variants to custom vocabulary
Code-switch context reset	STT	Swap to a multilingual STT for affected locales
Barge-in on background noise	turn-taking	Raise VAD noise floor for affected personas
Persona drift after turn six	LLM	System-prompt rewrite; reduce retrieved-context window
PII not redacted from call recording	compliance	Run `SecretsScanner` and `RegexScanner` before cold storage
LLM p95 latency drift past 700 ms	brain	Switch routing strategy to `lowest-latency`; check fallback rate

Each cluster carries a trend signal (rising, steady, falling). The cluster view is the weekly work queue. Linear is the only ticketing destination wired today; Slack, GitHub, Jira, and PagerDuty are on the roadmap.

Where the FAGI stack fits

Four product surfaces carry the voice eval loop without glue code.

simulate-sdk generates the persona-driven calls. Persona carries voice traits; Scenario carries the goal list; TestRunner drives the matrix against your wrapped agent (OpenAI, LangChain, Gemini, Anthropic) and returns a TestReport. The matrix scales: twelve personas, five intents, 100 rows per pair, 6,000 calls run in hours.

ai-evaluation scores every call. The Apache 2.0 SDK ships conversation_resolution, task_completion, ConversationCoherence, ASR/STT_accuracy, TTSAccuracy, audio_quality, plus the customer-agent template family (CustomerAgentInterruptionHandling, CustomerAgentTerminationHandling, CustomerAgentContextRetention, CustomerAgentLoopDetection, CustomerAgentLanguageHandling) as built-in templates. Error Localization pinpoints the failing turn so triage is one click, not one playback. Pricing runs lower per-eval than Galileo Luna-2 on equivalent throughput.

traceAI extends the same scoring to production. The OpenTelemetry-based SDK auto-instruments the wrappers, emits spans with PII redaction, and ships the gen_ai.voice.* attribute namespace unique to traceAI. Dashboards slice live conversation_resolution by accent and by intent the same way the launch matrix did.

Agent Command Center runs Protect on the voice path and gives you the latency telemetry to hold a sub-second p95. Four Gemma 3n LoRA adapters (toxicity, bias_detection, prompt_injection, data_privacy_compliance) plus Protect Flash score transcribed speech at 65ms text and 107ms image median time-to-label per arXiv 2510.13351. Prompt-injection attempts that come in through voice channels get caught before they reach the LLM; PII in transcripts gets masked before it lands in logs. The gateway is a single Go binary, Apache 2.0, benchmarked at 29k req/s and P99 21 ms with guardrails on, on t3.xlarge. Every response carries x-prism-latency-ms so per-turn p95 tracking is automatic.

The reason the loop closes inside a sprint: the four surfaces share datasets, evaluators, and a trend line. Personas feed the launch matrix. The launch matrix produces a baseline. The baseline becomes the regression suite. Production spans feed the same evaluator pool. Error Feed clusters write immediate_fix back into the work queue. Compliance: SOC 2 Type II, HIPAA, GDPR, CCPA certified per futureagi.com/trust; ISO/IEC 27001 in active audit.

Honest tradeoffs

Three calls worth naming.

Persona-driven generation is only as good as the TTS engine. Accent prosody is rendered, not recorded. For accents where the TTS layer doesn’t carry enough fidelity (some sub-regional dialects), pair the suite with a small recording dataset for the long tail. Voice cloning from ElevenLabs or Cartesia fills most of the gap; the residual is real.

End-task scoring requires precise goal statements. Vague goals make conversation_resolution an LLM-judge whim. Spend time on tight goal statements per scenario. The investment pays back the first time the suite catches a regression that WER would have missed.

Linear is the only ticketing destination wired today. Slack, GitHub, Jira, and PagerDuty are on the roadmap. If your incident workflow lives elsewhere, you’ll bridge through webhooks until those land.

The Voice AI Evaluation Infrastructure: Developer’s Guide: the rubric architecture this methodology sits on top of.
Accent and Dialect Testing for Voice AI: persona matrix design for accent coverage.
WER for Voice Agents: Beyond the 2026 Baseline: why WER is a diagnostic, not a gate.
How to Measure Voice AI Latency: per-stage latency methodology.

Sources and references

ai-evaluation source: github.com/future-agi/ai-evaluation (templates: ConversationResolution, TaskCompletion, ConversationCoherence, ASRAccuracy, TTSAccuracy, AudioQualityEvaluator, CustomerAgentInterruptionHandling, CustomerAgentTerminationHandling, CustomerAgentContextRetention, CustomerAgentLoopDetection, CustomerAgentLanguageHandling, CustomerAgentConversationQuality)
simulate-sdk source: docs.futureagi.com/docs/simulation
traceAI source: github.com/future-agi/traceAI
Agent Command Center docs: docs.futureagi.com/docs/command-center
Future AGI trust page: futureagi.com/trust
Protect model family: arxiv.org/abs/2510.13351

Frequently asked questions

Why is WER a bad eval for a voice agent?

WER scores the ASR component, not the agent. An ASR layer can hit 5% WER on a public read-speech corpus while the agent it feeds posts a 32% task-failure rate in production, because WER doesn't see proper-noun substitutions on the slot that matters, dropped negations, telephony codec degradation, or any of the LLM and TTS failures that come after the transcript. WER is fine as a per-stage diagnostic for the ASR component. It's the wrong gate for the agent. The agent's job is task completion at sub-second turn latency with conversational stability across N turns. Score the task, then attribute failures back to the pipeline stage that caused them.

What are the three scoring layers a voice agent eval needs?

End-task scoring asks whether the call resolved the user's intent. Pipeline-stage scoring attributes a failed call to a specific stage: ASR, LLM, TTS, or turn-taking. Conversation-coherence scoring asks whether the agent stays consistent across turns and recovers from interruption. The three layers answer three different questions and require three different scoring paths. End-task uses `conversation_resolution` and `task_completion`. Pipeline-stage uses `ASR/STT_accuracy`, `TTSAccuracy`, `audio_quality`, plus the gateway's per-turn latency header. Conversation-coherence uses `ConversationCoherence`, `CustomerAgentContextRetention`, and `CustomerAgentInterruptionHandling`. Use end-task as the launch gate; use the other two to explain why end-task fails when it does.

What turn latency should a voice agent target?

p95 turn latency under one second from user end-of-speech to first agent audio byte, with the LLM portion of that budget under 500 ms. Past about 800 ms total, callers start talking over the agent, barge-in fires incorrectly, and the call falls apart even when every utterance is semantically correct. Past 1.5 seconds the call feels broken. Track per-turn end-to-end latency from day one. Route the LLM through the Future AGI gateway and the response carries `x-prism-latency-ms`, `x-prism-model-used`, and `x-prism-fallback-used` so you can attribute the LLM-portion drift to a specific provider before the rolling p95 crosses the threshold.

How do you score conversation coherence across multiple turns?

Single-turn rubrics miss persona drift, context loss, and recovery from interruption. Use call-level rubrics. The `ConversationCoherence` template scores whether the agent's turns hang together as a single coherent thread. `CustomerAgentContextRetention` scores whether the agent kept earlier-turn facts in working memory. `CustomerAgentInterruptionHandling` scores barge-in recovery. `CustomerAgentTerminationHandling` scores end-of-call cleanup. Score per turn for telemetry; aggregate per call before you make a pass-fail decision. The drift you care about typically shows up around turn six, which is invisible to any single-turn rubric.

How does FAGI's simulate-sdk generate voice test cases?

`simulate-sdk` ships a Persona plus Scenario plus TestRunner primitive. A Persona carries voice traits (accent, speed, background noise, multilingual toggle); a Scenario carries the intent path and goal list; `TestRunner` drives the persona through the scenario against your wrapped agent (OpenAI, LangChain, Gemini, or Anthropic) and returns a TestReport with per-call transcripts, audio, and a span tree. Twelve personas times five scenarios at 100 rows per pair gives 6,000 fresh calls against the live stack, end-to-end, not a transcript proxy. The matrix regenerates when intents shift, which a static recording corpus can't do.

How does FAGI cluster voice failures in production?

Error Feed runs HDBSCAN soft-clustering over failed traces. For voice the clusters typically look like 'ASR drops disfluencies which breaks intent detection on turn two', 'barge-in fires on background noise during a quiet stretch', 'persona drift after turn six once retrieved context crowds out the system prompt', and 'PII not redacted when the caller reads a card number aloud'. A Sonnet 4.5 Judge agent writes the `immediate_fix` field per cluster. The Platform raises the confidence weight on future calls that match the cluster shape so the rubrics don't drift as the failure mix evolves. Linear is the only ticketing destination wired today; Slack, GitHub, Jira, and PagerDuty are on the roadmap.

Where does the FAGI eval stack fit in the voice loop?

`simulate-sdk` generates persona-driven calls and runs them through your live agent. `ai-evaluation` (Apache 2.0) scores each call with `conversation_resolution`, `task_completion`, `ConversationCoherence`, `ASR/STT_accuracy`, `TTSAccuracy`, `audio_quality`, and the customer-agent rubrics. `traceAI` extends the same scoring to production with PII redaction and a `gen_ai.voice.*` attribute namespace. The Agent Command Center runs Protect on the voice path: four Gemma 3n LoRA adapters (toxicity, bias, prompt injection, data privacy) score transcribed speech at 65ms text and 107ms image median time-to-label per arXiv 2510.13351. The gateway returns `x-prism-latency-ms` on every response so per-turn p95 tracking is one header lookup, not a custom timer.

View all

Guides

Voice AI Integration Guide 2026: Vapi, Retell, LiveKit, Pipecat + Eval

Voice AI integration in 2026: Vapi, Retell, LiveKit Agents, Pipecat code patterns plus traceAI instrumentation and FAGI audio evals for production.

Vrinda Damani · Aug 14, 2025

9 min

Guides

Best 5 Parea AI Alternatives in 2026

Five Parea AI alternatives scored on eval-catalog depth, logs-capped pricing, optimizer loops, guardrails, and team scale, and what each fixes.

NVJK Kartik · May 21, 2026

17 min

Guides

Best 5 RagaAI Alternatives in 2026

Five RagaAI alternatives scored on eval-judge depth, optimizer loops, gateway and guardrails, self-host ops burden, vendor maturity, and what each fixes.

NVJK Kartik · May 21, 2026

19 min

The thesis

Why WER is for the ASR, not the agent

End-task scoring: the launch gate

Pipeline-stage attribution: which stage failed

Turn-taking latency: the sub-second p95 target

Conversation coherence across N turns

Persona-driven test generation with simulate-sdk

Production observability with traceAI

Where the FAGI stack fits

Honest tradeoffs

Related reading

Sources and references

Frequently asked questions