AI Conversation Monitoring for Voice Agents: 6 Metrics That Matter in 2026
Monitor voice agent conversations with 6 metrics in 2026: turn coherence, intent confidence, completion, sentiment, escalation, and repeat-question signal.
Table of Contents
A voice agent can pass health checks, answer fast, and still ship a bad customer experience. The metrics that catch that gap don’t live in HTTP latency dashboards. They live in the transcript and the multi-turn trace. This post walks through six conversation-level metrics that matter for production voice agents in 2026, with the eval rubric each one maps to and the failure mode it catches that simpler observability misses.
TL;DR: the six metrics
| Metric | What it catches | Future AGI rubric |
|---|---|---|
| Turn coherence | Contradictions across turns, lost context after a tool call | conversation_coherence |
| Intent confidence | Misclassification at the entry point, ambiguous routing | Custom + llm_function_calling |
| Completion rate | Whether the call resolved the customer’s stated goal | conversation_resolution, task_completion |
| Sentiment trend | Frustration building before a hang-up or escalation | Tone family + custom sentiment rubric |
| Escalation triggers | Policy or scope boundaries hit, transfer-to-human moments | Custom + AnswerRefusal |
| Repeat-question signal | Customer rephrases because the answer wasn’t useful | Custom signal from transcript pattern matching |
The rest of the post explains how each metric works, why it matters, and how to wire it on top of a voice agent observability stack.
Why six and not three
The standard voice agent dashboards in 2024 stopped at three: latency, completion rate, and sentiment. Those three catch the obvious failures: calls that time out, calls that don’t resolve, calls where the customer is openly angry. What they miss is the long-tail pattern where everything looks green but the customer experience is quietly degrading.
The defensible wedge for conversation-level monitoring is the single trace view: component-level latency (STT, LLM, TTS scored separately as spans) joined with repetition, sentiment, and interruption metrics on the same trace. Most voice tooling forces you to correlate three or four dashboards by hand. FAGI surfaces the six metrics below as columns on one trace, with the same rubric layer scoring every captured call.
The six in this post catch the long tail. Two of them (turn coherence, intent confidence) sit at the start of the conversation lifecycle. Two (completion, sentiment trend) sit at the end. Two (escalation triggers, repeat-question signal) sit in the middle and catch the failure modes that don’t show up at either end alone.
You don’t need all six on day one. Start with completion rate and turn coherence. Add the others as your call volume grows and you start seeing patterns the first two don’t explain.
Metric 1: turn coherence
What it measures: does the assistant maintain context and consistency across multiple turns of the same conversation.
The failure modes it catches:
- The assistant confirms a fact in turn 3 and contradicts it in turn 7.
- A tool call returns successfully but the assistant doesn’t use the result on the next turn.
- The customer refers to “the second option you mentioned” and the assistant has forgotten which options it offered.
- A long-running RAG retrieval pulls in fresh context that conflicts with what the assistant said earlier, and the assistant doesn’t reconcile the two.
These failures look fine in any single-turn evaluation. The transcript reads correctly turn-by-turn. The failure is in the connective tissue between turns.
The rubric: Future AGI’s ai-evaluation ships ConversationCoherence as a built-in. It scores a multi-turn conversation against criteria for cross-turn consistency, context retention, and reference resolution. The input is a ConversationalTestCase with the full message history; the output is a coherence score plus reasoning.
from fi.testcases import ConversationalTestCase, LLMTestCase
from fi.evals import Evaluator, ConversationCoherence
ev = Evaluator(
fi_api_key="your-future-agi-api-key",
fi_secret_key="your-future-agi-secret-key",
)
conv = ConversationalTestCase(messages=[
LLMTestCase(query="What's my account balance?", response="It's $1,240."),
LLMTestCase(query="Can I transfer $500 to checking?", response="Sure, that brings your balance to $740."),
LLMTestCase(query="What was the original balance again?", response="It was $1,500."),
])
result = ev.evaluate(
eval_templates=[ConversationCoherence()],
inputs=[conv],
)
In the example above, turn 3 contradicts turn 1. The coherence rubric catches it. A turn-1-only evaluation wouldn’t.
In production, the rubric runs on every captured call automatically when attached to a Future AGI Observe project. The Error Feed clusters low-coherence calls into named issues like “Balance contradiction after transfer flow” or “Tool result not used on follow-up turn”, so the patterns surface as failure clusters rather than one-off scores.
Metric 2: intent confidence
What it measures: did the system correctly identify what the customer asked for, at the entry point of the conversation.
The failure modes it catches:
- A customer asks for “the refund thing” and the assistant routes to general billing instead of refund-specific tooling.
- An ambiguous opening (“I have a problem with my order”) routes to a default flow that doesn’t fit any of the actual problem types.
- A multilingual or accented customer says something the STT mistranscribes, and the intent classifier picks the wrong path off the garbled transcript.
- A customer asks two things in one turn (“I want to cancel my subscription and also get a refund”) and the system addresses only one.
The rubric: this one is usually custom, because intent taxonomy is org-specific. The pattern: define a LLMTestCase with the customer query as input and the chosen intent as output, then run a custom evaluator that checks whether the chosen intent matches the ground-truth intent for a held-out set of calls. For function-calling agents, llm_function_calling scores the function-call structure including the chosen intent.
For accent-sensitive deployments, this metric pairs with the audio_transcription rubric. If STT drift is silently degrading intent classification on hard accents, scoring both rubrics together points at the root cause: the intent classifier is fine but the input to it is garbled.
Error Feed clusters low-intent-confidence calls into named issues by intent category and entry-point pattern. The clusters often surface ambiguous opening phrases that should be added to the assistant’s clarification logic.
Metric 3: completion rate
What it measures: did the call resolve the customer’s stated goal.
This is your CSAT proxy. Every voice agent monitoring stack tracks completion in some form. The variants:
- Customer-perspective completion: did the customer get what they came for. Rubric:
ConversationResolution. - Agent-perspective completion: did the assistant complete the task it was supposed to complete, regardless of whether the customer was satisfied. Rubric:
TaskCompletion. - Business-perspective completion: did the call meet the business outcome (booking made, refund issued, escalation closed). Usually a custom rubric tied to a CRM or downstream system.
The three metrics split when policy and customer satisfaction diverge. A customer asks for an out-of-policy refund. The assistant correctly refuses. task_completion is high (the agent did the right thing), conversation_resolution is low (the customer didn’t get what they wanted). That split is signal, not noise. It tells you which failures are policy-induced versus capability-induced.
from fi.testcases import ConversationalTestCase, LLMTestCase
from fi.evals import Evaluator, ConversationResolution, TaskCompletion
ev = Evaluator(
fi_api_key="your-future-agi-api-key",
fi_secret_key="your-future-agi-secret-key",
)
conv = ConversationalTestCase(messages=[
LLMTestCase(query="I want to cancel my premium plan", response="I can help with that. Can you confirm your account email?"),
LLMTestCase(query="user@example.com", response="Confirmed. Your plan will be canceled at the end of the current billing cycle on Feb 28."),
])
result = ev.evaluate(
eval_templates=[ConversationResolution(), TaskCompletion()],
inputs=[conv],
)
Track both completion variants on every captured call. Their delta is more informative than either alone.
Metric 4: sentiment trend
What it measures: how customer sentiment evolves across the call, not the static end-of-call sentiment.
The failure modes it catches:
- A customer enters the call neutral, gets frustrated by turn 3, and hangs up before the agent realizes anything is wrong.
- A customer enters frustrated, the agent de-escalates well, and the call ends positive. (You want to credit the agent for that recovery, not just measure the final state.)
- An agent escalates customer frustration unintentionally with a phrasing pattern that pattern-matches across many calls.
The rubric: this is usually custom because sentiment ontology varies by industry. The Future AGI tone family (IsPolite, IsHelpful, IsConcise) plus a custom sentiment classifier covers the surface. The pattern: score sentiment per turn, plot the slope across turns, alert on negative slopes that cross a threshold.
In production, this metric matters most for outbound campaigns where escalating frustration predicts a hang-up two turns before it happens. If you can detect the slope and trigger a transfer-to-human at that point, you save the call.
Error Feed clusters declining-sentiment calls by the turn at which the slope inflected, so you can find the assistant phrasing patterns that consistently push customers from neutral into frustrated.
Metric 5: escalation triggers
What it measures: when and why a conversation hits a policy or scope boundary that requires a human.
The failure modes it catches:
- A customer asks for something the agent can do, but the agent escalates anyway out of excessive caution.
- A customer asks for something the agent can’t do, and the agent attempts to handle it instead of escalating.
- An agent escalates without giving the human enough context, forcing the human to re-collect basic information.
The rubric: AnswerRefusal scores whether a refusal was justified. Pair it with a custom escalation-context rubric that scores the handoff payload quality (does it summarize the call, name the customer’s actual issue, list what was already tried).
The reason this metric matters: escalation-out-of-caution is one of the largest hidden costs in production voice. Every unnecessary escalation costs human agent time and customer wait time. Escalation-when-needed-but-skipped is the other direction: the agent over-promises and the customer ends up worse off than if they’d been handed off at turn 2.
Error Feed clusters escalation patterns into named issues like “Refund requests above policy threshold handed off correctly” (good cluster, want to keep) versus “Account locked errors handled in-bot when they require human review” (bad cluster, fix the prompt).
Metric 6: repeat-question signal
What it measures: did the customer ask the same thing more than once because the previous answer wasn’t useful.
The failure modes it catches:
- The assistant gives a technically-accurate but unhelpful answer, and the customer rephrases the same question.
- The assistant answers a related question but not the one asked, and the customer pulls the conversation back to the original.
- The customer asks for confirmation, the assistant gives a partial confirmation, and the customer asks again to verify.
This is the metric that catches the failure mode where the dashboard looks fine but CSAT is quietly tanking. Intent classification is correct. Turn coherence is high. Completion fires positive. But the customer asked three times to get to the answer they wanted.
The rubric: this is a custom signal that compares semantic similarity of customer turns within a single call. A custom Future AGI evaluator computes pairwise similarity between customer turns, flags pairs above a threshold as “repeat questions”, and emits the count and the pair indices as the score.
# Pseudocode for a custom repeat-question rubric
from fi.evals import Evaluator
from sentence_transformers import SentenceTransformer
def repeat_question_score(conversation):
encoder = SentenceTransformer("all-MiniLM-L6-v2")
customer_turns = [m.query for m in conversation.messages]
embeddings = encoder.encode(customer_turns)
pairs_above_threshold = []
for i in range(len(embeddings)):
for j in range(i + 1, len(embeddings)):
similarity = embeddings[i] @ embeddings[j]
if similarity > 0.85:
pairs_above_threshold.append((i, j, float(similarity)))
return {
"repeat_count": len(pairs_above_threshold),
"pairs": pairs_above_threshold,
}
The in-product evaluator authoring agent in Future AGI can draft this kind of custom rubric directly from production traces. You point it at a corpus of calls with high repeat-question patterns, it proposes a rubric, and you tune it before deploying.
Error Feed clusters repeat-question calls by the turn pair that fired. The clusters tend to converge on a handful of phrasing patterns the assistant uses that customers find unhelpful. Each cluster carries a quick-fix recommendation like “Add explicit confirmation phrasing after refund approval” or “Replace generic timeout language with specific next-step language”.
How the metrics compose
The six metrics overlap intentionally. A high-coherence call with low completion is a different failure pattern than a low-coherence call with high completion. A call with rising sentiment and one escalation trigger is different from a call with flat sentiment and zero triggers.
The Error Feed view that matters: group calls by the combination of metric verdicts, not by single-metric scores. A common cluster: low completion + high repeat-question + neutral sentiment. That’s the silent-degradation pattern. Another common cluster: high completion + declining sentiment + one escalation trigger. That’s the recovered-after-friction pattern, which usually doesn’t need fixing.
Future AGI’s Error Feed handles the clustering automatically. It groups failed calls by trace pattern and writes the named issue with auto-extracted root cause, supporting evidence from spans, a quick fix to ship today, and a long-term recommendation. The output looks like:
Refund timeout language is causing repeat questions 47 calls in the last week showed the “refund timeout” failure pattern. Customers asked when their refund would post, the assistant said “within 7 business days”, and 31 of those customers asked again in the same call. Repeat-question fired on 31 of 47 traces (66%). Quick fix: replace “within 7 business days” with “by [specific date]” in the refund response template. Long-term: extend the refund tool return to include the projected post date as a structured field the assistant can quote directly.
That’s the kind of clustering output that turns six metrics into actionable engineering work.
Native voice observability without an SDK
For Vapi, Retell AI, and LiveKit, the path is dashboard-driven. Add your provider API key + Assistant ID to a Future AGI Agent Definition, enable observability, and every call streams in with auto call log capture, separate assistant + customer audio downloads, an auto transcript, and the full eval engine running. All six metrics in this post run on the captured calls automatically once the rubrics are attached.
For LiveKit and Pipecat code-driven setups, the traceai-livekit and traceAI-pipecat pip packages emit OpenInference-compatible spans. The same six rubrics run on the spans. No additional wiring.
For other voice providers, the Enable Others mode supports any provider through mobile-number simulation. Indian phone number support landed in the 2025-11-25 release.
Code: the full six-rubric pipeline
A working setup looks like this:
import os
from fi_instrumentation import register
from fi_instrumentation.fi_types import ProjectType
from fi.testcases import ConversationalTestCase, LLMTestCase, MLLMAudio, MLLMTestCase
from fi.evals import (
Evaluator,
ConversationCoherence,
ConversationResolution,
TaskCompletion,
AnswerRefusal,
IsPolite,
IsHelpful,
AudioTranscriptionEvaluator,
)
os.environ["FI_API_KEY"] = "your-future-agi-api-key"
os.environ["FI_SECRET_KEY"] = "your-future-agi-secret-key"
# Register tracing for the LLM service behind your voice agent
trace_provider = register(
project_type=ProjectType.OBSERVE,
project_name="voice_conversation_monitoring",
set_global_tracer_provider=True,
)
ev = Evaluator(
fi_api_key=os.environ["FI_API_KEY"],
fi_secret_key=os.environ["FI_SECRET_KEY"],
)
# A captured call from the dashboard
conv = ConversationalTestCase(messages=[
LLMTestCase(query="Hi, I need to update my address", response="Sure, what's the new address?"),
LLMTestCase(query="123 Oak Street, Springfield IL 62701", response="Got it. Updating now."),
LLMTestCase(query="Did you also update my billing?", response="Billing uses the same address, so yes."),
])
# Score the conversation against the core rubrics
result = ev.evaluate(
eval_templates=[
ConversationCoherence(),
ConversationResolution(),
TaskCompletion(),
AnswerRefusal(),
IsPolite(),
IsHelpful(),
],
inputs=[conv],
)
# Score the customer audio for STT drift
customer_audio = MLLMAudio(url="https://fagi.example.com/calls/abc/customer.wav")
audio_case = MLLMTestCase(input=customer_audio, query="Score ASR accuracy")
audio_result = ev.evaluate(
eval_templates=[AudioTranscriptionEvaluator()],
inputs=[audio_case],
)
The conversation rubrics score the transcript. The audio rubric scores the underlying audio against the rendered transcript. Together they cover the substance of the call.
For intent confidence and repeat-question signal, add custom evaluators authored either in code or via the in-product agent.
Where Future AGI fits
The full conversation monitoring loop on Future AGI:
- Native voice observability for Vapi, Retell, and LiveKit. No SDK required. Add provider API key + Assistant ID, get call logs, separate audio downloads, transcripts, and eval scoring on every call.
- 70+ built-in eval templates in ai-evaluation, Apache 2.0. The rubrics named in this post (
conversation_coherence,conversation_resolution,task_completion,is_polite,is_helpful,is_concise,answer_refusal,audio_transcription,audio_quality) all ship as built-ins. Unlimited custom rubrics for intent confidence, repeat-question signal, and vertical-specific compliance checks. - Error Feed auto-clusters failures into named issues with auto-written root cause, evidence, quick fix, and long-term recommendation. The cluster output is the actionable engineering backlog.
- Error Localization in Simulate (release 2025-11-25) pinpoints the exact failing turn when a scenario breaks. Programmatic eval API for configure + re-run lets you wire eval automation into CI.
- 18 pre-built personas plus unlimited custom in the simulation product. Gender, age range, location, accent, communication style, conversation speed, background noise, and multilingual controls per persona. Workflow Builder auto-generates branching scenarios with branch visibility.
- Future AGI Protect for inline guardrails. Gemma 3n foundation with LoRA-trained adapters per safety dimension per arXiv 2510.13351, sub-100ms inline. ProtectFlash binary classifier for the lowest-latency surface. In-house classifier models tuned for the LLM-as-judge cost/latency tradeoff on high-volume scoring.
- Agent Command Center for hosted, multi-region, or BYOC self-host. SOC 2 Type II, HIPAA, GDPR, CCPA, and ISO 27001 certified per futureagi.com/trust.
- agent-opt ships six prompt optimizers (Bayesian Search, Meta-Prompt per arXiv 2505.09666, ProTeGi, GEPA Genetic-Pareto per arXiv 2507.19457, Random Search per arXiv 2311.09569, PromptWizard) and is available both as a UI workflow inside the Dataset surface and as a Python SDK. Pick an optimizer + a dataset + an evaluator and run.
That’s the unified surface. Every metric in this post runs on the same platform that captures the trace, clusters the failure, and optimizes the prompt against the corrected version through a human-gated loop. Custom evaluators in the in-product agent calibrate from human review feedback so the rubrics get sharper as the team triages more clusters.
Two deliberate tradeoffs
Optimization is an explicit, gated run. The six-optimizer agent-opt surface (UI + SDK) never auto-rewrites prompts in production. Every optimization run is started by a human, gated by an evaluator, and surfaces candidate prompts for approval before they ship. That’s a deliberate process choice: production prompt changes go through human review.
Native voice observability ships for Vapi, Retell, and LiveKit out of the box. The dashboard path covers the three runtimes most teams are on with no SDK required. For any other runtime (Bland, ElevenLabs Agents, Pipecat, or a custom stack on Twilio/Plivo/Telnyx), Enable Others mode + traceAI SDK + webhook covers ingest. Between native and Enable Others, the active production stack in 2026 is in scope.
Common production pitfalls
Running all six metrics on day one. Start with completion and turn coherence. Add the others after you have enough call volume for each score to be statistically stable. Six metrics on 50 calls is noise.
Treating sentiment trend as binary. A single end-of-call sentiment is too coarse. The slope is what carries the signal. If your tool doesn’t compute per-turn sentiment and the slope across turns, the metric isn’t usable.
Conflating escalation triggers with escalation outcomes. A trigger is “the agent decided to hand off”. An outcome is “the human resolved the issue”. Track both; the failure mode where the trigger fires but the outcome doesn’t follow is its own cluster.
Ignoring repeat-question signal as too soft. It’s the highest-signal metric for silent CSAT degradation. The reason most teams skip it is that it requires a custom evaluator. The in-product evaluator authoring agent in Future AGI removes most of that friction.
Skipping the Error Feed clustering layer. Six metrics on raw calls is six dashboards to scan. Six metrics with Error Feed clustering is a named-issue backlog. The clustering is what makes the metrics actionable.
When you’ve outgrown this setup
Once the six metrics are running cleanly, the next move is to close the loop into simulation and optimization. Future AGI’s simulation product runs the same rubrics against synthetic conversations driven by 18 pre-built personas plus unlimited custom-authored personas (gender, age range, location, accent, communication style, conversation speed, background noise, multilingual, custom properties, free-form instructions). The Workflow Builder (Conversation / End Call / Transfer Call nodes) auto-generates branching scenarios at 20/50/100 rows with branch visibility, so you cover the long tail before production traffic does. Error Localization pinpoints the failing turn when a scenario breaks. agent-opt’s six optimizers (Bayesian Search, Meta-Prompt per arXiv 2505.09666, ProTeGi, GEPA Genetic-Pareto per arXiv 2507.19457, Random Search per arXiv 2311.09569, PromptWizard) optimize the assistant’s prompt against the corrected examples through the human-gated UI or programmatic SDK.
The unified surface is the point: production monitoring, simulation, eval, optimization, and inline guardrails on one platform with the same Agent Definition wired across all of them.
Related reading
- 12 metrics for AI conversation monitoring in 2026
- Voice AI Observability for Vapi: 2026 implementation guide
- 7 best voice agent monitoring platforms in 2026
- How to implement voice AI observability in 2026
Sources and references
- ai-evaluation on GitHub: github.com/future-agi/ai-evaluation
- traceAI on GitHub: github.com/future-agi/traceAI
- Error Feed docs: docs.futureagi.com/docs/observe
- Future AGI Protect docs: docs.futureagi.com/docs/protect
- Agent Command Center docs: docs.futureagi.com/docs/command-center
- arXiv 2510.13351 (Protect): arxiv.org/abs/2510.13351
- arXiv 2507.19457 (GEPA): arxiv.org/abs/2507.19457
- arXiv 2505.09666 (Meta-Prompt): arxiv.org/abs/2505.09666
- arXiv 2311.09569 (Random Search baseline): arxiv.org/abs/2311.09569
- Trust page: futureagi.com/trust
- OpenInference spec: github.com/Arize-ai/openinference
Frequently asked questions
Why six metrics instead of the usual three?
Do these metrics work on any voice provider?
How does FAGI's Error Feed differ from a normal alert pipeline?
What latency do these eval rubrics add?
Can I write custom metrics on top of these six?
How does repeat-question signal differ from intent confidence?
Where does FAGI rank against alternatives for voice conversation monitoring?
Two-category playbook for monitoring AI voice agents: native FAGI dashboard for Vapi-class, traceAI SDK for Pipecat and LiveKit, plus evals, SLOs, Error Feed.
Optimize LiveKit Agents voice latency to sub-500ms p95 in 2026. 12 techniques with real AgentSession code: streaming STT, partial TTS, prefix caching, regional routing, async eval.
Optimize Retell AI voice agent latency to sub-500ms p95 in 2026. 12 techniques with real Retell agent config: STT, response_engine, backchannel, states, async eval.