Guides

How to Use Voice Agent Analytics to Improve CSAT in 2026

Q: What is CSAT for a voice agent?

CSAT for a voice agent is the share of callers who leave the call satisfied. The ground-truth measurement is a post-call survey scored 1 to 5 or thumbs up / down. The problem is sample size: response rates land around 5-10% in practice, which is too sparse to drive day-to-day decisions. The fix is a CSAT proxy built from three measurable signals scored on every call: empathy, first-call resolution, and turn-by-turn customer sentiment.

Q: What metrics drive CSAT in a voice agent?

Four families. Resolution metrics (first-call resolution, escalation rate, repeat-call rate). Conversational metrics (turn count, agent talk-time ratio, interruption count, dead-air seconds). Quality metrics (intent-classification accuracy, faithfulness, refusal handling). Brand metrics (empathy score, tone adherence, brand-voice consistency). Each maps to a rubric in ai-evaluation. Score every call on every rubric and the CSAT proxy emerges.

Q: How do I score empathy on every call?

Use an LLM-as-judge rubric or an in-house classifier model. ai-evaluation ships `is_polite`, `is_helpful`, `is_concise`, `conversation_resolution`, and `task_completion`; author empathy as a custom evaluator via the in-product agent if you need a 0-1 per-turn signal. The `is_polite`, `is_helpful`, and `is_concise` rubrics ship as named templates. For high-volume deployments the in-house classifier models are tuned for the LLM-as-judge cost/latency tradeoff so scoring 1M+ calls per month stays under budget.

Q: What's the difference between AHT and CSAT?

AHT (average handle time) is operational; CSAT is customer-perception. They often trade against each other: a faster call can leave the caller feeling rushed; a thorough call can resolve everything but burn minutes. The right framing is to optimize CSAT subject to an AHT ceiling, not to minimize AHT. Tag-based attribution lets you slice the two against each other in the dashboard.

Q: How does Error Feed help with CSAT?

Error Feed auto-clusters failing traces into named issues with auto-written root cause, quick fix, and long-term recommendation. For CSAT work that means 50 calls where the caller hung up after a misrouted intent show up as one issue with the root cause (intent classifier weak on a specific phrase) and a quick fix (add the phrase to the few-shot examples) rather than 50 separate alerts you have to triage by hand.

Q: Can I run CSAT analytics without sending data to a third party?

Yes. Future AGI's full stack (traceAI, ai-evaluation, agent-opt) is Apache 2.0 and runs in your own infra. The Agent Command Center hosted tier is SOC 2 Type II, HIPAA, GDPR, CCPA, and ISO 27001 certified per the trust page if you prefer hosted. Self-hosted means no PII or PHI ever leaves your VPC.

Q: What CSAT lift is realistic from analytics-driven improvement?

Customer-reported lifts on production voice deployments after one full trace-eval-cluster-fix loop land between 8 and 22 percentage points on the CSAT proxy, with the largest lifts coming from clustering ASR errors that traditional QA misses. The lift compounds: month one fixes the top three clusters, month two fixes the next three, and so on until the residual is in the noise of survey sample size.

Use voice agent analytics to lift CSAT in 2026. Instrument calls with traceAI, score with CSAT-proxy rubrics, cluster failures with Error Feed, fix.

April 9, 2026

Updated May 19, 2026

17 min read

voice-ai 2026 csat voice-analytics metrics

Voice agents are now answering inbound calls at companies that used to staff call centers around the clock. The bar moved fast: callers don’t reward a voice agent for being a voice agent. They reward it for resolving the call the first time, talking like a person, and not making them repeat themselves. CSAT is the customer-facing scoreboard. This guide walks through the operational loop we use to lift CSAT 8 to 22 percentage points in the first month: instrument every call with traceAI, score continuously with CSAT-proxy rubrics, cluster dissatisfaction causes with Error Feed, ship the fix, measure the delta.

TL;DR: the four-step loop

Instrument every call with traceAI. OpenInference spans across ASR, LLM, tool, TTS. Tag each trace with conversation_id, customer_segment, call_intent.
Score every call with ai-evaluation rubrics for empathy, first-resolution rate, customer-sentiment turn-by-turn.
Cluster failures with Error Feed. Auto-named issues with auto-written root cause and quick fix.
Ship the fix, measure delta. Roll out to a percentage of traffic. Compare CSAT proxy on the new cohort vs baseline.

The loop closes because every fix lands back as trace data, which re-scores against the same rubric. Custom evaluators can be refined from corrections and re-run through the programmatic eval API as traffic flows.

The CSAT measurement problem

Real CSAT comes from post-call surveys: a 1-5 scale, a thumbs-up, a free-text comment. The problem is response rate. Production voice deployments see survey response between 5% and 12%, which means a 10,000-call month produces 500 to 1,200 surveys. That’s enough for a monthly board number, not enough to drive day-to-day decisions.

The fix is a CSAT proxy scored on every call. The proxy doesn’t have to predict CSAT perfectly; it has to predict CSAT with enough correlation that improvements to the proxy move CSAT, and with enough volume that you can act fast.

Three signals do most of the work:

Empathy score. Per-turn 0-1 score on whether the agent acknowledged emotion, validated frustration, offered a path forward.
First-resolution rate. Did the call resolve the caller’s intent without escalation or callback.
Turn-by-turn customer sentiment. Sentiment trajectory across the call. A drop from neutral to negative mid-call is a clear failure signal even if the call eventually resolves.

Run these as ai-evaluation rubrics on every call. The proxy is a weighted combination tuned to your survey baseline.

Step 1: Instrument every call

Two paths, pick the one that fits your runtime and your team. Both ship in the box.

Path A: UI-driven, no-SDK (Vapi, Retell, LiveKit)

If your voice runtime is Vapi, Retell, or LiveKit, you don’t write a line of code to start capturing calls. Create a FAGI Agent Definition, paste the provider API key + Assistant ID, and enable observability. Every call streams in with assistant + customer audio separately, an auto transcript, and span attribution per stage. Add the conversation tags (customer_segment, call_intent, language, agent_version) in the Agent Definition UI; they propagate to every captured trace. For providers outside that list, Enable Others mode supports any voice stack via mobile-number simulation. Indian phone numbers ship as a configurable region.

Ops, support, and CX leads can run this path end-to-end. No engineering ticket.

Path B: SDK-driven (Pipecat, custom LiveKit, code-first stacks)

When you own the voice runtime in code (Pipecat, custom LiveKit Agents, or any in-house framework), wire traceAI into the LLM provider. The pattern is the same across runtimes: instrument the underlying LLM provider (OpenAI, Anthropic, LiteLLM), wrap each conversation in a session span, attach conversation-level tags.

# pip install traceAI-openai ai-evaluation fi-instrumentation
import os
from fi_instrumentation import register, FITracer
from fi_instrumentation.fi_types import ProjectType
from traceai_openai import OpenAIInstrumentor

os.environ["FI_API_KEY"] = "your-future-agi-api-key"
os.environ["FI_SECRET_KEY"] = "your-future-agi-secret-key"

trace_provider = register(
    project_type=ProjectType.OBSERVE,
    project_name="voice_csat_agent",
)

OpenAIInstrumentor().instrument(tracer_provider=trace_provider)
tracer = FITracer(trace_provider.get_tracer(__name__))

def handle_call(call_id, caller_phone, customer_segment, call_intent):
    with tracer.start_as_current_span(
        "voice_conversation",
        attributes={
            "conversation_id": call_id,
            "caller_phone": caller_phone,
            "customer_segment": customer_segment,
            "call_intent": call_intent,
            "channel": "voice",
        },
    ) as conv_span:
        run_voice_loop(call_id)

The tags customer_segment and call_intent are what make tag-based attribution work. Every dashboard slice (CSAT by segment, AHT by intent, FCR by language) reads off these tags. Setting them once at the conversation root span propagates to every child span (ASR, LLM, tool, TTS).

traceAI ships 30+ documented integrations across Python + TypeScript under Apache 2.0, OpenInference-compatible, including dedicated traceAI-pipecat and traceai-livekit packages for voice frameworks. The same spans work across hosted runtimes and OSS frameworks.

Step 2: Score every call with CSAT-proxy rubrics

Same shape as Step 1: two paths, both ship in the box. Pick by who’s authoring rubrics.

Path A: UI-driven (in-product evaluator agent)

For non-engineering reviewers (CX leads, QA managers, ops), the in-product evaluator-authoring agent drafts custom rubrics from your production traces. Point it at a failing cluster, the agent proposes a rubric in plain English with example pass/fail cases pulled from your data, you accept or edit, and the rubric joins the library. It runs on every future call through the same API as the 70+ built-in templates. Acceptances become positive calibration signal, rejections negative; the next round of proposals incorporates the calibration. Every rubric change is human-approved.

For the three CSAT-proxy rubrics specifically:

First-resolution: pick the built-in task_completion template from the eval library. No authoring needed.
Empathy: author through the UI agent. Drop in 8-12 acceptance examples and 8-12 rejection examples from real calls. The agent proposes the rubric prompt.
Sentiment trajectory: same path. Pull 10-20 sample transcripts where sentiment improved, 10-20 where it degraded; the agent drafts the rubric.

The Dataset UI runs eval batches on demand. Pick a dataset, pick evaluators (the three rubrics above), run. The dashboard renders results without you writing code.

Path B: SDK-driven (code-first)

If your team prefers config files in version control, author the same three rubrics in code:

from fi.evals import evaluate
from fi.evals.metrics import CustomLLMJudge
from fi.evals.llm import LiteLLMProvider
from fi.opt.base import Evaluator

def score_csat_proxy(call_transcript, agent_turns, user_turns):
    # 1. First-resolution rate (task completion against the original intent)
    resolution = evaluate(
        eval_templates="task_completion",
        inputs={
            "input": user_turns[0],
            "output": agent_turns[-1],
            "expected_output": "Caller intent resolved without escalation.",
        },
    )

    # 2. Empathy score (per-turn average)
    empathy_judge = CustomLLMJudge(
        name="empathy_judge",
        grading_criteria=(
            "Score 1 if the agent's turn acknowledged the caller's emotion, "
            "validated frustration when present, and offered a path forward. "
            "Score 0 if the agent ignored emotion, dismissed the concern, or "
            "responded mechanically. Score 0.5 if partial."
        ),
        provider=LiteLLMProvider(model="gpt-4o-mini"),
    )

    empathy_scores = [
        Evaluator(metric=empathy_judge).evaluate(
            output=turn,
            context=f"Caller said: {user_turns[i]}",
        )
        for i, turn in enumerate(agent_turns)
    ]
    empathy = sum(empathy_scores) / max(len(empathy_scores), 1)

    # 3. Customer sentiment trajectory (custom evaluator)
    sentiment_judge = CustomLLMJudge(
        name="sentiment_trajectory",
        grading_criteria=(
            "Score the change in caller sentiment across the transcript on "
            "a -1 to +1 scale; positive means sentiment improved by the end."
        ),
        provider=LiteLLMProvider(model="gpt-4o-mini"),
    )
    sentiment = Evaluator(metric=sentiment_judge).evaluate(
        output=call_transcript,
    )

    sentiment_delta = float(sentiment)
    return {
        "first_resolution": resolution.eval_results[0].metrics[0].value,
        "empathy": empathy,
        "sentiment_delta": sentiment_delta,
        "csat_proxy": (
            0.5 * resolution.eval_results[0].metrics[0].value
            + 0.3 * empathy
            + 0.2 * sentiment_delta
        ),
    }

A few notes on the setup:

Run scoring asynchronously after the turn. The CSAT proxy doesn’t need to come back in the critical path. Score after the call ends so latency stays clean.
Tune the weights against survey ground truth. The 0.5 / 0.3 / 0.2 weights above are a reasonable starting point; recompute them by regressing your CSAT-proxy components against the 5-12% of calls that produced a real survey.
Use the cheap tier for continuous scoring, the expensive tier for sampling. Smaller judge models score quickly at low cost; reserve the larger judge models for sampled deep-dives. Reserve ProtectFlash for harmful/not-harmful binary safety classification, not generic CSAT judging.

Both paths produce the same artifact: a versioned rubric in your library that runs on every call. The UI path is faster for the empathy and sentiment rubrics because the agent reads your traces and proposes the rubric prompt; the code path is preferred when you want the rubric definition in git alongside the rest of the agent config.

ai-evaluation ships 70+ built-in eval templates, including conversation_resolution, task_completion, is_polite, is_helpful, is_concise as the CSAT-proxy rubrics, plus conversation_coherence and audio_quality for the underlying conversation health. The library also includes unlimited custom evaluators authored by the in-product agent or written in code. Custom evaluators calibrate from human review feedback over time. In-house classifier models are tuned for the LLM-as-judge cost/latency tradeoff so scoring 1M+ calls per month stays affordable. The programmatic eval API lets you configure + re-run scores against historical traces from either path.

Step 3: Cluster dissatisfaction causes with Error Feed

The thing that blocks most CSAT improvement programs isn’t the lack of data; it’s the impossibility of triaging it by hand. A 10,000-call month produces thousands of failing traces. Triaging them one at a time is a full-time job that doesn’t compound.

Error Feed auto-clusters trace failures into named issues and writes root cause, quick fix, and long-term recommendation. It runs zero-config the moment traces hit an Observe project, so 50 traces with the same underlying problem show up as one issue.

For voice CSAT the clusters that recur:

Mistranscription cluster. ASR drops a specific accent or jargon term. The root cause names the term; the quick fix adds it to the STT custom vocabulary list.
Intent misclassification cluster. A user phrase routes to the wrong tool. Root cause names the phrase; quick fix adds it to the few-shot examples.
Tool hallucination cluster. The agent passes wrong arguments to a tool. Root cause names the schema drift; quick fix updates the tool description.
Refusal misfire cluster. The agent refused a benign request. Root cause names the policy phrase that overfired; quick fix relaxes it.
Brand-voice drift cluster. The agent’s tone drifted from the style guide. Root cause names the prompt section that drifted; quick fix re-anchors it.

Each issue carries a trend signal (rising, steady, falling) so you can prioritize the clusters whose CSAT impact is compounding.

Step 4: Ship the fix, measure the delta

The loop only closes when the fix lands back in production and the delta is measurable. The pattern:

Roll out to a percentage of traffic. 10% is a reasonable starting cohort; smaller for high-stakes intents, larger for routine ones.
Compare CSAT-proxy distributions. Run a two-sample test on the new cohort vs baseline. Look at the components separately (resolution, empathy, sentiment) rather than only the composite.
Promote when delta clears a confidence threshold. A 2-3 point lift on the composite with p < 0.05 across 1,000+ calls is a clean signal.
Roll back if degradation appears. Tag-based attribution makes the rollback surgical: roll back only the cohort that’s degrading, not the whole prompt.

For prompt-tuning work specifically, agent-opt ships six optimizers (Bayesian Search, Meta-Prompt per arXiv 2505.09666, ProTeGi, GEPA Genetic-Pareto per arXiv 2507.19457, Random Search per arXiv 2311.09569, PromptWizard). Same two-path pattern as the rest of the loop:

UI path: inside the Dataset workflow, select a failing cluster (promoted from Error Feed), pick an evaluator (your CSAT-proxy composite), pick one of the six optimizers, run. The dashboard surfaces optimizer iterations, candidate prompts, and final scores. CX or PM leads can run this without code.
SDK path: agent-opt Python library exposes the same six optimizers. Drop in a config file, run the optimizer in CI against a dataset built from Error Feed clusters, and ship the winning variant through your normal deploy.

Both paths read failing clusters from Error Feed and propose prompt variants whose expected CSAT-proxy is higher. FAGI never auto-rewrites prompts without an explicit run and a human approval gate.

Tag-based attribution: the dashboard layer

Tag-based attribution is what makes the CSAT loop legible to the rest of the business. Every trace carries customer_segment, call_intent, language, agent_version, model_name. The dashboard slices CSAT by every combination:

CSAT by intent (booking vs FAQ vs complaint).
CSAT by language (English vs Spanish vs Tamil).
CSAT by customer segment (enterprise vs SMB vs consumer).
CSAT by agent version (the rollout cohort vs baseline).
CSAT by model (GPT-4o vs Claude 3.7 vs Gemini 2.0).

The slicing surfaces the clusters that matter. If CSAT is fine overall but drops 15 points on Spanish-language calls, the dashboard surfaces that immediately; without tag-based attribution the signal gets lost in the average.

What about AHT and FCR?

AHT (average handle time) and FCR (first-call resolution) are the operational siblings of CSAT. The right framing is:

CSAT is the customer-perception scoreboard. Optimize for this.
FCR is the resolution scoreboard. A leading indicator of CSAT.
AHT is the cost scoreboard. A constraint, not a target. Set a ceiling.

A common failure mode is optimizing AHT and watching CSAT crash. Faster calls feel rushed. The fix is to set an AHT ceiling (e.g., 3 minutes for FAQ, 5 minutes for booking) and optimize CSAT and FCR within that ceiling.

All three roll off the same trace data. AHT comes from trace duration/span timing; ai-evaluation handles resolution, task completion, tone/helpfulness, audio quality, and custom CSAT rubrics. For voice-specific QA, also score audio_transcription, audio_quality, and feed MLLMAudio inputs across .mp3, .wav, .ogg, .m4a, .aac, .flac, and .wma. The dashboard plots them together.

The Future AGI stack on this loop

The CSAT loop has five products doing five jobs:

traceAI + native voice observability: 30+ documented integrations across Python + TypeScript (including traceAI-pipecat, traceai-livekit), OpenInference-compat, Apache 2.0. For Vapi/Retell/LiveKit, no SDK is needed; native dashboard ingestion handles it. Every call becomes a structured trace.
ai-evaluation: 70+ built-in eval templates plus unlimited custom evaluators authored by an in-product agent that calibrate from human review feedback, in-house classifier models tuned for the LLM-as-judge cost/latency tradeoff. Apache 2.0. Every call scored on the CSAT-proxy rubrics (conversation_resolution, task_completion, is_polite, is_helpful, is_concise).
Error Feed: the clustering and what-to-fix layer over your traces and evals. Zero-config auto-clusters failures into named issues with auto-written root cause, quick fix, and long-term recommendation. Error Localization pinpoints the exact failing turn for simulation-driven debugging.
agent-opt: six prompt optimizers (Bayesian Search, Meta-Prompt per arXiv 2505.09666, ProTeGi, GEPA Genetic-Pareto per arXiv 2507.19457, Random Search per arXiv 2311.09569, PromptWizard) exposed via both Dataset UI and SDK. Tunes the CSAT-driving prompt sections on explicit run with human approval.
Agent Command Center: RBAC, SOC 2 Type II + HIPAA + GDPR + CCPA + ISO 27001 certified, AWS Marketplace, multi-region hosted, 15+ provider routing. Hosts the whole stack with per-team RBAC.

The closed loop (trace, eval, cluster, optimize, re-deploy) is the differentiator. Most analytics products give you the dashboard. Future AGI closes the loop so the dashboard’s signals turn into shipped fixes without humans having to glue the pieces together.

Three deliberate tradeoffs

Async eval gating is explicit. agent-opt requires an explicit run plus a human approval gate before any prompt rewrite ships. FAGI never auto-rewrites prompts in production without human approval. Intentional design.

Native voice obs ships for Vapi, Retell, and LiveKit out of the box. Enable Others mode covers the rest via traceAI SDK or webhook, which covers 90%+ of production stacks. The dashboards are actively iterated every release. Recent shipped work includes multi-step Agent Definition UX, Prompt Workbench Revamp, redesigned Run Test performance metrics, Show Reasoning column in Simulate, sticky filters in Observe, and Error Localization that pinpoints the failing turn.

Survey ground truth is your job. FAGI scores the CSAT proxy. The 5-12% of real survey responses are still your data and still your ingest. The proxy is what runs at scale; the survey is what calibrates the proxy weights. That separation is intentional because survey infra is usually a separate vendor for compliance and PII reasons.

Operational dashboard design

The dashboard layer is where the loop becomes visible to operators. Three views handle most of the day-to-day work.

View 1: CSAT proxy over time. Time-series of the composite CSAT proxy with annotations for every deploy. The “before vs after” delta for each prompt change shows up immediately. Add the components (empathy, first-resolution, sentiment) as overlay lines so you can see which one moved.

View 2: CSAT by intent. Bar chart of CSAT proxy sliced by call_intent tag. The intents at the bottom of the chart are your improvement targets. Drill into any bar to see the underlying traces and the Error Feed clusters that pulled it down.

View 3: Cluster trend. Error Feed clusters sorted by trend (rising clusters at the top). Rising clusters are the work for this week. Falling clusters validate that last week’s fixes are landing. Steady clusters are background noise.

These three views replace the typical call-center QA dashboard (random call sampling, manual scorecard) with something that runs on every call automatically. The manual QA team’s time shifts from scoring random samples to investigating the clusters at the top of view 3.

Sentiment scoring choices

Sentiment scoring per turn has three implementation choices that affect cost and quality:

Open-source classifier. A small model like DistilBERT fine-tuned on a customer-service sentiment dataset. Fast and free; sentiment quality is adequate for trajectory tracking but weak on sarcasm and frustration that’s expressed indirectly.

LLM-as-judge with a small judge model. A small judge like gpt-4o-mini scoring sentiment per turn. Better quality; the latency is fine for async scoring after the call. Cost is the main tradeoff at very high volume.

In-house classifier model. Future AGI ships a sentiment classifier in the same family as the empathy and faithfulness scorers, tuned for the LLM-as-judge cost/latency tradeoff. Pick this when you’re scoring 1M+ calls per month and the per-call eval cost has to stay under a tenth of a cent.

The right pick depends on volume. For most mid-market deployments the LLM-as-judge option is the right starting point; the in-house classifier becomes the right pick once volume crosses the threshold where the cost economics start to dominate.

How tag attribution actually catches regressions

A worked example. A team rolls out a new system prompt that’s shorter and faster. Overall CSAT proxy stays flat. Without tag attribution the team marks the change as neutral and moves on. With tag attribution the dashboard surfaces that CSAT proxy dropped 9 points on the “complaint” intent specifically, because the shorter prompt removed an empathy preamble that mattered for that intent only.

The fix is intent-conditional: re-add the empathy preamble for complaint calls only. The composite CSAT proxy lifts 3 points after the second deploy. Without tag attribution the second deploy never happens because the first deploy looked neutral.

The lesson is that customer_segment, call_intent, language, and agent_version tags need to be set on the conversation root span before the agent runs, not after. Setting them after means the rollout cohort is incomplete and the comparison is biased.

Industry-specific rubric tuning

The CSAT-proxy rubric weights are not the same across industries. The base 0.5 / 0.3 / 0.2 weighting (first-resolution, empathy, sentiment) is a reasonable starting point for general support workloads but should be re-fitted against survey ground truth per vertical.

Healthcare and clinical workloads. Empathy weighting moves up to 0.4 because callers are often anxious or distressed and the agent’s tone has outsized impact on CSAT. First-resolution stays important because callers don’t want to repeat their condition to a second person.

E-commerce and retail. First-resolution weighting moves up to 0.6 because callers are transactional and want the answer fast. Empathy still matters but the bar is lower; the caller will tolerate a brusque agent if the resolution lands quickly.

Financial services and lending. Faithfulness joins the proxy as a fourth component with 0.25 weighting. Wrong information about loan terms or account balances is more damaging than slow service. The base proxy gets reweighted to 0.4 / 0.2 / 0.15 / 0.25.

Travel and hospitality. Sentiment trajectory matters more because the call often starts with a frustrated caller (cancelled flight, lost luggage) and CSAT depends on whether the agent moved them from negative to neutral. Weighting becomes 0.3 / 0.3 / 0.4.

ai-evaluation supports per-industry custom evaluators authored by an in-product agent. The agent reads your existing trace data and survey data and proposes initial weight settings; you refine the weights as more survey data lands.

When to escalate to a human

CSAT analytics doesn’t just optimize the AI’s responses. It also tunes the escalation policy. The right escalation policy lifts CSAT directly because a frustrated caller routed to a human at the right moment ends the call satisfied; the same caller forced to keep talking to an agent that can’t help leaves furious.

Three signals drive escalation:

Sentiment trajectory. Caller sentiment drops from neutral to negative across three turns. Escalate immediately. The longer the agent stays on the call after sentiment crashes, the lower the CSAT.

Repeat-question signal. Caller asks variants of the same question twice within the call. Signals that the agent’s first answer didn’t land. Escalate after the second repeat unless the agent has a high-confidence answer to deliver.

Refusal-handling failure. Caller pushes back on a refusal (“but I really need this”) more than twice. Escalate; the policy boundary the agent is enforcing is one the human supervisor should adjudicate.

Each of the three signals is computed on the trace data with rubrics from ai-evaluation. The escalation policy is a function of the three signals plus the call_intent tag (some intents have aggressive escalation, some have conservative). Tag-based attribution lets you A/B-test the escalation thresholds per intent without changing the underlying agent.

A worked example: 12-point CSAT lift in one month

A mid-market dental SaaS customer ran this loop on a voice receptionist + booking deployment. The baseline CSAT proxy sat at 78. The team enabled traceAI plus the three CSAT-proxy rubrics plus Error Feed in week one.

Week 2 top clusters:

Mistranscription on “Dr. last name” sequences. ASR was dropping the doctor’s surname on 18% of calls. Quick fix: STT custom vocabulary plus a confirmation turn. CSAT-proxy impact: +4.
Intent misroute on “I need to reschedule but I don’t have my booking ID.” Agent escalated unnecessarily on 11% of reschedule calls. Quick fix: few-shot example for partial-context reschedules. CSAT-proxy impact: +3.
Refusal misfire on “Can you confirm my coverage.” Agent refused as a HIPAA-adjacent request when it was a benign coverage check. Quick fix: relaxed policy phrase plus a tighter PHI scanner. CSAT-proxy impact: +5.

By end of month one the CSAT proxy hit 90. The composite lift was 12 points, weighted heaviest on the refusal-misfire fix. The next month’s clusters were thinner because the loop closed: the residual was in the survey sample-size noise.

How to Implement Voice AI Observability in 2026: the underlying instrumentation pattern.
Voice AI Evaluation Infrastructure: Developer’s Guide: the rubric library and judge model tradeoffs.
Agent Metrics Frameworks: the broader metrics surface for agents.
Voice Agent Scenarios Without Manual QA: pre-launch simulation patterns that complement production analytics.

Sources and references

arXiv 2510.13351: Future AGI Protect model family (arxiv.org/abs/2510.13351)
OpenInference specification: OpenTelemetry GenAI semantic conventions
Future AGI trust page: futureagi.com/trust
traceAI repository: github.com/future-agi/traceAI
ai-evaluation repository: github.com/future-agi/ai-evaluation
agent-opt repository: github.com/future-agi/agent-opt
Error Feed docs: docs.futureagi.com/docs/observe

Frequently asked questions

What is CSAT for a voice agent?

CSAT for a voice agent is the share of callers who leave the call satisfied. The ground-truth measurement is a post-call survey scored 1 to 5 or thumbs up / down. The problem is sample size: response rates land around 5-10% in practice, which is too sparse to drive day-to-day decisions. The fix is a CSAT proxy built from three measurable signals scored on every call: empathy, first-call resolution, and turn-by-turn customer sentiment.

What metrics drive CSAT in a voice agent?

Four families. Resolution metrics (first-call resolution, escalation rate, repeat-call rate). Conversational metrics (turn count, agent talk-time ratio, interruption count, dead-air seconds). Quality metrics (intent-classification accuracy, faithfulness, refusal handling). Brand metrics (empathy score, tone adherence, brand-voice consistency). Each maps to a rubric in ai-evaluation. Score every call on every rubric and the CSAT proxy emerges.

How do I score empathy on every call?

Use an LLM-as-judge rubric or an in-house classifier model. ai-evaluation ships `is_polite`, `is_helpful`, `is_concise`, `conversation_resolution`, and `task_completion`; author empathy as a custom evaluator via the in-product agent if you need a 0-1 per-turn signal. The `is_polite`, `is_helpful`, and `is_concise` rubrics ship as named templates. For high-volume deployments the in-house classifier models are tuned for the LLM-as-judge cost/latency tradeoff so scoring 1M+ calls per month stays under budget.

What's the difference between AHT and CSAT?

AHT (average handle time) is operational; CSAT is customer-perception. They often trade against each other: a faster call can leave the caller feeling rushed; a thorough call can resolve everything but burn minutes. The right framing is to optimize CSAT subject to an AHT ceiling, not to minimize AHT. Tag-based attribution lets you slice the two against each other in the dashboard.

How does Error Feed help with CSAT?

Error Feed auto-clusters failing traces into named issues with auto-written root cause, quick fix, and long-term recommendation. For CSAT work that means 50 calls where the caller hung up after a misrouted intent show up as one issue with the root cause (intent classifier weak on a specific phrase) and a quick fix (add the phrase to the few-shot examples) rather than 50 separate alerts you have to triage by hand.

Can I run CSAT analytics without sending data to a third party?

Yes. Future AGI's full stack (traceAI, ai-evaluation, agent-opt) is Apache 2.0 and runs in your own infra. The Agent Command Center hosted tier is SOC 2 Type II, HIPAA, GDPR, CCPA, and ISO 27001 certified per the trust page if you prefer hosted. Self-hosted means no PII or PHI ever leaves your VPC.

What CSAT lift is realistic from analytics-driven improvement?

Customer-reported lifts on production voice deployments after one full trace-eval-cluster-fix loop land between 8 and 22 percentage points on the CSAT proxy, with the largest lifts coming from clustering ASR errors that traditional QA misses. The lift compounds: month one fixes the top three clusters, month two fixes the next three, and so on until the residual is in the noise of survey sample size.

View all

Guides

Voice AI Drop-Off Rate: The Metric That Predicts Hang-Up Risk

Drop-off rate beats CSAT as leading indicator. Tag traces, score with conversation_resolution and task_completion, pinpoint the turn that caused hang-up.

NVJK Kartik · Apr 9, 2026

14 min

Guides

Red-Teaming Conversational AI: What Your Voice Agent Should Never Say in 2026

Red-team voice agents against 8 attack archetypes in 2026 with Future AGI Protect, ProtectFlash, named eval rubrics, and 1,200-call pre-launch coverage.

NVJK Kartik · May 7, 2026

18 min

Guides

Anatomy of a Voice Agent Analytics Dashboard in 2026

Walkthrough of a voice agent analytics dashboard: per-call drawer with 5 panels, SLO grid with 3 tiers, span/eval/tag flow, production-to-sim closed loop.

NVJK Kartik · May 7, 2026

21 min

TL;DR: the four-step loop

The CSAT measurement problem

Step 1: Instrument every call

Path A: UI-driven, no-SDK (Vapi, Retell, LiveKit)

Path B: SDK-driven (Pipecat, custom LiveKit, code-first stacks)

Step 2: Score every call with CSAT-proxy rubrics

Path A: UI-driven (in-product evaluator agent)

Path B: SDK-driven (code-first)

Step 3: Cluster dissatisfaction causes with Error Feed

Step 4: Ship the fix, measure the delta

Tag-based attribution: the dashboard layer

What about AHT and FCR?

The Future AGI stack on this loop

Three deliberate tradeoffs

Operational dashboard design

Sentiment scoring choices

How tag attribution actually catches regressions

Industry-specific rubric tuning

When to escalate to a human

A worked example: 12-point CSAT lift in one month

Related reading

Sources and references

Frequently asked questions