Guides

AI Call Center QA Software: 8 Tools Compared for Transcript Scoring and Live Coaching (2026)

AI call center QA software 2026 splits into transcript scoring and live coaching. Eight tools on rubric depth, real-time latency, deployment, honest fit.

April 9, 2026

Updated May 20, 2026

17 min read

voice-ai 2026 call-center qa quality-monitoring live-coaching speech-analytics

Table of Contents

Call center QA in 2026 splits into two jobs that buyers keep conflating. The first is transcript scoring after the call: every recording lands in ASR, the transcript runs through a rubric set, failures cluster into named issues, and the result feeds coaching and audit. The second is live coaching during the call: a low-latency model watches the stream and surfaces a whisper, a knowledge-base card, or a supervisor barge-in cue while the agent is still on the line. Most teams need both, but the cadence is different. Transcript scoring runs hourly or daily and lives on coverage and calibration. Live coaching runs every turn and lives on latency and signal-to-noise. The mistake to avoid is buying a transcript-scorer and expecting it to coach in real time, or buying a live-coaching product and expecting an honest historical audit to fall out of it.

This guide compares eight AI call center QA tools across both jobs. As of May 2026 the named tools are Future AGI, Cresta, Observe.AI, Level AI, CallMiner, Salesforce Service Cloud Einstein, Zoom Revenue Accelerator, and Tethr. Each card covers what it’s best at, the honest tradeoff, the deployment surface, and the rubric coverage.

TL;DR: which tool for which job

Job to be done	Best fit	Honest tradeoff
Pre-launch voice-agent simulation + post-call scoring + live guardrails	Future AGI	Newer in the CCaaS recording market; integrations with Five9/Genesys ship through standard recording exports rather than a deep native adapter
Live coaching for sales, retention, collections	Cresta	Strong on real-time; post-call audit is improving but less deep than Observe.AI
End-to-end agent assist + post-call QA on traditional CCaaS	Observe.AI	Heavier rollout; pricing leans enterprise
Mid-market automated QA with no-code rubric authoring	Level AI	Less depth on real-time coaching than Cresta
Speech analytics on the largest call corpora	CallMiner	Origin in keyword spotting still shows in the UI
Salesforce-native shops	Salesforce Service Cloud Einstein	Locked to Service Cloud Voice; rubric library is smaller
Zoom-native shops	Zoom Revenue Accelerator	Sales-revenue lens, not a QA-first product
Mid-market post-call audit on Genesys/Talkdesk	Tethr	Smaller eval-template catalog; weaker on voice-agent runtimes

Non-negotiables to look for in any vendor: an honest separation between live and post-call scoring, a calibration cycle, a clustering layer (not just per-call alerts), and a recording-ingestion surface that covers both AI voice agents (Vapi, Retell, ElevenLabs, LiveKit, Pipecat) and the traditional CCaaS stacks (Five9, Genesys, Amazon Connect, NICE CXone, Talkdesk) without forcing a re-platforming project to start.

How we evaluated

Every card below covers the same six axes. The scorecard is portable: take it to any vendor demo and ask the same questions.

Job split. Does the product do transcript scoring, live coaching, or both? If both, where does the architecture compromise?
Rubric depth. How many built-in rubrics ship, how custom rubrics get authored, and whether the library covers the regulated verticals you operate in.
Real-time latency. For live coaching, the median time-to-whisper. Anything above 200ms reads as a hesitation cue to the agent.
Recording-ingestion surface. Native adapters for Vapi, Retell, ElevenLabs, LiveKit, Pipecat (voice-agent runtimes) and Five9, Genesys, Amazon Connect, NICE CXone, Talkdesk (traditional CCaaS).
Clustering and Error Localization. Does the product cluster failing calls into named issues with auto-written root cause and fix, or surface a flat per-call alert feed?
Compliance posture. SOC 2 Type II, HIPAA, GDPR, CCPA, ISO 27001. PCI-aware redaction.

Cards are sized to the depth of public detail, not the size of the vendor. This is a buyer’s comparison written from named primary sources and our own simulation work on the voice-agent runtimes Future AGI integrates with directly; it isn’t a hands-on bake-off of all eight in the same week.

1. Future AGI: voice-agent simulation, post-call scoring, and inline guardrails in one stack

Best for: teams running an AI voice-agent stack (Vapi, Retell, ElevenLabs, LiveKit, Pipecat) who need pre-launch simulation, post-call scoring at 100 percent coverage, and inline safety guardrails inside the same workflow.

Future AGI ships the eval stack as a package. The voice-relevant surface is simulate-sdk for pre-launch persona+scenario testing, ai-evaluation for post-call rubric scoring on 100 percent of traffic, traceAI for OpenTelemetry-native observability, agent-command-center for inline routing and guardrails, and the Error Feed for HDBSCAN clustering plus a Claude Sonnet 4.5 Judge agent that writes the root cause and the immediate_fix per cluster.

Rubric depth. The ai-evaluation SDK exposes 50+ pre-built evaluators including the eleven CustomerAgent templates (CustomerAgentConversationQuality, CustomerAgentLoopDetection, CustomerAgentTerminationHandling, CustomerAgentHumanEscalation, CustomerAgentQueryHandling, CustomerAgentLanguageHandling, CustomerAgentClarificationSeeking, CustomerAgentContextRetention, CustomerAgentObjectionHandling, CustomerAgentInterruptionHandling, CustomerAgentPromptConformance) plus Groundedness, ContextAdherence, AnswerRefusal, TaskCompletion, ConversationCoherence, ConversationResolution, content moderation, and audio rubrics that score against MLLMAudio for mp3, wav, ogg, m4a, aac, flac, and wma.

Voice-agent simulation. The simulate-sdk covers the pre-launch QA bar with Persona+Scenario test cases, supports OpenAI, LangChain, Gemini, and Anthropic agent wrappers, and includes voice-aware TTS/STT configuration. The TestRunner returns a TestReport with pass rate, failed scenarios, and the traces needed for Error Localization. Most teams use this to set the launch gate before any production call hits the line.

from fi.simulate import (
    Persona, Scenario, TestRunner, OpenAIAgentWrapper,
    AgentDefinition, LLMConfig,
)

agent_def = AgentDefinition(
    name="collections-voice-agent",
    llm_config=LLMConfig(model="gpt-4", temperature=0.7),
    system_prompt="You are a regulated collections agent. Open with the Mini-Miranda.",
)
personas = [Persona(name="distressed_debtor", traits={"tone": "anxious"})]
scenarios = [
    Scenario(description="Mini-Miranda required",
             goals=["state disclosure", "verify identity"]),
    Scenario(description="Cease-and-desist request",
             goals=["acknowledge", "stop contact attempts"]),
]
wrapper = OpenAIAgentWrapper(agent_def)
report = TestRunner(agent_wrapper=wrapper, personas=personas, scenarios=scenarios).run()
print(f"Pass rate: {report.pass_rate:.0%}")

Live guardrails. Protect Flash, documented in arXiv 2510.13351, runs at 65 ms text and 107 ms image median time-to-label. Four Gemma 3n LoRA adapters (toxicity, bias_detection, prompt_injection, data_privacy_compliance) cover the regulated floors for live coaching; fi.evals.guardrails.scanners ships sub-10ms pre-filters (JailbreakScanner, CodeInjectionScanner, SecretsScanner, MaliciousURLScanner, LanguageScanner, TopicRestrictionScanner, RegexScanner). The combined layer makes live whispers safe to surface every turn instead of every Nth.

Clustering. The Error Feed uses HDBSCAN soft-clustering over ClickHouse-stored span embeddings. A Sonnet 4.5 Judge agent investigates each cluster (30-turn budget, 8 span-tools, Haiku Chauffeur for large spans, ~90 percent prompt-cache hit) and writes a four-dimensional score (factual_grounding, privacy_and_safety, instruction_adherence, optimal_plan_execution; 1 to 5 each), an evidence quote, an immediate_fix, and a long-term recommendation. 200 failing calls become 5 to 20 named clusters per week.

Compliance and deployment. SOC 2 Type II, HIPAA, GDPR, and CCPA per futureagi.com/trust; ISO/IEC 27001 in active audit. Agent Command Center ships as a single Go binary (Apache 2.0) for self-host, with the ML hop to api.futureagi.com for the closed-weights Protect models. Cloud at app.futureagi.com; gateway at gateway.futureagi.com/v1; self-hostable in VPC; AWS Marketplace.

Honest tradeoffs. Future AGI is newer in the CCaaS recording-export market than CallMiner or Observe.AI, so Five9, Genesys, and NICE CXone integrations go through standard exports rather than a deep native connector. The Protect Flash hop adds 65 ms to live paths, so very high-volume real-time inference runs the deterministic scanners inline and the LoRA adapters on a sampled subset (the audit log records the sampling rate). Salesforce Service Cloud Voice capture runs through traceAI auto-instrumentation rather than a Service-Cloud-native adapter.

Best fit signal. The stack already includes Vapi, Retell, ElevenLabs, LiveKit, or Pipecat and the QA program needs both the pre-launch simulation gate and post-call scoring with native voice observability.

2. Cresta: built for live coaching

Best for: sales, retention, and collections programs that need real-time agent assist on every call, with measurable conversion or compliance lift attached.

Cresta is the canonical live-coaching product. The architecture watches the transcript stream, runs a behavior-detection model, and surfaces a whisper, a knowledge-base card, or a supervisor barge-in cue while the agent is still on the line. The product is judged on real-time signal-to-noise: how often the whisper is right, how often it’s ignored, and how the conversion or compliance number moves on the cohort that gets the suggestion versus the holdout that doesn’t.

Rubric depth. Centered on real-time intent and behavior models tuned to the customer’s playbook. Custom behaviors get authored during onboarding, then refined by the in-product agent. The product ships fewer plug-and-play rubrics than the post-call-first vendors because the design priority is the live model.

Real-time latency. Sub-300ms median for the whisper path on most CCaaS audio paths.

Recording-ingestion surface. Strong on traditional CCaaS (Five9, Genesys, Amazon Connect, NICE CXone). AI voice-agent runtime coverage is improving but is not the architectural priority.

Clustering and compliance. Named-issue clustering is part of the product; the post-call audit trail is real but leaner than Observe.AI’s. SOC 2 Type II, HIPAA, and PCI redaction support.

Honest tradeoff. If the program needs the audit trail to defend in a quarterly compliance review, the post-call surface falls short of Observe.AI or CallMiner. Cresta sits on top of an existing CCaaS recording pipeline rather than replacing it.

Best fit signal. A specific revenue or compliance number the live coaching is meant to move. If the answer to “what’s the metric” is “QA score went up,” Cresta’s leverage is wasted; the right buyer has a conversion rate, a save rate, or a compliance disclosure rate the live whisper is supposed to lift on a measurable cohort.

3. Observe.AI: end-to-end agent assist plus post-call QA

Best for: mid-to-large contact centers on traditional CCaaS that want one product for live coaching, post-call scoring, and the audit dashboard.

Observe.AI sits in the middle of the live-versus-post-call split. The product does both, with a deep enough audit trail that compliance-driven buyers treat it as a primary QA platform, and a real-time agent-assist surface that competes with Cresta for the live whisper use case. Customer logos lean enterprise.

Rubric depth. Auto-QA covers the standard families (resolution, tone, compliance, AHT) with a rubric-authoring agent for brand-specific policy. The library is well-stocked on the regulated verticals (healthcare, financial services, insurance, retail).

Real-time latency. Sub-500ms median for the live agent assist. Not the fastest, but consistent enough that the live coaching feels usable on the same call.

Recording-ingestion surface. Native adapters for Five9, Genesys, Amazon Connect, NICE CXone, Talkdesk. AI voice-agent runtimes are not the design center.

Clustering and compliance. Named-issue clustering with an auto-written root-cause column. SOC 2 Type II, HIPAA, GDPR, PCI, ISO 27001.

Honest tradeoff. Heavier rollout than Level AI or Tethr. Pricing leans enterprise. If the team is already on Talkdesk Copilot or NICE Enlighten and looking for incremental coaching value rather than a full QA replacement, Observe.AI is overscoped.

Best fit signal. A QA team that today runs 2 percent manual sampling on a traditional CCaaS stack and wants one vendor to take it to 100 percent post-call coverage with live agent assist on the same workflow.

4. Level AI: automated QA for the mid-market

Best for: mid-market QA programs that want no-code rubric authoring on top of an existing CCaaS stack, with measurable time-to-value in weeks rather than quarters.

Level AI’s wedge is the in-product agent that authors rubrics from natural-language descriptions of the brand’s policy. The QA lead types “the agent must mention the 60-day money-back guarantee on any retention call before offering a discount” and the agent produces a runnable evaluator the team can ship that afternoon.

Rubric depth. Strong on the universal rubrics (resolution, tone, compliance, audio quality) with a fast custom-evaluator authoring path. The library is narrower than Observe.AI’s or CallMiner’s, but the authoring agent closes the gap quickly for brand-specific policies.

Real-time latency. Real-time scoring is supported but less of an architectural priority than the post-call audit. Cresta is the better fit for live whisper-first programs.

Recording-ingestion surface. Five9, Genesys, Amazon Connect, NICE CXone, Talkdesk through standard CCaaS recording exports.

Clustering and compliance. Named-issue clustering with a coaching-backlog workflow that pairs the cluster with the agents who triggered it. SOC 2 Type II, HIPAA, GDPR, PCI-aware redaction.

Honest tradeoff. Less depth on real-time agent assist than Cresta or Observe.AI. The fast authoring is a starting point, not a calibration substitute.

Best fit signal. A mid-market QA program that needs to ship the first AI scorecard in eight weeks, with a small QA team taking ownership of the rubric set without engineering support.

5. CallMiner: the speech-analytics incumbent

Best for: large enterprises with the biggest call corpora who need both speech analytics (keyword and phrase queries) and modern AI-judge scoring on the same dataset.

CallMiner is the speech-analytics incumbent. The origin is keyword spotting and phrase-cloud analytics over very large call corpora, and the product still has the deepest query interface in the category. The 2024-2026 work has been to add LLM-judge scoring on top.

Rubric depth. Classic phrase libraries (compliance scripts, mandatory disclosures, banned phrases) are deep and well-maintained for the regulated verticals. The LLM-judge rubric library is growing.

Real-time latency. Real-time scoring exists. The design center is post-call analytics over very large corpora.

Recording-ingestion surface. Strong on every traditional CCaaS recording source. AI voice-agent runtimes are not part of the picture.

Clustering and compliance. Pattern clustering is part of the analytics surface; named-issue cluster cards with auto-written root cause are newer territory. SOC 2 Type II, HIPAA, GDPR, PCI; long track record with regulated buyers.

Honest tradeoff. The UI shows its origin. Teams hiring a new QA lead in 2026 sometimes prefer the newer scoring-first products because the learning curve is shorter.

Best fit signal. A regulated enterprise with the biggest call corpus in the industry and a compliance team that already runs phrase queries every week. Extend the deployment with the AI-judge layer; don’t tear it out.

6. Salesforce Service Cloud Einstein: for Salesforce shops

Best for: contact centers already on Salesforce Service Cloud Voice that want call summaries, post-call scoring, and supervisor insights inside the same console the agents already live in.

Salesforce’s Einstein layer covers Service Cloud Voice with call summaries, post-call scoring, next-best-action prompts, and a supervisor view that sits alongside the omnichannel queues. The wedge is the console. Agents don’t switch tools. Supervisors don’t switch tools. QA data lives next to the case data.

Rubric depth. Einstein’s library covers resolution, sentiment, compliance, and AHT-adjacent rubrics with custom evaluators authored in Prompt Builder and Agentforce. Smaller than the QA-first vendors but tightly integrated with the rest of the Salesforce data model.

Real-time latency. Real-time prompts in the agent console, tuned for the embedded experience rather than the absolute floor.

Recording-ingestion surface. Service Cloud Voice is the design center. Other CCaaS stacks need an explicit recording export.

Clustering and compliance. Tableau Pulse-class insights surface clusters in the supervisor view. Salesforce platform-level certifications; HIPAA-eligible deployment available; Shield encryption for PCI scopes.

Honest tradeoff. Locked to the Salesforce ecosystem. A shop running a Vapi or Retell voice-agent front end into a non-Salesforce backend cannot use Einstein as the QA layer without rewriting the call path.

Best fit signal. Service Cloud Voice is the deployed stack. The QA spend stays inside the Salesforce contract and the data stays inside the customer record.

7. Zoom Revenue Accelerator: for Zoom-native shops

Best for: Zoom Phone-native sales and customer-success teams that want call summaries, deal-stage scoring, and coaching tied to the revenue funnel.

Zoom Revenue Accelerator (formerly Zoom IQ for Sales) is the conversational-intelligence layer for Zoom Phone and Zoom Meetings. The product is built around the sales motion: deal scoring, talk-time ratio, next-step capture, and the coaching surface that ties it back to the rep. Classical QA (compliance, mandatory disclosures, FCR) is a secondary use case.

Rubric depth. Strong on sales-relevant rubrics (talk ratio, monologue length, next-step extraction, deal-risk scoring). Growing on classic QA rubrics. Custom evaluators through Zoom’s AI Companion.

Real-time latency. Real-time meeting insights work well inside Zoom Meetings. Zoom Phone latency is adequate for post-call use cases.

Recording-ingestion surface. Zoom Phone and Zoom Meetings are the design center. Other call sources require export.

Clustering and compliance. Coaching insights cluster around the rep and the deal stage. QA-failure clustering by rubric is less developed. SOC 2, HIPAA-eligible deployment, regional data residency.

Honest tradeoff. This is a revenue product, not a QA-first product. Buyers looking for FDCPA or HIPAA-grade audit will find the library narrower than CallMiner, Observe.AI, or Level AI.

Best fit signal. A Zoom-native sales or customer-success team where the call coaching needs to plug into Zoom-resident deal data plus existing CRM integrations.

8. Tethr: post-call analytics with an effort-score angle

Best for: mid-market post-call audit programs that want a customer-effort-score lens on every call without standing up the full enterprise speech-analytics suite.

Tethr’s wedge is the effort-score frame. Every call gets scored on how hard the customer worked to get the outcome, with a coaching backlog tied to the effort drivers. Lighter than CallMiner, narrower than Observe.AI, with a clear point of view on what to measure first.

Rubric depth. Effort-score and CES-adjacent rubrics are deep. The general-purpose QA library is narrower; custom evaluators are authored against the effort framework.

Real-time latency. Post-call is the design center. Live agent assist is not the primary surface.

Recording-ingestion surface. Genesys, Talkdesk, NICE CXone through standard CCaaS recording exports.

Clustering and compliance. Effort-driver clustering with named-issue cards. SOC 2 Type II, HIPAA-eligible deployment available.

Honest tradeoff. Smaller eval-template catalog than CallMiner, Observe.AI, or Level AI. AI voice-agent runtimes (Vapi, Retell, LiveKit) are not part of the design center.

Best fit signal. A mid-market QA program already aligned on customer effort as the leading indicator, with a recording-only audit cadence and no immediate need for live coaching.

The buying decision in five questions

Walk every vendor through these five before signing.

Live or post-call first? Pick the job that drives the bigger metric this quarter. If it’s compliance disclosure presence on collections, live. If it’s coaching backlog and audit defensibility, post-call. The cadence and the metric decide which vendor leads the rollout.
What’s the recording surface? Inventory every call source: traditional CCaaS (Five9, Genesys, Amazon Connect, NICE CXone, Talkdesk), AI voice-agent runtimes (Vapi, Retell, ElevenLabs, LiveKit, Pipecat), Zoom Phone, Service Cloud Voice. The shortlist drops to whichever vendors have native adapters for the dominant surface.
What rubric set already exists? A QA team that has been scoring calls manually for years has a printed rubric. The right vendor replicates it in a week, not one that asks the team to start over.
What’s the calibration cadence? A vendor that ships a rubric library without a named calibration cycle is selling theater. Ask how rubric drift gets detected, who owns the cycle, and what the disagreement-rate threshold is.
Where does live data go? PII redaction at recording, transcript, score, and storage. Compliance posture for the regulated verticals. The audit trail produced when a regulator asks for it. The answer is a docs URL plus a named officer, not a slide.

Common rollout failures

Three failure modes to call out before the first launch.

Calibration theatre. The team ships the vendor’s library on day one without a calibration sample, the scorecard moves fast, and three months in the QA lead realizes the rubric is over-confident on tone and under-confident on resolution. Fix: pull 50 to 100 calls per month for human review, compare against the AI score per rubric, treat any disagreement above 5 percent as a rubric defect. Above 15 percent means the calibration cycle has to compress.

Cluster numbness. 200 failures become 20 clusters, then 50, then 100, and the QA team treats them like an alert feed instead of a backlog. Fix: cap active clusters at 10 to 15, retire a cluster when its volume drops below threshold for two consecutive weeks, treat new clusters as a prioritized queue.

Live coaching as noise. The whisper fires on every call, agents start ignoring it, conversion lift never shows up in the cohort analysis. Fix: start with a tight set of deterministic floors (mandatory disclosure presence, banned phrases, prompt injection, PII) running live, run the full rubric suite post-call on 100 percent of traffic, widen the live surface only after the deterministic floor has earned the agents’ trust.

How Future AGI fits when both jobs are in scope

The thesis at the top of this guide is that transcript scoring and live coaching are two jobs with different cadences and different metrics. Future AGI’s design choice is to ship both inside one workflow. The simulate-sdk sets the pre-launch QA bar. Protect Flash and the inline fi.evals.guardrails.scanners handle the live-coaching floor at 65ms median time-to-label per arXiv 2510.13351. ai-evaluation scores 100 percent of post-call traffic against the CustomerAgent templates plus resolution, groundedness, and content-moderation rubrics. traceAI captures the spans across Vapi, Retell, ElevenLabs, LiveKit, and Pipecat. The Error Feed clusters failing calls with HDBSCAN plus a Sonnet 4.5 Judge agent that writes the immediate_fix. Agent Command Center enforces per-virtual-key tool permissions and audit-log spans on every write action.

Ready to score your first call? Wire the ai-evaluation SDK against an MLLMAudio test case this afternoon, run the eleven CustomerAgent templates plus ConversationResolution and TaskCompletion on a 100-call backfill, then add traceAI when production starts asking questions the CI gate missed.

7 Best AI Voice Agent Platforms for Inbound Customer Support in 2026: the runtime field the QA pipeline sits on top of.
IVR Modernization: Migrate Legacy IVR to AI Voice Agents in 2026: the cutover playbook the QA pipeline scores.
How to Implement Voice AI Observability in 2026: the trace layer behind every QA score.
Best Voice Agent Monitoring Platforms in 2026: the monitoring side of the same trace stream.
Voice AI Simulation: Cekura, Hamming, Bluejay, Coval: the pre-launch simulation market this guide complements.
Custom Voice Evaluator Authoring in 2026: writing brand-specific rubrics the way Level AI’s and Future AGI’s authoring agents do it.

Sources and references

arXiv 2510.13351, Future AGI Protect model family (arxiv.org/abs/2510.13351)
Future AGI trust page (futureagi.com/trust)
traceAI repository (github.com/future-agi/traceAI)
ai-evaluation repository (github.com/future-agi/ai-evaluation)
Future AGI documentation (docs.futureagi.com/docs/command-center)
Cresta, Observe.AI, Level AI, CallMiner, Salesforce Service Cloud Einstein, Zoom Revenue Accelerator, Tethr: vendor documentation and customer case studies (referenced in plain text per editorial policy).

Frequently asked questions

What is AI call center QA software in 2026?

AI call center QA software does two different jobs that buyers keep conflating. The first job is transcript scoring after the call: every recording lands in an ASR pipeline, the transcript runs through a rubric set (resolution, tone, compliance, audio quality), failures cluster into named issues, and the result feeds coaching and audit. The second job is live coaching during the call: a low-latency model watches the transcript stream and surfaces a whisper, a knowledge-base card, or a supervisor barge-in cue while the agent is still on the line. Most teams need both, but the cadence is different. Transcript scoring runs hourly or daily and is judged on coverage and calibration. Live coaching runs every turn and is judged on latency and signal-to-noise. The mistake is buying a transcript-scorer and expecting it to coach in real time, or buying a live-coaching product and expecting the historical audit trail to fall out of it.

Why does 2 percent manual sampling still break in 2026?

Three reasons that compound. Volume: a mid-sized inbound contact center handles 200,000 to 500,000 calls a month, so even 2 percent is 4,000 to 10,000 reviews, which is roughly four full-time QA reviewers at six minutes per call. Coverage: random sampling at 2 percent misses entire failure modes that occur in the 98 percent unreviewed slice, so a compliance violation that hits 0.5 percent of traffic produces 1,000 unobserved incidents a month. Drift: two QA reviewers scoring the same agent over the same month produce reports that recommend opposite coaching directions, and calibration sessions slow the drift but never close it. AI scoring at 100 percent coverage removes the sampling problem, the coverage problem, and the inter-rater drift problem in one pass, in exchange for a different set of problems (rubric drift, false agreement, and cluster noise) that this guide treats explicitly.

What's the difference between speech analytics and AI call center QA?

Speech analytics was a 2010s category built on keyword spotting and phrase clouds. The output was a heatmap of words like 'cancel' and 'manager' over a corpus of calls. The QA team still had to listen to the calls. AI call center QA in 2026 takes the transcript and scores it against named rubrics the way an LLM-as-judge scores a chatbot output. Resolution, tone, compliance, and audio quality each get a score with a reason. Failures cluster into named issues with auto-written root cause and fix. Speech analytics is a query interface over calls; AI QA is a scoring and clustering pipeline over calls. CallMiner and NICE were the kings of speech analytics and are now adding LLM scoring on top; the newer entrants (Cresta, Level AI, Observe.AI, Future AGI) start from scoring and clustering and bolt on the query interface.

Which rubrics map to traditional QA scorecards?

Seven rubric families cover most QA scorecards. Resolution: conversation_resolution and task_completion map to first-call resolution and containment. Tone: is_polite, is_helpful, is_concise map to CSAT and average handle time. Compliance: content_moderation and is_compliant map to script adherence and regulated-language coverage (FDCPA, TCPA, Reg E for collections; HIPAA-eligible phrasing for healthcare). Audio: audio_transcription scores ASR fidelity (WER-class), audio_quality scores TTS or recording quality. Brand: a custom evaluator scores against the brand-voice fingerprint and the approved-claims list. Safety: PII detection and prompt-injection screen on inputs. Coaching: a clarification-handling and loop-detection rubric for AI agents and a turn-taking rubric for human agents. The Future AGI ai-evaluation SDK ships all of these as Apache 2.0 EvalTemplate classes; the named eleven CustomerAgent templates plus Groundedness, AnswerRefusal, TaskCompletion, ConversationResolution, ContentModeration, and the audio rubrics map one-to-one against this list.

When does live coaching beat post-call scoring, and when does it lose?

Live coaching wins when the cost of getting one call wrong is high and the intervention is cheap. Collections and fraud are the canonical examples: missing a Mini-Miranda disclosure on the first turn of a collections call costs more than the whisper that prevents it, and the supervisor barge-in costs almost nothing. Healthcare scheduling, lending pre-qualification, and any insurance complaint flow have the same shape. Live coaching loses when the rubric needs the whole conversation to score honestly. Resolution, completeness, and most empathy rubrics need the closing turn before the score is meaningful, so trying to whisper 'be more empathetic' at turn three reads as noise. The practical pattern in 2026: run a tight set of deterministic floors live (PII, prompt injection, banned phrases, mandatory disclosure presence) and run the full rubric suite post-call on 100 percent of traffic. Cresta is built for the live path. CallMiner and Tethr are built for the post-call path. Observe.AI, Level AI, and Future AGI ship both paths in one product.

Which tool ships native voice observability for Vapi, Retell, and LiveKit?

Future AGI ships native voice observability for Vapi, Retell, and LiveKit with no SDK swap: provider API key plus Assistant ID gives auto call log capture, separate assistant and customer audio download, auto transcripts, and the full eval engine; the same surface covers ElevenLabs Agents and Pipecat through traceAI auto-instrumentation. Observe.AI and Level AI focus on traditional contact-center recording sources (Five9, Genesys, Amazon Connect, NICE CXone) and require integration work for the voice-agent runtimes. Cresta is built for traditional CCaaS with growing voice-agent coverage. Salesforce Einstein scores calls that land in Service Cloud Voice; Zoom Revenue Accelerator scores Zoom Phone calls. For a stack that mixes a Vapi or Retell front-end with a legacy contact center, the eval engine has to bridge both call surfaces, and Future AGI is the cleanest path in 2026 because the same MLLMAudio test case ingests Vapi traces and Five9 recordings into the same rubric set.

Can I run AI QA on existing human-agent recordings?

Yes, and most teams should start there because the human-agent volume is still the larger share. Future AGI's MLLMAudio test case accepts mp3, wav, ogg, m4a, aac, flac, and wma from local paths or signed URLs, transcribes through the platform's ASR, and runs the full rubric suite against the transcript. CallMiner, Observe.AI, Level AI, and Tethr all ingest recording exports from the major CCaaS platforms (Five9, Genesys, Amazon Connect, NICE CXone, Talkdesk) and apply the same scoring pipeline. The architecture is identical to AI-agent QA: trace (recording metadata plus transcript), score (named rubrics), cluster (named issues), coach (cluster-driven plus score-driven plus calibration). Run a 30-day backfill of historical calls before going live so the rubric calibration is honest before any agent is scored on it.

View all

Guides

Red-Teaming Conversational AI: What Your Voice Agent Should Never Say in 2026

Red-team voice agents against 8 attack archetypes in 2026 with Future AGI Protect, ProtectFlash, named eval rubrics, and 1,200-call pre-launch coverage.

NVJK Kartik · May 7, 2026

18 min

Guides

Anatomy of a Voice Agent Analytics Dashboard in 2026

Walkthrough of a voice agent analytics dashboard: per-call drawer with 5 panels, SLO grid with 3 tiers, span/eval/tag flow, production-to-sim closed loop.

NVJK Kartik · May 7, 2026

21 min

Guides

Voice Agent Regression Testing in CI/CD: A 2026 Engineering Guide

Wire voice agent regression tests into GitHub Actions and GitLab CI: golden conversations, three-layer testing, deploy gates, FAGI evals.

NVJK Kartik · May 7, 2026

18 min

TL;DR: which tool for which job

How we evaluated

1. Future AGI: voice-agent simulation, post-call scoring, and inline guardrails in one stack

2. Cresta: built for live coaching

3. Observe.AI: end-to-end agent assist plus post-call QA

4. Level AI: automated QA for the mid-market

5. CallMiner: the speech-analytics incumbent

6. Salesforce Service Cloud Einstein: for Salesforce shops

7. Zoom Revenue Accelerator: for Zoom-native shops

8. Tethr: post-call analytics with an effort-score angle

The buying decision in five questions

Common rollout failures

How Future AGI fits when both jobs are in scope

Related reading

Sources and references

Frequently asked questions