Guides

Anatomy of a Voice Agent Analytics Dashboard: A 2026 Engineering Walkthrough

Engineering walkthrough of a voice agent analytics dashboard: per-call detail drawer with 5 panels, aggregate SLO grid with 3 tiers, span/eval/tag data flow, and the production-to-simulation closed loop.

May 7, 2026

21 min read

voice-ai 2026 observability dashboards

A voice agent analytics dashboard is two surfaces, not one. There’s the per-call detail view an engineer opens to debug a specific failure, and there’s the aggregate analytics view a reliability lead opens to see how the fleet is doing this week. Most teams build both, often in the wrong order, and end up with a fleet view that nobody trusts because the underlying per-call story is incomplete. This post is the engineering walkthrough we use for production voice dashboards in 2026. It covers the actual panel structure, the actual span attributes, the actual eval IDs, the actual data flow, and the production-to-simulation loop that closes around the dashboard.

Treat it as a build spec. Every component referenced here is shipped surface in Future AGI Observe, ai-evaluation, the Agent Command Center, or the simulation product. Where you build vs. where you configure is called out explicitly in each section.

What “voice agent analytics dashboard” actually means

Most posts on this topic conflate two distinct surfaces and end up describing neither well. Pull them apart first.

Per-call detail view. Opens when an engineer clicks one row in the call log. The entire context for one conversation lives here: the audio (assistant and customer as separate channels), the transcript, the span tree, the eval scores, and a link out to any Error Feed cluster the call belongs to. This is the debugging surface. It is rendered on demand for one trace at a time.

Aggregate analytics view. Opens as the landing page of the project. Tells the reliability lead whether the fleet is healthy this hour, this day, this week. SLO gauges across the top, SLI trends in the middle, business KPIs at the bottom, and the Error Feed sidebar listing the active failure clusters by severity. This is the prioritization surface. It is rendered from aggregations across thousands of traces.

The detail view answers “why did this call fail.” The aggregate view answers “which failure pattern is worth my time today.” Build them separately, render them from the same underlying data, and link the two views at every appropriate jump-point. The rest of this post walks each view in panel-level detail.

The per-call detail view: five panels

The per-call view is the Call Details Drawer that opens when you click a row in the project’s call log. The Drawer has been through several iterations on Future AGI Observe (the Drawer received a major revamp shipped per the release notes, including the multi-channel audio player and granular audio field selection for evals). The five panels below are the reference structure we recommend regardless of which backend hosts the trace data.

[Image: FAGI Observe call detail drawer]

Panel 1: audio

Two waveforms, rendered as separate channels.

Assistant audio. The TTS output stream the user actually heard.
Customer audio. The microphone input stream from the user side.

Each waveform is independently playable and independently downloadable. Future AGI supports export in four formats: Caller Audio, Agent Audio, Mono Audio, and Stereo Audio. The stereo file is the engineering favorite because barge-in debugging needs both streams synchronized.

The separation matters because the bugs you find here are timing bugs. If the customer audio shows speech starting at t=1.4s and the assistant audio shows TTS still playing at t=1.6s, you have a barge-in detection gap of 200ms. If you only have a single mono mixdown, you cannot see that gap. If the customer waveform shows clipping at the start of every turn, you have a mic-gating issue on the device side, not a model issue. None of this is visible in the transcript.

A scrubber on either waveform syncs with the transcript panel and the span tree panel below. Codec artifacts and packet loss render as visible gaps. Most teams who switch from a transcript-only debugging flow report catching one class of bug per week they previously missed entirely.

Panel 2: transcript

Turn-by-turn alternating transcript, color-coded by speaker (user / assistant), each turn timestamped with the absolute call clock and an offset from the previous turn.

Per-turn metadata renders inline on hover:

STT confidence score (from the provider, or from your audio_transcription evaluator if you re-score).
Detected language tag (relevant for multilingual deployments).
Any words flagged as low-confidence (highlighted in amber).

Mistranscriptions flagged by audio_transcription (eval_id 73) highlight in red on the user side. Brand-name mispronunciations flagged by audio_quality (eval_id 75) highlight in red on the assistant side. The reviewer can label a span manually and that label feeds back into the evaluator calibration loop.

This is the panel that catches one specific class of bug well: ASR errors that snowball into LLM misintents. If the transcript shows the user said “cancel” but the STT returned “council,” the next turn’s prompt sees “council” and the LLM produces something nonsensical that the user hangs up on. Without the transcript-with-confidence panel that bug looks like an LLM bug, not an ASR bug, and you spend a day chasing the wrong layer.

Panel 3: span tree

The full OpenInference span hierarchy for the conversation, rendered as a collapsible tree.

session (root)
├── turn[0]
│   ├── stt.transcribe         duration=240ms  status=ok
│   ├── llm.generate            duration=420ms  ttft=180ms  status=ok
│   │   ├── tool.lookup_account duration=160ms  status=ok
│   │   └── tool.fetch_balance  duration=180ms  status=ok
│   └── tts.synthesize          duration=200ms  status=ok
├── turn[1]
│   ├── stt.transcribe          duration=310ms  status=ok
│   ├── llm.generate            duration=510ms  ttft=210ms  status=ok
│   └── tts.synthesize          duration=240ms  status=ok
...

Each span node shows: name, duration, status, and the most relevant attributes inline. Click any span to expand the full attribute list.

Real attributes you’ll see (these are the documented OpenInference and Future AGI span attributes that traceAI emits):

session.id, user.id, tag.tags, metadata — session-level identity and free-form metadata.
Business tag axes carried in metadata and used as filter axes downstream: customer_id, agent_version, intent, vertical, campaign_id, language.
llm.input_messages, llm.output_messages, llm.token_count.prompt, llm.token_count.completion, llm.invocation_parameters — the LLM call.
tool.name, tool.parameters, tool.result — for tool spans.
Voice-specific fields under the gen_ai.voice.* namespace: gen_ai.voice.stt.provider, gen_ai.voice.stt.language, gen_ai.voice.tts.provider, gen_ai.voice.tts.voice_id, gen_ai.voice.latency.transcriber_avg_ms, gen_ai.voice.latency.voice_avg_ms, gen_ai.voice.latency.turn_avg_ms, gen_ai.voice.latency.ttfb_ms, gen_ai.voice.interruptions.user_count, gen_ai.voice.interruptions.assistant_count, and the recording references gen_ai.voice.recording.assistant_url, gen_ai.voice.recording.customer_url, gen_ai.voice.recording.stereo_url.
Eval results written back by ai-evaluation as gen_ai.evaluation.name, gen_ai.evaluation.score.value, gen_ai.evaluation.score.label, gen_ai.evaluation.explanation, and gen_ai.evaluation.target_span_id. The Drawer renders them inline next to the span the eval was attached to.

Per-stage latency is the most-used surface on this panel. Voice budgets are tight (most teams target sub-800ms end-to-end per turn), so you need to see at a glance whether the 750ms turn was 300ms STT + 350ms LLM + 100ms TTS (acceptable, mostly LLM-bound) or 100ms STT + 200ms LLM + 450ms TTS (unacceptable, TTS regression). The tree makes that obvious. A flat span list does not.

Panel 4: eval scores

A small table, one row per evaluator that ran against this call.

eval_id	name	score	passed	reasoning
73	audio_transcription	0.91	yes	clear audio, minor noise in turn 4
75	audio_quality	0.84	yes	one brand mispronunciation on turn 2
1	conversation_coherence	0.95	yes	flow followed expected pattern
2	conversation_resolution	0.78	yes	resolved with one clarification turn
99	task_completion	1.00	yes	refund processed
project user_eval_id (UUID)	tone_compliance (custom)	0.62	no	informal phrasing on turn 6

Each row is expandable to show the reasoning string from the evaluator. Failed scores highlight in red. Custom rubrics show alongside the built-ins.

ai-evaluation ships 56 built-in eval templates. The voice-relevant defaults we recommend on every captured call are eval_id 73 (audio_transcription), 75 (audio_quality), 1 (conversation_coherence), 2 (conversation_resolution), and 99 (task_completion). Custom evaluators are authored through the in-product agent that reads your traces and proposes rubric logic, or you write them directly via the programmatic eval API.

Crucially, the scores are not recomputed on dashboard load. They live as derived attributes on the trace, written back by the async eval pipeline. The Drawer reads them. This is what keeps the per-call view fast even with a dozen rubrics attached.

Panel 5: Error Feed cluster link

A single card at the bottom of the Drawer. If this call clustered into a known failure pattern, the card shows:

The cluster name (e.g. “STT mistranscription on Indian English accents, turn 1”).
The cluster category (factual grounding / tool crash / broken workflow / safety violation / reasoning gap).
The cluster’s auto-written root cause.
The auto-written quick fix to ship today.
The auto-written long-term recommendation.
A jump-out link to the full cluster view with all sibling traces.

The Error Feed is zero-config in Observe. It clusters incoming traces continuously and writes the analysis per cluster. The dashboard sidebar surfaces the top clusters by severity (call count multiplied by severity weighting). Each cluster tracks its volume trend (rising / steady / falling) so you can tell which fires are getting worse.

This is the panel that converts debugging from “look at one broken call” to “look at the pattern this call belongs to.” When 50 calls share a root cause, the on-call engineer should be paged on the pattern, not 50 times on the individual calls.

The aggregate analytics view: SLO grid + drill-down

The aggregate view is the project landing page. Three tiers stacked vertically, an Error Feed sidebar on the left, and a filter rail across the top.

Tier 1: the SLO gauges (top row)

Three gauges, no more. These are the metrics that page on-call when they go red.

Gauge	Metric	Threshold	Source
1	P95 turn latency	Under 800ms	Aggregate over turn span duration
2	task_completion pass rate	Above 90%	Aggregate over `task_completion` (eval_id 99) pass/fail results
3	conversation_resolution rate	Above 85% (or your business floor)	Aggregate over `conversation_resolution` (eval_id 2) pass/fail results

Each gauge: current value, threshold line, sparkline of the last hour, burn-rate number against the monthly SLO budget. Red, yellow, green coloring against the threshold. A red on tier 1 pages on-call.

We hold tier 1 to three gauges deliberately. The on-call engineer has to be able to glance at the top of the page and immediately know “fleet healthy” or “fleet on fire.” Four gauges already takes two glances. Seven is the noise threshold where people start ignoring half of them.

Tier 2: SLI trend charts (middle band)

The underlying signals. Four charts, each ~25% of horizontal width.

WER trend over 7 and 30 days, segmented by accent if you tag for it. A creeping WER on a specific accent cluster is the canary for ASR drift.
Intent confidence distribution, rendered as a stacked histogram. The shape of the distribution matters; mean intent confidence can hold steady while the left tail fattens, which is the actual problem.
audio_quality (eval_id 75) drift, daily mean and 5th percentile. The 5th percentile catches TTS regressions before the mean moves.
Barge-in failure rate per 1000 calls. Spikes here usually correlate with TTS chunk-size or VAD threshold changes on the runtime side.

These charts answer “what’s underlying the tier 1 health.” When task_completion pass rate dips on tier 1, you scroll to tier 2 and see which SLI is moving.

Tier 3: business KPI trends (bottom band)

The product-side surface. Same chart shelf as tier 2, different metrics, different audience cadence (reviewed weekly, not paged on).

AHT (average handle time), broken by intent.
FCR proxy derived from conversation_resolution (eval_id 2) plus customer return-call signal.
Drop-off rate by turn position (heatmap; turn 1 abandonment is a different bug than turn 6 abandonment).
Escalation rate, split planned escalation vs failure-driven.
Deflection rate if your agent is a deflection layer in front of a human contact center.

Tier 3 is where the product manager spends time. The cadence is weekly. The audience is product, ops, and CX leadership. Engineering rarely opens tier 3, and that’s the right division of labor.

Persistent across the page. Top 5 to 10 active clusters ranked by severity (call count × severity weighting). Each card shows the cluster name, the category, the 24-hour volume, and the trend arrow.

Click any cluster card to filter the rest of the dashboard to that cluster’s traces. The aggregate view becomes a cluster-scoped view: SLO gauges show the gauge values for the cluster only, SLI charts re-aggregate to the cluster only, the trace browser (one click further in) filters to the cluster’s calls only.

This is how the on-call workflow becomes triage by cluster, not triage by alert. The cluster is the unit of work.

The filter rail (top)

Sticky filters across all of Observe (the sticky filter behavior shipped per the release notes). The filter axes that matter for voice:

agent_version — the deploy axis. Always have this on so you can A/B by deploy candidate.
customer_id — for support cases when a specific customer reports an issue.
intent — the conversation classifier output.
vertical — if you run support / sales / scheduling on the same agent definition.
campaign_id — for outbound agents.
language — for multilingual deployments.
Date range, eval-score threshold, SLO breach flag.

These are not built-in metric names; they are tag axes you populate via your agent-side metadata. The filter rail reads whatever tag keys you emit on the traces, so the discipline of consistent tagging is what unlocks slicing. If you forget to tag agent_version, you can’t A/B deploys. Tag everything, tag every span, tag every release.

Data flow: how spans, evals, tags, and audio reach the dashboard

This is the part most “dashboard anatomy” posts skip and it’s the part that actually determines whether your dashboard works. Four data streams converge into the rendered panels.

Spans

Two paths, depending on whether you have code-level access to the agent runtime.

Path A: native voice observability (no SDK). This is the supported path for Vapi, Retell, and LiveKit dashboard-driven agents where you don’t ship your own runtime code. Inside Future AGI you create an Agent Definition (the multi-step UX revamped per the release notes: Basic Information → Configuration → Behaviour), enable observability, and paste the provider API key plus the Assistant ID. Future AGI pulls call logs, separate assistant and customer recordings, transcripts, and runs the configured eval set. No SDK code is required. This is the path for teams running on top of a managed voice platform.

Path B: traceAI SDK (code-level access). This is the path for Pipecat, LiveKit when you self-host the agent stack, or any custom voice runtime where you control the code. traceAI is the OpenInference-compatible auto-instrumentation library, Apache 2.0, with 30+ documented integrations across Python and TypeScript including dedicated traceAI-pipecat and traceai-livekit packages. Initialize traceAI once, and spans flow over OTLP/gRPC or OTLP/HTTP into Future AGI Observe, indexed by session and tags.

LiveKit example:

from fi_instrumentation.otel import register, ProjectType
from traceai_livekit import enable_http_attribute_mapping

trace_provider = register(
    project_type=ProjectType.OBSERVE,
    project_name="voice-prod",
    set_global_tracer_provider=True,
)
enable_http_attribute_mapping()

Pipecat example:

from fi_instrumentation.otel import register, ProjectType
from traceai_pipecat import enable_http_attribute_mapping
from pipecat.pipeline.task import PipelineTask

trace_provider = register(
    project_type=ProjectType.OBSERVE,
    project_name="voice-prod",
    set_global_tracer_provider=True,
)
enable_http_attribute_mapping()

task = PipelineTask(
    pipeline,
    enable_tracing=True,
    enable_turn_tracking=True,
    additional_span_attributes={"session.id": session_id},
)

Both paths land in the same Observe project shape. Same Drawer, same span tree, same eval engine.

Eval scores

ai-evaluation runs asynchronously per span and per session. The result is written back to the trace as the documented gen_ai.evaluation.* attribute group (gen_ai.evaluation.name, gen_ai.evaluation.score.value, gen_ai.evaluation.score.label, gen_ai.evaluation.explanation, gen_ai.evaluation.target_span_id). eval_id stays as the template identifier the query layer joins on. The dashboard joins these attributes at query time, which is why the per-call Drawer renders all five panels in one round-trip.

For the per-call view this matters because you don’t want the engineer to wait while six rubrics recompute. For the aggregate view this matters because the SLO gauges aggregate over the same attribute that the Drawer reads, so the gauge value and the row-level value are guaranteed consistent. No drift between “what the chart says” and “what I see when I click in.”

Audio

On the native path (Vapi/Retell/LiveKit), Future AGI pulls the recording from the provider API automatically when the call ends. The Call Details Drawer renders both channels in the multi-channel audio player. Granular audio field selection on the eval side lets you score against just the assistant channel, just the customer channel, or the mixed stereo file, depending on the rubric.

On the SDK path, emit recording URL metadata using the documented voice fields (gen_ai.voice.recording.assistant_url, gen_ai.voice.recording.customer_url, and gen_ai.voice.recording.stereo_url), or, if your runtime only carries one recording reference, the generic audio.url, audio.mime_type, and audio.transcript fields. From the Drawer’s perspective the result is identical: the audio panel renders once the recording reference is present on the trace.

What FAGI ships out of the box vs what you configure

Pulling this together explicitly because it’s where most engineering planning conversations get muddled.

Ships out of the box, zero configuration:

Per-call detail Drawer with all five panels (audio, transcript, span tree, eval scores, Error Feed link).
OpenInference span ingestion over OTLP, indexed by session and tags.
Eval-to-span join (the score becomes a derived attribute on the trace automatically).
Error Feed clustering with auto-written root cause, supporting span evidence, quick fix, and long-term recommendation per cluster.
Native voice observability for Vapi, Retell, and LiveKit via Agent Definition (provider API key plus Assistant ID, no SDK).
Multi-channel audio player in the Call Details Drawer with stereo/mono/assistant/customer export.
Sticky filters in Observe, multi-step Agent Definition UX, Show Reasoning column in Simulate, Error Localization in simulation runs.
56 built-in eval templates including the voice-relevant defaults.

You configure:

SLO thresholds for your traffic profile (P95 budget, task_completion floor, conversation_resolution floor).
Alert routing (where the page goes on tier-1 red).
Tag taxonomy (which keys you emit on traces).
Custom rubrics beyond the 56 built-ins, authored through the in-product agent or the programmatic API.
Filter axes shown by default on the aggregate view.

You do not build:

Span ingestion, OTel exporter wiring, OpenInference compatibility layer.
Eval-to-trace joining infrastructure.
Clustering algorithm or root-cause writer.
Audio capture and pull from supported providers.
The Drawer renderer.

This split matters because building any of the “do not build” items in-house is a multi-quarter project. Configuring the thresholds and the rubrics is a one-week project for a single engineer.

The closed loop: dashboard to simulation to dashboard

The dashboard is not a terminal surface. It is a node in a loop. Production observability feeds production-derived simulation, simulation feeds prompt optimization, optimization feeds the next deploy, the next deploy gets observed.

The loop runs like this:

Error Feed surfaces a failure cluster on the aggregate view. Say it’s “STT mistranscription on Indian English accents, turn 1, cluster volume rising over 48 hours.”
Engineer drills into one example call from the cluster, opens the Drawer, watches the audio, reads the transcript, sees audio_transcription (eval_id 73) score of 0.42 on turn 1.
Workflow Builder generates a scenario tree from this conversation. Branch visibility shows the expected conversation paths plus edge cases the conversation could have taken but didn’t. Add scenarios manually, generate via AI, or import from an existing dataset. The persona library (18 pre-built personas plus unlimited custom-authored personas with name, description, gender, age range bracket, location across US/Canada/UK/Australia/India, personality traits, communication style, accent, conversation speed, background noise, multilingual support) provides the caller-side variation.
Run Tests via the 4-step wizard: Config → Scenarios → Eval → Execute. Pick the same eval templates the production calls were scored with so the simulation result is directly comparable. Error Localization pinpoints the exact failing turn in each simulation run.
Prompt optimization reads the simulation dataset and runs one of the six optimizers in agent-opt: Bayesian Search (smart few-shot, bayesian-search), Meta-Prompt (bilevel optimization, arXiv 2505.09666), ProTeGi (textual gradients + beam search + critique), GEPA (Genetic-Pareto reflective evolution, arXiv 2507.19457), Random Search (baseline, arXiv 2311.09569), and PromptWizard (production-grade prompt optimization). Optimization runs through the Dataset UI or the agent-opt Python library; the dashboard surfaces optimizer iterations, candidate prompts, and final scores.
Deploy the winning prompt candidate with a new agent_version tag. The aggregate view filters to the new version. The cluster in the Error Feed either resolves (volume drops to zero) or it doesn’t (in which case you have a new candidate hypothesis to test and the loop continues).

The optimization step is always explicit and human-gated. agent-opt does not rewrite production prompts behind your back. You run a job, you review the candidates, you ship the one you trust. The loop framing here is the production-to-simulation pipeline that the team operates, not an autonomous self-rewriting system.

This loop is what separates a working voice analytics setup from a chart-junk one. The dashboard is the trigger. The simulation suite is the workshop. The optimizer is the assistant. The deploy is the commit. The next observation is the test.

How the dashboard evolves over the agent lifecycle

Both views change shape as the agent matures and traffic accumulates.

At launch (week 0). The aggregate view’s tier 1 is the whole story. Tier 2 charts are still building enough signal to be useful. Tier 3 KPIs are noisy because the sample size is small. The Error Feed sidebar is empty because clustering needs volume. The per-call Drawer is the workhorse: engineers open it on every call to validate the agent’s behavior.

Weeks 2 to 4. Error Feed clusters start appearing. Tier 2 charts have enough density to see daily and weekly seasonality. The trace browser (one click in from the aggregate view) becomes the main workflow surface. The Drawer is still opened many times a day but increasingly from cluster drill-downs rather than from random sampling.

Months 2 to 3. Dashboard is in mature shape. Tier 1 is the heartbeat, reviewed continuously. Tier 2 is the troubleshooting surface, reviewed on demand. Tier 3 is the weekly product review surface. The Error Feed sidebar has 20+ named clusters at varying severity. The Drawer is opened dozens of times a day, mostly via cluster jumps. The closed loop into simulation runs every deploy.

Past 6 months. The dashboard increasingly feeds the optimization layer. The same trace data that the Drawer reads is now also being read by agent-opt against accumulated failure patterns. The optimizer proposes candidate prompts; the simulation suite scores them against the persona library; the dashboard validates the winning candidate in production after deploy. The library of regression scenarios grows continuously with each new cluster the team resolves.

Common dashboard anti-patterns to avoid

Twelve-tile tier-1 grids. Three. Always three. P95 turn latency, task_completion pass rate, conversation_resolution rate. Add a fourth and on-call starts to ignore it.

Average latency as the headline number. Always P95 or P99. Average hides the long tail; the long tail is who hangs up.

Transcript-only Drawer. Half the bugs in voice systems are timing bugs that only show in the waveform. The two-channel audio panel is non-negotiable.

One dashboard for engineering and product. Tier 1 + tier 2 are engineering. Tier 3 is product. Trying to put both audiences on one ungrouped surface ends with one team muting the other team’s alerts.

Static thresholds for everything. Hard limits (5xx rate, uptime) take static thresholds. Traffic-dependent signals (latency at peak, WER under new accent mix) need anomaly detection. Pages should fire on burn rate, not on instantaneous threshold breach.

Untagged traces. If you don’t tag agent_version on every span, you cannot A/B deploys. If you don’t tag intent, you cannot slice tier 2. Tagging discipline at instrumentation time is what makes the dashboard usable downstream.

Alert without analysis. A page that says “P95 latency breached” is worse than a page that says “P95 latency breached, root cause is tool call X regressed after deploy v2.4.1, quick fix is rollback.” The Error Feed’s auto-written analysis is what makes the on-call workflow sustainable.

Where FAGI calibrates short

Optimization is explicit and human-gated. agent-opt with the six optimizers (Bayesian Search, Meta-Prompt, ProTeGi, GEPA, Random Search, PromptWizard) reads from the same trace and dataset the dashboard renders, but it never rewrites production prompts without an explicit run and a human approval gate. Teams who want autonomous self-rewriting behavior in production won’t get it here. That is intentional; the loop is operator-controlled.

Persona library depth is Cekura’s home turf. For pre-launch synthetic caller coverage at extreme variation counts, Cekura has a deeper catalog in some narrow accent segments. Future AGI ships 18 pre-built personas plus unlimited custom-authored personas with full configurability (gender, age range, location, accent, conversation speed, background noise, multilingual), but if the buyer is specifically optimizing for raw library size at the persona-count level only, pair them.

BYOC routing for the most regulated workloads. SOC 2 Type II, HIPAA, GDPR, CCPA, and ISO 27001 are certified per futureagi.com/trust. ISO 42001 is in progress. For workloads that require an in-VPC deployment with a customer-owned audit boundary, BYOC is supported, but the BYOC config has more knobs than the managed SaaS path and the rollout takes longer.

Putting the architecture into practice

If you’re starting from a blank backend, build the per-call detail Drawer first, the aggregate analytics view second. The reason is that the Drawer is what your engineers debug from on day one. The aggregate view needs traffic to be useful; the Drawer needs only one trace.

Inside the Drawer, keep the visible panel order as audio, transcript, span tree, eval scores, Error Feed link. If you are implementing from scratch, instrument spans early so the tree is correct from day one, but do not change the user-facing panel order. Inside the aggregate view, build tier 1 first (three gauges), Error Feed sidebar second (it’s already produced by the clustering layer), tier 2 SLIs third, tier 3 KPIs fourth.

The full production instrumentation procedure that produces the trace data lives in the production monitoring playbook. The 12 metrics that populate the SLO grid and the KPI tabs are detailed in the conversation monitoring metrics post. The end-to-end implementation walkthrough including SDK setup is in the voice AI observability implementation guide. The platform-by-platform comparison of where each layer comes from is in the voice agent monitoring platforms roundup.

Sources and references

Voice observability quickstart: docs.futureagi.com — voice quickstart
Voice observability overview: docs.futureagi.com — voice overview
traceAI on GitHub: github.com/future-agi/traceAI
ai-evaluation on GitHub: github.com/future-agi/ai-evaluation
agent-opt optimizer docs (Bayesian Search, Meta-Prompt, ProTeGi, GEPA, Random Search, PromptWizard): docs.futureagi.com — optimization
Error Feed and Observe docs: docs.futureagi.com/docs/observe
Future AGI Protect docs: docs.futureagi.com/docs/protect
Agent Command Center docs: docs.futureagi.com/docs/command-center
arXiv 2507.19457 (GEPA): arxiv.org/abs/2507.19457
arXiv 2505.09666 (Meta-Prompt bilevel optimization): arxiv.org/abs/2505.09666
arXiv 2311.09569 (Random Search baseline): arxiv.org/abs/2311.09569
arXiv 2510.13351 (Protect inline latency): arxiv.org/abs/2510.13351
Trust page: futureagi.com/trust
OpenInference specification: github.com/Arize-ai/openinference
Google SRE workbook on SLOs and error budgets: sre.google/workbook/implementing-slos/

Frequently asked questions

What's the difference between a per-call detail view and the aggregate analytics view?

The per-call detail view is what an engineer opens when they click one call in the call log. It contains the assistant and customer audio waveforms downloadable separately, the turn-by-turn transcript with STT confidence, the full OpenInference span tree (session to turn to STT/LLM/tool/TTS spans with per-stage latency), the eval scores attached to that call (audio_transcription, audio_quality, conversation_coherence, conversation_resolution, task_completion, plus any custom rubrics), and a link to the Error Feed cluster the call belongs to if one exists. The aggregate analytics view is what a reliability lead opens to see fleet health: three tier-1 SLO gauges at the top, tier-2 SLI trend charts in the middle, tier-3 KPI trends at the bottom, and the Error Feed sidebar listing the active failure clusters by severity. Most teams ship both. The detail view is where you debug. The aggregate view is where you decide what to debug next.

Which eval IDs should I run on every captured voice call?

On Future AGI the production-grade default set is eval_id 73 (audio_transcription), 75 (audio_quality), 1 (conversation_coherence), 2 (conversation_resolution), and 99 (task_completion). These cover the audio layer, the conversation flow layer, and the outcome layer. ai-evaluation ships 56 built-in templates total, so you can layer on additional rubrics like tone, faithfulness, and tool correctness, plus unlimited custom evaluators authored through the in-product agent. Scores attach to the trace as derived attributes the moment the eval finishes running, so the dashboard joins them to spans without any glue code.

How does eval-score data actually reach the dashboard?

Spans flow from traceAI (OpenInference compatible) over OTLP/gRPC or OTLP/HTTP into Future AGI Observe and get indexed by session ID and tags. ai-evaluation runs the configured rubric set asynchronously per span. The resulting scores are written back as derived attributes on the same trace, keyed by the span ID and the eval_id. The dashboard query layer joins on trace + eval at read time, which is why every panel renders without any precomputation. For Vapi, Retell, and LiveKit, the dashboard pulls audio and assistant metadata automatically once you wire the provider API key plus Assistant ID into the Agent Definition, no SDK required. For Pipecat, custom runtimes, or anything that needs full code-level instrumentation, the traceAI SDK path covers it via dedicated packages including traceAI-pipecat and traceai-livekit.

What goes in the tier-1 SLO row at the top of the aggregate view?

Three gauges, no more. P95 turn latency under 800ms, task_completion pass rate above 90%, and conversation_resolution rate above your business floor (we recommend 85% for most deployments). Each gauge is green/yellow/red against the threshold with a sparkline of the last hour and a burn-rate number against the monthly SLO budget. A red on any tier-1 gauge pages on-call. Tier 2 in the middle band shows the underlying SLIs: WER trend over 7 and 30 days, intent confidence distribution, audio_quality drift, and barge-in failure rate. Tier 3 at the bottom shows business KPIs: AHT, FCR proxy, drop-off rate, escalation rate, deflection rate. Three tiers, twelve to fifteen total chart cells. Beyond that the page becomes noise.

How does the dashboard close the loop into the simulation suite?

The Error Feed clusters production failures into named issues with auto-written root cause, supporting span evidence, a quick fix, and a long-term recommendation. From any cluster you drill into one example call, then promote that call into the Workflow Builder which auto-generates a scenario tree from the conversation. The scenario goes into the simulation suite alongside the persona library (18 pre-built personas plus unlimited custom-authored ones with configurable age range, accent, location, conversation speed, background noise, and multilingual settings), runs through the 4-step Run Tests wizard, and gets scored by the same eval engine that scored the production call. Error Localization pinpoints the exact failing turn. Run the simulation on each deploy candidate, watch the cluster resolve in production, repeat. The optimization layer (Bayesian Search, Meta-Prompt, ProTeGi, GEPA arXiv 2507.19457, Random Search arXiv 2311.09569, PromptWizard) reads the same dataset and proposes prompt revisions against the failing scenarios.

What does Future AGI ship out of the box vs what do I have to build?

Out of the box: the per-call detail drawer with separate assistant/customer audio download and turn-by-turn transcript, the OpenInference span ingest, the eval-to-span join, the Error Feed clustering and root-cause writing, native voice observability for Vapi/Retell/LiveKit via Agent Definition (no SDK), the multi-channel audio player in the Call Details Drawer with stereo/mono/assistant/customer export, and the sticky filters in Observe. You configure: SLO thresholds for your traffic, alert routing, the tag taxonomy (agent_version, customer_id, vertical, campaign_id, whatever filter axes your team needs), and any custom rubrics beyond the 56 built-ins. You don't build: span ingest, OTel exporter wiring, eval-to-trace joining, clustering algorithm, or the audio capture path for the three supported providers.

What latency does the Future AGI Protect model family add to inline checks?

Sub-100ms inline per arXiv 2510.13351. Built on Gemma 3n with LoRA-trained adapters per safety dimension (Toxicity, Tone, Sexism, Prompt Injection, Data Privacy), multi-modal across text, image, and audio. ProtectFlash gives a single-call binary classifier path when you need an even tighter latency budget. The dashboard surfaces a Protect panel showing blocked-output rate per dimension and the latency of each check, so you can verify the guardrail isn't eating your voice budget on a per-turn basis.

View all

Guides

Evaluating Voice AI Agents in 2026: The Methodology

Voice agent eval is end-task scoring plus pipeline-stage attribution plus conversation coherence. WER scores the ASR component, not the agent.

NVJK Kartik · May 19, 2026

12 min

Guides

Voice AI Observability for Pipecat: A 2026 Implementation Guide

Implement voice observability for Pipecat with traceAI-pipecat: install, register, enable HTTP attribute mapping, attach audio + multi-turn eval rubrics.

NVJK Kartik · Apr 23, 2026

12 min

Guides

Logging and Analytics Architecture for Voice Agents in 2026

Design the data plane for voice agents in 2026: spans, OTLP, eval, dashboards, alerts, retention, and GDPR/HIPAA tradeoffs across the full architecture.

NVJK Kartik · Apr 9, 2026

15 min

Anatomy of a Voice Agent Analytics Dashboard: A 2026 Engineering Walkthrough

What “voice agent analytics dashboard” actually means

The per-call detail view: five panels

Panel 1: audio

Panel 2: transcript

Panel 3: span tree

Panel 4: eval scores

Panel 5: Error Feed cluster link

The aggregate analytics view: SLO grid + drill-down

Tier 1: the SLO gauges (top row)

Tier 2: SLI trend charts (middle band)

Tier 3: business KPI trends (bottom band)

The Error Feed sidebar (left rail)

The filter rail (top)

Data flow: how spans, evals, tags, and audio reach the dashboard

Spans

Eval scores

Tags

Audio

What FAGI ships out of the box vs what you configure

The closed loop: dashboard to simulation to dashboard

How the dashboard evolves over the agent lifecycle

Common dashboard anti-patterns to avoid

Where FAGI calibrates short

Putting the architecture into practice

Sources and references

Frequently asked questions

What “voice agent analytics dashboard” actually means

The per-call detail view: five panels

Panel 1: audio

Panel 2: transcript

Panel 3: span tree

Panel 4: eval scores

Panel 5: Error Feed cluster link

The aggregate analytics view: SLO grid + drill-down

Tier 1: the SLO gauges (top row)

Tier 2: SLI trend charts (middle band)

Tier 3: business KPI trends (bottom band)

The Error Feed sidebar (left rail)

The filter rail (top)

Data flow: how spans, evals, tags, and audio reach the dashboard

Spans

Eval scores

Tags

Audio

What FAGI ships out of the box vs what you configure

The closed loop: dashboard to simulation to dashboard

How the dashboard evolves over the agent lifecycle

Common dashboard anti-patterns to avoid

Where FAGI calibrates short

Putting the architecture into practice

Related reading

Sources and references

Frequently asked questions