Anatomy of a Voice Agent Analytics Dashboard: A 2026 Engineering Walkthrough
Engineering walkthrough of a voice agent analytics dashboard: per-call detail drawer with 5 panels, aggregate SLO grid with 3 tiers, span/eval/tag data flow, and the production-to-simulation closed loop.
Table of Contents
A voice agent analytics dashboard is two surfaces, not one. There’s the per-call detail view an engineer opens to debug a specific failure, and there’s the aggregate analytics view a reliability lead opens to see how the fleet is doing this week. Most teams build both, often in the wrong order, and end up with a fleet view that nobody trusts because the underlying per-call story is incomplete. This post is the engineering walkthrough we use for production voice dashboards in 2026. It covers the actual panel structure, the actual span attributes, the actual eval IDs, the actual data flow, and the production-to-simulation loop that closes around the dashboard.
Treat it as a build spec. Every component referenced here is shipped surface in Future AGI Observe, ai-evaluation, the Agent Command Center, or the simulation product. Where you build vs. where you configure is called out explicitly in each section.
What “voice agent analytics dashboard” actually means
Most posts on this topic conflate two distinct surfaces and end up describing neither well. Pull them apart first.
Per-call detail view. Opens when an engineer clicks one row in the call log. The entire context for one conversation lives here: the audio (assistant and customer as separate channels), the transcript, the span tree, the eval scores, and a link out to any Error Feed cluster the call belongs to. This is the debugging surface. It is rendered on demand for one trace at a time.
Aggregate analytics view. Opens as the landing page of the project. Tells the reliability lead whether the fleet is healthy this hour, this day, this week. SLO gauges across the top, SLI trends in the middle, business KPIs at the bottom, and the Error Feed sidebar listing the active failure clusters by severity. This is the prioritization surface. It is rendered from aggregations across thousands of traces.
The detail view answers “why did this call fail.” The aggregate view answers “which failure pattern is worth my time today.” Build them separately, render them from the same underlying data, and link the two views at every appropriate jump-point. The rest of this post walks each view in panel-level detail.
The per-call detail view: five panels
The per-call view is the Call Details Drawer that opens when you click a row in the project’s call log. The Drawer has been through several iterations on Future AGI Observe (the Drawer received a major revamp shipped per the release notes, including the multi-channel audio player and granular audio field selection for evals). The five panels below are the reference structure we recommend regardless of which backend hosts the trace data.
[Image: FAGI Observe call detail drawer]
Panel 1: audio
Two waveforms, rendered as separate channels.
- Assistant audio. The TTS output stream the user actually heard.
- Customer audio. The microphone input stream from the user side.
Each waveform is independently playable and independently downloadable. Future AGI supports export in four formats: Caller Audio, Agent Audio, Mono Audio, and Stereo Audio. The stereo file is the engineering favorite because barge-in debugging needs both streams synchronized.
The separation matters because the bugs you find here are timing bugs. If the customer audio shows speech starting at t=1.4s and the assistant audio shows TTS still playing at t=1.6s, you have a barge-in detection gap of 200ms. If you only have a single mono mixdown, you cannot see that gap. If the customer waveform shows clipping at the start of every turn, you have a mic-gating issue on the device side, not a model issue. None of this is visible in the transcript.
A scrubber on either waveform syncs with the transcript panel and the span tree panel below. Codec artifacts and packet loss render as visible gaps. Most teams who switch from a transcript-only debugging flow report catching one class of bug per week they previously missed entirely.
Panel 2: transcript
Turn-by-turn alternating transcript, color-coded by speaker (user / assistant), each turn timestamped with the absolute call clock and an offset from the previous turn.
Per-turn metadata renders inline on hover:
- STT confidence score (from the provider, or from your
audio_transcriptionevaluator if you re-score). - Detected language tag (relevant for multilingual deployments).
- Any words flagged as low-confidence (highlighted in amber).
Mistranscriptions flagged by audio_transcription (eval_id 73) highlight in red on the user side. Brand-name mispronunciations flagged by audio_quality (eval_id 75) highlight in red on the assistant side. The reviewer can label a span manually and that label feeds back into the evaluator calibration loop.
This is the panel that catches one specific class of bug well: ASR errors that snowball into LLM misintents. If the transcript shows the user said “cancel” but the STT returned “council,” the next turn’s prompt sees “council” and the LLM produces something nonsensical that the user hangs up on. Without the transcript-with-confidence panel that bug looks like an LLM bug, not an ASR bug, and you spend a day chasing the wrong layer.
Panel 3: span tree
The full OpenInference span hierarchy for the conversation, rendered as a collapsible tree.
session (root)
├── turn[0]
│ ├── stt.transcribe duration=240ms status=ok
│ ├── llm.generate duration=420ms ttft=180ms status=ok
│ │ ├── tool.lookup_account duration=160ms status=ok
│ │ └── tool.fetch_balance duration=180ms status=ok
│ └── tts.synthesize duration=200ms status=ok
├── turn[1]
│ ├── stt.transcribe duration=310ms status=ok
│ ├── llm.generate duration=510ms ttft=210ms status=ok
│ └── tts.synthesize duration=240ms status=ok
...
Each span node shows: name, duration, status, and the most relevant attributes inline. Click any span to expand the full attribute list.
Real attributes you’ll see (these are the documented OpenInference and Future AGI span attributes that traceAI emits):
session.id,user.id,tag.tags,metadata— session-level identity and free-form metadata.- Business tag axes carried in
metadataand used as filter axes downstream:customer_id,agent_version,intent,vertical,campaign_id,language. llm.input_messages,llm.output_messages,llm.token_count.prompt,llm.token_count.completion,llm.invocation_parameters— the LLM call.tool.name,tool.parameters,tool.result— for tool spans.- Voice-specific fields under the
gen_ai.voice.*namespace:gen_ai.voice.stt.provider,gen_ai.voice.stt.language,gen_ai.voice.tts.provider,gen_ai.voice.tts.voice_id,gen_ai.voice.latency.transcriber_avg_ms,gen_ai.voice.latency.voice_avg_ms,gen_ai.voice.latency.turn_avg_ms,gen_ai.voice.latency.ttfb_ms,gen_ai.voice.interruptions.user_count,gen_ai.voice.interruptions.assistant_count, and the recording referencesgen_ai.voice.recording.assistant_url,gen_ai.voice.recording.customer_url,gen_ai.voice.recording.stereo_url. - Eval results written back by ai-evaluation as
gen_ai.evaluation.name,gen_ai.evaluation.score.value,gen_ai.evaluation.score.label,gen_ai.evaluation.explanation, andgen_ai.evaluation.target_span_id. The Drawer renders them inline next to the span the eval was attached to.
Per-stage latency is the most-used surface on this panel. Voice budgets are tight (most teams target sub-800ms end-to-end per turn), so you need to see at a glance whether the 750ms turn was 300ms STT + 350ms LLM + 100ms TTS (acceptable, mostly LLM-bound) or 100ms STT + 200ms LLM + 450ms TTS (unacceptable, TTS regression). The tree makes that obvious. A flat span list does not.
Panel 4: eval scores
A small table, one row per evaluator that ran against this call.
| eval_id | name | score | passed | reasoning |
|---|---|---|---|---|
| 73 | audio_transcription | 0.91 | yes | clear audio, minor noise in turn 4 |
| 75 | audio_quality | 0.84 | yes | one brand mispronunciation on turn 2 |
| 1 | conversation_coherence | 0.95 | yes | flow followed expected pattern |
| 2 | conversation_resolution | 0.78 | yes | resolved with one clarification turn |
| 99 | task_completion | 1.00 | yes | refund processed |
| project user_eval_id (UUID) | tone_compliance (custom) | 0.62 | no | informal phrasing on turn 6 |
Each row is expandable to show the reasoning string from the evaluator. Failed scores highlight in red. Custom rubrics show alongside the built-ins.
ai-evaluation ships 56 built-in eval templates. The voice-relevant defaults we recommend on every captured call are eval_id 73 (audio_transcription), 75 (audio_quality), 1 (conversation_coherence), 2 (conversation_resolution), and 99 (task_completion). Custom evaluators are authored through the in-product agent that reads your traces and proposes rubric logic, or you write them directly via the programmatic eval API.
Crucially, the scores are not recomputed on dashboard load. They live as derived attributes on the trace, written back by the async eval pipeline. The Drawer reads them. This is what keeps the per-call view fast even with a dozen rubrics attached.
Panel 5: Error Feed cluster link
A single card at the bottom of the Drawer. If this call clustered into a known failure pattern, the card shows:
- The cluster name (e.g. “STT mistranscription on Indian English accents, turn 1”).
- The cluster category (factual grounding / tool crash / broken workflow / safety violation / reasoning gap).
- The cluster’s auto-written root cause.
- The auto-written quick fix to ship today.
- The auto-written long-term recommendation.
- A jump-out link to the full cluster view with all sibling traces.
The Error Feed is zero-config in Observe. It clusters incoming traces continuously and writes the analysis per cluster. The dashboard sidebar surfaces the top clusters by severity (call count multiplied by severity weighting). Each cluster tracks its volume trend (rising / steady / falling) so you can tell which fires are getting worse.
This is the panel that converts debugging from “look at one broken call” to “look at the pattern this call belongs to.” When 50 calls share a root cause, the on-call engineer should be paged on the pattern, not 50 times on the individual calls.
The aggregate analytics view: SLO grid + drill-down
The aggregate view is the project landing page. Three tiers stacked vertically, an Error Feed sidebar on the left, and a filter rail across the top.
Tier 1: the SLO gauges (top row)
Three gauges, no more. These are the metrics that page on-call when they go red.
| Gauge | Metric | Threshold | Source |
|---|---|---|---|
| 1 | P95 turn latency | Under 800ms | Aggregate over turn span duration |
| 2 | task_completion pass rate | Above 90% | Aggregate over task_completion (eval_id 99) pass/fail results |
| 3 | conversation_resolution rate | Above 85% (or your business floor) | Aggregate over conversation_resolution (eval_id 2) pass/fail results |
Each gauge: current value, threshold line, sparkline of the last hour, burn-rate number against the monthly SLO budget. Red, yellow, green coloring against the threshold. A red on tier 1 pages on-call.
We hold tier 1 to three gauges deliberately. The on-call engineer has to be able to glance at the top of the page and immediately know “fleet healthy” or “fleet on fire.” Four gauges already takes two glances. Seven is the noise threshold where people start ignoring half of them.
Tier 2: SLI trend charts (middle band)
The underlying signals. Four charts, each ~25% of horizontal width.
- WER trend over 7 and 30 days, segmented by accent if you tag for it. A creeping WER on a specific accent cluster is the canary for ASR drift.
- Intent confidence distribution, rendered as a stacked histogram. The shape of the distribution matters; mean intent confidence can hold steady while the left tail fattens, which is the actual problem.
- audio_quality (eval_id 75) drift, daily mean and 5th percentile. The 5th percentile catches TTS regressions before the mean moves.
- Barge-in failure rate per 1000 calls. Spikes here usually correlate with TTS chunk-size or VAD threshold changes on the runtime side.
These charts answer “what’s underlying the tier 1 health.” When task_completion pass rate dips on tier 1, you scroll to tier 2 and see which SLI is moving.
Tier 3: business KPI trends (bottom band)
The product-side surface. Same chart shelf as tier 2, different metrics, different audience cadence (reviewed weekly, not paged on).
- AHT (average handle time), broken by intent.
- FCR proxy derived from
conversation_resolution(eval_id 2) plus customer return-call signal. - Drop-off rate by turn position (heatmap; turn 1 abandonment is a different bug than turn 6 abandonment).
- Escalation rate, split planned escalation vs failure-driven.
- Deflection rate if your agent is a deflection layer in front of a human contact center.
Tier 3 is where the product manager spends time. The cadence is weekly. The audience is product, ops, and CX leadership. Engineering rarely opens tier 3, and that’s the right division of labor.
The Error Feed sidebar (left rail)
Persistent across the page. Top 5 to 10 active clusters ranked by severity (call count × severity weighting). Each card shows the cluster name, the category, the 24-hour volume, and the trend arrow.
Click any cluster card to filter the rest of the dashboard to that cluster’s traces. The aggregate view becomes a cluster-scoped view: SLO gauges show the gauge values for the cluster only, SLI charts re-aggregate to the cluster only, the trace browser (one click further in) filters to the cluster’s calls only.
This is how the on-call workflow becomes triage by cluster, not triage by alert. The cluster is the unit of work.
The filter rail (top)
Sticky filters across all of Observe (the sticky filter behavior shipped per the release notes). The filter axes that matter for voice:
agent_version— the deploy axis. Always have this on so you can A/B by deploy candidate.customer_id— for support cases when a specific customer reports an issue.intent— the conversation classifier output.vertical— if you run support / sales / scheduling on the same agent definition.campaign_id— for outbound agents.language— for multilingual deployments.- Date range, eval-score threshold, SLO breach flag.
These are not built-in metric names; they are tag axes you populate via your agent-side metadata. The filter rail reads whatever tag keys you emit on the traces, so the discipline of consistent tagging is what unlocks slicing. If you forget to tag agent_version, you can’t A/B deploys. Tag everything, tag every span, tag every release.
Data flow: how spans, evals, tags, and audio reach the dashboard
This is the part most “dashboard anatomy” posts skip and it’s the part that actually determines whether your dashboard works. Four data streams converge into the rendered panels.
Spans
Two paths, depending on whether you have code-level access to the agent runtime.
Path A: native voice observability (no SDK). This is the supported path for Vapi, Retell, and LiveKit dashboard-driven agents where you don’t ship your own runtime code. Inside Future AGI you create an Agent Definition (the multi-step UX revamped per the release notes: Basic Information → Configuration → Behaviour), enable observability, and paste the provider API key plus the Assistant ID. Future AGI pulls call logs, separate assistant and customer recordings, transcripts, and runs the configured eval set. No SDK code is required. This is the path for teams running on top of a managed voice platform.
Path B: traceAI SDK (code-level access). This is the path for Pipecat, LiveKit when you self-host the agent stack, or any custom voice runtime where you control the code. traceAI is the OpenInference-compatible auto-instrumentation library, Apache 2.0, with 30+ documented integrations across Python and TypeScript including dedicated traceAI-pipecat and traceai-livekit packages. Initialize traceAI once, and spans flow over OTLP/gRPC or OTLP/HTTP into Future AGI Observe, indexed by session and tags.
LiveKit example:
from fi_instrumentation.otel import register, ProjectType
from traceai_livekit import enable_http_attribute_mapping
trace_provider = register(
project_type=ProjectType.OBSERVE,
project_name="voice-prod",
set_global_tracer_provider=True,
)
enable_http_attribute_mapping()
Pipecat example:
from fi_instrumentation.otel import register, ProjectType
from traceai_pipecat import enable_http_attribute_mapping
from pipecat.pipeline.task import PipelineTask
trace_provider = register(
project_type=ProjectType.OBSERVE,
project_name="voice-prod",
set_global_tracer_provider=True,
)
enable_http_attribute_mapping()
task = PipelineTask(
pipeline,
enable_tracing=True,
enable_turn_tracking=True,
additional_span_attributes={"session.id": session_id},
)
Both paths land in the same Observe project shape. Same Drawer, same span tree, same eval engine.
Eval scores
ai-evaluation runs asynchronously per span and per session. The result is written back to the trace as the documented gen_ai.evaluation.* attribute group (gen_ai.evaluation.name, gen_ai.evaluation.score.value, gen_ai.evaluation.score.label, gen_ai.evaluation.explanation, gen_ai.evaluation.target_span_id). eval_id stays as the template identifier the query layer joins on. The dashboard joins these attributes at query time, which is why the per-call Drawer renders all five panels in one round-trip.
For the per-call view this matters because you don’t want the engineer to wait while six rubrics recompute. For the aggregate view this matters because the SLO gauges aggregate over the same attribute that the Drawer reads, so the gauge value and the row-level value are guaranteed consistent. No drift between “what the chart says” and “what I see when I click in.”
Tags
Tags are agent-side metadata you attach when you emit the trace. The dashboard reads whatever tag keys you set. For voice we recommend at minimum:
agent_version— every deploycustomer_id— per session (hashed if needed)intent— per turn, from your classifiervertical— support / sales / schedulingcampaign_id— outbound onlylanguage— multilingual only
Tags become the filter axes on the aggregate view, the row metadata on the trace browser, and the inline attributes on the span tree. Tagging discipline at instrumentation time is what makes the dashboard usable later. There is no fixing missing tags retroactively; the trace is immutable once emitted.
Audio
On the native path (Vapi/Retell/LiveKit), Future AGI pulls the recording from the provider API automatically when the call ends. The Call Details Drawer renders both channels in the multi-channel audio player. Granular audio field selection on the eval side lets you score against just the assistant channel, just the customer channel, or the mixed stereo file, depending on the rubric.
On the SDK path, emit recording URL metadata using the documented voice fields (gen_ai.voice.recording.assistant_url, gen_ai.voice.recording.customer_url, and gen_ai.voice.recording.stereo_url), or, if your runtime only carries one recording reference, the generic audio.url, audio.mime_type, and audio.transcript fields. From the Drawer’s perspective the result is identical: the audio panel renders once the recording reference is present on the trace.
What FAGI ships out of the box vs what you configure
Pulling this together explicitly because it’s where most engineering planning conversations get muddled.
Ships out of the box, zero configuration:
- Per-call detail Drawer with all five panels (audio, transcript, span tree, eval scores, Error Feed link).
- OpenInference span ingestion over OTLP, indexed by session and tags.
- Eval-to-span join (the score becomes a derived attribute on the trace automatically).
- Error Feed clustering with auto-written root cause, supporting span evidence, quick fix, and long-term recommendation per cluster.
- Native voice observability for Vapi, Retell, and LiveKit via Agent Definition (provider API key plus Assistant ID, no SDK).
- Multi-channel audio player in the Call Details Drawer with stereo/mono/assistant/customer export.
- Sticky filters in Observe, multi-step Agent Definition UX, Show Reasoning column in Simulate, Error Localization in simulation runs.
- 56 built-in eval templates including the voice-relevant defaults.
You configure:
- SLO thresholds for your traffic profile (P95 budget, task_completion floor, conversation_resolution floor).
- Alert routing (where the page goes on tier-1 red).
- Tag taxonomy (which keys you emit on traces).
- Custom rubrics beyond the 56 built-ins, authored through the in-product agent or the programmatic API.
- Filter axes shown by default on the aggregate view.
You do not build:
- Span ingestion, OTel exporter wiring, OpenInference compatibility layer.
- Eval-to-trace joining infrastructure.
- Clustering algorithm or root-cause writer.
- Audio capture and pull from supported providers.
- The Drawer renderer.
This split matters because building any of the “do not build” items in-house is a multi-quarter project. Configuring the thresholds and the rubrics is a one-week project for a single engineer.
The closed loop: dashboard to simulation to dashboard
The dashboard is not a terminal surface. It is a node in a loop. Production observability feeds production-derived simulation, simulation feeds prompt optimization, optimization feeds the next deploy, the next deploy gets observed.
The loop runs like this:
- Error Feed surfaces a failure cluster on the aggregate view. Say it’s “STT mistranscription on Indian English accents, turn 1, cluster volume rising over 48 hours.”
- Engineer drills into one example call from the cluster, opens the Drawer, watches the audio, reads the transcript, sees
audio_transcription(eval_id 73) score of 0.42 on turn 1. - Workflow Builder generates a scenario tree from this conversation. Branch visibility shows the expected conversation paths plus edge cases the conversation could have taken but didn’t. Add scenarios manually, generate via AI, or import from an existing dataset. The persona library (18 pre-built personas plus unlimited custom-authored personas with name, description, gender, age range bracket, location across US/Canada/UK/Australia/India, personality traits, communication style, accent, conversation speed, background noise, multilingual support) provides the caller-side variation.
- Run Tests via the 4-step wizard: Config → Scenarios → Eval → Execute. Pick the same eval templates the production calls were scored with so the simulation result is directly comparable. Error Localization pinpoints the exact failing turn in each simulation run.
- Prompt optimization reads the simulation dataset and runs one of the six optimizers in agent-opt: Bayesian Search (smart few-shot,
bayesian-search), Meta-Prompt (bilevel optimization, arXiv 2505.09666), ProTeGi (textual gradients + beam search + critique), GEPA (Genetic-Pareto reflective evolution, arXiv 2507.19457), Random Search (baseline, arXiv 2311.09569), and PromptWizard (production-grade prompt optimization). Optimization runs through the Dataset UI or the agent-opt Python library; the dashboard surfaces optimizer iterations, candidate prompts, and final scores. - Deploy the winning prompt candidate with a new
agent_versiontag. The aggregate view filters to the new version. The cluster in the Error Feed either resolves (volume drops to zero) or it doesn’t (in which case you have a new candidate hypothesis to test and the loop continues).
The optimization step is always explicit and human-gated. agent-opt does not rewrite production prompts behind your back. You run a job, you review the candidates, you ship the one you trust. The loop framing here is the production-to-simulation pipeline that the team operates, not an autonomous self-rewriting system.
This loop is what separates a working voice analytics setup from a chart-junk one. The dashboard is the trigger. The simulation suite is the workshop. The optimizer is the assistant. The deploy is the commit. The next observation is the test.
How the dashboard evolves over the agent lifecycle
Both views change shape as the agent matures and traffic accumulates.
At launch (week 0). The aggregate view’s tier 1 is the whole story. Tier 2 charts are still building enough signal to be useful. Tier 3 KPIs are noisy because the sample size is small. The Error Feed sidebar is empty because clustering needs volume. The per-call Drawer is the workhorse: engineers open it on every call to validate the agent’s behavior.
Weeks 2 to 4. Error Feed clusters start appearing. Tier 2 charts have enough density to see daily and weekly seasonality. The trace browser (one click in from the aggregate view) becomes the main workflow surface. The Drawer is still opened many times a day but increasingly from cluster drill-downs rather than from random sampling.
Months 2 to 3. Dashboard is in mature shape. Tier 1 is the heartbeat, reviewed continuously. Tier 2 is the troubleshooting surface, reviewed on demand. Tier 3 is the weekly product review surface. The Error Feed sidebar has 20+ named clusters at varying severity. The Drawer is opened dozens of times a day, mostly via cluster jumps. The closed loop into simulation runs every deploy.
Past 6 months. The dashboard increasingly feeds the optimization layer. The same trace data that the Drawer reads is now also being read by agent-opt against accumulated failure patterns. The optimizer proposes candidate prompts; the simulation suite scores them against the persona library; the dashboard validates the winning candidate in production after deploy. The library of regression scenarios grows continuously with each new cluster the team resolves.
Common dashboard anti-patterns to avoid
Twelve-tile tier-1 grids. Three. Always three. P95 turn latency, task_completion pass rate, conversation_resolution rate. Add a fourth and on-call starts to ignore it.
Average latency as the headline number. Always P95 or P99. Average hides the long tail; the long tail is who hangs up.
Transcript-only Drawer. Half the bugs in voice systems are timing bugs that only show in the waveform. The two-channel audio panel is non-negotiable.
One dashboard for engineering and product. Tier 1 + tier 2 are engineering. Tier 3 is product. Trying to put both audiences on one ungrouped surface ends with one team muting the other team’s alerts.
Static thresholds for everything. Hard limits (5xx rate, uptime) take static thresholds. Traffic-dependent signals (latency at peak, WER under new accent mix) need anomaly detection. Pages should fire on burn rate, not on instantaneous threshold breach.
Untagged traces. If you don’t tag agent_version on every span, you cannot A/B deploys. If you don’t tag intent, you cannot slice tier 2. Tagging discipline at instrumentation time is what makes the dashboard usable downstream.
Alert without analysis. A page that says “P95 latency breached” is worse than a page that says “P95 latency breached, root cause is tool call X regressed after deploy v2.4.1, quick fix is rollback.” The Error Feed’s auto-written analysis is what makes the on-call workflow sustainable.
Where FAGI calibrates short
Optimization is explicit and human-gated. agent-opt with the six optimizers (Bayesian Search, Meta-Prompt, ProTeGi, GEPA, Random Search, PromptWizard) reads from the same trace and dataset the dashboard renders, but it never rewrites production prompts without an explicit run and a human approval gate. Teams who want autonomous self-rewriting behavior in production won’t get it here. That is intentional; the loop is operator-controlled.
Persona library depth is Cekura’s home turf. For pre-launch synthetic caller coverage at extreme variation counts, Cekura has a deeper catalog in some narrow accent segments. Future AGI ships 18 pre-built personas plus unlimited custom-authored personas with full configurability (gender, age range, location, accent, conversation speed, background noise, multilingual), but if the buyer is specifically optimizing for raw library size at the persona-count level only, pair them.
BYOC routing for the most regulated workloads. SOC 2 Type II, HIPAA, GDPR, CCPA, and ISO 27001 are certified per futureagi.com/trust. ISO 42001 is in progress. For workloads that require an in-VPC deployment with a customer-owned audit boundary, BYOC is supported, but the BYOC config has more knobs than the managed SaaS path and the rollout takes longer.
Putting the architecture into practice
If you’re starting from a blank backend, build the per-call detail Drawer first, the aggregate analytics view second. The reason is that the Drawer is what your engineers debug from on day one. The aggregate view needs traffic to be useful; the Drawer needs only one trace.
Inside the Drawer, keep the visible panel order as audio, transcript, span tree, eval scores, Error Feed link. If you are implementing from scratch, instrument spans early so the tree is correct from day one, but do not change the user-facing panel order. Inside the aggregate view, build tier 1 first (three gauges), Error Feed sidebar second (it’s already produced by the clustering layer), tier 2 SLIs third, tier 3 KPIs fourth.
The full production instrumentation procedure that produces the trace data lives in the production monitoring playbook. The 12 metrics that populate the SLO grid and the KPI tabs are detailed in the conversation monitoring metrics post. The end-to-end implementation walkthrough including SDK setup is in the voice AI observability implementation guide. The platform-by-platform comparison of where each layer comes from is in the voice agent monitoring platforms roundup.
Related reading
- 7 best voice agent monitoring platforms in 2026
- How to monitor AI voice agents in production: a 2026 playbook
- 12 metrics that actually matter for AI conversation monitoring in 2026
- How to implement voice AI observability in 2026
- Accent and dialect testing for voice AI in 2026
- When your agent passes evals but fails in production
Sources and references
- Voice observability quickstart: docs.futureagi.com — voice quickstart
- Voice observability overview: docs.futureagi.com — voice overview
- traceAI on GitHub: github.com/future-agi/traceAI
- ai-evaluation on GitHub: github.com/future-agi/ai-evaluation
- agent-opt optimizer docs (Bayesian Search, Meta-Prompt, ProTeGi, GEPA, Random Search, PromptWizard): docs.futureagi.com — optimization
- Error Feed and Observe docs: docs.futureagi.com/docs/observe
- Future AGI Protect docs: docs.futureagi.com/docs/protect
- Agent Command Center docs: docs.futureagi.com/docs/command-center
- arXiv 2507.19457 (GEPA): arxiv.org/abs/2507.19457
- arXiv 2505.09666 (Meta-Prompt bilevel optimization): arxiv.org/abs/2505.09666
- arXiv 2311.09569 (Random Search baseline): arxiv.org/abs/2311.09569
- arXiv 2510.13351 (Protect inline latency): arxiv.org/abs/2510.13351
- Trust page: futureagi.com/trust
- OpenInference specification: github.com/Arize-ai/openinference
- Google SRE workbook on SLOs and error budgets: sre.google/workbook/implementing-slos/
Frequently asked questions
What's the difference between a per-call detail view and the aggregate analytics view?
Which eval IDs should I run on every captured voice call?
How does eval-score data actually reach the dashboard?
What goes in the tier-1 SLO row at the top of the aggregate view?
How does the dashboard close the loop into the simulation suite?
What does Future AGI ship out of the box vs what do I have to build?
What latency does the Future AGI Protect model family add to inline checks?
Voice agent eval is end-task scoring plus pipeline-stage attribution plus conversation coherence. WER scores the ASR component, not the agent.
Implement voice observability for Pipecat with traceAI-pipecat: install, register, enable HTTP attribute mapping, attach audio + multi-turn eval rubrics.
Design the data plane for voice agents in 2026: spans, OTLP, eval, dashboards, alerts, retention, and GDPR/HIPAA tradeoffs across the full architecture.