Logging and Analytics Architecture for Voice Agents in 2026
Design the data plane for voice agents in 2026: spans, OTLP, eval, dashboards, alerts, retention, and GDPR/HIPAA tradeoffs across the full architecture.
Table of Contents
The architecture diagram on a voice agent monitoring slide deck looks simple: spans go in, dashboards come out. The reality is six layers of data flowing across services, retention policies, compliance constraints, and query patterns. This guide walks through the actual data plane: what gets logged, where it lives, how it flows, what retention rules apply, which query patterns the analytics layer needs to support, and where Future AGI’s Error Feed sits on top.
TL;DR step preview
- Capture six classes of data per call: metadata, audio, transcripts, spans, eval scores, tags.
- Ship spans via OTLP to your backend. Store audio in object storage with URL references on the spans. Keep eval scores joined to the spans they scored.
- Apply retention per attribute class. Customer-identifying fields and audio are short-lived; aggregated metrics are long-lived.
- Wire compliance from the start. Tenant-isolated storage, PII redaction, Data Privacy Compliance auditing, encryption in transit and at rest, audit logs on the trace store.
- Support five query patterns: tag filter, eval-score filter, time-series aggregation, cluster view, session replay.
- Put Error Feed on top of the eval and span data. It clusters failures into named issues with auto-written root cause, quick fix, and long-term recommendation.
The rest of the guide walks the architecture, the retention policies, the compliance surface, and the common pitfalls.
The full data plane
The end-to-end view of a voice agent monitoring stack in 2026:
+-----------------+
| Voice provider | Vapi / Retell / LiveKit / Pipecat /
| or framework | custom (via Enable Others)
+--------+--------+
|
+---> call audio (assistant + customer streams)
| |
| v
| +------------------+
| | Object storage | S3 / GCS / Azure Blob /
| | (audio) | provider's recording bucket
| +------------------+
|
+---> call metadata + transcripts (via native voice obs)
| |
| v
| +------------------+
| | Observe project | call log table, session view
| | (FAGI dashboard) | audio playback, transcript
| +------------------+
|
+---> spans (via traceAI SDK, OpenInference)
|
v OTLP gRPC or HTTP
+------------------+
| OTel collector | filtering, batching, fanout
| (optional) |
+--------+---------+
|
v
+------------------+
| Backend ingest | FAGI Observe API +
| | span store
+--------+---------+
|
v
+------------------+
| Scoring layer | 70+ built-in eval rubrics
| | + unlimited custom
+--------+---------+
|
v
+------------------+
| Clustering layer | Error Feed: named issues,
| | root cause, quick fix
+--------+---------+
|
+---> dashboards (product, ops)
+---> alerts (engineering on-call)
+---> simulation (pre-launch coverage)
+---> agent-opt (prompt optimization)
+---> Protect (inline guardrails)
Each box is a separate concern. The capture layer is where data enters. The storage layer is where it lives. The scoring, clustering, and downstream layers are where it becomes useful.
What gets logged per call
Six classes of data. They have different retention, access, and compliance requirements, so design them as separate stores even when one platform manages all of them.
Call metadata
The lightweight envelope: start time, end time, duration, direction (inbound or outbound), customer phone, agent assistant id, final status, and any tags. This is the row in the call log table. Cheap to store, indexed for filtering, retained as long as you retain the call record itself.
Typical retention: 1-3 years depending on industry. Healthcare and financial services often require longer.
Audio
Two streams per call: assistant audio and customer audio, kept separate. Future AGI’s native voice observability stores them as separately-downloadable files on every call, which is what makes debugging barge-in failures or TTS regressions tractable.
Audio belongs in object storage, not in your span attribute table. The OTLP collector chokes on payloads above a few kilobytes per attribute, and base64’d audio explodes storage cost. The right pattern: put the audio file in S3, GCS, or your provider’s recording bucket; put the URL on the span as an attribute.
Typical retention: 30-90 days for raw audio, 1 year for compliance-required calls. Encryption at rest with a KMS-managed key. Pre-signed URL access for the dashboard player; no public bucket access.
Transcripts
Turn-by-turn text with speaker tags and per-turn timestamps. Future AGI captures these automatically when wired to a native voice provider. For SDK-driven setups, the transcript usually comes from your STT provider’s response on each turn.
Transcripts contain PII by default. Customer names, account numbers, credit card numbers, addresses, medical details, anything the customer said in plain language. Apply the PII rubric for detection and the DataPrivacyCompliance rubric for audit. Redaction policies should run on a schedule that matches your industry’s requirements.
Typical retention: 30-90 days raw, indefinitely redacted.
Spans
The trace tree itself. One span per call leg: STT, LLM, tool, retrieval, TTS. Each span carries OpenInference attributes for its kind (LLM model name, retrieved chunks, tool parameters, audio duration). Each span also carries cross-cutting attributes (conversation_id, customer_id, agent_version, channel, turn_index).
Spans are cheap. A typical span is 1-3 KB. A 5-minute call with 30 turns and 6 spans per turn is 180 spans, 200-500 KB total. At 10,000 calls a day, that’s 2-5 GB per day of span data, hot. Plan storage accordingly: 30-day hot, 90-day warm, archive beyond that.
The Future AGI traceAI instrumentors emit OpenInference-compatible spans via OTLP. The traceai-livekit and traceAI-pipecat packages cover the voice frameworks; the broader catalog of 30+ documented integrations across Python + TypeScript covers the LLM providers and agent frameworks behind the voice runtime.
Eval scores
One row per rubric per call. A call with 5 attached rubrics produces 5 score rows. Each row joins back to the spans it scored.
The Future AGI ai-evaluation SDK ships 70+ built-in eval templates. The voice-specific ones are audio_transcription, audio_quality, conversation_coherence, conversation_resolution, and task_completion. Multi-lingual ones include translation_accuracy and cultural_sensitivity. Tone ones include is_polite, is_helpful, and is_concise. Plus unlimited custom evaluators authored either in code or via the in-product agent.
Eval scores are dense but compact. A score row is a few hundred bytes (score, reasoning, span pointer, rubric id, timestamp). Retention typically matches span retention, so they live and die together.
Tags
The filter axes. customer_id, vertical, agent_version, campaign_id, intent, plus any custom dimensions your business needs (cohort, geography, channel partner, A/B bucket).
Tags get applied at the Agent Definition level on Future AGI and propagate to every call. For SDK-driven setups, tags ride on the root span as attributes and propagate to children. Without tags, you can capture everything and analyze nothing.
Span attribute taxonomy
The span attributes that matter for voice debugging, by span kind:
LLM span
llm.model_name,llm.providerllm.input_messages,llm.output_messages(serialized)llm.token_count.prompt,llm.token_count.completionllm.invocation_parameters(temperature, tools list, max tokens)
STT span (TOOL or custom kind)
provider(deepgram, assemblyai, whisper, etc.)modelinput.audio.url(pointer to object storage)output.value(transcribed text)confidenceaudio.duration_secondslanguage
TTS span (TOOL or custom kind)
provider(cartesia, elevenlabs, openai, etc.)voice_idoutput.audio.urlaudio.duration_secondsinput.value(text)ssml(if used)
Tool span
tool.nametool.parameters(JSON arguments)tool.return(result payload or pointer)
Retriever span
retrieval.documentsretrieval.queryembedding.model_nameretrieval.top_k
Cross-cutting (on every span)
conversation_idcustomer_idagent_versionchannel=voiceturn_indexintent
The same attribute taxonomy applies whether you’re emitting spans via traceai-livekit, traceAI-pipecat, or hand-rolled OTel SDK code. OpenInference standardizes the names; the platform reads them.
OTLP transport
OpenTelemetry Protocol, gRPC by default for low overhead, HTTP fallback when a proxy or compliance layer forces it. The Future AGI register() helper handles transport setup for you when using traceAI. For hand-rolled instrumentation:
from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
provider = TracerProvider()
exporter = OTLPSpanExporter(endpoint="otlp.your-backend.example.com:4317")
provider.add_span_processor(BatchSpanProcessor(exporter))
trace.set_tracer_provider(provider)
For voice workloads, tune max_queue_size and max_export_batch_size upward from defaults. Voice calls produce dense span batches per session, and the defaults assume sparse HTTP-shaped workloads.
The OTel collector is an optional middle layer. Use it when you want to:
- Fan spans out to multiple backends in parallel (Future AGI plus an in-house warehouse, for example).
- Apply attribute filters before export (strip PII before it leaves your VPC).
- Absorb backend outages without losing trace data.
For a starting setup, you can skip the collector and ship straight from the application to the backend. Add it when the parallel-export or attribute-filter requirements emerge.
Retention policy
Retention rules per data class, with typical defaults for non-regulated industries. Regulated industries (healthcare, financial services, public sector) extend these.
| Data class | Hot retention | Warm retention | Archive |
|---|---|---|---|
| Call metadata | 90 days | 1 year | 7 years if required |
| Audio (raw) | 30 days | 90 days | 1 year for compliance calls |
| Transcripts (raw) | 30 days | 90 days | indefinite if redacted |
| Spans | 30 days | 90 days | aggregated indefinitely |
| Eval scores | 30 days | 90 days | aggregated indefinitely |
| Tags + attribution | indefinite | - | - |
| Error Feed clusters | 90 days | 1 year | aggregated indefinitely |
The principle: customer-identifying fields and audio expire fast, aggregated metrics persist. PII redaction policies run on a schedule that hits the warm retention boundary, so anything that survives past 90 days is already redacted.
Future AGI handles retention policy enforcement as a managed feature. For BYOC self-host, you wire the equivalent lifecycle rules on your object storage and span store.
Compliance surface
Five practices cover the regulated-industry surface.
Tenant-isolated storage
Every span, every audio file, every transcript, every score carries a tenant id. Queries enforce row-level filtering on tenant. The trace store should refuse a query that doesn’t include a tenant filter, not return a global view. The Future AGI Agent Command Center handles this with RBAC scoped per tenant.
Retention policies per attribute class
Customer-identifying fields, audio, and full transcripts are short-lived. Hashed identifiers, aggregated metrics, and Error Feed clusters are long-lived. The policy applies at the attribute class level, not at the record level.
PII detection and redaction
The PII rubric flags PII fields in captured text. The DataPrivacyCompliance rubric audits the whole call session for privacy violations. The platform that runs them should be able to redact in-place after the hot retention window expires.
Encryption in transit and at rest
TLS to the OTLP collector. KMS-managed keys on object storage for audio. Encrypted span store at rest. Pre-signed URL access for dashboard audio playback (no public bucket access).
Audit log on the trace store itself
Every query against the trace store should produce an audit log entry: who queried, what filter, what time, what result count. The audit log lives in its own retention bucket separate from the span store.
Future AGI is SOC 2 Type II, HIPAA, GDPR, CCPA, and ISO 27001 certified per futureagi.com/trust. The platform handles most of these practices by default. For federal procurement, deploy via BYOC self-host so the audit boundary stays inside your VPC.
Query patterns the analytics layer needs
Five query patterns matter most for voice agent analytics. A good platform handles all five at production scale.
Tag filter
“All calls for customer X in the last 7 days.” “All calls in the support vertical on agent_version v3.” “All calls in the outbound_sales campaign id 47.” This is the bread-and-butter query for ops and customer success.
Implementation: indexed tag attributes on the call log table, filterable from the dashboard, query API for programmatic access.
Eval-score filter
“All calls with conversation_resolution below 0.5.” “All calls with audio_transcription flagging mistranscription.” “All calls with is_polite below a threshold.”
Implementation: eval scores joined to call rows, filterable by score range, with the option to filter on AND combinations across multiple rubrics.
Time-series aggregation
“Completion rate over the last 30 days, bucketed daily.” “Latency p50, p95, p99 over the last 7 days.” “Escalation rate by intent over the last quarter.”
Implementation: pre-aggregated rollups for fast time-series queries, on-demand drill-down to underlying calls.
Cluster view
“What named issues is Error Feed surfacing this week, and which are rising?” “Show me the supporting evidence for the ‘STT confidence drop on Indian English’ cluster.” “How many calls hit the ‘Tool argument schema mismatch in book_appointment’ issue?”
Implementation: Error Feed clusters as first-class objects, with named issue rows, status (active, acknowledged, resolved), trend direction (rising, falling, flat), and drill-down to underlying calls and spans.
Session replay
“Replay this specific call with audio, transcript, and span tree side by side.”
Implementation: a session view that pulls the assistant audio, customer audio, transcript, and span tree into one panel. The dashboard player handles playback in sync with transcript scrolling and span timeline highlighting.
All five queries are common enough that they should be one click from the platform’s homepage. If any one of them requires SQL or a custom export, the platform is incomplete.
Dashboards and alerts
Dashboards summarize metrics for ops and product. Alerts fire on SLO breaches for engineering.
Dashboard primitives
- Call volume by hour, day, week.
- Completion rate time-series, segmented by intent or vertical.
- Latency percentiles (TTFT, end-to-end turn latency) time-series.
- Eval score distributions for the attached rubrics (histogram of
conversation_resolution,task_completion,is_polite, etc.). - Error Feed named-issue list with status, trend, count.
- Top tags by call volume (which customers, verticals, agent versions are driving traffic).
These are the standing review primitives. Product looks at completion rate trends. Ops looks at volume and named-issue counts. Engineering looks at latency percentiles and rising issues.
Alert primitives
- SLO breach: completion rate below threshold, latency above threshold, error rate above threshold.
- Cluster spike: Error Feed sees a sudden rise in a named issue.
- Eval-score anomaly: a rubric score distribution shifts significantly from baseline.
- Audio failure spike:
audio_transcriptionoraudio_qualityfailure rate rises. - Compliance signal:
PIIorDataPrivacyCompliancedetects an unexpected pattern.
Alerts wire into your on-call rotation. The Error Feed clustering layer matters here: instead of 50 individual alerts on related failures, you get one named issue with a quick fix recommendation.
Where Error Feed sits
Error Feed is the clustering and triage layer above the raw eval and span data. It reads the captured spans, identifies failure patterns across five categories (factual grounding failures, tool crashes, broken workflows, safety violations, reasoning gaps), groups related failures into named issues with auto-written root cause, supporting evidence from spans, quick fix, and long-term recommendation. Tracks whether each issue is rising or falling.
The output is your engineering backlog. Instead of triaging individual alerts, you triage clusters. Each cluster carries:
- A name (auto-generated, like “STT confidence drop on Indian English”).
- A count (how many calls hit this cluster).
- A trend (rising, falling, flat).
- Supporting evidence (sample spans, audio links, transcripts).
- A quick fix (something to ship today).
- A long-term recommendation (the structural change).
- Status (active, acknowledged, in progress, resolved).
That’s the workflow that turns raw data into action. Without the clustering layer, you have data; with it, you have a backlog.
Error Feed is zero-config. The moment spans flow into an Observe project, clustering starts. It needs a few days of traffic to populate, but after that the named issues update continuously.
Closing the loop
The architecture doesn’t end at dashboards and alerts. Three feedback loops close the system:
Simulation. Failures clustered by Error Feed feed into simulation runs. The simulation product ships 18 pre-built personas plus unlimited custom (gender, age range, location, accent, communication style, conversation speed, background noise, and multilingual controls). Workflow Builder auto-generates branching scenarios (20, 50, or 100 rows) with branch visibility. Error Localization in Simulate (release 2025-11-25) pinpoints the exact failing turn when a scenario breaks. The same Agent Definition wired in observability plugs into Simulate.
Optimization. agent-opt ships six prompt optimizers (Bayesian Search, Meta-Prompt per arXiv 2505.09666, ProTeGi, GEPA Genetic-Pareto per arXiv 2507.19457, Random Search per arXiv 2311.09569, PromptWizard) and is available both as a UI workflow inside the Dataset surface and as a Python SDK. The loop: production traces flag a failure pattern, Error Feed clusters it, you correct the assistant’s response on a few examples, you pick an optimizer + a dataset + an evaluator and start a run through the UI or SDK, the new prompt ships after human approval, the cluster shrinks. Custom evaluators in the in-product agent calibrate from human review feedback so each iteration sharpens the rubric.
Inline guardrails. The Future AGI Protect model family runs sub-100ms inline per arXiv 2510.13351. Gemma 3n foundation with LoRA-trained adapters across 4 safety dimensions (Content Moderation, Bias Detection, Security, Data Privacy Compliance), multi-modal across text, image, and audio. ProtectFlash adds a single-call binary classifier path. The guardrail runs between the LLM response and the TTS leg on the critical voice path; its verdicts land on the FAGI span so the trust team can review denied responses in Error Feed.
The architecture is unified around one platform: the same Agent Definition wires observability, simulation, optimization, and inline guardrails. One dashboard, one trust posture, one bill.
Common pitfalls
Logging audio to span attributes. Said in several places in this guide because it’s the most common production failure. Use URLs to object storage; don’t base64 the audio onto a span attribute.
Single-tier retention. Hot retention for everything is expensive and unnecessary. Hot for recent, warm for queryable history, archive for aggregated metrics. Customer-identifying fields and audio expire fast; aggregated metrics persist.
Skipping tag attribution at the source. If you can’t filter by customer, vertical, agent version, and intent, you have telemetry but not analytics. Apply tags at the Agent Definition level and propagate.
Treating PII redaction as a one-off. Redaction policies run on a schedule, not on demand. Hit the warm retention boundary, redact, retain the redacted version indefinitely.
Running raw alerts without clustering. A noisy alert stream burns out on-call. Error Feed clustering is what turns alerts into a manageable backlog.
**Not designing for federal procurement.Federal teams deploy in their VPC with customer-owned audit boundary.
Where Future AGI fits
Future AGI is one of several platforms that handle the voice agent data plane in 2026. The architectural reason most voice teams pick it is the unified surface: capture, storage, scoring, clustering, alerting, simulation, optimization, and inline guardrails on one platform.
- Native voice observability for Vapi, Retell AI, and LiveKit with no SDK required. Add provider API key + Assistant ID, get call logs, separate assistant + customer audio, transcripts, and the full eval engine on every call. Enable Others mode supports any provider via mobile-number simulation; Indian phone number support shipped 2025-11-25.
- traceAI for SDK-driven span capture. 30+ documented integrations across Python + TypeScript, OpenInference-compatible, Apache 2.0. Dedicated
traceai-livekitandtraceAI-pipecatpip packages for voice frameworks. Custom voices from ElevenLabs and Cartesia configurable in Run Prompt and Experiments. - ai-evaluation ships 70+ built-in eval templates in Apache 2.0. Voice rubrics include
audio_transcription,audio_quality,conversation_coherence,conversation_resolution,task_completion. Multilingual includetranslation_accuracy,cultural_sensitivity. Tone includeis_polite,is_helpful,is_concise. RAG includegroundedness,chunk_attribution,context_relevance. In-house classifier models tuned for the LLM-as-judge cost/latency tradeoff on high-volume scoring. - Error Feed clusters trace failures into named issues with auto-written root cause, supporting evidence, quick fix, and long-term recommendation. Zero-config.
- Error Localization in Simulate pinpoints the exact failing turn. Programmatic eval API for configure + re-run enables CI integration.
- Future AGI Protect for inline guardrails. Sub-100ms per arXiv 2510.13351. ProtectFlash binary classifier for the lowest-latency surface.
- Agent Command Center for hosted, multi-region, or BYOC self-host. RBAC, AWS Marketplace, 15+ providers in the router surface. SOC 2 Type II, HIPAA, GDPR, CCPA, and ISO 27001 certified per futureagi.com/trust.
- agent-opt ships six prompt optimizers (Bayesian Search, Meta-Prompt per arXiv 2505.09666, ProTeGi, GEPA Genetic-Pareto per arXiv 2507.19457, Random Search per arXiv 2311.09569, PromptWizard). UI workflow inside the Dataset surface and a Python SDK for programmatic control.
Two deliberate tradeoffs
Optimization is an explicit, gated run. The six-optimizer agent-opt surface (UI + SDK) never auto-rewrites prompts in production. Every optimization run is started by a human, gated by an evaluator, and surfaces candidate prompts for approval before they ship. That’s a deliberate process choice: production prompt changes go through human review.
Native voice observability ships for Vapi, Retell, and LiveKit out of the box. The dashboard path covers the three runtimes most teams are on with no SDK required. For any other runtime (Bland, ElevenLabs Agents, Pipecat, or a custom stack on Twilio/Plivo/Telnyx), Enable Others mode + traceAI SDK + webhook covers ingest. Between native and Enable Others, the active production stack in 2026 is in scope.
Related reading
- Voice AI Observability for Vapi: 2026 implementation guide
- OpenInference + OpenTelemetry for Voice Agents: 2026 tracing guide
- An Introduction to Production Monitoring for Voice Agents in 2026
- 7 best voice agent monitoring platforms in 2026
Sources and references
- traceAI on GitHub: github.com/future-agi/traceAI
- ai-evaluation on GitHub: github.com/future-agi/ai-evaluation
- Error Feed docs: docs.futureagi.com/docs/observe
- Future AGI Protect docs: docs.futureagi.com/docs/protect
- Agent Command Center docs: docs.futureagi.com/docs/command-center
- arXiv 2510.13351 (Protect): arxiv.org/abs/2510.13351
- arXiv 2507.19457 (GEPA): arxiv.org/abs/2507.19457
- arXiv 2505.09666 (Meta-Prompt): arxiv.org/abs/2505.09666
- arXiv 2311.09569 (Random Search baseline): arxiv.org/abs/2311.09569
- Trust page: futureagi.com/trust
- OpenInference spec: github.com/Arize-ai/openinference
- OpenTelemetry: opentelemetry.io
- OTLP spec: github.com/open-telemetry/opentelemetry-proto
Frequently asked questions
What gets logged for a voice agent call?
How do I handle GDPR and HIPAA for voice traces?
What query patterns should the analytics layer support?
How big does my OTLP backend need to be?
What latency does inline Protect add on the critical path?
How does Error Feed sit on top of the logging architecture?
When should I move from native dashboard observability to SDK-driven traceAI?
Voice agent eval is end-task scoring plus pipeline-stage attribution plus conversation coherence. WER scores the ASR component, not the agent.
Engineering walkthrough of a voice agent analytics dashboard: per-call detail drawer with 5 panels, aggregate SLO grid with 3 tiers, span/eval/tag data flow, and the production-to-simulation closed loop.
Implement voice observability for Pipecat with traceAI-pipecat: install, register, enable HTTP attribute mapping, attach audio + multi-turn eval rubrics.