Guides

Logging and Analytics Architecture for Voice Agents in 2026

Design the data plane for voice agents in 2026: spans, OTLP, eval, dashboards, alerts, retention, and GDPR/HIPAA tradeoffs across the full architecture.

·
Updated
·
15 min read
voice-ai 2026 observability logging analytics
Editorial cover image for Logging and Analytics Architecture for Voice Agents in 2026
Table of Contents

The architecture diagram on a voice agent monitoring slide deck looks simple: spans go in, dashboards come out. The reality is six layers of data flowing across services, retention policies, compliance constraints, and query patterns. This guide walks through the actual data plane: what gets logged, where it lives, how it flows, what retention rules apply, which query patterns the analytics layer needs to support, and where Future AGI’s Error Feed sits on top.

TL;DR step preview

  1. Capture six classes of data per call: metadata, audio, transcripts, spans, eval scores, tags.
  2. Ship spans via OTLP to your backend. Store audio in object storage with URL references on the spans. Keep eval scores joined to the spans they scored.
  3. Apply retention per attribute class. Customer-identifying fields and audio are short-lived; aggregated metrics are long-lived.
  4. Wire compliance from the start. Tenant-isolated storage, PII redaction, Data Privacy Compliance auditing, encryption in transit and at rest, audit logs on the trace store.
  5. Support five query patterns: tag filter, eval-score filter, time-series aggregation, cluster view, session replay.
  6. Put Error Feed on top of the eval and span data. It clusters failures into named issues with auto-written root cause, quick fix, and long-term recommendation.

The rest of the guide walks the architecture, the retention policies, the compliance surface, and the common pitfalls.

The full data plane

The end-to-end view of a voice agent monitoring stack in 2026:

+-----------------+
| Voice provider  |   Vapi / Retell / LiveKit / Pipecat /
| or framework    |   custom (via Enable Others)
+--------+--------+
         |
         +---> call audio (assistant + customer streams)
         |          |
         |          v
         |    +------------------+
         |    | Object storage   |   S3 / GCS / Azure Blob /
         |    | (audio)          |   provider's recording bucket
         |    +------------------+
         |
         +---> call metadata + transcripts (via native voice obs)
         |          |
         |          v
         |    +------------------+
         |    | Observe project  |   call log table, session view
         |    | (FAGI dashboard) |   audio playback, transcript
         |    +------------------+
         |
         +---> spans (via traceAI SDK, OpenInference)
                    |
                    v  OTLP gRPC or HTTP
              +------------------+
              | OTel collector   |   filtering, batching, fanout
              | (optional)       |
              +--------+---------+
                       |
                       v
              +------------------+
              | Backend ingest   |   FAGI Observe API +
              |                  |   span store
              +--------+---------+
                       |
                       v
              +------------------+
              | Scoring layer    |   70+ built-in eval rubrics
              |                  |   + unlimited custom
              +--------+---------+
                       |
                       v
              +------------------+
              | Clustering layer |   Error Feed: named issues,
              |                  |   root cause, quick fix
              +--------+---------+
                       |
                       +---> dashboards (product, ops)
                       +---> alerts (engineering on-call)
                       +---> simulation (pre-launch coverage)
                       +---> agent-opt (prompt optimization)
                       +---> Protect (inline guardrails)

Each box is a separate concern. The capture layer is where data enters. The storage layer is where it lives. The scoring, clustering, and downstream layers are where it becomes useful.

What gets logged per call

Six classes of data. They have different retention, access, and compliance requirements, so design them as separate stores even when one platform manages all of them.

Call metadata

The lightweight envelope: start time, end time, duration, direction (inbound or outbound), customer phone, agent assistant id, final status, and any tags. This is the row in the call log table. Cheap to store, indexed for filtering, retained as long as you retain the call record itself.

Typical retention: 1-3 years depending on industry. Healthcare and financial services often require longer.

Audio

Two streams per call: assistant audio and customer audio, kept separate. Future AGI’s native voice observability stores them as separately-downloadable files on every call, which is what makes debugging barge-in failures or TTS regressions tractable.

Audio belongs in object storage, not in your span attribute table. The OTLP collector chokes on payloads above a few kilobytes per attribute, and base64’d audio explodes storage cost. The right pattern: put the audio file in S3, GCS, or your provider’s recording bucket; put the URL on the span as an attribute.

Typical retention: 30-90 days for raw audio, 1 year for compliance-required calls. Encryption at rest with a KMS-managed key. Pre-signed URL access for the dashboard player; no public bucket access.

Transcripts

Turn-by-turn text with speaker tags and per-turn timestamps. Future AGI captures these automatically when wired to a native voice provider. For SDK-driven setups, the transcript usually comes from your STT provider’s response on each turn.

Transcripts contain PII by default. Customer names, account numbers, credit card numbers, addresses, medical details, anything the customer said in plain language. Apply the PII rubric for detection and the DataPrivacyCompliance rubric for audit. Redaction policies should run on a schedule that matches your industry’s requirements.

Typical retention: 30-90 days raw, indefinitely redacted.

Spans

The trace tree itself. One span per call leg: STT, LLM, tool, retrieval, TTS. Each span carries OpenInference attributes for its kind (LLM model name, retrieved chunks, tool parameters, audio duration). Each span also carries cross-cutting attributes (conversation_id, customer_id, agent_version, channel, turn_index).

Spans are cheap. A typical span is 1-3 KB. A 5-minute call with 30 turns and 6 spans per turn is 180 spans, 200-500 KB total. At 10,000 calls a day, that’s 2-5 GB per day of span data, hot. Plan storage accordingly: 30-day hot, 90-day warm, archive beyond that.

The Future AGI traceAI instrumentors emit OpenInference-compatible spans via OTLP. The traceai-livekit and traceAI-pipecat packages cover the voice frameworks; the broader catalog of 30+ documented integrations across Python + TypeScript covers the LLM providers and agent frameworks behind the voice runtime.

Eval scores

One row per rubric per call. A call with 5 attached rubrics produces 5 score rows. Each row joins back to the spans it scored.

The Future AGI ai-evaluation SDK ships 70+ built-in eval templates. The voice-specific ones are audio_transcription, audio_quality, conversation_coherence, conversation_resolution, and task_completion. Multi-lingual ones include translation_accuracy and cultural_sensitivity. Tone ones include is_polite, is_helpful, and is_concise. Plus unlimited custom evaluators authored either in code or via the in-product agent.

Eval scores are dense but compact. A score row is a few hundred bytes (score, reasoning, span pointer, rubric id, timestamp). Retention typically matches span retention, so they live and die together.

Tags

The filter axes. customer_id, vertical, agent_version, campaign_id, intent, plus any custom dimensions your business needs (cohort, geography, channel partner, A/B bucket).

Tags get applied at the Agent Definition level on Future AGI and propagate to every call. For SDK-driven setups, tags ride on the root span as attributes and propagate to children. Without tags, you can capture everything and analyze nothing.

Span attribute taxonomy

The span attributes that matter for voice debugging, by span kind:

LLM span

  • llm.model_name, llm.provider
  • llm.input_messages, llm.output_messages (serialized)
  • llm.token_count.prompt, llm.token_count.completion
  • llm.invocation_parameters (temperature, tools list, max tokens)

STT span (TOOL or custom kind)

  • provider (deepgram, assemblyai, whisper, etc.)
  • model
  • input.audio.url (pointer to object storage)
  • output.value (transcribed text)
  • confidence
  • audio.duration_seconds
  • language

TTS span (TOOL or custom kind)

  • provider (cartesia, elevenlabs, openai, etc.)
  • voice_id
  • output.audio.url
  • audio.duration_seconds
  • input.value (text)
  • ssml (if used)

Tool span

  • tool.name
  • tool.parameters (JSON arguments)
  • tool.return (result payload or pointer)

Retriever span

  • retrieval.documents
  • retrieval.query
  • embedding.model_name
  • retrieval.top_k

Cross-cutting (on every span)

  • conversation_id
  • customer_id
  • agent_version
  • channel = voice
  • turn_index
  • intent

The same attribute taxonomy applies whether you’re emitting spans via traceai-livekit, traceAI-pipecat, or hand-rolled OTel SDK code. OpenInference standardizes the names; the platform reads them.

OTLP transport

OpenTelemetry Protocol, gRPC by default for low overhead, HTTP fallback when a proxy or compliance layer forces it. The Future AGI register() helper handles transport setup for you when using traceAI. For hand-rolled instrumentation:

from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter

provider = TracerProvider()
exporter = OTLPSpanExporter(endpoint="otlp.your-backend.example.com:4317")
provider.add_span_processor(BatchSpanProcessor(exporter))
trace.set_tracer_provider(provider)

For voice workloads, tune max_queue_size and max_export_batch_size upward from defaults. Voice calls produce dense span batches per session, and the defaults assume sparse HTTP-shaped workloads.

The OTel collector is an optional middle layer. Use it when you want to:

  • Fan spans out to multiple backends in parallel (Future AGI plus an in-house warehouse, for example).
  • Apply attribute filters before export (strip PII before it leaves your VPC).
  • Absorb backend outages without losing trace data.

For a starting setup, you can skip the collector and ship straight from the application to the backend. Add it when the parallel-export or attribute-filter requirements emerge.

Retention policy

Retention rules per data class, with typical defaults for non-regulated industries. Regulated industries (healthcare, financial services, public sector) extend these.

Data classHot retentionWarm retentionArchive
Call metadata90 days1 year7 years if required
Audio (raw)30 days90 days1 year for compliance calls
Transcripts (raw)30 days90 daysindefinite if redacted
Spans30 days90 daysaggregated indefinitely
Eval scores30 days90 daysaggregated indefinitely
Tags + attributionindefinite--
Error Feed clusters90 days1 yearaggregated indefinitely

The principle: customer-identifying fields and audio expire fast, aggregated metrics persist. PII redaction policies run on a schedule that hits the warm retention boundary, so anything that survives past 90 days is already redacted.

Future AGI handles retention policy enforcement as a managed feature. For BYOC self-host, you wire the equivalent lifecycle rules on your object storage and span store.

Compliance surface

Five practices cover the regulated-industry surface.

Tenant-isolated storage

Every span, every audio file, every transcript, every score carries a tenant id. Queries enforce row-level filtering on tenant. The trace store should refuse a query that doesn’t include a tenant filter, not return a global view. The Future AGI Agent Command Center handles this with RBAC scoped per tenant.

Retention policies per attribute class

Customer-identifying fields, audio, and full transcripts are short-lived. Hashed identifiers, aggregated metrics, and Error Feed clusters are long-lived. The policy applies at the attribute class level, not at the record level.

PII detection and redaction

The PII rubric flags PII fields in captured text. The DataPrivacyCompliance rubric audits the whole call session for privacy violations. The platform that runs them should be able to redact in-place after the hot retention window expires.

Encryption in transit and at rest

TLS to the OTLP collector. KMS-managed keys on object storage for audio. Encrypted span store at rest. Pre-signed URL access for dashboard audio playback (no public bucket access).

Audit log on the trace store itself

Every query against the trace store should produce an audit log entry: who queried, what filter, what time, what result count. The audit log lives in its own retention bucket separate from the span store.

Future AGI is SOC 2 Type II, HIPAA, GDPR, CCPA, and ISO 27001 certified per futureagi.com/trust. The platform handles most of these practices by default. For federal procurement, deploy via BYOC self-host so the audit boundary stays inside your VPC.

Query patterns the analytics layer needs

Five query patterns matter most for voice agent analytics. A good platform handles all five at production scale.

Tag filter

“All calls for customer X in the last 7 days.” “All calls in the support vertical on agent_version v3.” “All calls in the outbound_sales campaign id 47.” This is the bread-and-butter query for ops and customer success.

Implementation: indexed tag attributes on the call log table, filterable from the dashboard, query API for programmatic access.

Eval-score filter

“All calls with conversation_resolution below 0.5.” “All calls with audio_transcription flagging mistranscription.” “All calls with is_polite below a threshold.”

Implementation: eval scores joined to call rows, filterable by score range, with the option to filter on AND combinations across multiple rubrics.

Time-series aggregation

“Completion rate over the last 30 days, bucketed daily.” “Latency p50, p95, p99 over the last 7 days.” “Escalation rate by intent over the last quarter.”

Implementation: pre-aggregated rollups for fast time-series queries, on-demand drill-down to underlying calls.

Cluster view

“What named issues is Error Feed surfacing this week, and which are rising?” “Show me the supporting evidence for the ‘STT confidence drop on Indian English’ cluster.” “How many calls hit the ‘Tool argument schema mismatch in book_appointment’ issue?”

Implementation: Error Feed clusters as first-class objects, with named issue rows, status (active, acknowledged, resolved), trend direction (rising, falling, flat), and drill-down to underlying calls and spans.

Session replay

“Replay this specific call with audio, transcript, and span tree side by side.”

Implementation: a session view that pulls the assistant audio, customer audio, transcript, and span tree into one panel. The dashboard player handles playback in sync with transcript scrolling and span timeline highlighting.

All five queries are common enough that they should be one click from the platform’s homepage. If any one of them requires SQL or a custom export, the platform is incomplete.

Dashboards and alerts

Dashboards summarize metrics for ops and product. Alerts fire on SLO breaches for engineering.

Dashboard primitives

  • Call volume by hour, day, week.
  • Completion rate time-series, segmented by intent or vertical.
  • Latency percentiles (TTFT, end-to-end turn latency) time-series.
  • Eval score distributions for the attached rubrics (histogram of conversation_resolution, task_completion, is_polite, etc.).
  • Error Feed named-issue list with status, trend, count.
  • Top tags by call volume (which customers, verticals, agent versions are driving traffic).

These are the standing review primitives. Product looks at completion rate trends. Ops looks at volume and named-issue counts. Engineering looks at latency percentiles and rising issues.

Alert primitives

  • SLO breach: completion rate below threshold, latency above threshold, error rate above threshold.
  • Cluster spike: Error Feed sees a sudden rise in a named issue.
  • Eval-score anomaly: a rubric score distribution shifts significantly from baseline.
  • Audio failure spike: audio_transcription or audio_quality failure rate rises.
  • Compliance signal: PII or DataPrivacyCompliance detects an unexpected pattern.

Alerts wire into your on-call rotation. The Error Feed clustering layer matters here: instead of 50 individual alerts on related failures, you get one named issue with a quick fix recommendation.

Where Error Feed sits

Error Feed is the clustering and triage layer above the raw eval and span data. It reads the captured spans, identifies failure patterns across five categories (factual grounding failures, tool crashes, broken workflows, safety violations, reasoning gaps), groups related failures into named issues with auto-written root cause, supporting evidence from spans, quick fix, and long-term recommendation. Tracks whether each issue is rising or falling.

The output is your engineering backlog. Instead of triaging individual alerts, you triage clusters. Each cluster carries:

  • A name (auto-generated, like “STT confidence drop on Indian English”).
  • A count (how many calls hit this cluster).
  • A trend (rising, falling, flat).
  • Supporting evidence (sample spans, audio links, transcripts).
  • A quick fix (something to ship today).
  • A long-term recommendation (the structural change).
  • Status (active, acknowledged, in progress, resolved).

That’s the workflow that turns raw data into action. Without the clustering layer, you have data; with it, you have a backlog.

Error Feed is zero-config. The moment spans flow into an Observe project, clustering starts. It needs a few days of traffic to populate, but after that the named issues update continuously.

Closing the loop

The architecture doesn’t end at dashboards and alerts. Three feedback loops close the system:

Simulation. Failures clustered by Error Feed feed into simulation runs. The simulation product ships 18 pre-built personas plus unlimited custom (gender, age range, location, accent, communication style, conversation speed, background noise, and multilingual controls). Workflow Builder auto-generates branching scenarios (20, 50, or 100 rows) with branch visibility. Error Localization in Simulate (release 2025-11-25) pinpoints the exact failing turn when a scenario breaks. The same Agent Definition wired in observability plugs into Simulate.

Optimization. agent-opt ships six prompt optimizers (Bayesian Search, Meta-Prompt per arXiv 2505.09666, ProTeGi, GEPA Genetic-Pareto per arXiv 2507.19457, Random Search per arXiv 2311.09569, PromptWizard) and is available both as a UI workflow inside the Dataset surface and as a Python SDK. The loop: production traces flag a failure pattern, Error Feed clusters it, you correct the assistant’s response on a few examples, you pick an optimizer + a dataset + an evaluator and start a run through the UI or SDK, the new prompt ships after human approval, the cluster shrinks. Custom evaluators in the in-product agent calibrate from human review feedback so each iteration sharpens the rubric.

Inline guardrails. The Future AGI Protect model family runs sub-100ms inline per arXiv 2510.13351. Gemma 3n foundation with LoRA-trained adapters across 4 safety dimensions (Content Moderation, Bias Detection, Security, Data Privacy Compliance), multi-modal across text, image, and audio. ProtectFlash adds a single-call binary classifier path. The guardrail runs between the LLM response and the TTS leg on the critical voice path; its verdicts land on the FAGI span so the trust team can review denied responses in Error Feed.

The architecture is unified around one platform: the same Agent Definition wires observability, simulation, optimization, and inline guardrails. One dashboard, one trust posture, one bill.

Common pitfalls

Logging audio to span attributes. Said in several places in this guide because it’s the most common production failure. Use URLs to object storage; don’t base64 the audio onto a span attribute.

Single-tier retention. Hot retention for everything is expensive and unnecessary. Hot for recent, warm for queryable history, archive for aggregated metrics. Customer-identifying fields and audio expire fast; aggregated metrics persist.

Skipping tag attribution at the source. If you can’t filter by customer, vertical, agent version, and intent, you have telemetry but not analytics. Apply tags at the Agent Definition level and propagate.

Treating PII redaction as a one-off. Redaction policies run on a schedule, not on demand. Hit the warm retention boundary, redact, retain the redacted version indefinitely.

Running raw alerts without clustering. A noisy alert stream burns out on-call. Error Feed clustering is what turns alerts into a manageable backlog.

**Not designing for federal procurement.Federal teams deploy in their VPC with customer-owned audit boundary.

Where Future AGI fits

Future AGI is one of several platforms that handle the voice agent data plane in 2026. The architectural reason most voice teams pick it is the unified surface: capture, storage, scoring, clustering, alerting, simulation, optimization, and inline guardrails on one platform.

  • Native voice observability for Vapi, Retell AI, and LiveKit with no SDK required. Add provider API key + Assistant ID, get call logs, separate assistant + customer audio, transcripts, and the full eval engine on every call. Enable Others mode supports any provider via mobile-number simulation; Indian phone number support shipped 2025-11-25.
  • traceAI for SDK-driven span capture. 30+ documented integrations across Python + TypeScript, OpenInference-compatible, Apache 2.0. Dedicated traceai-livekit and traceAI-pipecat pip packages for voice frameworks. Custom voices from ElevenLabs and Cartesia configurable in Run Prompt and Experiments.
  • ai-evaluation ships 70+ built-in eval templates in Apache 2.0. Voice rubrics include audio_transcription, audio_quality, conversation_coherence, conversation_resolution, task_completion. Multilingual include translation_accuracy, cultural_sensitivity. Tone include is_polite, is_helpful, is_concise. RAG include groundedness, chunk_attribution, context_relevance. In-house classifier models tuned for the LLM-as-judge cost/latency tradeoff on high-volume scoring.
  • Error Feed clusters trace failures into named issues with auto-written root cause, supporting evidence, quick fix, and long-term recommendation. Zero-config.
  • Error Localization in Simulate pinpoints the exact failing turn. Programmatic eval API for configure + re-run enables CI integration.
  • Future AGI Protect for inline guardrails. Sub-100ms per arXiv 2510.13351. ProtectFlash binary classifier for the lowest-latency surface.
  • Agent Command Center for hosted, multi-region, or BYOC self-host. RBAC, AWS Marketplace, 15+ providers in the router surface. SOC 2 Type II, HIPAA, GDPR, CCPA, and ISO 27001 certified per futureagi.com/trust.
  • agent-opt ships six prompt optimizers (Bayesian Search, Meta-Prompt per arXiv 2505.09666, ProTeGi, GEPA Genetic-Pareto per arXiv 2507.19457, Random Search per arXiv 2311.09569, PromptWizard). UI workflow inside the Dataset surface and a Python SDK for programmatic control.

Two deliberate tradeoffs

Optimization is an explicit, gated run. The six-optimizer agent-opt surface (UI + SDK) never auto-rewrites prompts in production. Every optimization run is started by a human, gated by an evaluator, and surfaces candidate prompts for approval before they ship. That’s a deliberate process choice: production prompt changes go through human review.

Native voice observability ships for Vapi, Retell, and LiveKit out of the box. The dashboard path covers the three runtimes most teams are on with no SDK required. For any other runtime (Bland, ElevenLabs Agents, Pipecat, or a custom stack on Twilio/Plivo/Telnyx), Enable Others mode + traceAI SDK + webhook covers ingest. Between native and Enable Others, the active production stack in 2026 is in scope.

Sources and references

Frequently asked questions

What gets logged for a voice agent call?
Six classes of data. Call metadata (start time, duration, direction, customer id, agent version). Audio (separate assistant and customer streams stored in object storage with retention policies). Transcripts (turn-by-turn with speaker tags and timestamps). Spans (one per call leg: STT, LLM, tool, retrieval, TTS, with OpenInference attributes). Eval scores (one per rubric per call, joined to the relevant spans). Tags (customer_id, vertical, agent_version, campaign_id, intent for filter axes). The split matters because each class has different retention, access, and compliance requirements.
How do I handle GDPR and HIPAA for voice traces?
Five practices cover the surface. Tenant-isolated storage with row-level filtering on every query. Retention policies per attribute class (PII fields auto-redact in 30 days, audio in 90, aggregated metrics indefinitely). PII detection on transcripts using the PII rubric and DataPrivacyCompliance auditing. Encryption in transit (TLS to OTLP) and at rest (KMS-managed keys on object storage). Audit logs on the trace store itself (who queried what when, with what filter). Future AGI is SOC 2 Type II + HIPAA + GDPR + CCPA + ISO 27001 certified per the trust page; the platform handles most of these by default.
What query patterns should the analytics layer support?
Five matter most. Filter by tag (all calls for customer X, all calls in vertical Y, all calls on agent version Z). Filter by eval score (all calls with conversation_resolution below threshold). Time-series aggregations (completion rate over time, latency percentiles, escalation rate by intent). Cluster views (failures grouped by named issue from Error Feed). Session replay (single call with audio, transcript, span tree side by side). The platform that handles all five well becomes the daily-driver for engineering, ops, and product.
How big does my OTLP backend need to be?
Math on a typical setup. 10,000 calls per day, 5 turns per call, 6 spans per turn, 2KB per span gives 600MB/day raw. Add eval score writes (one per rubric per call) at maybe 50MB/day. Add audio metadata (URLs, not payloads) at negligible cost. Plan for 30-day hot retention plus 90-day warm, so roughly 25GB hot + 75GB warm at this volume. Scale linearly with call volume. Audio itself sits in object storage and is the dominant cost; spans are cheap. Future AGI handles the storage as a managed surface; BYOC self-host runs against the cloud storage already in your VPC.
What latency does inline Protect add on the critical path?
Sub-100ms inline per arXiv 2510.13351. The Protect model family is built on Gemma 3n with LoRA-trained adapters across 4 safety dimensions (Content Moderation, Bias Detection, Security, Data Privacy Compliance), multi-modal across text, image, and audio. ProtectFlash adds a single-call binary classifier path when even rule-based scan time is too much. Either fits inside a sub-500ms per-turn voice budget. Pair inline Protect with async eval (run after the call for richer rubrics) and you get guardrails on the critical path without slowing the call.
How does Error Feed sit on top of the logging architecture?
Error Feed is the clustering and triage layer above raw traces and eval scores. It reads the captured spans, identifies failure patterns (factual grounding failures, tool crashes, broken workflows, safety violations, reasoning gaps), groups related failures into named issues with auto-written root cause plus quick fix plus long-term recommendation, and tracks whether each issue is rising or falling. It works zero-config the moment spans flow into a project. The output is your engineering backlog: instead of triaging individual alerts, you triage clusters.
When should I move from native dashboard observability to SDK-driven traceAI?
Stay on native for as long as it covers your debugging needs. Move to SDK when you need turn-level depth that the provider's API doesn't expose: tool call arguments, retrieval chunks, prompt variants, A/B test span attribution, or richer LLM-level metadata. The two paths compose: native captures call-level surface, SDK captures turn-level depth, both land in the same Observe project. Most teams run both. Future AGI's traceai-livekit and traceAI-pipecat packages are the SDK entry points for code-driven voice frameworks; for Vapi and Retell, the native path usually stays the primary surface.
Related Articles
View all