Guides

How to Trace Voice Agents with traceAI in 2026: STT, LLM, TTS, and Tool Spans

Trace voice agents with traceAI in 2026: how STT/LLM/TTS/tool spans are captured, OTLP transport, the FAGI Observe backend, and traceAI code for LiveKit and Pipecat.

January 22, 2026

Updated May 21, 2026

15 min read

voice-ai 2026 observability opentelemetry traceai tracing

Table of Contents

Voice agents broke the assumption that one OpenTelemetry span per HTTP call is enough. A real call has STT, an LLM, a few tool invocations, a TTS leg, and turn-taking logic between them. Each of those can fail independently, and the failure signature is in the span attributes you chose to capture. This guide shows how traceAI, Future AGI’s open-source tracing SDK, captures a voice pipeline as a span tree, what OTLP transport looks like in practice, and how the FAGI Observe backend turns those spans into eval scores and clustered failures.

TL;DR step preview

traceAI is built on the OpenTelemetry SDK. OTel is the wire format; traceAI adds the voice-specific span kinds and attributes on top.
Install the framework package: traceai-livekit for LiveKit Agents or traceAI-pipecat for Pipecat. Both come from Future AGI’s traceAI catalog.
Call register() once at service startup. Every turn then emits a span tree: STT, LLM, TOOL, and TTS spans, auto-captured.
Spans ship via OTLP gRPC or HTTP to FAGI Observe, where the eval engine scores them and Error Feed clusters the failures.
Attach the voice rubrics (audio_transcription, audio_quality, conversation_coherence, conversation_resolution, task_completion) and the loop closes.

The rest of the guide walks the span model, the attribute taxonomy, the transport, the backend, and the production gotchas.

What traceAI actually is

traceAI is Future AGI’s open-source observability SDK for LLM and voice applications. It is Apache 2.0 licensed and built directly on the OpenTelemetry SDK, so it inherits the standard wire format, trace context propagation, and OTLP export rather than reinventing them. What traceAI adds is the layer that a generic OTel install does not have: a span model that understands LLM calls, retriever calls, tool calls, and the audio legs that bracket a voice turn.

The catalog ships 30+ documented integrations across Python and TypeScript. Most of them instrument an LLM provider or framework (Anthropic, OpenAI, Groq, Mistral, Bedrock, Vertex, LangChain, LlamaIndex, and the long tail). Two of them are voice-specific and matter for this guide: traceai-livekit and traceAI-pipecat. Each is a standalone pip package that instruments one open-source voice runtime and emits the right spans without you writing instrumentation code.

For voice, traceAI works because a voice agent is a multi-stage LLM workload with audio legs at the edges. Once you register traceAI in your service, every turn auto-captures four span types:

An STT span for the transcription leg, with the provider, model, audio duration, and confidence.
An LLM span for each model call, with the full message payloads and token counts.
TOOL spans for function calls the agent makes, with arguments and returns.
A TTS span for the synthesis leg, with the provider, voice id, and rendered audio URL.

Those spans ship over OTLP to the FAGI Observe backend, where the eval engine runs the audio and conversation rubrics on them, Error Feed clusters the failures into named issues, and the dashboard renders the call as a trace tree with audio attached. Because traceAI is built on OpenTelemetry, the spans follow OTel semantic conventions and also render in any OTel-compatible backend; the eval scoring and clustering are FAGI Observe features.

The voice agent span tree

The mental model for a voice call trace that traceAI produces:

root span: voice_session (kind: AGENT)
  span: turn_1 (kind: CHAIN)
    span: stt_call (kind: TOOL, provider: deepgram)
    span: llm_call (kind: LLM, model: claude-sonnet-4-7)
      span: tool_call: lookup_account (kind: TOOL)
      span: retriever: kb_lookup (kind: RETRIEVER)
    span: tts_call (kind: TOOL, provider: cartesia)
  span: turn_2 (kind: CHAIN)
    ... same pattern
  ... more turns

Every span carries the conversation id, so the call ties back to a single session regardless of which service emitted the span. Turn spans wrap the per-turn legs so a single failed turn is easy to isolate. STT and TTS map to TOOL spans with a provider attribute, since that is the closest fit in traceAI’s span model for an audio leg that is not itself an LLM call. traceAI sets this mapping for you in the framework packages; if you emit spans by hand, stay consistent across your stack and the FAGI Observe backend renders them the same way.

Attributes per stage

The attribute set traceAI captures per span type, using the standard names where they exist and the FAGI-documented voice keys where the audio legs need them:

STT span (TOOL kind, voice leg)

provider: deepgram, assemblyai, whisper, etc.
model: model id
input.audio.url: pointer to the audio file in object storage, not the base64 payload
output.value: transcribed text
confidence: STT confidence if the provider exposes it
audio.duration_seconds
language: detected or specified language code

LLM span (LLM kind)

llm.model_name
llm.provider
llm.input_messages: serialized messages
llm.output_messages: serialized response
llm.token_count.prompt and llm.token_count.completion
llm.invocation_parameters: temperature, max tokens, tools list

Tool span (TOOL kind)

tool.name
tool.parameters: the JSON arguments
tool.return: the result payload (or pointer if it’s large)

Retriever span (RETRIEVER kind)

retrieval.documents: the retrieved chunks
retrieval.query
embedding.model_name: the embedder used for the query
retrieval.top_k

TTS span (TOOL kind, voice leg)

provider: cartesia, elevenlabs, openai, etc.
voice_id
output.audio.url: pointer to the rendered audio
audio.duration_seconds
input.value: the text that was synthesized
ssml: any SSML payload, if used

Cross-cutting attributes

On every span in a call, regardless of kind:

conversation_id: ties spans to a session
customer_id: tenant filter axis
agent_version: prompt or build version for A/B comparisons
channel: voice
turn_index: the ordering of the turn within the call
intent: top-level intent if you have it

These are the attributes that turn raw spans into filterable dashboards. The framework packages set the stage-specific attributes automatically; the cross-cutting ones you pass when you wrap the conversation root span. If you skip them, you get tracing but you don’t get analytics.

OTLP transport

OTLP is the OpenTelemetry Protocol, and traceAI exports over it because traceAI is built on the OTel SDK. Two variants matter for voice: gRPC for low overhead and HTTP for environments where gRPC is awkward. Pick gRPC by default; switch to HTTP only when a proxy or compliance layer forces it.

The TracerProvider lifecycle is the standard OpenTelemetry pattern. Initialize once at service startup, register a BatchSpanProcessor that buffers spans, point the OTLP exporter at the FAGI Observe ingest URL. traceAI’s register() helper handles all of this for you. If you are emitting spans by hand for a runtime traceAI does not package, the underlying pattern looks like:

from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter

provider = TracerProvider()
exporter = OTLPSpanExporter(endpoint="otlp.your-backend.example.com:4317")
provider.add_span_processor(BatchSpanProcessor(exporter))
trace.set_tracer_provider(provider)

For voice workloads with high call volume, tune the BatchSpanProcessor’s max_queue_size and max_export_batch_size upward from defaults. The defaults assume HTTP-shaped workloads with low span counts per request. A 5-minute call with 30 turns and 6 spans per turn is 180 spans for one session, and those numbers add up fast across concurrent calls. When you use register(), traceAI applies voice-appropriate batching, so this tuning is only your concern on a hand-rolled provider.

Sampling for voice

Tail-based sampling beats head-based for voice. The interesting calls are the long ones, the failed ones, and the ones with bad eval scores. A head-based 10% sample throws away most of the signal. If your backend supports tail sampling (FAGI Observe does), keep every failed call, every call above a latency threshold, and a random sample of the rest.

If your backend only supports head sampling, sample at 100% during the first few weeks of a new agent version. The cost is real, but the debugging value of a complete trace dataset early in a release outweighs the storage bill.

Code: traceAI for LiveKit

LiveKit is the open-source voice orchestration runtime. traceai-livekit is the dedicated traceAI package that instruments it. The package traces voice events and audio interactions across the LiveKit agent loop and emits the right span kinds and attributes for the STT, LLM, tool, and TTS legs automatically.

import os
from fi_instrumentation import FITracer
from fi_instrumentation import register
from fi_instrumentation.fi_types import ProjectType
from traceai_livekit import enable_http_attribute_mapping

os.environ["FI_API_KEY"] = "your-future-agi-api-key"
os.environ["FI_SECRET_KEY"] = "your-future-agi-secret-key"

trace_provider = register(
    project_name="livekit_voice_agent",
    project_type=ProjectType.OBSERVE,
    set_global_tracer_provider=True,
)
enable_http_attribute_mapping()

Install with pip install traceai-livekit. After registration, every LiveKit Agent turn emits a span tree. STT calls (Deepgram, AssemblyAI, Whisper, or whichever you wire) show up as TOOL spans with provider and confidence attributes. LLM calls show up as LLM spans with full message payloads. Tool calls show up as TOOL spans with parameters and returns. TTS calls show up as TOOL spans with voice id and rendered audio URL.

enable_http_attribute_mapping() is the toggle that maps LiveKit’s transport-layer calls (the HTTP and gRPC calls it makes to STT, LLM, and TTS providers) onto traceAI’s voice span attributes. Without it, the spans land with provider names but without the LLM-specific input and output content. The call is a one-liner; keep it in the worker entrypoint.

The export goes to the OTLP endpoint register() is configured for, which is FAGI Observe by default. Because traceAI rides the standard OTel SDK, you can add a second exporter on the same TracerProvider if you want spans to fan out elsewhere; the spans go to every configured destination.

Code: traceAI for Pipecat

Pipecat is Daily’s open-source voice agent framework. traceAI-pipecat is the traceAI package that instruments it the same way.

import os
from fi_instrumentation import register
from fi_instrumentation.fi_types import ProjectType
from traceai_pipecat import enable_http_attribute_mapping

os.environ["FI_API_KEY"] = "your-future-agi-api-key"
os.environ["FI_SECRET_KEY"] = "your-future-agi-secret-key"

trace_provider = register(
    project_type=ProjectType.OBSERVE,
    project_name="pipecat_voice_app",
    set_global_tracer_provider=True,
)
enable_http_attribute_mapping()

Install with pip install traceAI-pipecat pipecat-ai[tracing]. Pipecat needs its own tracing extra installed so the framework emits the transport-layer spans that traceAI-pipecat then maps. enable_http_attribute_mapping() is the toggle that maps Pipecat’s internal HTTP, gRPC, and explicit transport attributes onto traceAI’s voice span attributes. Without it, the spans land with Pipecat-native field names and the FAGI Observe backend misses the LLM-specific rendering. With it, the spans look the same as any other traceAI workload.

Both LiveKit and Pipecat traceAI packages are Apache 2.0. Both ship as standalone pip packages. Both work alongside the broader traceAI catalog of 30+ documented integrations across Python and TypeScript, including the LLM providers behind your voice agent (Anthropic, OpenAI, Groq, Mistral, Bedrock, Vertex, and the long tail). A team running both LiveKit and Pipecat sees one consistent span model across both, because both packages emit to the same FAGI Observe project.

Native voice observability without an SDK

For Vapi, Retell AI, and LiveKit, there is a path that skips SDK instrumentation entirely. Future AGI ships native dashboard-driven voice observability for those three providers. You add the provider API key plus the Assistant ID to a FAGI Agent Definition, enable observability, and every call streams in with:

Auto call log capture
Separate assistant and customer audio downloads
Auto transcripts
The full eval engine running on every call

The captured calls land in the same Observe project as your traceAI spans. The dashboard joins them under the same Agent Definition. You can run both paths in parallel: the native path captures the call-level surface, traceAI captures the turn-level spans on the LLM provider behind the voice agent. The same session view renders both.

This is the path most teams pick first because it needs no code at all. Add traceAI instrumentation later if and when you want richer LLM-level depth.

The FAGI Observe backend

traceAI emits the spans; FAGI Observe is the backend that receives them and runs the layer that makes voice traces actionable. Because traceAI rides OTLP, the spans also land in any OTel-compatible backend, but the eval scoring, error clustering, audio playback, and inline guardrails described below are FAGI Observe features.

What FAGI Observe does on top of the traceAI span stream:

Native voice observability for Vapi, Retell, and LiveKit with no SDK required. Add provider API key plus Assistant ID, get call logs, separate assistant and customer audio, transcripts, and the full eval engine on every call.
70+ built-in eval templates in ai-evaluation, Apache 2.0. Voice-specific rubrics include audio_transcription, audio_quality, conversation_coherence, conversation_resolution, and task_completion. Multilingual rubrics include translation_accuracy and cultural_sensitivity. Tone rubrics include is_polite, is_helpful, and is_concise. RAG rubrics include groundedness, chunk_attribution, and context_relevance. Scores attach onto the traceAI spans and the gen_ai.evaluation.* namespace carries the results.
Error Feed auto-clusters trace failures into named issues with auto-written root cause, supporting evidence from spans, a quick fix to ship today, and a long-term recommendation. Zero-config.
Error Localization in Simulate (release 2025-11-25) pinpoints the exact failing turn when a scenario breaks. A programmatic eval API for configure-and-rerun lets you wire the eval flow into your CI.
18 pre-built personas plus unlimited custom in the simulation product. Each persona controls gender, age range, location, accent, communication style, conversation speed, background noise, and a multilingual toggle. Workflow Builder auto-generates branching scenarios with branch visibility.
Future AGI Protect model family. A Gemma 3n foundation with LoRA-trained adapters across 4 safety dimensions (Content Moderation, Bias Detection, Security, Data Privacy Compliance) per arXiv 2510.13351, sub-100ms inline, multi-modal across text, image, and audio. ProtectFlash adds a single-call binary classifier path.
Agent Command Center for hosted, multi-region, or BYOC self-host. RBAC, AWS Marketplace, 15+ providers in the router surface. SOC 2 Type II, HIPAA, GDPR, CCPA, and ISO 27001 certified per futureagi.com/trust.
agent-opt ships six prompt optimizers (Bayesian Search, Meta-Prompt per arXiv 2505.09666, ProTeGi, GEPA Genetic-Pareto per arXiv 2507.19457, Random Search per arXiv 2311.09569, PromptWizard) as a UI workflow inside the Dataset surface and as a Python SDK for programmatic control.

A note on naming: gen_ai.voice.* is the namespace Future AGI documents for voice attributes in its product docs, and gen_ai.evaluation.* is the namespace it documents for eval results that attach to spans. The vanilla OpenTelemetry GenAI conventions do not yet specify voice-specific keys, so these are the FAGI-documented extensions that the Observe backend renders when it reads traceAI spans.

Attribute payload sizing

The most common production OTLP failure for voice is attribute payload bloat. Audio is the obvious offender. Don’t base64 it onto a span attribute. The OTLP collector won’t reject it, but query performance collapses, storage cost balloons, and the trace UI starts truncating.

The right pattern: write the audio to object storage (S3, GCS, Azure Blob, or your provider’s recording URL), put the URL on the span, and let the backend fetch on demand. traceAI follows this pattern, and Future AGI’s native voice observability does the upload automatically. For SDK-driven instrumentation, you write the upload step yourself and put the resulting URL on the span.

LLM prompts and outputs are the second offender for high-volume agents. A 4000-token system prompt repeated on every turn span across 10,000 calls a day is a lot of redundant storage. Patterns that help:

Hash the system prompt and store the mapping separately. Put the hash on the span; full text lookups go to a side store.
Truncate user messages above a threshold for traces (full text stays in your application logs).
Use the OpenTelemetry SDK’s attribute value length limit so oversized values truncate at the source.

Sampling, retention, and PII

Voice data is regulated almost everywhere it ships. Transcripts contain customer PII by default. Audio contains voice biometrics. Even span metadata (customer id, intent, retrieved knowledge chunks) often falls under data-protection rules.

The minimum you need:

Retention policy per span attribute class: customer-identifying fields auto-redact after 30 days; full audio after 90; aggregated metrics indefinitely.
Tenant-isolated storage: tag every span with a tenant id, enforce row-level filtering on the backend.
PII redaction on transcripts: name, account number, SSN, credit card. Future AGI’s PII eval flags these inline; DataPrivacyCompliance audits the whole call session for privacy violations.
Audit log on the trace store itself: who queried what, when, with what filter.

For regulated industries, the trust posture matters. Future AGI is SOC 2 Type II, HIPAA, GDPR, CCPA, and ISO 27001 certified per the trust page. Agent Command Center also ships a BYOC self-host option so the audit boundary stays inside your VPC.

A working reference architecture

End-to-end, a traceAI voice stack looks like this:

+-------------------+         +--------------------+         +------------------+
| Voice Orchestrator| ------> | Your LLM Service   | ------> | LLM Provider     |
| (LiveKit/Pipecat) |         | + traceAI instr    |         | (OpenAI, Claude) |
|  + traceai-livekit|         | + Protect inline   |         +------------------+
|  + audio upload   |         +---------+----------+
+---------+---------+                   |
          |                             | traceAI spans via OTLP
          | spans + audio URLs          v
          |              +-------------------------------+
          +------------> | OTel Collector (optional)     |
                         | - filters, batching, fanout   |
                         +---------------+---------------+
                                         |
                                         v
                         +-------------------------------+
                         | FAGI Observe backend          |
                         | - render trace tree           |
                         | - eval scoring                |
                         | - error clustering            |
                         | - inline guardrails           |
                         +-------------------------------+

The OTel collector is optional but useful for production. It lets you fan out spans to multiple destinations in parallel, apply attribute filters before export, and absorb backend outages without losing trace data.

Where Future AGI fits in this picture

traceAI is the SDK; FAGI Observe is the backend; the reason most voice teams adopt the pair is the layer that runs once the spans land. The wire format under traceAI is plain OpenTelemetry, so the spans stay portable. Two FAGI Observe capabilities matter beyond the eval and clustering layer already covered above.

Simulation. The product ships 18 pre-built personas plus unlimited custom. Each persona controls gender, age range, location, accent, communication style, conversation speed, background noise, and a multilingual toggle. Workflow Builder auto-generates branching scenarios with branch visibility, and Error Localization (release 2025-11-25) pinpoints the exact failing turn when a scenario breaks. A programmatic eval API for configure-and-rerun wires the eval flow into your CI.

Inline guardrails and optimization. The Future AGI Protect model family runs inline: a Gemma 3n foundation with LoRA-trained adapters across 4 safety dimensions (Content Moderation, Bias Detection, Security, Data Privacy Compliance) per arXiv 2510.13351, sub-100ms, multi-modal across text, image, and audio, with ProtectFlash as a single-call binary classifier path. agent-opt ships six prompt optimizers (Bayesian Search, Meta-Prompt per arXiv 2505.09666, ProTeGi, GEPA Genetic-Pareto per arXiv 2507.19457, Random Search per arXiv 2311.09569, PromptWizard) as a UI workflow inside the Dataset surface and as a Python SDK. The whole stack runs hosted, multi-region, or BYOC self-host through Agent Command Center, with RBAC and the cert set on futureagi.com/trust: SOC 2 Type II, HIPAA, GDPR, CCPA, and ISO 27001.

traceAI gets the voice pipeline traced; FAGI Observe is where the loop closes.

A deliberate tradeoff

Optimization is an explicit, gated run. The six-optimizer agent-opt surface (UI plus SDK) never auto-rewrites prompts in production. Every optimization run is initiated by a human, gated by an evaluator, and surfaces candidate prompts in the dashboard for approval before they ship. That is a deliberate process choice: production prompt changes go through human review.

Native voice observability ships for Vapi, Retell, and LiveKit out of the box. The dashboard path covers the three runtimes most teams are on with no SDK required. For any other runtime, the traceAI SDK plus the Enable Others mode covers the rest. Between the native path and traceAI, the active production stack across Twilio, Plivo, Telnyx, Bland, ElevenLabs Agents, and Pipecat is in scope. The boundary is native dashboard ingest versus SDK or webhook ingest, not supported versus unsupported.

Common production gotchas

Missing trace context propagation across services. If your voice orchestrator and your LLM service are separate processes, you need to inject and extract the OpenTelemetry context across the call boundary. Without it, the orchestrator’s call span and the LLM service’s spans land as separate traces in the backend. The fix is the standard OTel propagation pattern: inject in the client, extract in the server, share a TraceContext header. traceAI uses the standard OTel propagator, so this works the moment you wire the header.

Forgetting to flush spans on shutdown. Voice services that scale up and down miss spans on shutdown if the BatchSpanProcessor doesn’t flush its buffer before the process exits. Add trace_provider.shutdown() to your service’s graceful-shutdown hook; the provider that register() returns exposes it.

Putting full audio on attributes. Said above. Worth saying twice. Use URLs.

Sampling out the failures. Head-based sampling at 10% throws away 90% of your debugging data. Use tail-based sampling, or sample at 100% during early release cycles.

Skipping conversation_id on every span. If the LLM-service spans don’t carry the voice call’s conversation id, you can’t join them to the call session in the backend. Set it on every span, not just the root. With the framework packages, pass it on the conversation root span and the child spans inherit it.

Forgetting the attribute mapping toggle. Both traceai-livekit and traceAI-pipecat need enable_http_attribute_mapping() called in the entrypoint. Without it, the provider HTTP calls land as spans with names but no input or output content.

Sources and references

traceAI on GitHub: github.com/future-agi/traceAI
ai-evaluation on GitHub: github.com/future-agi/ai-evaluation
Error Feed docs: docs.futureagi.com/docs/observe
Future AGI Protect docs: docs.futureagi.com/docs/protect
Agent Command Center docs: docs.futureagi.com/docs/command-center
OpenTelemetry: opentelemetry.io
OTLP spec: github.com/open-telemetry/opentelemetry-proto
arXiv 2510.13351 (Protect): arxiv.org/abs/2510.13351
arXiv 2507.19457 (GEPA): arxiv.org/abs/2507.19457
arXiv 2505.09666 (Meta-Prompt): arxiv.org/abs/2505.09666
arXiv 2311.09569 (Random Search baseline): arxiv.org/abs/2311.09569
Trust page: futureagi.com/trust

Frequently asked questions

What is traceAI and how does it trace a voice agent?

traceAI is Future AGI's open-source observability SDK for LLM and voice applications, Apache 2.0 licensed, built on the OpenTelemetry SDK. It ships 30+ documented integrations across Python and TypeScript, including dedicated traceai-livekit and traceAI-pipecat packages for voice. Once registered, traceAI auto-captures the four span types a voice call produces: an STT span for the transcription leg, an LLM span for each model call, TOOL spans for function calls, and a TTS span for synthesis. Each span carries provider, model, latency, and stage-specific attributes. The spans ship over OTLP to the FAGI Observe backend, where the eval engine, Error Feed, and audio rubrics run on them. You wire it once at service startup and every turn produces a span tree without further instrumentation code.

Does traceAI work with my existing OpenTelemetry setup?

Yes. traceAI is layered on the standard OpenTelemetry SDK and exporter, so it runs inside the same TracerProvider that already handles your HTTP, database, and application spans. The voice spans land at the same OTLP collector, take the same sampling decisions, and join under the same trace ID. Nothing about traceAI replaces OTel; it adds the voice-specific span kinds and attributes on top. Engineers who already operate OpenTelemetry for application performance get voice agent traces in the dashboards they already use. traceAI spans follow OpenTelemetry semantic conventions, so they also render in any OTel-compatible backend, though the eval scoring, Error Feed clustering, and audio playback are FAGI Observe features.

Can I use traceAI if my pipeline isn't an LLM SDK call?

Yes. A voice agent is more than one LLM call, and traceAI is built for that. The traceai-livekit and traceAI-pipecat packages instrument the whole agent loop: the STT call to Deepgram or AssemblyAI, the LLM call, the tool invocations, and the TTS call to Cartesia or ElevenLabs. STT and TTS are not LLM SDK calls, and traceAI still captures them as first-class spans with provider, audio duration, and confidence attributes. If you run a framework traceAI does not yet package, you can emit the spans manually with the OpenTelemetry SDK and use traceAI's attribute names so the FAGI Observe backend renders them. The voice rubrics score the STT and TTS spans directly regardless of how they were emitted.

Do I need to instrument STT and TTS separately?

Yes if you want per-stage debugging, which you almost always do. STT failures (mistranscription, accent drift, background noise sensitivity) and TTS failures (mispronunciation, voice regressions, prosody flatness) have distinct root causes and distinct fixes. A single root span per call buries them. traceAI emits a child span for the STT call with provider, model, confidence, and audio duration as attributes, and a child span for the TTS call with provider, voice id, duration, and any SSML payload. Future AGI's audio_transcription and audio_quality rubrics score those spans directly. With the traceai-livekit and traceAI-pipecat packages this separation is automatic; the spans arrive split out of the box.

How do I avoid logging the full audio payload on every span?

Don't put base64 audio on span attributes. OpenTelemetry collectors choke on payloads above a few kilobytes per attribute, and the trace cost explodes. Instead, store the audio in object storage (S3, GCS, or your provider's recording URL), put the URL on the span as an attribute (input.audio.url or output.audio.url), and let the backend fetch it on demand. traceAI follows this pattern, and Future AGI's native voice observability for Vapi, Retell, and LiveKit handles it automatically: separate assistant and customer audio files attach to the call session, not to the span attribute table. For SDK-driven instrumentation you write the upload step, then put the resulting URL on the span.

What span attributes matter most for voice debugging?

Per stage, you want at minimum: span kind (LLM/RETRIEVER/TOOL), provider name, model id, latency, and token counts or audio duration. For voice specifically: conversation_id on every span so a turn ties back to the call, channel set to voice, customer_id and agent_version for filter axes, and a turn_index so spans within a call order correctly. STT spans add confidence and language; TTS spans add voice_id and any SSML payload. traceAI sets the stage-specific attributes for you; the cross-cutting ones (conversation_id, customer_id, agent_version) you pass when you wrap the conversation root span. These are the attributes Future AGI's Error Feed clusters on when it auto-names failure modes.

What does traceai-livekit actually capture, and how is it different from traceAI-pipecat?

Both packages instrument an open-source voice runtime and emit the same span model to FAGI Observe. traceai-livekit targets LiveKit Agents: it traces voice events and audio interactions across the agent loop and maps the provider HTTP calls into voice spans. traceAI-pipecat targets Pipecat, Daily's voice framework, and does the same for that runtime. The registration code is nearly identical; both call register() from fi_instrumentation and then enable_http_attribute_mapping() to map the framework's transport-layer calls onto the right voice attributes. Pipecat additionally needs pipecat-ai installed with the tracing extra. Spans from either package land in the same Observe project and render in the same dashboard, so a team running both runtimes sees one consistent view.

Which Future AGI rubrics run on traceAI voice spans?

The ai-evaluation SDK ships 70+ built-in eval templates, Apache 2.0 licensed. Five carry most of a voice workload: audio_transcription scores ASR drift on the customer audio, audio_quality scores TTS clarity and prosody on the assistant audio, conversation_coherence scores multi-turn flow, conversation_resolution scores whether the call met the customer's goal, and task_completion scores whether the agent finished its tool calls and workflow. For multilingual agents, translation_accuracy and cultural_sensitivity add language-pair coverage. The rubrics attach scores directly onto the traceAI spans in FAGI Observe, and the gen_ai.evaluation.* namespace carries the results. Error Feed then clusters low-scoring spans into named failure modes with root cause and a quick fix.

View all

Guides

traceAI: OpenTelemetry-Native LLM and Agent Tracing in 2026

traceAI is the open-source OpenTelemetry-native tracing library for LLM and agent apps. Span model, 30+ integrations, OTLP transport, and how to choose your tracing layer in 2026.

NVJK Kartik · May 21, 2026

18 min

Guides

Evaluating Voice AI Agents in 2026: The Methodology

Voice agent eval is end-task scoring plus pipeline-stage attribution plus conversation coherence. WER scores the ASR component, not the agent.

NVJK Kartik · May 19, 2026

12 min

Guides

LLM App Observability with OpenTelemetry: The 2026 Setup

OTel for LLM apps in 2026 = OTel-GenAI + OpenInference + eval-as-span-attribute. The three layers, the traceAI register pattern, span enrichment, and sampling.

NVJK Kartik · May 19, 2026

12 min

TL;DR step preview

What traceAI actually is

The voice agent span tree

Attributes per stage

Cross-cutting attributes

OTLP transport

Sampling for voice

Code: traceAI for LiveKit

Code: traceAI for Pipecat

Native voice observability without an SDK

The FAGI Observe backend

Attribute payload sizing

Sampling, retention, and PII

A working reference architecture

Where Future AGI fits in this picture

A deliberate tradeoff

Common production gotchas

Related reading

Sources and references

Frequently asked questions