How to Trace Voice Agents with traceAI in 2026: STT, LLM, TTS, and Tool Spans
Trace voice agents with traceAI in 2026: how STT/LLM/TTS/tool spans are captured, OTLP transport, the FAGI Observe backend, and traceAI code for LiveKit and Pipecat.
Table of Contents
Voice agents broke the assumption that one OpenTelemetry span per HTTP call is enough. A real call has STT, an LLM, a few tool invocations, a TTS leg, and turn-taking logic between them. Each of those can fail independently, and the failure signature is in the span attributes you chose to capture. This guide shows how traceAI, Future AGI’s open-source tracing SDK, captures a voice pipeline as a span tree, what OTLP transport looks like in practice, and how the FAGI Observe backend turns those spans into eval scores and clustered failures.
TL;DR step preview
- traceAI is built on the OpenTelemetry SDK. OTel is the wire format; traceAI adds the voice-specific span kinds and attributes on top.
- Install the framework package:
traceai-livekitfor LiveKit Agents ortraceAI-pipecatfor Pipecat. Both come from Future AGI’s traceAI catalog. - Call
register()once at service startup. Every turn then emits a span tree: STT, LLM, TOOL, and TTS spans, auto-captured. - Spans ship via OTLP gRPC or HTTP to FAGI Observe, where the eval engine scores them and Error Feed clusters the failures.
- Attach the voice rubrics (
audio_transcription,audio_quality,conversation_coherence,conversation_resolution,task_completion) and the loop closes.
The rest of the guide walks the span model, the attribute taxonomy, the transport, the backend, and the production gotchas.
What traceAI actually is
traceAI is Future AGI’s open-source observability SDK for LLM and voice applications. It is Apache 2.0 licensed and built directly on the OpenTelemetry SDK, so it inherits the standard wire format, trace context propagation, and OTLP export rather than reinventing them. What traceAI adds is the layer that a generic OTel install does not have: a span model that understands LLM calls, retriever calls, tool calls, and the audio legs that bracket a voice turn.
The catalog ships 30+ documented integrations across Python and TypeScript. Most of them instrument an LLM provider or framework (Anthropic, OpenAI, Groq, Mistral, Bedrock, Vertex, LangChain, LlamaIndex, and the long tail). Two of them are voice-specific and matter for this guide: traceai-livekit and traceAI-pipecat. Each is a standalone pip package that instruments one open-source voice runtime and emits the right spans without you writing instrumentation code.
For voice, traceAI works because a voice agent is a multi-stage LLM workload with audio legs at the edges. Once you register traceAI in your service, every turn auto-captures four span types:
- An STT span for the transcription leg, with the provider, model, audio duration, and confidence.
- An LLM span for each model call, with the full message payloads and token counts.
- TOOL spans for function calls the agent makes, with arguments and returns.
- A TTS span for the synthesis leg, with the provider, voice id, and rendered audio URL.
Those spans ship over OTLP to the FAGI Observe backend, where the eval engine runs the audio and conversation rubrics on them, Error Feed clusters the failures into named issues, and the dashboard renders the call as a trace tree with audio attached. Because traceAI is built on OpenTelemetry, the spans follow OTel semantic conventions and also render in any OTel-compatible backend; the eval scoring and clustering are FAGI Observe features.
The voice agent span tree
The mental model for a voice call trace that traceAI produces:
root span: voice_session (kind: AGENT)
span: turn_1 (kind: CHAIN)
span: stt_call (kind: TOOL, provider: deepgram)
span: llm_call (kind: LLM, model: claude-sonnet-4-7)
span: tool_call: lookup_account (kind: TOOL)
span: retriever: kb_lookup (kind: RETRIEVER)
span: tts_call (kind: TOOL, provider: cartesia)
span: turn_2 (kind: CHAIN)
... same pattern
... more turns
Every span carries the conversation id, so the call ties back to a single session regardless of which service emitted the span. Turn spans wrap the per-turn legs so a single failed turn is easy to isolate. STT and TTS map to TOOL spans with a provider attribute, since that is the closest fit in traceAI’s span model for an audio leg that is not itself an LLM call. traceAI sets this mapping for you in the framework packages; if you emit spans by hand, stay consistent across your stack and the FAGI Observe backend renders them the same way.
Attributes per stage
The attribute set traceAI captures per span type, using the standard names where they exist and the FAGI-documented voice keys where the audio legs need them:
STT span (TOOL kind, voice leg)
provider:deepgram,assemblyai,whisper, etc.model: model idinput.audio.url: pointer to the audio file in object storage, not the base64 payloadoutput.value: transcribed textconfidence: STT confidence if the provider exposes itaudio.duration_secondslanguage: detected or specified language code
LLM span (LLM kind)
llm.model_namellm.providerllm.input_messages: serialized messagesllm.output_messages: serialized responsellm.token_count.promptandllm.token_count.completionllm.invocation_parameters: temperature, max tokens, tools list
Tool span (TOOL kind)
tool.nametool.parameters: the JSON argumentstool.return: the result payload (or pointer if it’s large)
Retriever span (RETRIEVER kind)
retrieval.documents: the retrieved chunksretrieval.queryembedding.model_name: the embedder used for the queryretrieval.top_k
TTS span (TOOL kind, voice leg)
provider:cartesia,elevenlabs,openai, etc.voice_idoutput.audio.url: pointer to the rendered audioaudio.duration_secondsinput.value: the text that was synthesizedssml: any SSML payload, if used
Cross-cutting attributes
On every span in a call, regardless of kind:
conversation_id: ties spans to a sessioncustomer_id: tenant filter axisagent_version: prompt or build version for A/B comparisonschannel:voiceturn_index: the ordering of the turn within the callintent: top-level intent if you have it
These are the attributes that turn raw spans into filterable dashboards. The framework packages set the stage-specific attributes automatically; the cross-cutting ones you pass when you wrap the conversation root span. If you skip them, you get tracing but you don’t get analytics.
OTLP transport
OTLP is the OpenTelemetry Protocol, and traceAI exports over it because traceAI is built on the OTel SDK. Two variants matter for voice: gRPC for low overhead and HTTP for environments where gRPC is awkward. Pick gRPC by default; switch to HTTP only when a proxy or compliance layer forces it.
The TracerProvider lifecycle is the standard OpenTelemetry pattern. Initialize once at service startup, register a BatchSpanProcessor that buffers spans, point the OTLP exporter at the FAGI Observe ingest URL. traceAI’s register() helper handles all of this for you. If you are emitting spans by hand for a runtime traceAI does not package, the underlying pattern looks like:
from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
provider = TracerProvider()
exporter = OTLPSpanExporter(endpoint="otlp.your-backend.example.com:4317")
provider.add_span_processor(BatchSpanProcessor(exporter))
trace.set_tracer_provider(provider)
For voice workloads with high call volume, tune the BatchSpanProcessor’s max_queue_size and max_export_batch_size upward from defaults. The defaults assume HTTP-shaped workloads with low span counts per request. A 5-minute call with 30 turns and 6 spans per turn is 180 spans for one session, and those numbers add up fast across concurrent calls. When you use register(), traceAI applies voice-appropriate batching, so this tuning is only your concern on a hand-rolled provider.
Sampling for voice
Tail-based sampling beats head-based for voice. The interesting calls are the long ones, the failed ones, and the ones with bad eval scores. A head-based 10% sample throws away most of the signal. If your backend supports tail sampling (FAGI Observe does), keep every failed call, every call above a latency threshold, and a random sample of the rest.
If your backend only supports head sampling, sample at 100% during the first few weeks of a new agent version. The cost is real, but the debugging value of a complete trace dataset early in a release outweighs the storage bill.
Code: traceAI for LiveKit
LiveKit is the open-source voice orchestration runtime. traceai-livekit is the dedicated traceAI package that instruments it. The package traces voice events and audio interactions across the LiveKit agent loop and emits the right span kinds and attributes for the STT, LLM, tool, and TTS legs automatically.
import os
from fi_instrumentation import FITracer
from fi_instrumentation import register
from fi_instrumentation.fi_types import ProjectType
from traceai_livekit import enable_http_attribute_mapping
os.environ["FI_API_KEY"] = "your-future-agi-api-key"
os.environ["FI_SECRET_KEY"] = "your-future-agi-secret-key"
trace_provider = register(
project_name="livekit_voice_agent",
project_type=ProjectType.OBSERVE,
set_global_tracer_provider=True,
)
enable_http_attribute_mapping()
Install with pip install traceai-livekit. After registration, every LiveKit Agent turn emits a span tree. STT calls (Deepgram, AssemblyAI, Whisper, or whichever you wire) show up as TOOL spans with provider and confidence attributes. LLM calls show up as LLM spans with full message payloads. Tool calls show up as TOOL spans with parameters and returns. TTS calls show up as TOOL spans with voice id and rendered audio URL.
enable_http_attribute_mapping() is the toggle that maps LiveKit’s transport-layer calls (the HTTP and gRPC calls it makes to STT, LLM, and TTS providers) onto traceAI’s voice span attributes. Without it, the spans land with provider names but without the LLM-specific input and output content. The call is a one-liner; keep it in the worker entrypoint.
The export goes to the OTLP endpoint register() is configured for, which is FAGI Observe by default. Because traceAI rides the standard OTel SDK, you can add a second exporter on the same TracerProvider if you want spans to fan out elsewhere; the spans go to every configured destination.
Code: traceAI for Pipecat
Pipecat is Daily’s open-source voice agent framework. traceAI-pipecat is the traceAI package that instruments it the same way.
import os
from fi_instrumentation import register
from fi_instrumentation.fi_types import ProjectType
from traceai_pipecat import enable_http_attribute_mapping
os.environ["FI_API_KEY"] = "your-future-agi-api-key"
os.environ["FI_SECRET_KEY"] = "your-future-agi-secret-key"
trace_provider = register(
project_type=ProjectType.OBSERVE,
project_name="pipecat_voice_app",
set_global_tracer_provider=True,
)
enable_http_attribute_mapping()
Install with pip install traceAI-pipecat pipecat-ai[tracing]. Pipecat needs its own tracing extra installed so the framework emits the transport-layer spans that traceAI-pipecat then maps. enable_http_attribute_mapping() is the toggle that maps Pipecat’s internal HTTP, gRPC, and explicit transport attributes onto traceAI’s voice span attributes. Without it, the spans land with Pipecat-native field names and the FAGI Observe backend misses the LLM-specific rendering. With it, the spans look the same as any other traceAI workload.
Both LiveKit and Pipecat traceAI packages are Apache 2.0. Both ship as standalone pip packages. Both work alongside the broader traceAI catalog of 30+ documented integrations across Python and TypeScript, including the LLM providers behind your voice agent (Anthropic, OpenAI, Groq, Mistral, Bedrock, Vertex, and the long tail). A team running both LiveKit and Pipecat sees one consistent span model across both, because both packages emit to the same FAGI Observe project.
Native voice observability without an SDK
For Vapi, Retell AI, and LiveKit, there is a path that skips SDK instrumentation entirely. Future AGI ships native dashboard-driven voice observability for those three providers. You add the provider API key plus the Assistant ID to a FAGI Agent Definition, enable observability, and every call streams in with:
- Auto call log capture
- Separate assistant and customer audio downloads
- Auto transcripts
- The full eval engine running on every call
The captured calls land in the same Observe project as your traceAI spans. The dashboard joins them under the same Agent Definition. You can run both paths in parallel: the native path captures the call-level surface, traceAI captures the turn-level spans on the LLM provider behind the voice agent. The same session view renders both.
This is the path most teams pick first because it needs no code at all. Add traceAI instrumentation later if and when you want richer LLM-level depth.
The FAGI Observe backend
traceAI emits the spans; FAGI Observe is the backend that receives them and runs the layer that makes voice traces actionable. Because traceAI rides OTLP, the spans also land in any OTel-compatible backend, but the eval scoring, error clustering, audio playback, and inline guardrails described below are FAGI Observe features.
What FAGI Observe does on top of the traceAI span stream:
- Native voice observability for Vapi, Retell, and LiveKit with no SDK required. Add provider API key plus Assistant ID, get call logs, separate assistant and customer audio, transcripts, and the full eval engine on every call.
- 70+ built-in eval templates in ai-evaluation, Apache 2.0. Voice-specific rubrics include
audio_transcription,audio_quality,conversation_coherence,conversation_resolution, andtask_completion. Multilingual rubrics includetranslation_accuracyandcultural_sensitivity. Tone rubrics includeis_polite,is_helpful, andis_concise. RAG rubrics includegroundedness,chunk_attribution, andcontext_relevance. Scores attach onto the traceAI spans and thegen_ai.evaluation.*namespace carries the results. - Error Feed auto-clusters trace failures into named issues with auto-written root cause, supporting evidence from spans, a quick fix to ship today, and a long-term recommendation. Zero-config.
- Error Localization in Simulate (release 2025-11-25) pinpoints the exact failing turn when a scenario breaks. A programmatic eval API for configure-and-rerun lets you wire the eval flow into your CI.
- 18 pre-built personas plus unlimited custom in the simulation product. Each persona controls gender, age range, location, accent, communication style, conversation speed, background noise, and a multilingual toggle. Workflow Builder auto-generates branching scenarios with branch visibility.
- Future AGI Protect model family. A Gemma 3n foundation with LoRA-trained adapters across 4 safety dimensions (Content Moderation, Bias Detection, Security, Data Privacy Compliance) per arXiv 2510.13351, sub-100ms inline, multi-modal across text, image, and audio. ProtectFlash adds a single-call binary classifier path.
- Agent Command Center for hosted, multi-region, or BYOC self-host. RBAC, AWS Marketplace, 15+ providers in the router surface. SOC 2 Type II, HIPAA, GDPR, CCPA, and ISO 27001 certified per futureagi.com/trust.
- agent-opt ships six prompt optimizers (Bayesian Search, Meta-Prompt per arXiv 2505.09666, ProTeGi, GEPA Genetic-Pareto per arXiv 2507.19457, Random Search per arXiv 2311.09569, PromptWizard) as a UI workflow inside the Dataset surface and as a Python SDK for programmatic control.
A note on naming: gen_ai.voice.* is the namespace Future AGI documents for voice attributes in its product docs, and gen_ai.evaluation.* is the namespace it documents for eval results that attach to spans. The vanilla OpenTelemetry GenAI conventions do not yet specify voice-specific keys, so these are the FAGI-documented extensions that the Observe backend renders when it reads traceAI spans.
Attribute payload sizing
The most common production OTLP failure for voice is attribute payload bloat. Audio is the obvious offender. Don’t base64 it onto a span attribute. The OTLP collector won’t reject it, but query performance collapses, storage cost balloons, and the trace UI starts truncating.
The right pattern: write the audio to object storage (S3, GCS, Azure Blob, or your provider’s recording URL), put the URL on the span, and let the backend fetch on demand. traceAI follows this pattern, and Future AGI’s native voice observability does the upload automatically. For SDK-driven instrumentation, you write the upload step yourself and put the resulting URL on the span.
LLM prompts and outputs are the second offender for high-volume agents. A 4000-token system prompt repeated on every turn span across 10,000 calls a day is a lot of redundant storage. Patterns that help:
- Hash the system prompt and store the mapping separately. Put the hash on the span; full text lookups go to a side store.
- Truncate user messages above a threshold for traces (full text stays in your application logs).
- Use the OpenTelemetry SDK’s attribute value length limit so oversized values truncate at the source.
Sampling, retention, and PII
Voice data is regulated almost everywhere it ships. Transcripts contain customer PII by default. Audio contains voice biometrics. Even span metadata (customer id, intent, retrieved knowledge chunks) often falls under data-protection rules.
The minimum you need:
- Retention policy per span attribute class: customer-identifying fields auto-redact after 30 days; full audio after 90; aggregated metrics indefinitely.
- Tenant-isolated storage: tag every span with a tenant id, enforce row-level filtering on the backend.
- PII redaction on transcripts: name, account number, SSN, credit card. Future AGI’s
PIIeval flags these inline;DataPrivacyComplianceaudits the whole call session for privacy violations. - Audit log on the trace store itself: who queried what, when, with what filter.
For regulated industries, the trust posture matters. Future AGI is SOC 2 Type II, HIPAA, GDPR, CCPA, and ISO 27001 certified per the trust page. Agent Command Center also ships a BYOC self-host option so the audit boundary stays inside your VPC.
A working reference architecture
End-to-end, a traceAI voice stack looks like this:
+-------------------+ +--------------------+ +------------------+
| Voice Orchestrator| ------> | Your LLM Service | ------> | LLM Provider |
| (LiveKit/Pipecat) | | + traceAI instr | | (OpenAI, Claude) |
| + traceai-livekit| | + Protect inline | +------------------+
| + audio upload | +---------+----------+
+---------+---------+ |
| | traceAI spans via OTLP
| spans + audio URLs v
| +-------------------------------+
+------------> | OTel Collector (optional) |
| - filters, batching, fanout |
+---------------+---------------+
|
v
+-------------------------------+
| FAGI Observe backend |
| - render trace tree |
| - eval scoring |
| - error clustering |
| - inline guardrails |
+-------------------------------+
The OTel collector is optional but useful for production. It lets you fan out spans to multiple destinations in parallel, apply attribute filters before export, and absorb backend outages without losing trace data.
Where Future AGI fits in this picture
traceAI is the SDK; FAGI Observe is the backend; the reason most voice teams adopt the pair is the layer that runs once the spans land. The wire format under traceAI is plain OpenTelemetry, so the spans stay portable. Two FAGI Observe capabilities matter beyond the eval and clustering layer already covered above.
Simulation. The product ships 18 pre-built personas plus unlimited custom. Each persona controls gender, age range, location, accent, communication style, conversation speed, background noise, and a multilingual toggle. Workflow Builder auto-generates branching scenarios with branch visibility, and Error Localization (release 2025-11-25) pinpoints the exact failing turn when a scenario breaks. A programmatic eval API for configure-and-rerun wires the eval flow into your CI.
Inline guardrails and optimization. The Future AGI Protect model family runs inline: a Gemma 3n foundation with LoRA-trained adapters across 4 safety dimensions (Content Moderation, Bias Detection, Security, Data Privacy Compliance) per arXiv 2510.13351, sub-100ms, multi-modal across text, image, and audio, with ProtectFlash as a single-call binary classifier path. agent-opt ships six prompt optimizers (Bayesian Search, Meta-Prompt per arXiv 2505.09666, ProTeGi, GEPA Genetic-Pareto per arXiv 2507.19457, Random Search per arXiv 2311.09569, PromptWizard) as a UI workflow inside the Dataset surface and as a Python SDK. The whole stack runs hosted, multi-region, or BYOC self-host through Agent Command Center, with RBAC and the cert set on futureagi.com/trust: SOC 2 Type II, HIPAA, GDPR, CCPA, and ISO 27001.
traceAI gets the voice pipeline traced; FAGI Observe is where the loop closes.
A deliberate tradeoff
Optimization is an explicit, gated run. The six-optimizer agent-opt surface (UI plus SDK) never auto-rewrites prompts in production. Every optimization run is initiated by a human, gated by an evaluator, and surfaces candidate prompts in the dashboard for approval before they ship. That is a deliberate process choice: production prompt changes go through human review.
Native voice observability ships for Vapi, Retell, and LiveKit out of the box. The dashboard path covers the three runtimes most teams are on with no SDK required. For any other runtime, the traceAI SDK plus the Enable Others mode covers the rest. Between the native path and traceAI, the active production stack across Twilio, Plivo, Telnyx, Bland, ElevenLabs Agents, and Pipecat is in scope. The boundary is native dashboard ingest versus SDK or webhook ingest, not supported versus unsupported.
Common production gotchas
Missing trace context propagation across services. If your voice orchestrator and your LLM service are separate processes, you need to inject and extract the OpenTelemetry context across the call boundary. Without it, the orchestrator’s call span and the LLM service’s spans land as separate traces in the backend. The fix is the standard OTel propagation pattern: inject in the client, extract in the server, share a TraceContext header. traceAI uses the standard OTel propagator, so this works the moment you wire the header.
Forgetting to flush spans on shutdown. Voice services that scale up and down miss spans on shutdown if the BatchSpanProcessor doesn’t flush its buffer before the process exits. Add trace_provider.shutdown() to your service’s graceful-shutdown hook; the provider that register() returns exposes it.
Putting full audio on attributes. Said above. Worth saying twice. Use URLs.
Sampling out the failures. Head-based sampling at 10% throws away 90% of your debugging data. Use tail-based sampling, or sample at 100% during early release cycles.
Skipping conversation_id on every span. If the LLM-service spans don’t carry the voice call’s conversation id, you can’t join them to the call session in the backend. Set it on every span, not just the root. With the framework packages, pass it on the conversation root span and the child spans inherit it.
Forgetting the attribute mapping toggle. Both traceai-livekit and traceAI-pipecat need enable_http_attribute_mapping() called in the entrypoint. Without it, the provider HTTP calls land as spans with names but no input or output content.
Related reading
- Voice AI Observability for Vapi: 2026 implementation guide
- Voice AI Observability for LiveKit Agents: a 2026 guide
- How to monitor AI voice agents in production: a 2026 playbook
- How to implement voice AI observability in 2026
Sources and references
- traceAI on GitHub: github.com/future-agi/traceAI
- ai-evaluation on GitHub: github.com/future-agi/ai-evaluation
- Error Feed docs: docs.futureagi.com/docs/observe
- Future AGI Protect docs: docs.futureagi.com/docs/protect
- Agent Command Center docs: docs.futureagi.com/docs/command-center
- OpenTelemetry: opentelemetry.io
- OTLP spec: github.com/open-telemetry/opentelemetry-proto
- arXiv 2510.13351 (Protect): arxiv.org/abs/2510.13351
- arXiv 2507.19457 (GEPA): arxiv.org/abs/2507.19457
- arXiv 2505.09666 (Meta-Prompt): arxiv.org/abs/2505.09666
- arXiv 2311.09569 (Random Search baseline): arxiv.org/abs/2311.09569
- Trust page: futureagi.com/trust
Frequently asked questions
What is traceAI and how does it trace a voice agent?
Does traceAI work with my existing OpenTelemetry setup?
Can I use traceAI if my pipeline isn't an LLM SDK call?
Do I need to instrument STT and TTS separately?
How do I avoid logging the full audio payload on every span?
What span attributes matter most for voice debugging?
What does traceai-livekit actually capture, and how is it different from traceAI-pipecat?
Which Future AGI rubrics run on traceAI voice spans?
traceAI is the open-source OpenTelemetry-native tracing library for LLM and agent apps. Span model, 30+ integrations, OTLP transport, and how to choose your tracing layer in 2026.
Voice agent eval is end-task scoring plus pipeline-stage attribution plus conversation coherence. WER scores the ASR component, not the agent.
OTel for LLM apps in 2026 = OTel-GenAI + OpenInference + eval-as-span-attribute. The three layers, the traceAI register pattern, span enrichment, and sampling.