Voice AI Observability for LiveKit Agents: A 2026 Guide
Implement voice AI observability for LiveKit Agents: native FAGI dashboard via Assistant ID plus traceai-livekit pip package for code-driven span tracing.
Table of Contents
LiveKit owns the WebRTC layer, the agent loop, and the orchestration of STT, LLM, and TTS. Your job is to know what happened in every call, score it against the rubrics that matter, and route the failures into something the on-call rotation can actually act on. This guide walks through both paths Future AGI ships for LiveKit Agents: a no-SDK dashboard path, and a code-driven traceai-livekit path. The two compose.
Step preview
LiveKit is an engineering-driven stack, so the primary path is the SDK. The Native Voice Obs dashboard path is available as a secondary view for teams that want both.
- Install
traceai-livekitand register the tracer in your worker entrypoint. Voice events and audio interactions land as traceAI spans. - Verify the spans appear in the FAGI dashboard — root conversation span, LLM, tool, STT, and TTS children, audio attached.
- Optional dashboard-only path: wire your LiveKit assistant into a Future AGI Agent Definition via API key + Assistant ID for a no-code call log view alongside the SDK spans.
- Attach the named voice rubrics:
audio_transcription,audio_quality,conversation_coherence,conversation_resolution,task_completion. - Turn on Error Feed for auto-clustered failures and Future AGI Protect for inline guardrails. (Simulation has its own UI + SDK paths; see voice-agent-simulation-2026-guide.)
The rest of the post fills in the details.
Why LiveKit specifically
LiveKit is the open-source orchestration layer that most production voice teams reach for when they want full control over the call pipeline. The genuine wedges are three:
WebRTC depth. LiveKit owns the SFU and the real-time media layer. That means accurate packet-loss, jitter, and codec stats at the call level, which the hosted vendors expose only after a delay (and sometimes not at all).
Open-source orchestration. Agents run in your runtime, your cloud, your boundary. For regulated workloads (healthcare, fintech, federal), this is often the right deployment, and BYOC is the default rather than an upsell.
Provider flexibility. LiveKit Agents bridges to any STT, LLM, and TTS the team picks. Deepgram for STT, Anthropic or OpenAI for LLM, Cartesia or ElevenLabs for TTS, swap any leg without rewriting the runtime.
What LiveKit Agents does not ship is a deep observability and eval layer. LiveKit Telemetry gives you WebRTC-layer infra metrics inside the LiveKit dashboard, and that’s solid for media-layer debugging. It does not score every call against multi-turn rubrics, it does not auto-cluster failures into named issues with root cause analysis, and it does not run inline guardrails on the LLM response. That gap is where FAGI sits.
The pattern we recommend: keep LiveKit as your call runtime, add FAGI as your observability, eval, and guardrail layer. The two compose cleanly because FAGI ships both a native API-driven integration and a dedicated traceai-livekit SDK.
Step 1: Install traceai-livekit and instrument the agent
LiveKit teams are engineering teams. Most production LiveKit deployments lead with the SDK because you’re already writing Python or TypeScript to define the agent, and the SDK gives span-level depth (LLM, tool, STT, TTS) that no-code paths do not surface. traceai-livekit is a dedicated pip package that traces voice events and audio interactions and lands them in the FAGI Observe project as spans. The dashboard-only path stays available as a secondary view (Step 3 below); the two compose.
Install and register
import os
from fi_instrumentation import FITracer
from fi_instrumentation import register
from fi_instrumentation.fi_types import ProjectType
from traceai_livekit import enable_http_attribute_mapping
os.environ["FI_API_KEY"] = "your-future-agi-api-key"
os.environ["FI_SECRET_KEY"] = "your-future-agi-secret-key"
trace_provider = register(
project_name="LiveKit Agent",
project_type=ProjectType.OBSERVE,
set_global_tracer_provider=True,
)
enable_http_attribute_mapping()
pip install traceai-livekit brings in the package. enable_http_attribute_mapping() switches on the attribute mapping that joins LiveKit’s HTTP-layer calls (to STT, LLM, TTS providers) into the same OpenInference spans the dashboard renders. Voice events and audio interactions are traced via the same package.
Wrap each conversation in a root span
The voice events and audio interactions are auto-traced; the conversation grouping is on you. Wrap the agent entrypoint in a root span and pass conversation_id as an attribute so every child span joins under it.
from fi_instrumentation import FITracer
tracer = FITracer(trace_provider.get_tracer(__name__))
async def entrypoint(ctx):
with tracer.start_as_current_span(
"livekit_conversation",
attributes={
"conversation_id": ctx.room.name,
"customer_id": ctx.proc.userdata.get("customer_id", "unknown"),
"agent_version": "4.3.1",
"channel": "voice",
"provider": "livekit",
},
):
await run_agent_loop(ctx)
Now every STT call, LLM call, tool invocation, and TTS call inside run_agent_loop joins under the conversation root. The dashboard renders the trace tree per call with all the leg spans visible.
What the spans look like
Once traceai-livekit is wired in, a typical voice turn produces something like this span tree:
livekit_conversation [conversation_id=room-abc123]
voice_turn [turn_index=3]
stt.recognize [provider=deepgram, model=nova-3]
llm.completion [provider=openai, model=gpt-4o-mini]
tool.call [name=lookup_customer]
tts.synthesize [provider=cartesia, voice_id=sonic-en]
Each span carries standard OpenInference attributes (input.value, output.value, latency, token counts) plus voice-specific attributes (audio durations, confidence scores). Eval rubrics from step 4 attach scores directly onto these spans.
When to add the SDK on top of the dashboard path
The dashboard path covers most call-level debugging. Add traceai-livekit when:
- You need tool call arguments visible at span level.
- You’re A/B testing prompt revisions and need turn-level eval differentials.
- You’re integrating with a RAG retrieval layer and need retrieval spans on the trace tree.
- You want to debug LLM or TTS latency at the per-turn granularity.
- You need cross-service tracing (e.g. the LiveKit agent calls a separate inference microservice).
For most support and inbound use cases, the dashboard path is enough. The SDK adds depth, not replaces the dashboard.
Step 2: Verify the spans appear in the FAGI dashboard
Run your agent and place a test call into the LiveKit room. Within a few minutes the spans land in the FAGI project. Open the trace and you should see the conversation root span with child spans for every LLM call, tool invocation, STT and TTS leg, and audio interaction the agent fired. The Call Log row shows the four panels below.
Audio panel: two players, one labelled Assistant, one labelled Customer. Each has its own waveform and a download button. The separation lets you debug a barge-in failure (interruption timing in the customer leg) or a TTS regression (clarity in the assistant leg) without listening to both legs mixed.
Transcript panel: turn-by-turn rows with speaker tags and timestamps. Hover a row to see STT confidence per turn.
Session timeline: horizontal trace tree rendering the call as the root. Without traceai-livekit installed yet, the timeline shows turn boundaries only. After step 3, voice events, LLM calls, tool invocations, and audio interactions appear as nested child spans.
Tags panel: whatever metadata LiveKit passed through.
If any panel is missing, double-check that the LiveKit API key has the right scope and that observability is enabled on the agent.
Step 3: Optional — add a Native Voice Obs Agent Definition for a dashboard-only view
If your team wants a no-code dashboard view alongside (or instead of) the SDK spans, FAGI’s Native Voice Obs supports LiveKit as one of the three natively-supported providers (Vapi, Retell AI, LiveKit). Wire the same LiveKit credentials into a FAGI Agent Definition and the call log table populates without any code on top of the SDK spans.
Create the Agent Definition
In the FAGI console, open the Observe product and create a new project. Inside the project, create an Agent Definition. The form asks for:
- Agent name: free-text, what shows up in the call log table.
- Provider: pick LiveKit. The natively supported providers are Vapi, Retell AI, and LiveKit.
- Provider API key: paste the LiveKit API key from your LiveKit project settings.
- Assistant ID: paste the Assistant ID expected by the FAGI LiveKit agent definition.
- Observability toggle: enable.
Save the agent. FAGI handshakes with LiveKit to verify the credentials. If the handshake fails, the dashboard surfaces the error inline.
What lands after save
The next call routed to that LiveKit agent appears in FAGI within a few minutes. The Call Log row carries:
- Two separate audio files: assistant audio and customer audio, downloadable independently.
- The auto transcript: turn-by-turn alternating rows with timestamps and speaker tags.
- The session timeline: call as a root span, turn boundaries as child events.
- The tags panel: whatever metadata you passed through LiveKit’s call API.
This is the surface that needs zero code. You can stop here and you already have more observability than LiveKit Telemetry’s media-layer focus surfaces.
Tagging for KPI attribution
Set custom attributes via LiveKit’s agent metadata. The common set:
customer_id: filter axis for per-account analysis.vertical: e.g.support,outbound_sales,appointment_booking.agent_version: lets you A/B compare prompt revisions.room_id: links the FAGI session to the LiveKit room.intent: top-level intent class.
These ride into the FAGI session as filter axes in the Observe dashboard and as cluster keys in Error Feed.
Step 4: Attach the named voice rubrics
The ai-evaluation SDK ships 70+ built-in eval templates in Apache 2.0. Five of them carry most of the load on a LiveKit Agent workload.
| Rubric | What it scores |
|---|---|
audio_transcription | ASR drift on customer audio against the rendered transcript |
audio_quality | TTS clarity and prosody on the assistant audio |
conversation_coherence | Multi-turn coherence across the whole call |
conversation_resolution | Did the call resolve the customer’s stated goal |
task_completion | Did the agent complete its tool calls and workflow |
In the dashboard, open the project’s Evals tab and add the five built-ins. They run on every captured call going forward. Past calls require an explicit backfill (one click).
If you’d rather keep eval config in code, the pattern looks like this:
from fi.testcases import MLLMTestCase, MLLMAudio, ConversationalTestCase, LLMTestCase
from fi.evals import (
Evaluator,
AudioTranscriptionEvaluator,
AudioQualityEvaluator,
ConversationCoherence,
ConversationResolution,
TaskCompletion,
)
ev = Evaluator(
fi_api_key="your-future-agi-api-key",
fi_secret_key="your-future-agi-secret-key",
)
assistant_audio = MLLMAudio(url="https://fagi.example.com/calls/livekit-room-abc/assistant.wav")
audio_case = MLLMTestCase(input=assistant_audio, query="Score the assistant TTS leg")
conv = ConversationalTestCase(messages=[
LLMTestCase(query="Hey, can you help me return an order?", response="Of course. What's the order number?"),
LLMTestCase(query="It's 8842-A", response="Got it. I see an order placed last week. What's the reason for the return?"),
])
result = ev.evaluate(
eval_templates=[
AudioTranscriptionEvaluator(),
AudioQualityEvaluator(),
ConversationCoherence(),
ConversationResolution(),
TaskCompletion(),
],
inputs=[audio_case, conv],
)
MLLMAudio accepts seven formats out of the box: .mp3, .wav, .ogg, .m4a, .aac, .flac, .wma. URLs or local paths, auto base64 encoded.
Multilingual LiveKit agents
If your LiveKit agent handles multiple languages, add two more rubrics from ai-evaluation:
translation_accuracy: scores translation quality across language pairs.cultural_sensitivity: scores cultural appropriateness of the response in the target locale.
Both are Apache 2.0 built-ins. They run alongside the core five and surface multilingual-specific failure modes (mistranslated entities, culturally tone-deaf responses) that the standard rubrics miss.
Step 5: Turn on Error Feed and inline Protect
This is where the loop closes.
Error Feed auto-clusters LiveKit failures
Error Feed is the zero-config error monitoring layer in the FAGI Observe product. It detects errors across five categories: factual grounding failures, tool crashes, broken workflows, safety violations, and reasoning gaps. It auto-clusters them into named issues with auto-written root cause, supporting span evidence, a quick fix to ship today, and a long-term recommendation.
For LiveKit Agent workloads, the common clusters look like this:
- “WebRTC packet loss correlated with audio quality drop” clusters cases where media-layer degradation cascades into TTS or STT failures.
- “STT confidence drop on jitter-affected segments” correlates LiveKit’s network stats with ASR drift.
- “Late barge-in detection after framework version bump” clusters turn-taking failures and points at the LiveKit Agents version that introduced the regression.
- “Tool argument schema mismatch in
book_appointment” clusters failed tool calls and points at the prompt drift.
You don’t write these names. The clustering layer writes them.
Inline guardrails via Future AGI Protect
The Future AGI Protect model family runs sub-100ms inline. Foundation is Gemma 3n with LoRA-trained adapters across 4 safety dimensions (Content Moderation, Bias Detection, Security, Data Privacy Compliance), multi-modal across text, image, and audio.
The integration sits inside your LiveKit agent loop, between the LLM response and the TTS leg:
from fi.evals import Protect
p = Protect()
def safe_reply(user_text, agent_text):
out = p.protect(
inputs={"input": user_text, "output": agent_text},
protect_rules=[
{"metric": "content_moderation"},
{"metric": "security"},
{"metric": "data_privacy_compliance"},
],
)
if out.blocked:
return "I'm sorry, I can't help with that. Let me hand you to a human agent."
return agent_text
For the fastest path:
out = p.protect(
inputs={"input": user_text, "output": agent_text},
)
ProtectFlash returns a single harmful or not-harmful verdict in one call. The verdict lands on the FAGI span, so the trust team can review denied responses in Error Feed.
A full reference architecture
+--------------------+ +---------------------+ +-------------------+
| LiveKit Agent | -----> | STT / LLM / TTS | -----> | Providers |
| (your runtime, | | (Deepgram, OpenAI, | | (Deepgram, OpenAI,|
| open-source) | | Cartesia, etc.) | | Cartesia, etc.) |
| + traceai-livekit | | + Protect inline | +-------------------+
| + Protect inline | +----------+----------+
+--------------------+ |
| | OpenInference spans
| v
| +----------------------------+
| | FAGI Observe project |
+------------> | - native LiveKit integ. |
call log + audio | - traceai-livekit spans |
+ transcript | - 70+ built-in rubrics |
| - Error Feed clustering |
| - inline Protect verdicts |
+----------------------------+
|
v
+----------------------------+
| Agent Command Center |
| - RBAC, BYOC, multi-region |
| - SOC 2 + HIPAA + GDPR |
| + CCPA + ISO 27001 |
+----------------------------+
LiveKit Agents owns the runtime. traceai-livekit instruments inside the agent loop. Inline Protect guards the LLM response before TTS. The FAGI Observe project receives both surfaces (native call-level plus SDK span-level) and joins them under one session. Agent Command Center hosts the whole stack with RBAC, multi-region or BYOC, and the cert set listed on futureagi.com/trust: SOC 2 Type II, HIPAA, GDPR, CCPA, and ISO 27001.
Calibrated honesty: where LiveKit genuinely wins
LiveKit is the deepest open-source orchestration layer for voice in 2026. The wedge matters in three concrete ways:
WebRTC layer accuracy. LiveKit owns the SFU. The infra-layer voice quality stats (jitter, packet loss, codec, MOS estimates) come straight from the media path, which is more accurate than any vendor that proxies the stats from an upstream provider.
Open-source orchestration. The runtime is yours. You own the code, the deployment, the network path. For regulated workloads, BYOC is the default rather than an upsell. For teams that need to audit every leg of the pipeline, this is the only option that survives a security review.
Provider flexibility. Swap any leg without rewriting the runtime. Deepgram to AssemblyAI on STT, OpenAI to Anthropic on LLM, ElevenLabs to Cartesia on TTS, all configurable per-agent. The orchestration is the abstraction.
What LiveKit Agents does not ship is the deep observability and eval layer that production teams need on top. LiveKit Telemetry covers the media layer; FAGI covers the agent layer. The two compose. Native voice observability on FAGI even reads from LiveKit’s call API, so you get both surfaces in one dashboard.
Two deliberate tradeoffs
Async eval gating is explicit. FAGI never auto-rewrites prompts in production without an explicit run plus a human approval gate. The Dataset UI ships UI-driven optimization across all six optimizers (Bayesian Search, Meta-Prompt per arXiv 2505.09666, ProTeGi, GEPA Genetic-Pareto per arXiv 2507.19457, Random Search per arXiv 2311.09569, PromptWizard); the agent-opt Python library exposes the same six for programmatic control. Either way, the loop stays explicit: point the run at a dataset, pick an evaluator, pick the optimizer, then promote a candidate by hand.
Native voice obs ships for Vapi, Retell, and LiveKit; everything else routes through traceAI or Enable Others. The provider-API-key dashboard path covers the three runtimes most production teams pick. The remaining 10 percent (Synthflow, Bland, Pipecat, custom RTP) lands through the traceAI SDK (from fi_instrumentation import register plus from fi_instrumentation.fi_types import ProjectType) or via Enable Others mode with mobile-number simulation. Active iteration on the dashboard surface keeps shipping every release: multi-step Agent Definition UX, Prompt Workbench Revamp, redesigned Run Test performance metrics, Show Reasoning column in Simulate, sticky filters in Observe, scenario generation with branch visibility, and Error Localization that pinpoints the failing turn.
Common pitfalls when wiring LiveKit observability
Don’t skip the enable_http_attribute_mapping() call. Without it, the HTTP-layer calls LiveKit Agents makes to providers (Deepgram, OpenAI, Cartesia) don’t get the OpenInference attribute mapping, and the spans render with provider names but without input or output content. The call is a one-liner; just make sure it’s in the entrypoint.
Don’t wrap the conversation in a span outside the entrypoint. The tracer.start_as_current_span has to be inside the LiveKit agent entrypoint so that the room context is bound when the span is created. If you create the root span in a parent process and pass it in, child spans from traceai-livekit won’t attach correctly.
Don’t run traceai-livekit without the native dashboard integration. The SDK gives you span depth; the native integration gives you the audio recordings and the transcript surface. The two compose; running only one leaves you with a thinner debug surface than necessary.
Don’t run all five rubrics on every call from day one. Start with conversation_resolution and task_completion. Add audio_transcription and audio_quality once you’ve seen a TTS or STT regression. Add conversation_coherence once you have enough multi-turn data.
Don’t ignore Error Feed for the first week. It needs traffic to populate the named issue list. Once volume crosses a threshold, the clusters start surfacing.
Don’t deploy in BYOC without sizing the trace storage. Trace volume scales with call volume. Size the trace storage (Postgres or ClickHouse depending on deployment) for at least 90 days of retention at peak call rate. Agent Command Center provides sizing guidance per deployment.
When you’ve outgrown this setup
Once the dashboard path, traceai-livekit, eval rubrics, Error Feed, and inline Protect are running cleanly, the next move is simulation. FAGI’s simulation product ships 18 pre-built personas plus unlimited custom-authored personas. Custom personas configure name, gender, age range (18-25 / 25-32 / 32-40 / 40-50 / 50-60 / 60+), location (US / Canada / UK / Australia / India), personality traits, communication style, accent, conversation speed, background noise, multilingual toggle, custom properties, and free-form behavioral instructions. The Workflow Builder (Conversation Node, End Call Node, Transfer Call Node) auto-generates branching scenarios (20, 50, or 100 rows) with branch visibility; Dataset scenarios accept CSV, JSON, and Excel uploads or synthetic generation; script-based runs cover deterministic regression. The 4-step Run Tests wizard runs the suite against your LiveKit assistant, Error Localization pinpoints the exact failing turn, and the Show Reasoning column surfaces eval rationale per scenario.
The same Agent Definition you wired in step 1 plugs into Simulate. The same eval rubrics run on simulated calls. The same Error Feed clusters scenario failures alongside production failures. Custom voices from ElevenLabs and Cartesia plug into Run Prompt and Experiments for per-run voice routing; Indian phone number simulation ships as a configurable region. The Tool Calling eval and programmatic eval API cover CI integration.
The other natural extension is closing the loop into optimization. The Dataset UI ships all six optimizers (Bayesian Search, Meta-Prompt per arXiv 2505.09666, ProTeGi, GEPA Genetic-Pareto per arXiv 2507.19457, Random Search per arXiv 2311.09569, PromptWizard); the agent-opt Python library exposes the same six for programmatic runs. Both read the same trace data the dashboard renders and propose prompt revisions against live failure patterns. The loop is explicit by design. Turn it on after the first month, once eval baselines stabilize.
For a deeper walkthrough of the simulation side, see the voice agent scenario guide. For the broader production monitoring playbook, see how to monitor AI voice agents in production.
Related reading
- Voice AI Observability for Vapi: A 2026 Implementation Guide
- Voice AI Observability for Retell AI: A 2026 Implementation Guide
- How to monitor AI voice agents in production: a 2026 playbook
- 7 best voice agent monitoring platforms in 2026
Sources and references
- traceAI on GitHub: github.com/future-agi/traceAI
- ai-evaluation on GitHub: github.com/future-agi/ai-evaluation
- Error Feed docs: docs.futureagi.com/docs/observe
- Future AGI Protect docs: docs.futureagi.com/docs/protect
- Agent Command Center docs: docs.futureagi.com/docs/command-center
- arXiv 2510.13351 (Protect): arxiv.org/abs/2510.13351
- arXiv 2507.19457 (GEPA): arxiv.org/abs/2507.19457
- Trust page (SOC 2 + HIPAA + GDPR + CCPA + ISO 27001): futureagi.com/trust
- OpenInference spec: github.com/Arize-ai/openinference
- LiveKit (plain text reference; no competitor backlink)
Frequently asked questions
Does LiveKit have native FAGI dashboard support, or do I need the SDK?
What does the traceai-livekit pip package actually trace?
Which eval rubrics should I run on LiveKit Agent traces?
Can I run LiveKit Agents in BYOC with FAGI observability?
Does traceai-livekit work with both Python and TypeScript LiveKit agents?
What latency does inline Future AGI Protect add inside a LiveKit voice budget?
How does Error Feed handle LiveKit-specific failures?
Implement voice observability for Pipecat with traceAI-pipecat: install, register, enable HTTP attribute mapping, attach audio + multi-turn eval rubrics.
Wire Retell AI observability the FAGI way: native dashboard via Assistant ID, optional traceAI SDK, eval engine on every call with audio + transcript.
Implement voice AI observability for Vapi in 2026: native FAGI dashboard via Assistant ID, traceAI SDK path, audio_transcription and conversation rubrics.