What Is a Media Stream?
A continuous, time-ordered flow of audio or video frames between sender and receiver, transported over WebRTC, RTP, or websocket.
What Is a Media Stream?
A media stream is a continuous, time-ordered flow of audio, video, or mixed-media frames between two endpoints, transported over real-time protocols such as WebRTC, RTP, gRPC bidi, or websocket. In voice-AI and multimodal-agent systems, the media stream is the path microphone audio takes into the ASR pipeline and the path synthesised TTS audio takes back to the user. Unlike a file, a stream is unbounded and lossy under the wire — a frame missed is a frame gone — and every component downstream (VAD, transcriber, turn detector, agent loop) must accept frames as they arrive with strict latency budgets.
Why It Matters in Production LLM and Agent Systems
A voice agent is only as reliable as the media stream feeding it. Drop 80 ms of inbound audio in the middle of a digit sequence and the ASR transcribes “405” as “45”; the agent confidently books the wrong room. Pile up 250 ms of jitter on the outbound stream and the user hears the TTS stutter, perceives the agent as broken, and hangs up. Most of these failures are silent at the application layer because the agent code only sees text — the media stream is upstream of every metric the LLM team usually watches.
The pain hits the SRE first (latency p99 alarms), then the voice-agent product team (call abandonment up, CSAT down), then the ML team (transcription accuracy regressions that turn out not to be the model’s fault). Logs show clean, complete LLM traces while users are hearing chopped audio.
For 2026-era multimodal agents the stakes go up. A live screen-share stream, a camera feed, and a microphone stream all need to be aligned to the same wall-clock so the agent can reason about “what the user is pointing at right now”. When media streams desync, the agent’s vision and audio modalities disagree and downstream tool calls go wrong. Treat the media stream as part of the trace, not as plumbing.
How FutureAGI Handles Media Streams
FutureAGI captures media streams as first-class objects inside the simulate-sdk via the LiveKitEngine voice simulation surface. When you build a voice agent, LiveKitEngine opens a real WebRTC media stream against your agent endpoint, pipes synthetic-persona audio in, captures the agent’s outbound audio frames, and emits a TestCaseResult containing the full transcript and the captured .wav file path. That artefact is what FutureAGI’s evaluators score against — ASRAccuracy measures inbound transcription quality, AudioQualityEvaluator measures playback quality (jitter, clipping, naturalness), and CaptionHallucination checks whether the transcript invented words that were never spoken.
In production, the same media-stream view ties to OpenTelemetry. A traceAI integration with livekit or pipecat instruments the audio pipeline so each frame chunk emits span attributes such as audio.frame.timestamp, audio.frame.bytes, and audio.frame.dropped. That lets a platform engineer look at a single failed call, see exactly which 40 ms of inbound audio went missing, and correlate it with a specific transcription gap. FutureAGI’s voice-agent dashboard surfaces the most common stream-level regressions — packet loss above threshold, end-to-end latency above 800 ms, audio-quality MOS below 3.5 — so the on-call engineer sees stream health and agent health in the same view.
How to Measure or Detect It
Media streams are evaluated at three layers — transport, signal, and content:
- Transport: packet-loss rate, jitter, RTT, frame-drop count from WebRTC stats; alarm at sustained packet-loss > 2%.
- Signal:
AudioQualityEvaluatorreturns a 0–1 quality score per captured stream, with sub-scores for clipping, noise, and naturalness. - Content:
ASRAccuracyreturns word error rate against the synthetic-persona ground truth;CaptionHallucinationflags transcripts that contain words not present in the audio. - End-to-end: time-to-first-audio (TTFA) and turn-taking latency captured as OpenTelemetry span attributes.
- User-feedback proxy: call-abandon rate within first 10 s correlates strongly with stream-level failures.
Minimal Python:
from fi.simulate import LiveKitEngine, Persona
from fi.evals import ASRAccuracy, AudioQualityEvaluator
engine = LiveKitEngine(agent_endpoint="wss://agent.example/ws")
result = engine.run(Persona.from_yaml("flight-booking.yaml"))
print(ASRAccuracy().evaluate(result).score)
print(AudioQualityEvaluator().evaluate(result).score)
Common Mistakes
- Logging only the LLM trace, not the media frames. Half of voice-agent regressions live in the audio path; missing frames look identical to a perfect call in text logs.
- Treating jitter and latency as the same thing. Constant high latency is fixable; variable jitter ruins turn-taking even at low average latency.
- Recording at 16 kHz mono and evaluating against a 24 kHz reference. Sample-rate mismatch corrupts both ASR and audio-quality scores.
- Skipping silence frames in evaluation. Endpointing depends on silence — strip them and your turn-detection eval becomes meaningless.
- Assuming the codec is lossless. Opus and G.711 introduce different failure modes; evaluate against the codec the user will actually hear.
Frequently Asked Questions
What is a media stream?
A media stream is a real-time, time-ordered flow of audio or video frames carried over WebRTC, RTP, or websocket. In voice agents, it carries microphone input into ASR and synthesised speech back to the user.
How is a media stream different from a media file?
A media file is a finite, indexable container. A media stream is unbounded and ordered by arrival time — frames must be processed as they arrive, with strict latency and jitter constraints, before they're discarded or buffered.
How do you evaluate the quality of a media stream?
FutureAGI captures media streams through `LiveKitEngine` and scores them with `AudioQualityEvaluator` and `ASRAccuracy`, surfacing jitter, packet loss, and transcription gaps alongside model-level voice-agent metrics.