Research

Multimodal LLM Tracing in 2026: Image, Audio, and Text Spans

Tracing image, audio, and text spans across multimodal LLM apps in 2026. OTel schema, payload handling, redaction, sampling, and the tools that ingest them.

·
Updated
·
7 min read
multimodal llm-tracing opentelemetry vision audio observability traceai 2026
Editorial cover image on a pure black starfield background with faint white grid. Bold all-caps white headline MULTIMODAL LLM TRACING fills the left half. The right half shows a wireframe trace tree with one root node branching into three child nodes, each branch ending in a small icon (image rectangle, audio waveform, text bubble), with a soft white halo glow on the root node, drawn in pure white outlines.
Table of Contents

A multimodal customer-support agent ingests a screenshot of a billing error, transcribes the user’s voice memo, and generates a refund explanation. By 2pm the agent’s latency p99 hits 11 seconds. The on-call engineer opens the trace and sees a single 11-second span labeled “agent.invocation” with no children, because the team instrumented OpenAI’s chat span but not the vision encoder, the audio transcription, or the tool calls. The fix is not faster compute. It is multimodal tracing: a trace tree that records the image embed, the audio transcription, the LLM generation, and the tool call as separate spans, with the modality tagged and the payload referenced via URL so the storage cost stays sane.

This guide covers what multimodal LLM tracing looks like in 2026, the OTel schema that makes it cross-vendor, the storage and redaction patterns that keep it compliant, and the tools that actually ingest it.

TL;DR: Multimodal tracing in one paragraph

Multimodal LLM tracing captures image, audio, video, and text spans in one OpenTelemetry trace tree, with the modality named on the span attribute and the payload referenced via URL to a blob store. The schema is the OTel GenAI semantic conventions (still in Development status as of 2026), with multimodal payload URLs (image_url, audio_url, video_url) carried as OpenInference, traceAI, or vendor extensions until a stable cross-vendor schema lands. The tools that handle it natively are OTel-native LLM platforms (FutureAGI traceAI, Phoenix with OpenInference, Langfuse). The hard parts are payload storage, pre-store redaction, and modality-aware sampling.

Why multimodal tracing matters in 2026

Three pressures pushed multimodal tracing from “edge case” to “table stakes” by 2026.

Multimodal models hit production at scale. GPT-5.5 with vision and audio, Claude 4.x with vision, Gemini 3.x natively multimodal. Production teams ship apps where a single user invocation routinely involves three modalities. Tracing that captures only text misses two-thirds of the system.

Voice agents became real. Voice agents in customer support, healthcare intake, and field service ship with sub-second latency requirements and audio-first interaction patterns. The tracing has to record transcription, intent classification, response generation, and TTS as separate spans because each one fails differently.

Compliance asks for image and audio audit trails. EU AI Act Article 12 (record-keeping) and HIPAA both ask for traceable records of inputs and outputs. For text, this is a logging problem. For images and audio, it requires storage architecture decisions: where the payload lives, how long it stays, who can access it.

The anatomy of a multimodal trace

A typical 2026 multimodal trace has four span types:

  1. Invocation span (root). Carries the trace_id, the user_id, the session_id, and aggregate latency.
  2. Encoder spans. One per non-text input. Vision encoder, audio encoder, video encoder. Each carries a payload URL and the encoder model id.
  3. Generation span. The LLM call. Carries the prompt template id, the prompt version, the model id, the input messages with content_type per message, and the completion.
  4. Tool spans. One per tool call. Carries the tool name, the input arguments, the output, and a modality tag if the output is non-text.

Each span carries gen_ai.* attributes per the OTel GenAI semantic conventions. Vendor extensions add gen_ai.tool.calls, gen_ai.tool.definitions, and provider-specific cost attributes.

Editorial diagram on a black starfield background titled MULTIMODAL TRACE TREE with subhead ONE INVOCATION, THREE MODALITIES. A wireframe trace tree drawn in thin white outlines with one root span at the top labeled agent.invocation branching down into three child spans labeled vision.embed, audio.transcribe, and text.generate. Each child span shows small mode-specific icons: a tiny wireframe image rectangle, a tiny wireframe audio waveform, a tiny wireframe text bubble. Below each child are leaf spans showing OTel attributes such as gen_ai.input.messages, gen_ai.usage.input_tokens, and vendor-extension payload-URL fields like image_url and audio_url. A soft white radial halo glow surrounds the central multimodal aggregation node where the three branches converge into the root.

How multimodal tracing is implemented

Span attributes for multimodal

The OTel GenAI semconv is still in Development status as of 2026 and stabilizing across releases. Commonly-supported attributes today:

  • gen_ai.request.model. The model id (for example gpt-5.5, claude-opus-4-7, gemini-3.1-flash). Verify against current provider docs at the time of instrumentation.
  • gen_ai.input.messages and gen_ai.output.messages. Arrays of messages with role and parts.
  • gen_ai.output.type. The response type when applicable.
  • gen_ai.usage.input_tokens, gen_ai.usage.output_tokens. Token counts.
  • gen_ai.tool.definitions. Tool definitions for tool-calling models.

Multimodal payload URLs (image_url, audio_url, video_url) and per-message content_type are commonly carried as OpenInference, traceAI, or vendor-specific extensions on the span until a stable cross-vendor multimodal schema lands. Treat the URL fields as extensions rather than canonical OTel attributes.

Payload storage

Three patterns for the actual bytes:

  • Blob store with signed URL. S3, GCS, or Azure Blob. Payloads written before the span exports. The span carries the signed URL on an OpenInference, traceAI, or vendor-specific attribute (often named image_url, audio_url, or similar). The OTel GenAI spec does not yet define these attribute names canonically. This is the dominant 2026 pattern because it survives at scale.
  • Platform artifact store. Phoenix, FutureAGI, Langfuse all accept small media payloads via API and reference them on the span. Convenient for low-volume use; not designed for terabyte-scale ingest.
  • Inline base64. Acceptable for icon-sized images. Catastrophic for full-resolution images or audio. Avoid above a few hundred bytes.

Redaction

Pre-store redaction is the only compliant pattern for image and audio:

  • Images. Face blurring (OpenCV, AWS Rekognition with redaction, GCP Cloud Vision), document OCR-then-redact, license plate masking. Run before the upload to blob storage.
  • Audio. Transcribe with a speech-to-text model, run PII detection on the transcript, redact words, optionally regenerate audio with TTS. Or speaker-diarize and remove the speaker that contains the PII.
  • Video. Same as image plus same as audio, frame-by-frame plus track-by-track. Computationally heavy; sample to disk and process offline.

Storing originals in the trace backend and redacting in the dashboard is a compliance failure pattern. The trace backend should never see the original PII payload.

Sampling

Modality-aware sampling matters because the cost structure differs by orders of magnitude.

  • Text-only routes. Sample at 5-20 percent of traffic. Storage cost is bounded by token count.
  • Image-heavy routes. Sample at 1-5 percent of traffic. A 1024 x 1024 image is roughly 700 KB; even URL references add up at 100 sample units.
  • Audio-heavy routes. Sample at 1-3 percent. One minute of 16 kHz audio is roughly 1 MB.
  • Errors and high-cost. Sample at 100 percent regardless of modality. The diagnostic value swamps the storage cost.

Tail-based sampling (decide after the trace completes) is the right strategy for multimodal. Head-based sampling makes the modality-aware decision impossible.

Tools that handle multimodal tracing

Five tools cover the multimodal path with first-party support:

  • FutureAGI traceAI. Apache 2.0, OTel-compatible tracing library with integrations for OpenAI, Anthropic, Google GenAI, Vertex AI, and many frameworks. The voice observability docs show recording-URL handling, and the judge model docs list image- and audio-capable evaluators.
  • Arize Phoenix with OpenInference. ELv2. Strong vision and audio span semantics. Same instrumentation in Phoenix and Arize AX.
  • Langfuse. MIT core. Multimodal trace support landed in 2025; payloads stored via media references.
  • OpenInference (Arize). The instrumentation library used by Phoenix and many other backends. Apache 2.0 packages for Python, JavaScript, Java.
  • OpenAI / Anthropic / Google vendor SDKs. First-class multimodal inputs and OTLP export hooks. The schema gap is filled by traceAI or OpenInference auto-instrumentation.

The tools to avoid are pure-text observability tools that don’t ingest media URLs (most APM tools, several legacy LLMOps platforms). They store the URL as a string but never render or judge the payload.

Common mistakes when tracing multimodal apps

  • Inlining large payloads as base64. Storage cost balloons; OTLP exporters time out. Switch to URL references at week one.
  • Sampling at the trace level without modality awareness. A 5 percent sample on an image-heavy route still costs 100x a text-only route. Pick sampling rates per route.
  • Storing originals before redaction. The trace backend should never see the original PII payload. Redact in a side pipeline before upload.
  • Treating tool spans as opaque text. A tool that returns an image must tag its modality. The trace renderer otherwise truncates the payload as a string.
  • Missing the cross-modality trace_id. A vision span and a text span from the same invocation share the trace_id. Without it, multimodal debugging is impossible.
  • Pure-text judges on multimodal outputs. A text-only judge cannot evaluate image groundedness or audio accuracy. Pick judges that match the modality.
  • No retention policy on payloads. Image and audio payloads accumulate quickly. Set retention shorter than text trace retention; the legal exposure is larger.

The future of multimodal tracing

Three trends shape the next 18 months.

OTel GenAI semconv stabilization. The spec reached widespread adoption in 2026; multimodal extensions are still in development. Expect the schema to converge by mid-2027.

Multimodal-aware judges become standard. Image groundedness, audio transcription accuracy, video event detection. Distilled multimodal judges (cheaper than frontier multimodal models) make 100 percent online scoring possible.

On-device redaction. Privacy-first deployments (healthcare, defense) push redaction into the client. The trace backend never sees raw payloads at all; spans carry only redacted URLs. Tools that handle on-device-redacted payloads natively will lead.

Sources

Read next: What is LLM Tracing?, LLM Tracing Best Practices, Best LLM Tracing Tools 2026

Frequently asked questions

What is multimodal LLM tracing?
Multimodal LLM tracing is the practice of capturing OpenTelemetry spans that record image, audio, video, and text inputs and outputs as part of the same trace tree. A single user invocation might involve a vision span (image embedding), an audio span (transcription), a text span (LLM generation), and a tool span (function call). The trace tree links them with span attributes that name the modality and a payload reference that handles the size question.
How is multimodal tracing different from text-only LLM tracing?
Three differences. First, the payload is large: a 1024 x 1024 image is roughly 700 KB, a one-minute audio clip is roughly 1 MB, while a text prompt is rarely above 16 KB. Storing payloads inline blows up your trace storage. Second, redaction is harder: PII in an image or an audio clip needs vision or transcription to detect. Third, sampling decisions are payload-aware: you sample images differently from text because the cost structure differs.
What does the OTel schema for multimodal spans look like in 2026?
The OTel GenAI semantic conventions are still in Development status as of 2026. The current spec covers gen_ai.request.model, gen_ai.input.messages and gen_ai.output.messages (with parts), gen_ai.output.type, gen_ai.usage.input_tokens, gen_ai.usage.output_tokens, and gen_ai.tool.definitions. Multimodal payload URLs are commonly carried as OpenInference, traceAI, or vendor extensions on the span (image_url, audio_url, video_url). The convention in 2026 is to store small inline payloads as attributes and large payloads as URL references to a blob store, with the URL on the span.
Where do you store the actual image and audio payloads?
Three patterns. First, S3 or GCS blob storage with a signed URL on the span attribute. This is the dominant pattern in 2026. Second, a dedicated artifact store inside the observability platform (Phoenix, FutureAGI, Langfuse all support this for small payloads). Third, never store inline beyond a few hundred bytes; the trace storage system was not designed for megabyte payloads. Inline-only patterns become unworkable above a few thousand traces a day.
How do you redact PII in image and audio traces?
Image redaction needs vision: face blurring, license-plate masking, OCR-then-redact for documents. Audio redaction needs transcription plus PII detection on the transcript, or speaker-diarization plus voice-removal. Both are slower than text redaction. The pattern in 2026 is pre-store redaction in a side pipeline: redact, write the redacted artifact to blob storage, link the span to the redacted URL. Storing the original then redacting in the dashboard is a compliance failure pattern.
Which tools support multimodal LLM tracing in 2026?
OTel-native tools that handle multimodal payloads with URL references: FutureAGI traceAI (Apache 2.0), Arize Phoenix (ELv2 with OpenInference), Langfuse (MIT core with multimodal trace support). Vendor SDKs that handle multimodal natively: OpenAI Python and JavaScript clients with the responses API, Anthropic Python and TypeScript with vision messages, Google Vertex SDK with image and audio inputs. The tracing tool you pick has to ingest URLs and treat them as first-class span attributes, not just truncated strings.
What are common mistakes when tracing multimodal apps?
Four. First, inlining large payloads as base64 strings in span attributes; this kills storage cost and exporter throughput. Second, sampling at the trace level without considering modality cost (an image-heavy route should sample more aggressively than a text-only route). Third, treating tool spans as opaque text; a tool that returns an image should mark its modality. Fourth, missing the cross-modality alignment: when a vision span and a text span are part of the same invocation, they must share the trace_id.
Does FutureAGI support multimodal tracing?
Yes. traceAI is an Apache 2.0, OTel-compatible tracing library with integrations for OpenAI, Anthropic, Google GenAI, Vertex AI, and many frameworks. Payload URLs ride on the span as vendor or OpenInference-style extensions; large payloads should live in blob storage with the URL referenced on the span. FutureAGI's evaluator catalog includes audio-transcription and image-related judges; check the [docs](https://docs.futureagi.com/docs/evaluation/concepts/judge-models/) for the current evaluator list.
Related Articles
View all
Stay updated on AI observability

Get weekly insights on building reliable AI systems. No spam.