Guides

Multimodal LLM Tracing in 2026: The Methodology That Actually Works

Multimodal LLM tracing for Gemini Vision, GPT-5 Vision, Claude Vision. Modality attribution on the span, per-modality cost, PII at the modality boundary, traceAI schema.

·
Updated
·
11 min read
llm-tracing multi-modal opentelemetry vision audio traceai 2026
Editorial cover image for Multi-Modal LLM Tracing in 2026
Table of Contents

Tracing a GPT-5 Vision turn with a text-era tracing library shows a single LLM span with prompt: "[image attached]" and a token count that’s 5x what the text would have predicted. Nothing tells you the image was a 4K screenshot the user pasted by accident, nothing separates vision tokens from text tokens in the cost number, and nothing flagged that the screenshot contained an unredacted invoice. Multimodal tracing isn’t text-tracing with audio glued on. It needs three things text-tracing doesn’t: input-content-type attribution on every span, per-modality cost accounting, and PII handling at the modality boundary. This post is the working methodology for all three.

Why text-tracing breaks for multimodal models

Multimodal LLM workloads in 2026 routinely route through more than one input type per call. Gemini Vision, GPT-5 Vision, and Claude Vision all accept text, image, and audio in a single message. The downstream tracing story hasn’t kept up, and the failures show up in three predictable places.

The payload stops being a string. Text-era schemas treat messages as a list of strings with role and content. A Gemini Vision turn pulls in a 700 KB image, a 30-second audio clip, and a tool result inside one message. Most text-era tracing libraries do one of three things when they see that: drop the non-text parts, base64-encode them into a single attribute that blows up the exporter, or store the URL and skip the modality metadata. None of that is useful for debugging.

Tokens stop being a single counter. A vision token is 2-5x more expensive than a text token, and image-token arithmetic is provider-specific. GPT-5 Vision: 85 tokens for low detail; ceil(width/512) * ceil(height/512) * 170 + 85 for high detail. Claude Vision: tile-based with a different geometry. Gemini Vision: 258 tokens flat per image up to 3072x3072. Audio is usually priced per second, not per token. A single total_tokens attribute hides every cost regression that matters.

PII risk moves to the modality boundary. A text PII scanner finds emails and SSNs. It doesn’t find a face in an uploaded photo, an account number visible in a screenshot, or a customer name spoken in an audio clip. Once images, audio, and video enter the trace, the PII surface is the modality boundary, not the text string.

The three problems compose. A vision app where finance can’t see the cost spike, security can’t see the PII surface, and engineering can’t filter spans by modality is a tracing setup that ships zero usable signal the day a real multimodal failure happens.

The three things multimodal tracing needs

The methodology for tracing multimodal LLM apps in 2026 ships as three concrete additions to the text-tracing baseline. They map cleanly to the three failures above.

1. Input content type on every span part

The unit of attribution is the part, not the message. Each input part on an LLM span carries an explicit type: text, image, audio, or video. The OTel GenAI semantic conventions ship this as a structured input.messages array where each message’s content is itself a list of parts; OpenInference ships the same thing under llm.input_messages with message.contents as the part list. Both vocabularies converged on the same shape in late 2025, and both are what register(semantic_convention=...) in traceAI emits today.

The query you want to be able to run on a trace store: “of the spans that failed the brand_safety judge yesterday, how many had an image in the input?” That query is a single filter when the part type is a first-class attribute. It’s an OCR-on-prompt-strings nightmare when it isn’t.

Practical attributes on every multimodal LLM span:

  • gen_ai.input.messages[*].content[*].type — text, image, audio, video
  • gen_ai.input.messages[*].content[*].image_url — pointer to object storage, not the base64 payload
  • gen_ai.input.messages[*].content[*].audio.format — wav, mp3, opus
  • gen_ai.input.messages[*].content[*].audio.duration_seconds
  • gen_ai.input.modalities — denormalized list of part types present, useful for cheap filtering

The denormalized gen_ai.input.modalities attribute is the one that lets you build dashboards. Cardinality is bounded (at most four values: text, image, audio, video), so it works as a chart dimension without exploding the storage cost.

2. Per-modality cost and usage attribution

Two attributes on the LLM span are not enough. A working multimodal cost schema splits usage and cost by modality:

  • gen_ai.usage.input_tokens.text
  • gen_ai.usage.input_tokens.image
  • gen_ai.usage.input_tokens.audio_seconds (audio is per-second on most providers)
  • gen_ai.usage.input_tokens.video_seconds
  • gen_ai.usage.output_tokens.text
  • gen_ai.usage.output_tokens.image (for image-generation calls)
  • gen_ai.cost.input.image_usd, gen_ai.cost.input.audio_usd, gen_ai.cost.input.text_usd
  • gen_ai.cost.output.image_usd, gen_ai.cost.output.audio_usd, gen_ai.cost.output.text_usd
  • gen_ai.cost.price_snapshot_date — the date the price table was loaded; vision pricing moves often enough that a stale snapshot causes a silent regression

The split matters operationally. A 4x cost spike that finance sees on Tuesday morning is either a vision-token spike (a marketing user started pasting 4K screenshots), an audio-seconds spike (a customer support bot started transcribing long voicemails), a request-count spike (traffic), or a model-swap spike (a router moved to a more expensive vision model). Each has a different fix, and a single total_cost_usd attribute can’t tell them apart.

Tip: capture the estimate before the forward and the actual after. GPT-5 Vision’s billed tokens vary by 5-15% from the documented formula because tiling logic rounds based on aspect ratio. The estimate is what your budget cap enforces against; the actual is what you reconcile to.

3. PII at the modality boundary

Multimodal payloads carry more PII risk per byte than text. A screenshot can leak a credential field, a password manager UI, or a chat thread. A voice clip leaks the speaker’s voice biometric and any spoken numbers. A video leaks faces, license plates, and physical environments. Three rules keep the trace store clean:

Redact before commit. The trace store should never hold the raw image bytes or the raw audio. Either store the payload in object storage and put the URL on the span, or strip the part entirely at the SDK layer before the span is exported. traceAI ships the second knob as a set of environment variables, so the right defaults ship without an app-code change:

  • FI_HIDE_INPUT_IMAGES strips image attributes from input messages while leaving the part structure intact
  • FI_HIDE_INPUT_TEXT and FI_HIDE_OUTPUT_TEXT redact text parts while leaving image and audio metadata
  • FI_HIDE_INPUT_MESSAGES and FI_HIDE_OUTPUT_MESSAGES drop the full message arrays
  • FI_BASE64_IMAGE_MAX_LENGTH caps inline base64 length so a stray large image doesn’t slip into the span attribute table
  • FI_PII_REDACTION=true enables the regex scanner for emails, SSNs, credit cards, IPs, API keys, and phone numbers across the text portions of multimodal messages

Hidden values are replaced with __REDACTED__ so the part structure stays valid and downstream evals still have something to read.

Run image and audio guardrails on the request path, not in a side queue. A multimodal PII problem caught in a nightly batch is a multimodal PII problem that already leaked. A guardrail layer that runs inline at modality-aware latency (more on this below) is the only architecture that catches the leak before the trace is committed.

Sample by modality budget, not by trace count. An image-heavy route generates traces 50-100x larger than a text-only route. Head-based percentage sampling that drops 90% of all traces still preserves 90% of the bytes if those bytes are concentrated in the 10% of traces with images. Sample on gen_ai.input.modalities to keep modality coverage honest.

traceAI’s multimodal span schema

The traceAI semantic conventions are verifiable in python/fi_instrumentation/fi_types.py and shipped on every span the corresponding instrumentor emits. Four namespaces are uniquely multimodal-first.

gen_ai.image.* — image input and image generation. Covers prompt, negative_prompt, width, height, quality, style, steps, guidance_scale, seed, format, count, revised_prompt, and output_urls. The revised_prompt field catches DALL-E 3 and gpt-image-1 prompt-rewriter regressions: both rewrite the user prompt before generating, and logging only the original hides the actual cause when an output drifts from the request.

gen_ai.audio.* — audio input, audio output, and TTS/STT. Covers format, duration_seconds, language, model, provider, voice_id, and the per-stage latency breakdown. For voice agents on Pipecat or LiveKit, the equivalent gen_ai.voice.* namespace adds telephony attributes (call_id, from_number, to_number, channel_type), interruption counts, and per-stage cost breakdown across STT, TTS, LLM, and telephony. See OpenInference + OpenTelemetry for voice agents for the voice mapping in detail.

gen_ai.video.* — video input, frame extraction, and video generation. Covers duration_seconds, fps, resolution, codec, frame_count_sampled, and the per-frame token estimate when the provider exposes one (Gemini’s video pricing samples 1 frame per second by default).

gen_ai.computer_use.* — for vision-driven agentic workloads where the model operates a virtual desktop. Covers action, coordinate_x, coordinate_y, screenshot, viewport_width, viewport_height, current_url, element_selector, and result. When an agent clicks the wrong button, the screenshot plus coordinates plus selector tell you in one span what the model saw, what it decided, and what actually happened.

All four namespaces compose with the OpenInference span-kind taxonomy. traceAI ships 14 span kinds (LLM, CHAIN, RETRIEVER, TOOL, EMBEDDING, AGENT, RERANKER, GUARDRAIL, EVALUATOR, CONVERSATION, VECTOR_DB, A2A_CLIENT, A2A_SERVER, UNKNOWN); Phoenix ships 8, Langfuse ships 5. The extras that matter most for multimodal: GUARDRAIL as a top-level kind (so image and audio safety decisions are first-class spans), EVALUATOR as a top-level kind (so a multimodal judge score attaches to the span it scored), and A2A_CLIENT/A2A_SERVER (so a multimodal hand-off between agents stays on one trace tree).

Protect for image and audio: 4 LoRA + Flash

Multimodal payloads break text-only guardrails the moment they hit the request. A toxicity classifier that only reads message.content strings misses the toxic image overlay, the unsafe voice tone, and the screenshot showing a credential. Protect ships per-modality coverage as four Gemma 3n LoRA adapters (toxicity, bias_detection, prompt_injection, data_privacy_compliance) with prompt templates tuned per modality. The published benchmark in arXiv 2510.13351 reports 65 ms p95 text and 107 ms p95 image median time-to-label, fast enough to run inline on every multimodal call without breaking the per-turn latency budget.

The integration into traceAI is one line. When ai-evaluation is installed, the OpenAIInstrumentor (and the equivalent Anthropic and Google instrumentors) auto-attaches a GuardrailProtectWrapper around fi.evals.Protect.protect. Every guardrail evaluation becomes a GUARDRAIL span on the same trace tree as the LLM span, populated with:

  • guardrail.status, guardrail.failed_rule, guardrail.reasons, guardrail.time_taken
  • guardrail.completed_rules, guardrail.uncompleted_rules
  • gen_ai.guardrail.modality — text, image, audio, or video
  • gen_ai.guardrail.name, gen_ai.guardrail.type, gen_ai.guardrail.result, gen_ai.guardrail.score, gen_ai.guardrail.modified_output

That gives you “the model saw the image, the image guardrail flagged it for data_privacy_compliance in 107 ms, here’s the modified output we returned” as one contiguous span chain. Compare with the usual pattern: a separate guardrail logs table that you JOIN against the trace by request ID after the fact, with no inline latency visible to the SRE.

For sub-10 ms deterministic pre-filters on text portions of multimodal messages, fi.evals.guardrails.scanners ships JailbreakScanner, CodeInjectionScanner, SecretsScanner, MaliciousURLScanner, LanguageScanner, TopicRestrictionScanner, and RegexScanner. The deterministic layer catches the obvious; the LoRA adapters catch the modality-specific.

Practical setup for a GPT-5 Vision multimodal agent

The minimal three-line setup for a multimodal app:

from fi_instrumentation import register
from fi_instrumentation.fi_types import ProjectType
from traceai_openai import OpenAIInstrumentor

trace_provider = register(
    project_type=ProjectType.OBSERVE,
    project_name="my_multimodal_agent",
    project_version_name="v1.0.0",
)
OpenAIInstrumentor().instrument(tracer_provider=trace_provider)

The OpenAI instrumentor captures multimodal payloads natively: image parts on client.chat.completions.create(...), audio I/O on the responses API, image generation on client.images.generate(...), and the TTS/STT calls. No application-code change is needed. For Claude Vision and Gemini Vision, the equivalent native instrumentors preserve the provider’s shape end-to-end:

from traceai_anthropic import AnthropicInstrumentor
from traceai_google_genai import GoogleGenAIInstrumentor

AnthropicInstrumentor().instrument(tracer_provider=trace_provider)
GoogleGenAIInstrumentor().instrument(tracer_provider=trace_provider)

For routing through an AI gateway, the Agent Command Center ships native /v1/messages (Anthropic) and /v1beta (Gemini) adapters so the multimodal payload stays in the provider’s shape end-to-end. Many gateways translate every request into an OpenAI envelope on the way in and back out, which works for text and breaks for Claude tool-use blocks, Gemini parts arrays, and Anthropic computer-use action payloads because the OpenAI shape doesn’t have a place for them. For the broader gateway picture see the best AI gateways for image and vision LLM routing.

Common mistakes when tracing multimodal apps

Four patterns that show up in audits:

  1. Inlining megabyte payloads as base64 attributes. Blows up exporter throughput and storage cost. Put the blob in object storage, put the URL on the span. Use FI_BASE64_IMAGE_MAX_LENGTH as a hard cap.
  2. Treating the voice transcript as the source of truth. The transcript is a derived artifact. The audio is the source. If you only log the transcript, you can’t replay the failure with a better STT model later.
  3. Sampling at the trace level without modality awareness. An image-heavy route generates traces 50-100x more expensive than a text-only route. Sample by modality budget, not by trace count.
  4. Logging total_tokens only. A single counter hides every cost regression that matters. Split usage and cost by modality; pin a price_snapshot_date so a silent provider price change doesn’t look like a usage spike.

Where Future AGI fits

Most multimodal observability setups end up with a text-only trace store, a separate object store for screenshots and audio, a separate logs table for guardrail decisions, and a separate eval dashboard that re-fetches traces by ID. Future AGI compresses that into one tree: traceAI emits OTel spans with the multimodal namespaces above, Protect attaches per-modality guardrail decisions as GUARDRAIL spans on the same trace, and ai-evaluation’s CustomLLMJudge accepts inline image and audio payloads through LiteLLM so a single rubric scores text, image, and audio attached to the same span. The 60+ EvalTemplate classes (Toxicity, PromptInjection, DataPrivacyCompliance, Groundedness, ContextAdherence, ChunkAttribution, AnswerRefusal, TaskCompletion, and the rest) read message.contents rather than the scalar message.content, so they apply to multimodal traces without a rewrite.

The pluggable convention layer is the differentiator on the wire-format side. register(semantic_convention=...) accepts FI (default), OTEL_GENAI (OpenTelemetry GenAI SIG), OPENINFERENCE (Arize Phoenix), or OPENLLMETRY (Traceloop). Teams already standardized on one of those keep their downstream tools and still get the multimodal namespaces. Phoenix and Langfuse hardcode one convention each.

Honest limitation: more moving parts than a pure instrumentation library when all you want is the SDK. The multimodal namespaces are also newer than the text equivalents — provider attribute drift (Gemini’s video frame sampling rate, Anthropic’s image-tile geometry) needs an instrumentor bump every few months, which is the cost of staying close to the provider shape instead of flattening to OpenAI.

For the broader tracing-tool picture see the best LLM tracing tools in 2026 and what is LLM tracing. For the compliance pattern around multimodal PII see AI compliance and guardrails for enterprise LLMs.

TL;DR

Multimodal LLM tracing in 2026 needs three things text-tracing doesn’t: input-content-type attribution on every span part (so you can query by modality), per-modality cost and usage accounting (so a vision spike separates from a request spike), and PII handling at the modality boundary (so a screenshot or voice clip never reaches the trace store raw). OTel-GenAI and OpenInference ship the first; traceAI ships the second and third on top, with gen_ai.image.*, gen_ai.audio.*, gen_ai.video.*, and gen_ai.computer_use.* namespaces, Protect’s four Gemma 3n LoRA adapters at 65 ms text and 107 ms image p95, and SDK-layer redaction knobs that keep the raw blob out of the span store. The result is one trace tree where finance can see the cost split, security can see the PII surface, and engineering can filter spans by modality without grepping prompt strings.

Frequently asked questions

Why does text-tracing break for multimodal LLMs?
Three reasons. Payloads stop being strings (a Gemini Vision turn pulls a 700 KB image, an audio clip, and a tool result into one span), tokens stop being a single counter (vision tokens are 2-5x more expensive than text tokens and image-token math is provider-specific), and PII risk moves to the modality boundary (a screenshot leaks an account number, a voice clip leaks a name). Text-era tracing libraries either stringify the non-text payload as base64 that breaks the exporter, drop it, or store the URL and skip the modality-specific attributes.
What does modality attribution on the span actually look like?
Every input part on the span carries an explicit type. Under the OTel GenAI conventions, the input message is an array with each part tagged as text, image, audio, or video. Under OpenInference the same thing lives under message.contents. The unit of attribution is the part, not the message. That lets you answer 'how many of my failing spans had an image in the input?' as a single query on the trace store, instead of regex-grepping prompt strings.
How do I attribute cost per modality?
Capture per-modality token counts and per-modality price snapshots on the LLM span. GPT-5 Vision: 85 tokens low detail, ~170 tokens per 512x512 tile high detail plus an 85-token base. Claude Vision: tile-based with a different geometry. Gemini Vision: 258 tokens flat per image up to 3072x3072. Audio is priced per second on most providers, not per token. Roll all of that into gen_ai.usage.input_tokens.image, gen_ai.usage.input_tokens.audio_seconds, and the corresponding cost fields, and your finance dashboard finally separates a vision spike from a request spike.
What is PII at the modality boundary?
Image, audio, and video payloads carry PII that text PII scanners never see. A screenshot can leak a credential field, password manager UI, or chat thread with a customer name. A voice clip leaks the speaker's voice biometric and any spoken numbers. A video leaks faces, license plates, and physical environments. The modality boundary is where you redact before the payload is committed to the trace store; FI_HIDE_INPUT_IMAGES, FI_HIDE_INPUT_TEXT, FI_BASE64_IMAGE_MAX_LENGTH, and FI_PII_REDACTION in traceAI handle this at the SDK layer so the trace store never holds the raw blob in the first place.
How does Protect handle image and audio safety?
Protect runs four Gemma 3n LoRA adapters (toxicity, bias_detection, prompt_injection, data_privacy_compliance) with prompt templates tuned per modality. Published benchmark in arXiv 2510.13351 reports 65 ms p95 on text and 107 ms p95 on image, fast enough to run inline on each call. When ai-evaluation is installed, the OpenAIInstrumentor auto-wraps Protect.protect with GuardrailProtectWrapper, so every guardrail decision lands as a GUARDRAIL span next to the LLM span that triggered it, with status, failed_rule, reasons, and time_taken queryable as attributes.
What does setup look like for a GPT-5 Vision plus audio app?
Three lines. Call register(project_type=ProjectType.OBSERVE, project_name='multimodal_agent') from fi_instrumentation, then OpenAIInstrumentor().instrument(tracer_provider=trace_provider). The OpenAI patch captures multimodal payloads natively: image parts on chat completions, audio I/O on the responses API, and image generation calls. For Claude Vision and Gemini Vision, swap AnthropicInstrumentor or GoogleGenAIInstrumentor. Each preserves the provider's native shape end-to-end so multimodal attributes survive the round trip.
Related Articles
View all