Multimodal LLM Tracing in 2026: The Methodology That Actually Works
Multimodal LLM tracing for Gemini Vision, GPT-5 Vision, Claude Vision. Modality attribution on the span, per-modality cost, PII at the modality boundary, traceAI schema.
Table of Contents
Tracing a GPT-5 Vision turn with a text-era tracing library shows a single LLM span with prompt: "[image attached]" and a token count that’s 5x what the text would have predicted. Nothing tells you the image was a 4K screenshot the user pasted by accident, nothing separates vision tokens from text tokens in the cost number, and nothing flagged that the screenshot contained an unredacted invoice. Multimodal tracing isn’t text-tracing with audio glued on. It needs three things text-tracing doesn’t: input-content-type attribution on every span, per-modality cost accounting, and PII handling at the modality boundary. This post is the working methodology for all three.
Why text-tracing breaks for multimodal models
Multimodal LLM workloads in 2026 routinely route through more than one input type per call. Gemini Vision, GPT-5 Vision, and Claude Vision all accept text, image, and audio in a single message. The downstream tracing story hasn’t kept up, and the failures show up in three predictable places.
The payload stops being a string. Text-era schemas treat messages as a list of strings with role and content. A Gemini Vision turn pulls in a 700 KB image, a 30-second audio clip, and a tool result inside one message. Most text-era tracing libraries do one of three things when they see that: drop the non-text parts, base64-encode them into a single attribute that blows up the exporter, or store the URL and skip the modality metadata. None of that is useful for debugging.
Tokens stop being a single counter. A vision token is 2-5x more expensive than a text token, and image-token arithmetic is provider-specific. GPT-5 Vision: 85 tokens for low detail; ceil(width/512) * ceil(height/512) * 170 + 85 for high detail. Claude Vision: tile-based with a different geometry. Gemini Vision: 258 tokens flat per image up to 3072x3072. Audio is usually priced per second, not per token. A single total_tokens attribute hides every cost regression that matters.
PII risk moves to the modality boundary. A text PII scanner finds emails and SSNs. It doesn’t find a face in an uploaded photo, an account number visible in a screenshot, or a customer name spoken in an audio clip. Once images, audio, and video enter the trace, the PII surface is the modality boundary, not the text string.
The three problems compose. A vision app where finance can’t see the cost spike, security can’t see the PII surface, and engineering can’t filter spans by modality is a tracing setup that ships zero usable signal the day a real multimodal failure happens.
The three things multimodal tracing needs
The methodology for tracing multimodal LLM apps in 2026 ships as three concrete additions to the text-tracing baseline. They map cleanly to the three failures above.
1. Input content type on every span part
The unit of attribution is the part, not the message. Each input part on an LLM span carries an explicit type: text, image, audio, or video. The OTel GenAI semantic conventions ship this as a structured input.messages array where each message’s content is itself a list of parts; OpenInference ships the same thing under llm.input_messages with message.contents as the part list. Both vocabularies converged on the same shape in late 2025, and both are what register(semantic_convention=...) in traceAI emits today.
The query you want to be able to run on a trace store: “of the spans that failed the brand_safety judge yesterday, how many had an image in the input?” That query is a single filter when the part type is a first-class attribute. It’s an OCR-on-prompt-strings nightmare when it isn’t.
Practical attributes on every multimodal LLM span:
gen_ai.input.messages[*].content[*].type— text, image, audio, videogen_ai.input.messages[*].content[*].image_url— pointer to object storage, not the base64 payloadgen_ai.input.messages[*].content[*].audio.format— wav, mp3, opusgen_ai.input.messages[*].content[*].audio.duration_secondsgen_ai.input.modalities— denormalized list of part types present, useful for cheap filtering
The denormalized gen_ai.input.modalities attribute is the one that lets you build dashboards. Cardinality is bounded (at most four values: text, image, audio, video), so it works as a chart dimension without exploding the storage cost.
2. Per-modality cost and usage attribution
Two attributes on the LLM span are not enough. A working multimodal cost schema splits usage and cost by modality:
gen_ai.usage.input_tokens.textgen_ai.usage.input_tokens.imagegen_ai.usage.input_tokens.audio_seconds(audio is per-second on most providers)gen_ai.usage.input_tokens.video_secondsgen_ai.usage.output_tokens.textgen_ai.usage.output_tokens.image(for image-generation calls)gen_ai.cost.input.image_usd,gen_ai.cost.input.audio_usd,gen_ai.cost.input.text_usdgen_ai.cost.output.image_usd,gen_ai.cost.output.audio_usd,gen_ai.cost.output.text_usdgen_ai.cost.price_snapshot_date— the date the price table was loaded; vision pricing moves often enough that a stale snapshot causes a silent regression
The split matters operationally. A 4x cost spike that finance sees on Tuesday morning is either a vision-token spike (a marketing user started pasting 4K screenshots), an audio-seconds spike (a customer support bot started transcribing long voicemails), a request-count spike (traffic), or a model-swap spike (a router moved to a more expensive vision model). Each has a different fix, and a single total_cost_usd attribute can’t tell them apart.
Tip: capture the estimate before the forward and the actual after. GPT-5 Vision’s billed tokens vary by 5-15% from the documented formula because tiling logic rounds based on aspect ratio. The estimate is what your budget cap enforces against; the actual is what you reconcile to.
3. PII at the modality boundary
Multimodal payloads carry more PII risk per byte than text. A screenshot can leak a credential field, a password manager UI, or a chat thread. A voice clip leaks the speaker’s voice biometric and any spoken numbers. A video leaks faces, license plates, and physical environments. Three rules keep the trace store clean:
Redact before commit. The trace store should never hold the raw image bytes or the raw audio. Either store the payload in object storage and put the URL on the span, or strip the part entirely at the SDK layer before the span is exported. traceAI ships the second knob as a set of environment variables, so the right defaults ship without an app-code change:
FI_HIDE_INPUT_IMAGESstrips image attributes from input messages while leaving the part structure intactFI_HIDE_INPUT_TEXTandFI_HIDE_OUTPUT_TEXTredact text parts while leaving image and audio metadataFI_HIDE_INPUT_MESSAGESandFI_HIDE_OUTPUT_MESSAGESdrop the full message arraysFI_BASE64_IMAGE_MAX_LENGTHcaps inline base64 length so a stray large image doesn’t slip into the span attribute tableFI_PII_REDACTION=trueenables the regex scanner for emails, SSNs, credit cards, IPs, API keys, and phone numbers across the text portions of multimodal messages
Hidden values are replaced with __REDACTED__ so the part structure stays valid and downstream evals still have something to read.
Run image and audio guardrails on the request path, not in a side queue. A multimodal PII problem caught in a nightly batch is a multimodal PII problem that already leaked. A guardrail layer that runs inline at modality-aware latency (more on this below) is the only architecture that catches the leak before the trace is committed.
Sample by modality budget, not by trace count. An image-heavy route generates traces 50-100x larger than a text-only route. Head-based percentage sampling that drops 90% of all traces still preserves 90% of the bytes if those bytes are concentrated in the 10% of traces with images. Sample on gen_ai.input.modalities to keep modality coverage honest.
traceAI’s multimodal span schema
The traceAI semantic conventions are verifiable in python/fi_instrumentation/fi_types.py and shipped on every span the corresponding instrumentor emits. Four namespaces are uniquely multimodal-first.
gen_ai.image.* — image input and image generation. Covers prompt, negative_prompt, width, height, quality, style, steps, guidance_scale, seed, format, count, revised_prompt, and output_urls. The revised_prompt field catches DALL-E 3 and gpt-image-1 prompt-rewriter regressions: both rewrite the user prompt before generating, and logging only the original hides the actual cause when an output drifts from the request.
gen_ai.audio.* — audio input, audio output, and TTS/STT. Covers format, duration_seconds, language, model, provider, voice_id, and the per-stage latency breakdown. For voice agents on Pipecat or LiveKit, the equivalent gen_ai.voice.* namespace adds telephony attributes (call_id, from_number, to_number, channel_type), interruption counts, and per-stage cost breakdown across STT, TTS, LLM, and telephony. See OpenInference + OpenTelemetry for voice agents for the voice mapping in detail.
gen_ai.video.* — video input, frame extraction, and video generation. Covers duration_seconds, fps, resolution, codec, frame_count_sampled, and the per-frame token estimate when the provider exposes one (Gemini’s video pricing samples 1 frame per second by default).
gen_ai.computer_use.* — for vision-driven agentic workloads where the model operates a virtual desktop. Covers action, coordinate_x, coordinate_y, screenshot, viewport_width, viewport_height, current_url, element_selector, and result. When an agent clicks the wrong button, the screenshot plus coordinates plus selector tell you in one span what the model saw, what it decided, and what actually happened.
All four namespaces compose with the OpenInference span-kind taxonomy. traceAI ships 14 span kinds (LLM, CHAIN, RETRIEVER, TOOL, EMBEDDING, AGENT, RERANKER, GUARDRAIL, EVALUATOR, CONVERSATION, VECTOR_DB, A2A_CLIENT, A2A_SERVER, UNKNOWN); Phoenix ships 8, Langfuse ships 5. The extras that matter most for multimodal: GUARDRAIL as a top-level kind (so image and audio safety decisions are first-class spans), EVALUATOR as a top-level kind (so a multimodal judge score attaches to the span it scored), and A2A_CLIENT/A2A_SERVER (so a multimodal hand-off between agents stays on one trace tree).
Protect for image and audio: 4 LoRA + Flash
Multimodal payloads break text-only guardrails the moment they hit the request. A toxicity classifier that only reads message.content strings misses the toxic image overlay, the unsafe voice tone, and the screenshot showing a credential. Protect ships per-modality coverage as four Gemma 3n LoRA adapters (toxicity, bias_detection, prompt_injection, data_privacy_compliance) with prompt templates tuned per modality. The published benchmark in arXiv 2510.13351 reports 65 ms p95 text and 107 ms p95 image median time-to-label, fast enough to run inline on every multimodal call without breaking the per-turn latency budget.
The integration into traceAI is one line. When ai-evaluation is installed, the OpenAIInstrumentor (and the equivalent Anthropic and Google instrumentors) auto-attaches a GuardrailProtectWrapper around fi.evals.Protect.protect. Every guardrail evaluation becomes a GUARDRAIL span on the same trace tree as the LLM span, populated with:
guardrail.status,guardrail.failed_rule,guardrail.reasons,guardrail.time_takenguardrail.completed_rules,guardrail.uncompleted_rulesgen_ai.guardrail.modality— text, image, audio, or videogen_ai.guardrail.name,gen_ai.guardrail.type,gen_ai.guardrail.result,gen_ai.guardrail.score,gen_ai.guardrail.modified_output
That gives you “the model saw the image, the image guardrail flagged it for data_privacy_compliance in 107 ms, here’s the modified output we returned” as one contiguous span chain. Compare with the usual pattern: a separate guardrail logs table that you JOIN against the trace by request ID after the fact, with no inline latency visible to the SRE.
For sub-10 ms deterministic pre-filters on text portions of multimodal messages, fi.evals.guardrails.scanners ships JailbreakScanner, CodeInjectionScanner, SecretsScanner, MaliciousURLScanner, LanguageScanner, TopicRestrictionScanner, and RegexScanner. The deterministic layer catches the obvious; the LoRA adapters catch the modality-specific.
Practical setup for a GPT-5 Vision multimodal agent
The minimal three-line setup for a multimodal app:
from fi_instrumentation import register
from fi_instrumentation.fi_types import ProjectType
from traceai_openai import OpenAIInstrumentor
trace_provider = register(
project_type=ProjectType.OBSERVE,
project_name="my_multimodal_agent",
project_version_name="v1.0.0",
)
OpenAIInstrumentor().instrument(tracer_provider=trace_provider)
The OpenAI instrumentor captures multimodal payloads natively: image parts on client.chat.completions.create(...), audio I/O on the responses API, image generation on client.images.generate(...), and the TTS/STT calls. No application-code change is needed. For Claude Vision and Gemini Vision, the equivalent native instrumentors preserve the provider’s shape end-to-end:
from traceai_anthropic import AnthropicInstrumentor
from traceai_google_genai import GoogleGenAIInstrumentor
AnthropicInstrumentor().instrument(tracer_provider=trace_provider)
GoogleGenAIInstrumentor().instrument(tracer_provider=trace_provider)
For routing through an AI gateway, the Agent Command Center ships native /v1/messages (Anthropic) and /v1beta (Gemini) adapters so the multimodal payload stays in the provider’s shape end-to-end. Many gateways translate every request into an OpenAI envelope on the way in and back out, which works for text and breaks for Claude tool-use blocks, Gemini parts arrays, and Anthropic computer-use action payloads because the OpenAI shape doesn’t have a place for them. For the broader gateway picture see the best AI gateways for image and vision LLM routing.
Common mistakes when tracing multimodal apps
Four patterns that show up in audits:
- Inlining megabyte payloads as base64 attributes. Blows up exporter throughput and storage cost. Put the blob in object storage, put the URL on the span. Use
FI_BASE64_IMAGE_MAX_LENGTHas a hard cap. - Treating the voice transcript as the source of truth. The transcript is a derived artifact. The audio is the source. If you only log the transcript, you can’t replay the failure with a better STT model later.
- Sampling at the trace level without modality awareness. An image-heavy route generates traces 50-100x more expensive than a text-only route. Sample by modality budget, not by trace count.
- Logging
total_tokensonly. A single counter hides every cost regression that matters. Split usage and cost by modality; pin aprice_snapshot_dateso a silent provider price change doesn’t look like a usage spike.
Where Future AGI fits
Most multimodal observability setups end up with a text-only trace store, a separate object store for screenshots and audio, a separate logs table for guardrail decisions, and a separate eval dashboard that re-fetches traces by ID. Future AGI compresses that into one tree: traceAI emits OTel spans with the multimodal namespaces above, Protect attaches per-modality guardrail decisions as GUARDRAIL spans on the same trace, and ai-evaluation’s CustomLLMJudge accepts inline image and audio payloads through LiteLLM so a single rubric scores text, image, and audio attached to the same span. The 60+ EvalTemplate classes (Toxicity, PromptInjection, DataPrivacyCompliance, Groundedness, ContextAdherence, ChunkAttribution, AnswerRefusal, TaskCompletion, and the rest) read message.contents rather than the scalar message.content, so they apply to multimodal traces without a rewrite.
The pluggable convention layer is the differentiator on the wire-format side. register(semantic_convention=...) accepts FI (default), OTEL_GENAI (OpenTelemetry GenAI SIG), OPENINFERENCE (Arize Phoenix), or OPENLLMETRY (Traceloop). Teams already standardized on one of those keep their downstream tools and still get the multimodal namespaces. Phoenix and Langfuse hardcode one convention each.
Honest limitation: more moving parts than a pure instrumentation library when all you want is the SDK. The multimodal namespaces are also newer than the text equivalents — provider attribute drift (Gemini’s video frame sampling rate, Anthropic’s image-tile geometry) needs an instrumentor bump every few months, which is the cost of staying close to the provider shape instead of flattening to OpenAI.
For the broader tracing-tool picture see the best LLM tracing tools in 2026 and what is LLM tracing. For the compliance pattern around multimodal PII see AI compliance and guardrails for enterprise LLMs.
TL;DR
Multimodal LLM tracing in 2026 needs three things text-tracing doesn’t: input-content-type attribution on every span part (so you can query by modality), per-modality cost and usage accounting (so a vision spike separates from a request spike), and PII handling at the modality boundary (so a screenshot or voice clip never reaches the trace store raw). OTel-GenAI and OpenInference ship the first; traceAI ships the second and third on top, with gen_ai.image.*, gen_ai.audio.*, gen_ai.video.*, and gen_ai.computer_use.* namespaces, Protect’s four Gemma 3n LoRA adapters at 65 ms text and 107 ms image p95, and SDK-layer redaction knobs that keep the raw blob out of the span store. The result is one trace tree where finance can see the cost split, security can see the PII surface, and engineering can filter spans by modality without grepping prompt strings.
Frequently asked questions
Why does text-tracing break for multimodal LLMs?
What does modality attribution on the span actually look like?
How do I attribute cost per modality?
What is PII at the modality boundary?
How does Protect handle image and audio safety?
What does setup look like for a GPT-5 Vision plus audio app?
OTel for LLM apps in 2026 = OTel-GenAI + OpenInference + eval-as-span-attribute. The three layers, the traceAI register pattern, span enrichment, and sampling.
traceAI is the open-source OpenTelemetry-native tracing library for LLM and agent apps. Span model, 30+ integrations, OTLP transport, and how to choose your tracing layer in 2026.
Trace voice agents with traceAI in 2026: how STT/LLM/TTS/tool spans are captured, OTLP transport, the FAGI Observe backend, and traceAI code for LiveKit and Pipecat.