Research

Python Decorator Tracing for LLM Apps in 2026: Patterns and Pitfalls

Decorator tracing for Python LLM apps in 2026: when to use @-tracing, when middleware fits better, OTel GenAI attributes, async pitfalls, cardinality.

July 14, 2025

12 min read

python-tracing decorator-pattern opentelemetry otel-genai llm-tracing openinference traceai 2026

Table of Contents

A Python LLM app with three custom decorators on the wrong functions produces 14 spans per request, three of them duplicates from auto-instrumentation, none of them carrying the prompt version. The on-call engineer reading the trace wants to know which step regressed. The flat tree shows nothing useful. The decorator pattern is one of the cleaner ways to instrument Python; it is also one of the easiest ways to ship a trace stack that looks impressive and answers no operational questions.

This post walks through the decorator pattern for tracing LLM apps in Python, when it fits, when middleware fits better, and the specific pitfalls that matter when you put decorators in front of LLM, retriever, tool, and evaluator code. Examples use OpenTelemetry directly with OTel GenAI attributes (gen_ai.*); traceAI, OpenInference, and OpenLLMetry decorators apply the same patterns over their respective schemas (OpenInference uses fields like input.value, retrieval.documents, document.id).

TL;DR: When to reach for the decorator

Scenario	Decorator fits	Better alternative
One Python function = one observable operation (LLM call, tool, retriever)	Yes	None
Custom business logic between LLM calls	Yes	None
FastAPI request envelope	No	OTel FastAPI auto-instrumentation
OpenAI, Anthropic, Bedrock client call	Maybe	Library auto-instrumentation if available
Long-running stream end-of-stream metric	No	Span event or end-of-stream callback
200 hot-loop iterations per request	No	Aggregate, then one summary span

The decorator is a precise tool. It gets you a stable span name, OTel GenAI attributes, and parent-child structure. It does not magically give you the right granularity. That is still a design call.

Why decorators fit function-level LLM tracing

Three reasons.

First, the Python decorator is the natural way to express “this function is an observable unit.” Reading the code, the @trace_llm line at the top of a chat-completion function tells the next engineer what the function is for, what its span name will be, and which OTel attributes will be set. No separate observability config file to consult.

Second, OTel’s tracer.start_as_current_span plus async support handles most of the heavy lifting. The decorator becomes a thin wrapper that calls the tracer, opens a span, runs the function, captures the result, and ends the span. Forty lines of Python.

Third, LLM-specific instrumentation libraries (traceAI, OpenInference, OpenLLMetry) shipped decorator or helper APIs that pre-fill an AI tracing schema from the wrapped call’s response. Raw OTel examples in this post use the gen_ai.* namespace (gen_ai.request.model, gen_ai.usage.input_tokens, gen_ai.response.finish_reasons); OpenInference-compatible libraries layer their own attributes (input.value, llm.model_name, retrieval.documents). The schema discipline lives in the library; you keep your application code clean.

The minimal OTel decorator

A working decorator that creates a span, captures latency, sets an operation name, and propagates exceptions:

import functools
import time
from opentelemetry import trace
from opentelemetry.trace import Status, StatusCode

tracer = trace.get_tracer("llm-app")

def trace_op(name=None, kind=None):
    def decorator(fn):
        span_name = name or fn.__name__
        @functools.wraps(fn)
        def wrapper(*args, **kwargs):
            with tracer.start_as_current_span(span_name) as span:
                if kind:
                    span.set_attribute("gen_ai.operation.name", kind)
                t0 = time.perf_counter()
                try:
                    result = fn(*args, **kwargs)
                    span.set_status(Status(StatusCode.OK))
                    return result
                except Exception as exc:
                    span.set_status(Status(StatusCode.ERROR, str(exc)))
                    span.record_exception(exc)
                    raise
                finally:
                    span.set_attribute("duration.ms", (time.perf_counter() - t0) * 1000)
        return wrapper
    return decorator

That is the shape. Real decorators add async support, attribute propagation, and schema-aware extraction from the return value. The shape stays the same.

Decorator versus middleware versus auto-instrumentation

Three layers exist at once in any production Python LLM app. Pick consciously.

Auto-instrumentation runs at the library boundary. The OTel Python contrib repo ships instrumentations for FastAPI, Starlette, requests, httpx, asyncio, SQLAlchemy, and a long list of others. Install the package, call the instrumentor at process start, every request through that library gets a span. Zero application code touched.

Middleware runs at the framework hook layer. FastAPI middleware wraps the entire request pipeline. Starlette middleware does the same. The middleware-created span becomes the root span for the request. Auto-instrumentation often is middleware under the hood; you can also write your own middleware for cross-cutting concerns (correlation ids, tenant headers).

Decorators run at the function boundary. They give you per-function spans inside the request handler, after the middleware has opened the root span. The decorator-created spans nest under whatever span is current when the decorated function is called.

The pattern that works for most production stacks: auto-instrumentation for FastAPI plus the HTTP client (requests, httpx). Decorators on the LLM call, the retriever query, the tool function, the sub-agent dispatch, and the evaluator. Middleware is reserved for cross-cutting concerns that do not fit either of the other two layers.

What does not work: writing a decorator that wraps every function the framework already auto-instruments. Duplicate spans, drift between attribute sets, and confusion about which span “won” the parent slot.

OTel GenAI attributes the decorator should set

For an LLM call site, the OTel GenAI semantic conventions name the canonical attribute set. The decorator extracts these from the response object and sets them on the span:

gen_ai.operation.name (e.g., chat, embeddings, text_completion)
gen_ai.provider.name (e.g., openai, anthropic, aws.bedrock, gcp.vertex_ai)
gen_ai.request.model
gen_ai.response.id
gen_ai.response.model (when the response model differs from request, e.g., routing)
gen_ai.usage.input_tokens
gen_ai.usage.output_tokens
gen_ai.response.finish_reasons

Cache and reasoning token attributes (when present in the provider response) get separate fields:

gen_ai.usage.cache_creation.input_tokens
gen_ai.usage.cache_read.input_tokens
gen_ai.usage.reasoning.output_tokens

A reasoning model that emits 30,000 reasoning tokens before producing 500 visible output tokens has very different cost and latency characteristics from a non-reasoning chat call. Collapsing both into a single gen_ai.usage.output_tokens field hides the cost.

For tool calls, the OTel conventions name gen_ai.tool.name and gen_ai.tool.call.id; if you add arguments, use a project-owned bounded attribute namespace and document it. Avoid setting the full argument JSON as a single attribute; pick the fields that matter for debugging.

For retrievers, the OTel conventions are still under development; OpenInference uses retrieval.documents with per-document fields like document.id, document.score, document.content, and document.metadata, with the query commonly carried on input attributes and top_k defined on reranker spans. Either schema works; pick one and stay consistent across the codebase.

Async, streaming, and the context-manager subtlety

Async Python with start_as_current_span works:

def trace_op_async(name=None, kind=None):
    def decorator(fn):
        @functools.wraps(fn)
        async def wrapper(*args, **kwargs):
            with tracer.start_as_current_span(name or fn.__name__) as span:
                if kind:
                    span.set_attribute("gen_ai.operation.name", kind)
                return await fn(*args, **kwargs)
        return wrapper
    return decorator

The decorator factory is a plain def; only the wrapper is async. A unified decorator that handles both sync and async callables can dispatch with inspect.iscoroutinefunction(fn).

The with block holds the span open across the await. The OTel Python SDK installs the span context into the contextvar that asyncio uses for context propagation. Child tasks created with asyncio.create_task inherit the context.

Streaming is the part where new instrumentation breaks.

@trace_op(name="chat.stream", kind="chat")
async def stream_chat(client, model, messages):
    response = await client.chat.completions.create(model=model, messages=messages, stream=True)
    async for chunk in response:
        yield chunk

The decorator opens a span. The function returns an async generator. The with block exits when the function returns. The span ends before the first chunk is yielded.

The fix is to manage the span manually inside the generator:

async def stream_chat(client, model, messages):
    span = tracer.start_span("chat.stream", attributes={"gen_ai.operation.name": "chat"})
    # Make the span current so any auto-instrumented child calls nest under it.
    with trace.use_span(span, end_on_exit=False):
        try:
            response = await client.chat.completions.create(model=model, messages=messages, stream=True)
            async for chunk in response:
                yield chunk
            span.set_status(Status(StatusCode.OK))
        except Exception as exc:
            span.set_status(Status(StatusCode.ERROR, str(exc)))
            span.record_exception(exc)
            raise
        finally:
            span.end()

The pattern: streaming responses get a manual span lifecycle. The decorator pattern fits non-streaming functions cleanly. For streaming, either skip the decorator and write the manual lifecycle inline, or use a stream-aware decorator that detects the generator return and defers span.end() to the generator’s aclose.

Cardinality, attribute hygiene, and the “ship every prompt” trap

The decorator does not blow up cardinality on its own. Attribute values do.

Three cardinality landmines:

Setting raw user input as a span attribute. A million different users produce a million different attribute values. Trace search slows down. Storage cost goes up. Privacy review fails.
Setting full prompts and completions on every span. OpenTelemetry GenAI conventions name gen_ai.input.messages and gen_ai.output.messages opt-in for this reason.
Embedding request ids in span names. chat.openai.req_abc123 produces one unique span name per request. Aggregations break.

The hygiene rules:

Span name is a low-cardinality string. chat.openai, not chat.openai.req_abc123. Use attributes for the per-request identity.
Bounded enums for fields with finite domains: gen_ai.provider.name, gen_ai.operation.name, gen_ai.request.model.
Hashed or pseudonymized identifiers for fields with high cardinality you still want: user.cohort_hash, tenant.id.
Content fields (prompt, completion, retriever chunks) gated behind an opt-in flag and redacted at the collector before storage.

The opt-in flag is the only realistic answer for production. Dev environments emit prompts; production redacts at the collector or skips the opt-in attributes entirely. The schema discipline travels in the same repo as the decorator. See LLM tracing best practices for the broader span-hygiene discussion.

Prompt versions, user cohorts, and the contextvar pattern

Decorators capture function-local information cleanly. They struggle with request-scoped values that need to ride along on every span downstream of the request handler.

The decent pattern for that is contextvars:

import contextvars

prompt_version_var = contextvars.ContextVar("prompt_version", default=None)
user_cohort_var = contextvars.ContextVar("user_cohort", default=None)

def trace_with_request_ctx(name=None, kind=None):
    def decorator(fn):
        @functools.wraps(fn)
        async def wrapper(*args, **kwargs):
            with tracer.start_as_current_span(name or fn.__name__) as span:
                if kind:
                    span.set_attribute("gen_ai.operation.name", kind)
                if (v := prompt_version_var.get()):
                    span.set_attribute("prompt.version", v)
                if (c := user_cohort_var.get()):
                    span.set_attribute("user.cohort", c)
                return await fn(*args, **kwargs)
        return wrapper
    return decorator

The middleware sets the contextvar at request start. Every decorator-created span inside the request inherits it. Async tasks inherit it through the asyncio context machinery. No thread-locals, no manual propagation.

The trap that catches teams: setting the contextvar inside an asyncio.create_task callback that does not propagate the parent context. Use asyncio.create_task(coro, context=copy_context()) if you need to copy explicitly; usually the default copies correctly.

Picking a library: traceAI, OpenInference, OpenLLMetry, raw OTel

Four realistic choices.

Raw OTel. Ten lines of decorator code, you own the schema, no extra dependency. Fits when you have three LLM call sites and want full control. Breaks down when the codebase has 30 LLM call sites and the schema starts drifting.

traceAI (github.com/future-agi/traceAI) is Future AGI’s Apache 2.0 OTel-native instrumentation framework, Apache 2.0, around 50+ integrations across Python, TypeScript, Java, and C#. Decorators pre-fill the OpenInference attribute schema. Emits OTLP to any backend. Fits when you want OpenInference-compatible attributes and a wide framework coverage.

OpenInference (github.com/Arize-ai/openinference) is Arize’s OTel semantic conventions for LLM applications, around 31 Python packages plus JavaScript and Java coverage. Decorators set the OpenInference attribute schema. Fits when your backend is Phoenix or any OTel backend that understands OpenInference attributes.

OpenLLMetry (github.com/traceloop/openllmetry) is Traceloop’s OTel-based instrumentation, Apache-2.0 licensed, decorator-shaped API. Fits when you want a thin layer over raw OTel with sensible defaults.

The comparison: all four emit OTLP and work with most OTel backends. The three library-backed options ship LLM-specific decorator helpers; raw OTel gives you start_as_current_span plus the work of writing the decorator and configuring an OTLP exporter yourself. Schema differences narrowed in 2026, so pick by which library covers your framework set best (CrewAI, LangGraph, AutoGen, Pydantic AI, OpenAI Agents SDK, Vercel AI SDK if you have a TypeScript surface). If you use Future AGI’s evaluation and observability stack, traceAI emits attributes that the Future AGI tracing platform understands natively (no custom schema mapping required); SDK registration, credentials, and exporter wiring are still part of normal setup.

Common mistakes when adopting decorator tracing

Wrapping framework boundaries the auto-instrumentation already covers. Duplicate spans. Audit the call graph first.
Putting the decorator inside a hot loop. A retriever that scores 1,000 candidates does not need 1,000 spans. Aggregate, emit one summary span.
Setting raw user inputs as span attributes. Cardinality explosion plus privacy risk.
Using thread-local storage for request-scoped values in async code. Use contextvars.
Decorating sync code with an async decorator (or vice versa). Inspect asyncio.iscoroutinefunction(fn) and dispatch.
Forgetting to end the span on streaming responses. The with block exits before the stream is consumed.
Letting every team write their own decorator. Schema drift. Write one, document the attribute set, treat schema changes as breaking changes.
Skipping the no-op tracer guard in dev. OTel’s no-op tracer is fine. Decorators on top of it have negligible overhead. Gate the SDK init, not the decorators.

When the decorator is the wrong tool

Three cases.

First, when the unit of observability is the whole request. The FastAPI middleware (auto-instrumentation) already opens a request span. Wrapping the request handler with a decorator duplicates that span.

Second, when the unit of observability is a side effect that does not fit a function call. End-of-stream metrics, eviction events on a cache, garbage-collection pauses. These belong to span events on an existing span or to OTel metrics, not to a decorator.

Third, when the function is called from many places and the right span name depends on the caller. The decorator names the span at decoration time. If the caller wants agent.dispatch.weather_tool and another caller wants agent.dispatch.search_tool for the same underlying function, the decorator name is wrong. Open the span at the call site instead.

How to actually adopt decorator tracing in 2026

Pick the library. Raw OTel for small surfaces; traceAI, OpenInference, or OpenLLMetry for larger ones.
Document the attribute schema. Required attributes, opt-in fields, custom dimensions (prompt.version, user.cohort, tenant.id).
Audit the call graph. Where is auto-instrumentation already creating spans? Do not duplicate.
Decorate the meaningful units. LLM call, retriever, tool, sub-agent, evaluator. Custom logic between LLM calls if it carries operational weight.
Wire the contextvar layer. Middleware sets prompt.version, user.cohort at request start; decorators read them.
Set the cardinality budget. No raw user input as attributes. Bounded enums, hashed identifiers, opt-in content fields.
Handle streaming explicitly. Manual span lifecycle for async generators; decorator only for non-streaming functions.
Wire the collector. OTLP to a collector, redaction processor, tail-sampling processor in front of the backend.
Test the no-op path. Dev with OTLP off should not slow down meaningfully. If it does, the decorator is doing too much.
Audit quarterly. Span volume per request, attribute drift, duplicate spans. Refactor before the schema fragments.

What is shifting in Python LLM tracing in 2026

These are the directions worth tracking. Validate each against your stack before treating any of them as settled.

OTel GenAI semantic conventions are still in Development with an opt-in stability transition (OTEL_SEMCONV_STABILITY_OPT_IN). Cross-vendor decorator interop is improving but not yet stable.
Reasoning-token attributes are named in the OTel GenAI spec (gen_ai.usage.reasoning.output_tokens). Decorators that do not capture them under-attribute cost on reasoning models.
Async context propagation in OTel Python is mature: contextvars-based attribute propagation works without manual copy in most async patterns.
Per-framework instrumentation coverage has expanded across OpenInference, traceAI, and OpenLLMetry, so decorator-only stacks are less common than auto-instrumentation plus targeted decorators.
Tail sampling at the OTel collector is widely deployed (the contrib processor is still beta), reducing the case for client-side sampling logic in decorators.

How FutureAGI implements Python decorator tracing for LLMs

FutureAGI is the production-grade backend for Python decorator-based LLM tracing built around the closed reliability loop that decorator-only stacks stitch together by hand. The full stack runs on one Apache 2.0 self-hostable plane:

Decorator-friendly tracing, traceAI (Apache 2.0) ships Python decorators alongside auto-instrumentation for 35+ frameworks; the same library covers TypeScript, Java (LangChain4j and Spring AI), and a C# core, so decorator-instrumented Python services share trace IDs with non-Python services.
Span-attached evals, 50+ first-party metrics attach as span attributes on decorator-emitted spans; BYOK lets any LLM serve as the judge at zero platform fee, and turing_flash runs the same rubrics at 50 to 70 ms p95.
Simulation, persona-driven scenarios exercise decorator-instrumented code in pre-prod with the same scorer contract that judges production traces.
Gateway and guardrails, the Agent Command Center fronts 100+ providers with BYOK routing, and 18+ runtime guardrails enforce policy on the same plane.

Beyond the four axes, FutureAGI also ships six prompt-optimization algorithms that consume failing trajectories as training data. Pricing starts free with a 50 GB tracing tier; Boost is $250 per month, Scale is $750 per month with HIPAA, and Enterprise from $2,000 per month with SOC 2 Type II.

Most teams running decorator-based Python LLM tracing in production end up running three or four backend tools alongside the decorators: one for traces, one for evals, one for the gateway, one for guardrails. FutureAGI is the recommended pick because tracing, evals, simulation, gateway, and guardrails all live on one self-hostable runtime; the loop closes without stitching.

Sources

Series cross-link

Frequently asked questions

When should I use a decorator versus middleware for LLM tracing?

Decorators fit when the unit of observability is a single function and you want the call site to declare what it is (LLM call, retriever, tool). Middleware fits when the unit of observability is the whole HTTP request or framework hook. Most production stacks use both: decorators on the LLM, retriever, tool, and evaluator functions plus an OTel middleware on the FastAPI or Starlette layer. Auto-instrumentation libraries (traceAI, OpenInference) cover the framework boundaries; decorators fill the gaps where you have custom Python logic between LLM calls.

Will adding @trace to every function blow up my cardinality?

Not by itself. Cardinality blows up when attribute values are unbounded (user ids, full prompt text, raw URLs with query strings). The decorator is just a span-creation hook. You control cardinality through what you set as attributes. Rule of thumb: span name is a low-cardinality string (function name plus operation), attributes are either bounded enums or hashed identifiers, and content fields (prompts, completions) are gated by an opt-in flag plus collector-side redaction. Skip the decorator on hot per-request loops if span volume is the issue.

Do decorators work with async Python and streaming responses?

Yes for plain async functions if you use a context-manager-aware decorator. The OTel Python tracer's start_as_current_span context manager handles async. Wrap an async function with a decorator that uses start_as_current_span around an await on the wrapped function. Plain decorators do not work for async generators (streaming responses): the with block exits when the generator is created, so the span ends before the first chunk is yielded. For streaming you either skip the decorator and write a manual span lifecycle inside the generator, or use a stream-aware decorator that detects the generator return and defers span.end() until aclose.

Should I use OpenTelemetry directly or an LLM-specific decorator library?

OTel-native first, an LLM-specific library on top if it ships sane defaults for the gen_ai.* schema. traceAI, OpenInference, and OpenLLMetry all expose decorators that pre-fill OTel GenAI attributes from the wrapped call's signature. The advantage is consistency: every LLM call across the codebase gets the same attribute set without hand-coding. The disadvantage is one more layer; if you only have three LLM call sites, raw OTel is fine.

How do I attach prompt version and user cohort to a decorator-created span?

Two patterns. The decorator accepts attribute kwargs at decoration time for static values (function-level constants), and uses a thread-local or contextvar for runtime values (request-scoped prompt version, user cohort). Inside the decorated function body, call trace.get_current_span().set_attribute('prompt.version', resolved_id) once the version is known. The contextvar pattern composes cleanly with async; the thread-local pattern does not.

What attributes should the decorator set automatically?

For LLM call sites: gen_ai.operation.name, gen_ai.provider.name, gen_ai.request.model, gen_ai.usage.input_tokens, gen_ai.usage.output_tokens, gen_ai.response.finish_reasons. For tools and retrievers: a stable operation name plus the bounded set of arguments you care about. Skip raw prompts and raw retriever chunks unless an opt-in flag is set, and even then redact at the collector. Everything else is application-specific (prompt.version, user.cohort, tenant.id, eval.score).

What goes wrong when teams over-decorate?

Three failure modes. First, span explosion: 200 spans per request when the meaningful unit was 8. Trace storage bills go up, trace search becomes slower, and the parent-child tree gets unreadable. Second, attribute drift: every team writes their own decorator and their attributes do not match. Third, the decorator creates a span around a function that was already traced by auto-instrumentation, producing duplicate spans. Audit the call graph before adding the decorator and lean on auto-instrumentation for framework boundaries.

Can I use decorators in production without a backend running?

Yes. If no tracer provider is registered, OTel API calls are effectively no-ops. The pattern: gate SDK initialization behind an environment flag. In dev, either skip SDK init or install an explicit no-op provider; do not assume a configured SDK without an exporter is free, since spans may still be created and dropped. In prod, register the provider once at process start; the decorators reference the global tracer.

View all

Research

What is LLM Tracing? Spans, OTel GenAI, and Sampling in 2026

LLM tracing is structured spans for prompts, tools, retrievals, and sub-agents under OTel GenAI conventions. What it is and how to implement it in 2026.

Rishav Hada · Jan 16, 2026

13 min

Research

Best OTel Instrumentation Tools for LLMs in 2026: 6 Compared

OpenInference, traceAI, OpenLLMetry, OpenLIT, OTel-contrib, and vendor SDKs as the 2026 OTel-for-LLMs shortlist. License, language coverage, gen_ai.* support.

Nikhil Pareek · Aug 28, 2025

13 min

Research

What is OpenInference? OpenTelemetry for LLM Apps in 2026

OpenInference is the OpenTelemetry-aligned semantic convention and instrumentation library for LLM applications, maintained by Arize. What it is and how it fits in 2026.

Rishav Hada · Apr 11, 2026

8 min