What Is TraceAI?
FutureAGI's open-source OpenTelemetry instrumentation library for LLM and agent frameworks, emitting gen_ai.* compliant spans under Apache 2.0.
What Is TraceAI?
TraceAI is FutureAGI’s open-source instrumentation library that turns LLM and agent code into structured OpenTelemetry traces. As of May 2026, it ships drop-in tracers for 50+ frameworks across Python, TypeScript, Java, and C#. model SDKs (OpenAI, Anthropic, Bedrock, Google GenAI, Mistral, Cohere, Groq, Together, xAI), agent frameworks (LangChain, LlamaIndex, CrewAI, OpenAI Agents SDK, Google ADK, AutoGen, Mastra, Pydantic AI, DSPy, Strands, Smolagents, Haystack), MCP servers and clients, voice (LiveKit, Pipecat), gateways (Portkey, LiteLLM), and vector stores (Pinecone, Weaviate, Qdrant, Chroma, LanceDB, Milvus, MongoDB, pgvector). It is Apache 2.0, OTel-native, and exports to any OTLP backend. the Agent Command Center inside FutureAGI is just one possible target. The whole point of traceAI is that the trace you collect in dev is the same trace shape you replay in simulate, score in evaluate, and watch in production tracing.
Why traceAI matters in production LLM and agent systems
Most teams write their own LLM instrumentation once, regret it for two years. In 2026 the cost of rolling your own is higher than it was in 2023, because the surface area has multiplied: GPT-5.x, Claude Opus 4.7, Gemini 3.x, and Llama 4 each ship slightly different streaming, tool-call, reasoning-trace, and structured-output formats; agent frameworks like CrewAI, LangGraph, OpenAI Agents SDK, and Google ADK each model planning, handoff, and tool use differently; MCP and A2A added two more wire protocols that need span coverage. Hand-rolled wrappers patch OpenAI.chat.completions.create, capture prompt/completion strings, compute token counts, thread parent span ids through async tool calls, and try to stitch agent state into a tree. That is six months of platform work, maintained forever, broken every time an upstream SDK ships a method.
Two failure modes follow from skipping a real instrumentation library. First, trace gaps. A LangGraph agent calls a tool, the tool spawns an inner LLM call via a different SDK, the inner call has no parent span id, and the trace tree breaks at the boundary. The user sees a slow turn; the engineer sees half the spans and cannot reconstruct what the agent trajectory actually did. Second, schema drift. Three teams instrument three differently. one writes tokens_in, another writes prompt_tokens, a third writes usage.input.tokens. and dashboards stop being comparable across services. Cost reporting, LLM observability alerts, and the evaluator layer all degrade because there is no canonical attribute to read.
The pain lands across roles. Developers see flaky tool traces and missing context windows when a regression test fails. SREs see latency outliers without a way to slice by model, by tenant, or by retriever. Cost owners see surprise spikes because they were counting tokens at the gateway only, not at every fallback path. Compliance leads need audit trails that include exact prompts, tool args, retrieved chunks, and policy decisions; without an OTel-aligned library those records sit in five different schemas.
TraceAI exists to remove that work. Each integration is maintained against the upstream SDK version it tracks; spans share a single attribute schema (gen_ai.* plus fi.span.kind); parent-child relationships flow through OTel context propagation across threads, async tasks, and process boundaries. The result is one instrumentation contract across the whole agent stack, not 50 bespoke wrappers. and because it is OTel, the same spans flow to a vendor-neutral backend if a team ever wants to leave FutureAGI.
How FutureAGI builds and ships traceAI
FutureAGI’s approach is to keep traceAI deliberately decoupled from the rest of the platform. The library emits OTel spans; they go wherever you point OTLP. FutureAGI, Phoenix, Langfuse, Datadog, Honeycomb, Grafana Tempo, or a self-hosted collector. There is no proprietary SDK requirement, no closed wire format, and no telemetry held hostage behind a login.
A typical Python install is two lines plus the integration:
from fi_instrumentation import register
from traceai_openai import OpenAIInstrumentor
trace_provider = register(project_name="checkout-agent")
OpenAIInstrumentor().instrument(tracer_provider=trace_provider)
That snippet alone produces one span per chat completion, with gen_ai.system, gen_ai.request.model, gen_ai.usage.input_tokens, gen_ai.usage.output_tokens, gen_ai.cost.total, full prompts (if not redacted), the streamed completion, and any tool calls. Add traceai-langchain and the same call now produces a parent CHAIN span with the LLM span as a child, plus retriever spans for vector DB hits and tool spans for any LangChain Tool invocation. The agent tracing tree is automatic. you do not write a single with tracer.start_as_current_span block.
For agents, the traceAI-openai-agents integration captures the OpenAI Agents SDK’s tool-call decisions, handoffs, and trajectory; combined with fi.evals.TrajectoryScore, you get an agent-graph view with eval verdicts at each node. The traceAI-crewai integration captures crew composition, role assignments, and the per-task plan. The traceAI-google-adk and traceAI-mastra integrations cover Google’s Agent Development Kit and Mastra workflows. For voice, traceAI-livekit captures STT, LLM, and TTS stages with gen_ai.voice.latency.ttfb_ms and gen_ai.voice.latency.tts_first_audio_ms per turn, which is the only honest way to ground a voice agent latency SLO. For gateways, traceAI-portkey and traceAI-litellm propagate the trace context across the proxy hop so a single user request stays one trace even when traffic crosses provider boundaries. a property that almost nothing else in the LLM observability space handles cleanly.
The differentiator vs. OpenInference (Arize’s analogous library) is three things. First, language reach: traceAI ships in Python, TypeScript, Java, and C#, where most enterprise voice, contact-center, and JVM-based RAG stacks live. OpenInference is Python-and-TS first. Second, the fi.span.kind taxonomy that distinguishes LLM, RETRIEVER, TOOL, AGENT, CHAIN, EMBEDDING, RERANKER, GUARDRAIL, and EVALUATOR spans. meaning a dashboard can filter “show me only retriever spans with retrieval.score below 0.4” without parsing free-form span names. Third, tight integration with the FutureAGI eval, simulate, and gateway surfaces. the same span ids you debug in production are the ones the simulator replays, the same ones protect-flash decisions reference, and the same ones the evaluation framework writes verdicts onto. Langfuse and Helicone do their own thing well; if your stack is already OTel-first, traceAI is the lowest-friction path.
| Library | Maintainer | Languages | Frameworks | Span taxonomy | Backend lock-in |
|---|---|---|---|---|---|
| traceAI | FutureAGI | Python, TS, Java, C# | 50+ incl. CrewAI, OpenAI Agents SDK, Mastra, ADK, MCP, LiveKit | fi.span.kind (LLM/RETRIEVER/TOOL/AGENT/CHAIN/EMBEDDING/RERANKER/GUARDRAIL/EVALUATOR) | None. any OTLP backend |
| OpenInference | Arize | Python, TS | 30+ frameworks | openinference.span.kind | None. any OTLP backend |
| Langfuse SDK | Langfuse | Python, TS, JS | LangChain, LlamaIndex, OpenAI, Anthropic | Langfuse-native (trace/span/generation) | Langfuse backend |
| Helicone proxy | Helicone | Provider-agnostic (HTTP proxy) | Any OpenAI-compatible | Custom log schema | Helicone backend |
| Datadog LLM Obs | Datadog | Python, Node | OpenAI, Anthropic, LangChain | Datadog APM schema | Datadog |
| Hand-rolled | You | Whatever you write | Whatever you patch | Whatever you invent | Yours, forever |
How to use traceAI in your stack
The install pattern is consistent across every integration. Pick the framework you care about, install traceai-{framework}, register a project, and call Instrumentor().instrument(). The same shape works for vector database clients, reranker calls, MCP servers, and guardrails policies.
A typical agent stack install:
from fi_instrumentation import register
from traceai_openai_agents import OpenAIAgentsInstrumentor
from traceai_pinecone import PineconeInstrumentor
from traceai_langchain import LangChainInstrumentor
trace_provider = register(project_name="support-agent-prod")
OpenAIAgentsInstrumentor().instrument(tracer_provider=trace_provider)
PineconeInstrumentor().instrument(tracer_provider=trace_provider)
LangChainInstrumentor().instrument(tracer_provider=trace_provider)
Once those three instrumentors are live, a single user message produces a distributed trace with the agent root span, planning spans, tool-call spans for every Pinecone retrieve, LangChain chain spans for any subprocess, LLM spans for each call, and final reply spans. From there:
- Production monitoring. point the OTLP exporter at the FutureAGI Agent Command Center; you get the standard tracing dashboard plus per-cohort eval-fail rate from the evaluator layer.
- Offline replay. re-export the spans into the simulate sandbox; the persona engine replays the same user trajectory against a candidate model.
- Release gates. attach
fi.evals.HallucinationScore,Groundedness,AnswerRelevancy,ContextPrecision, orTrajectoryScoreto the dataset; the CI job re-runs the gate and blocks the deploy on regression. - Cost and latency budgeting. slice
gen_ai.usage.input_tokensandgen_ai.client.operation.durationbygen_ai.request.modeland tenant; the same query works whether traffic crossed a gateway or hit the provider directly.
The 2026 tells that an LLM team has traceAI wired correctly: their on-call can answer “show me every Claude Opus 4.7 call in the last hour where retrieval recall@5 dropped below 0.6” in one query, and their CI can block a deploy on a regression eval without anyone writing custom plumbing.
Wiring traceAI into a multi-environment rollout
A common 2026 pattern is the same agent shipped across dev, staging, canary, and prod with different OTLP exporters in each. Each environment registers its own project_name and tenant label; the OTel resource attributes carry deployment.environment, service.name, service.version, and a build sha. The exporter destinations differ. dev to a local collector, staging to FutureAGI sandbox, canary to FutureAGI prod under a separate project, prod to FutureAGI prod under the main project. From the engineer’s side, the application code does not change between environments; only the registration call’s args do. That property is what makes traceAI usable across hundreds of services without per-service forks of an instrumentation wrapper.
The other property worth understanding is stable attribute schema across model providers. A 2026 stack rarely uses only one provider; the same agent often routes to GPT-5.x for default traffic, Claude Opus 4.7 for long-context legal questions, Gemini 3.x for video understanding, and Llama 4 for cost-sensitive batch jobs. Without a stable schema, each provider’s spans would carry different attribute names, and a dashboard built on OpenAI would silently miss half the data. TraceAI normalises this. gen_ai.request.model, gen_ai.usage.input_tokens, gen_ai.cost.total mean the same thing whether the call hit OpenAI, Anthropic, Bedrock, Vertex AI, or a local vLLM endpoint. The provider is recorded in gen_ai.system and gen_ai.provider.name, so a single query can slice by provider when you need to, and ignore it otherwise.
For MCP servers and A2A protocol traffic, traceAI emits the protocol type, the remote agent identity, the tool manifest hash, and per-call timing on a dedicated span kind. That gives compliance teams the audit trail they need without anyone having to write custom code: every cross-agent call, every tool surface declared by an MCP server, every A2A handoff is a structured span.
How to measure or detect traceAI coverage
TraceAI itself is the instrumentation layer; what it produces is what you measure. The OTel attributes traceAI emits are the actual signals that downstream evals, dashboards, and alerts read:
- Span kind:
fi.span.kinddistinguishesLLM,RETRIEVER,TOOL,AGENT,CHAIN,EMBEDDING,RERANKER,GUARDRAIL,EVALUATOR. Filter by kind to isolate a layer. - Provider and model:
gen_ai.system,gen_ai.provider.name,gen_ai.request.model,gen_ai.response.model. - Tokens and cost:
gen_ai.usage.input_tokens,gen_ai.usage.output_tokens,gen_ai.cost.total,gen_ai.cost.input,gen_ai.cost.output. - Latency:
gen_ai.client.operation.duration,gen_ai.server.time_to_first_token,gen_ai.server.time_per_output_token, and for voicegen_ai.voice.latency.ttfb_ms. - Tool calls:
gen_ai.tool.name,gen_ai.tool.call.arguments,gen_ai.tool.call.result,gen_ai.tool.call.error. - Retrieval:
retrieval.documents,retrieval.score,vector.collection,vector.metric. - Coverage health: percentage of LLM API calls in your repo that produce a span (target ≥ 99%); orphan-span rate (spans with no parent); average tree depth per trace.
fi.evals.HallucinationScore. runs at span granularity on the captured prompt/response; the score writes back as an evaluator span next to the LLM span.
A reliable proxy for “is traceAI working”: every production LLM call in the last 5 minutes has a non-null gen_ai.request.model, a non-null gen_ai.usage.input_tokens, and a parent agent or chain span. If any of those is null for more than 1% of traces, an integration version is out of sync.
For evaluator wiring on traced data, the common minimal pattern is:
from fi.evals import HallucinationScore
evaluator = HallucinationScore()
for span in llm_spans:
score = evaluator.evaluate(
input=span.attributes["gen_ai.prompt"],
output=span.attributes["gen_ai.completion"],
context=span.attributes.get("retrieval.documents", []),
)
span.set_attribute("fi.eval.hallucination", score.value)
That snippet works whether the spans came from OpenAI, Claude, Gemini, or a local Llama 4. the attribute schema is the same.
We’ve found that the teams who get the most out of traceAI are the ones that wire evaluator output back onto the span itself, not into a separate metrics store. When fi.eval.hallucination sits next to gen_ai.request.model on the same span, the on-call query is one filter; when it sits in a different system, the same investigation takes three tabs and a join. In our 2026 evals across enterprise pilots, the median time to root-cause a regression dropped from over three hours to under twenty minutes once eval results were attached at span level rather than aggregated nightly.
For agent-shaped workloads, the span-level attachment matters most when you are pacing against public benchmarks: τ-bench (Anthropic’s multi-turn customer-support benchmark; frontier scores 55-70% in May 2026), SWE-Bench Verified (500 human-verified GitHub bug-fix tasks; frontier 70-78%), and GAIA (multi-step assistant tasks across browsing and tool use) all score outcomes that traceAI’s agent.trajectory.step and gen_ai.tool.* spans capture natively. For RAG-shaped traffic, RAGTruth (18K labeled chunks) and HaluEval (35K Q&A pairs; GPT-4 ~16.4% hallucination rate) are the standard public anchors that calibrate the fi.evals.HallucinationScore thresholds you attach to LLM spans.
Common mistakes
- Instrumenting only the model SDK. If you skip the agent framework (LangChain, CrewAI, OpenAI Agents SDK, Google ADK), you lose the parent span and the trace becomes a bag of LLM calls with no graph. Always install the framework integration too.
- Forgetting to propagate context across async boundaries. Use OTel
context.attach/detachor framework-native helpers; otherwise tool spans orphan from their parent and your agent tracing tree breaks. - Mixing two instrumentation libraries on the same stack. Running both traceAI and OpenInference, or traceAI and the Langfuse SDK, produces duplicate spans and inconsistent attribute names. Pick one and uninstall the others.
- Disabling content capture without a redactor. Setting
FI_HIDE_LLM_INVOCATION_PARAMETERS=truehides debugging detail; pair it with span-level redaction (regex or PII) so you keep tokens, timings, and cost while masking sensitive payloads. - Pinning to an old integration version. TraceAI tracks upstream SDKs. pinning to a year-old
traceai-openaimeans new tool-call shapes, reasoning trace fields, and structured output deltas drop on the floor. Re-pin quarterly. - Not setting
project_nameper environment. A singleproject_name="default"across dev, staging, and prod blends data and breaks per-cohort dashboards. Usedev-{service},staging-{service},prod-{service}. - Skipping the gateway integration. If traffic goes through Portkey, LiteLLM, or a custom proxy and the proxy is not instrumented, you get two unrelated traces. one client side, one provider side. Install
traceai-portkeyortraceai-litellmand propagate the OTel context header. - Ignoring
fi.span.kindin dashboards. Filtering by span name is fragile; the kind taxonomy is stable. Build dashboards onfi.span.kind = "RETRIEVER", not onname LIKE "%retriev%".
Frequently Asked Questions
What is TraceAI?
TraceAI is FutureAGI's Apache 2.0 OpenTelemetry instrumentation library that auto-instruments 50+ LLM and agent frameworks (OpenAI, Anthropic, LangChain, CrewAI, OpenAI Agents SDK, LiveKit, vector DBs) across Python, TypeScript, Java, and C#.
How is TraceAI different from OpenInference?
Both are OTel-aligned LLM instrumentation libraries. OpenInference is maintained by Arize and is broadly framework-coverage focused; TraceAI is maintained by FutureAGI, ships in four languages including Java and C#, adds an fi.span.kind taxonomy, and integrates directly with FutureAGI's eval, simulate, and gateway products.
How do you use TraceAI in production?
Install the integration package (e.g. traceAI-openai, traceAI-langchain), call register(project_name='prod') from fi_instrumentation, then call the framework's Instrumentor().instrument(). Spans flow to any OTLP backend, including FutureAGI.