Articles

traceAI: Open-Source OpenTelemetry LLM Tracing for 35+ Frameworks in Python, TypeScript, Java, and C#

traceAI is open-source OpenTelemetry AI tracing for 35+ frameworks in Python, TypeScript, Java, and C#. Two lines of code. Zero vendor lock-in.

·
16 min read
tracing observability llm open-source
traceAI: Open-Source OpenTelemetry LLM Tracing for 35+ Frameworks in Python, TypeScript, Java, and C#
Table of Contents

Why Do AI Agents Fail Silently in Production, and How Does traceAI Fix It?

Let’s start with the worst kind of bug I have ever shipped.

A healthcare client built a RAG agent on top of clinical guidelines, deployed to clinicians during triage. Three weeks in, a senior nurse flagged a response that recommended a 500mg dose for a drug that maxes out at 200mg.

You pull the logs. Every request returned 200 OK, every retrieval call hit the vector DB, every OpenAI completion came back clean. The agent had been confidently citing the wrong dose for three weeks, and the monitoring stack had nothing to say about it.

Turned out the retriever was pulling the right document, but a chunking bug split the dosage table across two chunks. The model only ever saw the first half. No error, no warning, no log line, just a quiet hallucination on top of partial context.

This is the observability gap that kills production AI systems. Your logs say everything is fine, your dashboards show green, and somewhere inside the chain (the retriever, the tool call, the prompt template, the model), something is silently failing.

traceAI is the open-source instrumentation library built for this problem. It auto-instruments 35+ LLM providers, orchestration frameworks, and agent systems across Python, TypeScript, Java, and C#.

Two lines of code. No proprietary lock-in. Every span goes to whatever backend you already run.

The point of this blog is to walk you through why production AI is invisible to traditional logging, what tracing actually catches that logs miss, and how to debug an agent that’s silently fabricating customer data using five lines of traceAI.

What Causes the Observability Gap Killing Your AI Stack?

Why Can’t You See Inside a Multi-Step AI Pipeline?

Your LLM application isn’t one service. It’s a chain of interdependent calls: retrieval, embedding, reranking, inference, tool execution, guardrail checks.

When something goes wrong, traditional logs tell you the chain completed. They tell you nothing about which step broke.

The gap shows up in three concrete ways:

  • Debugging gap. Your multi-step agent returned a hallucinated answer. Was it the retriever returning irrelevant chunks? The prompt? The model? Without spans, you’re guessing.
  • Performance gap. Latency spiked from 800ms to 4.2 seconds. Is it the LLM call, the embedding step, or the reranker? Token costs? Invisible.
  • Workflow gap. LangChain handles orchestration. OpenAI handles inference. Pinecone handles retrieval. No single tool shows you the full execution path across all three.

You can paper over one of these gaps with print statements. You can’t paper over all three.

Why Does Traditional APM Miss Every LLM Failure?

Datadog and New Relic trace HTTP requests beautifully. They do not understand what an LLM call is, what a retrieval step does, or why an agent chose Tool A over Tool B.

print() debugging works for one developer on one pipeline. It does not scale to production agents handling thousands of requests per day.

Teams end up with three different tracing setups for three different frameworks, no unified view, no correlation between spans. That’s not observability, that’s a wall of partial dashboards.

What’s missing is an open standard that works across every LLM provider, every framework, every language, and exports to whatever backend you already run.

That standard exists. It’s called OpenTelemetry. But OpenTelemetry doesn’t know what an LLM call is.

The newer AI tracing libraries try to fix this, but most of them trade one lock-in for another. OpenInference (Arize) ships solid instrumentors but pulls you toward Phoenix’s storage and UI, and Langfuse and LangSmith run their own backends with their own SDKs and their own span formats.

The moment you want your AI traces to live next to your HTTP traces in Datadog, or you want to switch backends six months in, you’re rewriting your instrumentation layer.

traceAI is built directly on OpenTelemetry semantic conventions, so the same spans flow into Jaeger, Tempo, Datadog, or Future AGI Observe with no code change.

That’s the gap traceAI was built to close.

What is traceAI?

traceAI is an open-source instrumentation library built on OpenTelemetry that auto-traces LLM calls, tool invocations, retrieval steps, and agent decisions across 35+ frameworks in Python, TypeScript, Java, and C#.

It captures every span with AI-specific attributes (token counts, model names, prompts, latency) and exports to any OTel-compatible backend. You debug production AI failures in minutes instead of days.

The Case for Open Instrumentation

AI observability is too critical to be locked behind proprietary tooling.

When the instrumentation layer is open source, you can audit it, extend it, and trust it the same way engineers trust OpenTelemetry for general-purpose tracing.

Instrumentors that live in your vendor’s closed SDK are a liability. Instrumentors you can read, fork, and contribute to are infrastructure.

traceAI is built on top of OpenTelemetry. It auto-instruments AI frameworks and maps their operations to standardized trace attributes and spans. Licensed under MIT.

What Does Open-Source LLM Tracing Actually Unlock?

CapabilityWhat It Means in Practice
Auto-instrumentationTwo lines of code. Every LLM call, retrieval, embedding, tool invocation, and agent decision becomes a span automatically.
Standardized semantic conventionsAI-specific span attributes (token counts, model names, prompt templates, retrieval scores) mapped to a consistent schema across all providers.
Multi-language supportPython (PyPI), TypeScript (npm), Java (JitPack/Maven), C# (NuGet). Same tracing model, four languages.
Backend-agnostic exportTraces go to any OpenTelemetry-compatible backend: Jaeger, Grafana Tempo, Datadog, or Future AGI’s Observe platform.
Manual tracing helpers@tracer.chain, @tracer.tool, @tracer.agent decorators for custom spans when auto-instrumentation does not reach your business logic.

What traceAI Is Not

traceAI is the instrumentation layer, not the platform.

It feeds observability platforms. It extends OpenTelemetry with AI-specific semantics. It’s not locked to Future AGI.

It works with any OTel-compatible backend. Your traces, your infrastructure.

You can also check out the Future AGI complete documentation.

Which AI Frameworks and Languages Does traceAI Support?

Which Python AI Frameworks Are Auto-Instrumented?

Python coverage is the most mature and is the recommended starting point for your first integration.

PackageFramework
traceAI-openaiOpenAI
traceAI-anthropicAnthropic
traceAI-langchainLangChain
traceAI-llamaindexLlamaIndex
traceAI-crewaiCrewAI
traceAI-autogenAutoGen
traceAI-dspyDSPy
traceAI-haystackHaystack
traceAI-openai-agentsOpenAI Agents SDK
traceAI-smolagentsHugging Face SmolAgents
traceAI-litellmLiteLLM
traceAI-groqGroq
traceAI-mistralaiMistral AI
traceAI-bedrockAWS Bedrock
traceAI-vertexaiVertex AI (Gemini)
traceAI-google-genaiGoogle GenAI
traceAI-guardrailsGuardrails AI
traceAI-mcpModel Context Protocol

Plus voice and realtime: LiveKit and Pipecat integrations. Plus low-code: n8n.

How Does traceAI Work in TypeScript, Java, and C# Projects?

TypeScript uses @traceai/fi-core for tracer registration, with @traceai/openai for OpenAI instrumentation. Mastra and Vercel AI SDK are supported too.

Java uses a Traced<Client> wrapper pattern. Wrap your existing client and tracing happens automatically. Supported: OpenAI, Anthropic, AWS Bedrock, Cohere, Pinecone, and Spring Boot.

C# uses FITracer.Initialize() for setup. Coverage is expanding to match Python’s breadth.

What’s the Full Language and Category Coverage Matrix?

CategoryPythonTypeScriptJavaC#
LLM Providers12+Yes4+Yes
Orchestration Frameworks10+3+Spring BootPlanned
Agent Frameworks5+YesPlannedPlanned
Voice/RealtimeLiveKit, PipecatYesPlannedPlanned
Vector DatabasesDirect supportDirect supportDirect supportPlanned

How Does traceAI Trace LLM Calls and Agent Decisions Under the Hood?

OpenTelemetry Instrumentation Model for AI Workflows

traceAI uses OpenTelemetry’s Instrumentor pattern. Each framework gets its own instrumentor class (OpenAIInstrumentor, LangChainInstrumentor, and so on).

When you call .instrument(), the instrumentor wraps the framework’s core methods. Every LLM call, retrieval, or tool invocation automatically creates an OTel span.

Each span carries AI-specific attributes: model name, token counts, prompt templates, retrieval scores, tool parameters, and execution status. You don’t write that mapping the instrumentor does.

What Are Semantic Span Kinds, and Why Do They Matter for Debugging?

Every span carries a typed span kind. This is the part that separates traceAI from generic HTTP tracing:

Span KindUse Case
CHAINFull pipeline or workflow execution
LLMIndividual LLM call (completion, chat, embedding)
TOOLTool/function invocation within an agent
RETRIEVERVector search or document retrieval
EMBEDDINGEmbedding generation
AGENTAgent-level orchestration span
RERANKERReranking step in RAG pipeline
GUARDRAILSafety/guardrail check
EVALUATORInline evaluation span

In your dashboard, you don’t see a flat list of HTTP calls. You see a nested timeline: Agent → Retriever → Embedding → LLM → Guardrail → Response.

The first time you see your agent rendered like that, the bug usually finds itself.

The Export Pipeline

traceAI doesn’t own the export path. OpenTelemetry does.

Traces flow: your code → traceAI instrumentor → OTel TracerProvider → BatchSpanProcessor → OTLP exporter → your backend.

Supported exporters include OTLP (HTTP/gRPC), Jaeger, Zipkin, and any custom OTel exporter.

Building on OpenTelemetry instead of a custom protocol means your AI traces live next to your HTTP traces, your database traces, your queue traces. Same dashboard. Same alerts. Same team can own it.

How Do You Debug an AI Agent Giving Confidently Wrong Answers in 5 Steps?

The Problem

You built an internal AI assistant that helps sales reps prepare for calls. The agent has two tools: search_company_docs (searches your knowledge base) and fetch_deal_context (calls your CRM API for deal stage, last interaction, and open questions).

Staging is fine. In production, sales reps report the bot “makes up deal details.”

Here’s what’s actually happening: your CRM API intermittently times out during peak hours. The try/except block swallows the timeout and returns an empty string.

The LLM receives an empty tool result and fabricates plausible deal context. Your agent is lying to your sales team, and nothing in your logs flags it.

Let’s solve this with traceAI.

Step 1: Install traceAI and Register the Tracer

# pip install fi-instrumentation-otel traceAI-openai
import os

os.environ["FI_API_KEY"] = "your-fi-api-key"
os.environ["FI_SECRET_KEY"] = "your-fi-secret-key"
os.environ["OPENAI_API_KEY"] = "your-openai-api-key"

from fi_instrumentation import register
from fi_instrumentation.fi_types import ProjectType

# register() handles TracerProvider, BatchSpanProcessor, and OTLP
# exporter in one call. Swap the endpoint to Jaeger, Grafana Tempo,
# or Datadog with zero code changes.
trace_provider = register(
    project_type=ProjectType.OBSERVE,
    project_name="sales-prep-agent",
)

Checkpoint: Tracer provider is ready. No traces flowing yet just the pipeline waiting to receive spans.

Step 2: Auto-Instrument OpenAI

Two lines. Every LLM call in your codebase is now traced.

from traceai_openai import OpenAIInstrumentor

OpenAIInstrumentor().instrument(tracer_provider=trace_provider)

The instrumentor wraps OpenAI’s core methods. Every call now emits a span with model name, token counts (input, output, total), latency in milliseconds, the full messages array, and the complete response.

Checkpoint: Every OpenAI call is traced. You changed zero lines of your existing agent code.

Step 3: Trace the Tool Calls That Actually Break

Auto-instrumentation covers the LLM. The tools that feed data to the LLM those are your functions. This is where the bug lives. Use @tracer.tool to make every tool call a visible span:

import requests
from fi_instrumentation import FITracer

tracer = FITracer(trace_provider.get_tracer(__name__))

@tracer.tool(
    name="search_company_docs",
    description="Searches knowledge base for product specs, case studies, pricing",
)
def search_company_docs(query: str) -> str:
    results = vector_db.search(query, top_k=5)
    if not results:
        return "No relevant documents found."
    return "\n---\n".join([r.text for r in results])

@tracer.tool(
    name="fetch_deal_context",
    description="Fetches prospect deal stage, last interaction, open questions from CRM",
)
def fetch_deal_context(prospect_name: str) -> str:
    try:
        resp = requests.get(
            "https://internal-crm.company.com/api/deals",
            params={"prospect": prospect_name},
            timeout=3,
        )
        resp.raise_for_status()
        deal = resp.json()
        return (
            f"Deal stage: {deal['stage']}\n"
            f"Last interaction: {deal['last_interaction']}\n"
            f"Open questions: {', '.join(deal['open_questions'])}"
        )
    except (requests.Timeout, requests.HTTPError):
        # THIS IS THE BUG. The timeout is silently swallowed.
        return ""

When fetch_deal_context returns an empty string after a timeout, the span’s output attribute shows "". That empty string is now visible evidence, not a hidden failure.

Checkpoint: Both tools are traced. When the CRM API times out, the fetch_deal_context span shows an empty output and 3,002ms latency. Before, this was invisible.

Step 4: Connect the Agent and Run It

from openai import OpenAI

client = OpenAI()

@tracer.agent
def sales_prep_agent(rep_question: str, prospect_name: str) -> str:
    product_context = search_company_docs(rep_question)   # TOOL span
    deal_context = fetch_deal_context(prospect_name)      # TOOL span

    response = client.chat.completions.create(            # LLM span
        model="gpt-4o",
        messages=[
            {
                "role": "system",
                "content": (
                    "You are a sales preparation assistant. Use ONLY the provided "
                    "product context and deal context to prepare a call brief. "
                    "If any context is missing or empty, explicitly say so. "
                    "Never fabricate deal details."
                ),
            },
            {
                "role": "user",
                "content": (
                    f"Prepare me for a call with {prospect_name}.\n\n"
                    f"Product context:\n{product_context}\n\n"
                    f"Deal context:\n{deal_context}\n\n"
                    f"My question: {rep_question}"
                ),
            },
        ],
    )
    return response.choices[0].message.content

brief = sales_prep_agent(
    rep_question="What pricing plans do we offer for enterprise?",
    prospect_name="Acme Corp",
)

The trace structure this produces:

  • Root span: sales_prep_agent (AGENT) wraps the entire run
  • Child 1: search_company_docs (TOOL) query, returned chunks, latency
  • Child 2: fetch_deal_context (TOOL) prospect name, returned data (or empty on timeout), latency
  • Child 3: openai.chat.completions.create (LLM) full messages, model, token counts, response, latency

Checkpoint: In your Future AGI dashboard, you can see the full trace with three nested child spans under the agent root.

Step 5: Find the Silent Tool Failure in the Trace

A sales rep reports “the bot made up deal details about Acme Corp.” Here’s the debugging flow:

  1. Open the trace for the reported request. Three nested child spans under sales_prep_agent.
  2. Click search_company_docs (TOOL). Output: 5 relevant document chunks, latency 120ms. Fine.
  3. Click fetch_deal_context (TOOL). Output: "". Latency: 3,002ms. That’s the timeout.
  4. Click openai.chat.completions.create (LLM). The deal_context field in the messages array is blank. The model fabricated a deal stage, a last interaction date, and open questions that never existed.

Root cause in under 60 seconds. The fix: change the except block to return "Deal context unavailable. CRM API did not respond." instead of "".

Before traceAI: you spend days trying to reproduce a timing-dependent failure. You blame the prompt and rewrite it three times.

After traceAI: 60 seconds. Click three spans. Done.

What Do You Get From AI Tracing in Production?

Debug Multi-Step Agents in Minutes

With traceAI, every agent decision becomes a span.

You see the full execution tree: which tool fired, what the retriever returned, what the LLM generated, and where the chain broke.

“The agent gave a wrong answer” stops being a vague complaint. It becomes “the retriever returned 3 irrelevant chunks at step 2, the LLM hallucinated based on chunk 3, and the guardrail did not catch it because it only checks for PII.”

Track Cost and Latency Per Span

Token counts (input, output, total) sit on every LLM span. Multiply by your per-token rate. You know exactly how much each agent run costs.

Latency breaks down by span type. Is the bottleneck in the LLM call (model-side) or the retriever (your infrastructure)?

Set up monitors for latency spikes or token usage outliers through Future AGI’s Alerts or any OTel-compatible alerting system.

Export to Your Existing Stack

Export DestinationUse Case
Jaeger / Grafana TempoSelf-hosted trace visualization
DatadogUnified infra + AI monitoring
S3 / Azure Blob / GCSCompliance and audit archival
SQS / Pub/SubStream trace events to custom consumers
Future AGI ObserveFull AI observability with trace-to-eval linkage

One team added traceAI three months into production. Within a week they found that 12% of their retrieval calls were returning empty results due to a stale index. The agent was covering for it by hallucinating plausible-sounding answers. Tracing surfaced it on day one.

The Bottom Line

When Should You Add Open-Source AI Tracing to Your Production Stack?

StageShould You Add traceAI?
Prototyping / hackathonOptional. Adds less than 5 minutes of setup.
Pre-production / stagingYes. Find issues before your users do.
ProductionNon-negotiable. You can’t operate what you can’t observe.
Multi-agent systemsCritical. Without tracing, debugging agent-to-agent handoffs is nearly impossible.

Takeaways

AI systems are multi-component, multi-framework, multi-provider pipelines. Traditional observability does not cover them. Custom logging does not scale.

traceAI gives you standardized, OpenTelemetry-native instrumentation across 35+ frameworks in 4 languages. Two lines of code. Zero vendor lock-in.

If you ship AI to production without tracing, you operate blind. traceAI fixes that.

It’s open source. Try it out.

Frequently Asked Questions

What is the best way to add observability to a LangChain or LlamaIndex agent in production?

Install traceAI-langchain or traceAI-llamaindex, call .instrument() once, and every chain, retrieval, and LLM call emits a structured OTel span automatically. You get token counts, latency per step, and the full input/output at each stage without touching your existing agent code.

How do I trace OpenAI API calls and see token costs per request?

OpenAIInstrumentor().instrument(tracer_provider=trace_provider) wraps every chat.completions.create() call in your codebase. It captures model name, input tokens, output tokens, total tokens, and latency as span attributes. Multiply token counts by your per-token rate directly in the trace. No third-party cost dashboard required.

What is the difference between OpenTelemetry and AI-specific tracing frameworks like traceAI?

OpenTelemetry is a general-purpose distributed tracing standard built for HTTP services, databases, and queues. It has no concept of an LLM call, a retrieval step, or a tool invocation. traceAI extends OTel with AI-specific semantic conventions (span kinds like LLM, RETRIEVER, TOOL, AGENT) so your traces carry the context you actually need to debug an AI pipeline.

Can I use open-source AI tracing without sending data to a third-party platform?

Yes. traceAI exports to any OTLP-compatible backend, including self-hosted Jaeger or Grafana Tempo. The instrumentation layer is fully independent of Future AGI’s Observe platform. Point the exporter at your own endpoint and your traces never leave your infrastructure.

How do I debug a multi-step AI agent that returns wrong answers without any errors in the logs?

Silent failures almost always live in the tool calls that feed data to the LLM, not in the LLM itself. Wrap each tool function with @tracer.tool and check the output attribute on the resulting span. An empty string, a stale payload, or a 3-second latency spike tells you exactly which tool broke and why the model filled the gap with fabricated content.

Related Articles
View all
Stay updated on AI observability

Get weekly insights on building reliable AI systems. No spam.