Guides

Claude Code Observability with OpenInference and OpenTelemetry in 2026

A 2026 walkthrough for Claude Code observability with OpenInference and OpenTelemetry: architecture, 4-step setup with config, and 5 scored backends.

·
17 min read
ai-gateway 2026 claude-code llm-observability
Editorial cover image for Claude Code Observability with OpenInference and OpenTelemetry in 2026

Claude Code is the most popular interactive coding agent on the market, and in May 2026 it still has no native trace exporter. The CLI streams output to your terminal, logs minimal local diagnostics, and forwards every request to api.anthropic.com. Nothing else leaves the box. That’s fine for one developer; it’s a problem the moment a team has more than five engineers, a finance department asking for chargeback, or a security team that wants to know which prompts touched a regulated repository.

The fix is a pair of open standards. OpenInference (github.com/Arize-ai/openinference) is the semantic convention for LLM telemetry, it defines what a span for an LLM call looks like, what attributes it carries (llm.input_messages, llm.token_count.total, llm.tool_calls), and how tool-use and retrieval spans nest underneath. OpenTelemetry (OTel) is the W3C-stewarded transport: it ships those spans to any backend you choose. Together they give Claude Code the observability story Anthropic hasn’t built, without locking you into a single vendor.

This post is half walkthrough, half buyer’s guide. The first half is the architecture and the four-step setup to get Claude Code emitting OpenInference-compliant spans over OTLP. The second half scores five backends that consume those spans, with a 7-axis matrix and a “where it falls short” block on each.


The problem: Claude Code has no native trace export

Anthropic ships Claude Code with three places telemetry could come from, none enough on its own.

  1. ~/.claude/logs/, a local rotating log of CLI events. Useful for “why did my CLI crash”; useless for “how much did the team spend on Tuesday.” No full request payload, no model response, rotated on a short window.
  2. The Anthropic dashboard, aggregate token totals per API key, refreshed daily. If your team shares one key, every developer collapses into one row. If every developer has their own key, you lose the bulk discount.
  3. The CLI’s --debug mode, verbose stdout traces per turn. Local debugging only. You can’t pipe it into a SIEM, can’t query with SQL, and it leaks into the same terminal the developer is working in.

What an OTel-native engineer wants in 2026 is a span tree per session: one root span per Claude Code turn, child spans for each tool call, model name and token counts as attributes, the whole trace shipped over OTLP to a backend the team controls. That’s what OpenInference defines and what Claude Code doesn’t produce.

The workaround is a gateway in front of Claude Code that intercepts the Anthropic call, builds the OpenInference span, and exports it. The gateway is the seam.


Why OpenInference + OpenTelemetry matter

You could solve this with a proprietary SDK. The reason to choose the open path comes down to four things:

No vendor lock-in on the wire format. OpenInference spans look the same whether produced by Future AGI’s traceAI, Arize’s openinference-instrumentation-anthropic, or a hand-rolled span builder. If you switch backends in 18 months, you don’t rewrite instrumentation, you change the OTLP endpoint URL.

Standards-grade semantic conventions. Attribute names like llm.model_name, llm.token_count.prompt, llm.token_count.completion, llm.tool_calls, input.value, output.value are stable across the ecosystem. Any analytics tool that speaks OpenInference can read your spans without a custom adapter.

OTel transport is battle-tested. OTLP/gRPC and OTLP/HTTP are the same protocols your microservices already export to. The collectors, exporters, and samplers your platform team already runs for non-LLM workloads work unchanged for LLM workloads.

The eval and optimizer ecosystem is built on these spans. ai-evaluation, Arize Phoenix’s eval framework, and several open-source agent optimizers consume OpenInference traces directly. If your spans are well-formed, evals and failure clustering come for free.


Architecture: Claude Code through the seam

The reference architecture has four moving parts. Reading left to right:

Claude Code CLI
    │   (Anthropic API)

Gateway (Future AGI / Phoenix / Langfuse / Helicone / Maxim)
    │   (1) Forward to api.anthropic.com
    │   (2) Build OpenInference span

    ├──► api.anthropic.com  (the actual model call)

    └──► OTel exporter (OTLP/gRPC or OTLP/HTTP)


       Backend storage + query
       (ClickHouse / Postgres / Phoenix / FAGI)


       Dashboard + eval + alerts

The gateway is the only new thing the developer notices, and only because ANTHROPIC_BASE_URL now points to it. The Claude Code CLI needs no plugin, fork, or custom build. Tool calls survive because the gateway preserves Anthropic’s tool-use blocks. Streaming survives because the gateway passes SSE through without buffering.

The span builder in the gateway constructs an OpenInference span tree on every request. The root span carries openinference.span.kind = "LLM", llm.model_name, llm.token_count.{prompt,completion,total}, and a session.id that ties every turn in a conversation together. Tool-use child spans get openinference.span.kind = "TOOL" with the tool name and arguments. Retrieval spans get "RETRIEVER". The whole tree exports over OTLP on the gateway’s own connection.

Three numbers to set expectations on overhead:

  • Span construction at the gateway: typically 1 to 3 ms per call.
  • OTLP export: asynchronous and batched. The default batch span processor sends every 5 seconds or 512 spans, whichever comes first. Zero added latency on the Claude Code turn.
  • Sampling 100% in dev / 10% in prod: a typical posture. A 30-developer team at 100% produces around 250K spans/day; at 10% production sampling, ~25K spans/day.

4-step setup walkthrough

This is the minimal config to get Claude Code spans into any OpenInference-compatible backend. Examples use traceAI (Apache 2.0); the same shape works for Arize Phoenix’s openinference-instrumentation-anthropic if you prefer producer-side instrumentation.

Step 1: Prereqs

Versions pinned as of May 2026:

pip install traceai-anthropic==0.6.4 \
            opentelemetry-api==1.30.0 \
            opentelemetry-sdk==1.30.0 \
            opentelemetry-exporter-otlp-proto-grpc==1.30.0

# environment
export ANTHROPIC_API_KEY="sk-ant-..."        # your real key
export FI_API_KEY="..."                       # if using FAGI as backend
export FI_SECRET_KEY="..."
export OTEL_EXPORTER_OTLP_ENDPOINT="https://api.futureagi.com/v1/traces"
export OTEL_EXPORTER_OTLP_PROTOCOL="grpc"

For Phoenix as backend, swap the endpoint to http://localhost:6006/v1/traces and add OTEL_EXPORTER_OTLP_HEADERS="api_key=...". Langfuse uses the same OTLP collector, point at its ingress, supply public/secret key pair as headers.

Step 2: Point Claude Code at the gateway

Set ANTHROPIC_BASE_URL in your shell profile so both the CLI and IDE plugin use the gateway:

# ~/.zshrc or ~/.bashrc
export ANTHROPIC_BASE_URL="https://gateway.your-company.internal/anthropic"
export ANTHROPIC_AUTH_TOKEN="${ANTHROPIC_API_KEY}"   # gateway re-validates

Most gateways accept Anthropic-shape requests at /anthropic/v1/messages and forward transparently. Confirm SSE pass-through, buffering streaming responses breaks Claude Code’s progress UI mid-turn.

Step 3: Wire the OpenInference instrumentation at the gateway

In FAGI’s Agent Command Center the instrumentation is built in, traceAI emits OpenInference spans on every Anthropic call without user code. For a self-hosted proxy (or LiteLLM/Bifrost deployment), the instrumentation lives in the proxy process:

from traceai_anthropic import AnthropicInstrumentor
from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry.sdk.resources import Resource

resource = Resource.create({
    "service.name": "claude-code-gateway",
    "service.version": "2026.5.17",
    "deployment.environment": "prod",
})

provider = TracerProvider(resource=resource)
provider.add_span_processor(
    BatchSpanProcessor(
        OTLPSpanExporter(
            endpoint="https://api.futureagi.com/v1/traces",
            headers={
                "x-fi-api-key": os.environ["FI_API_KEY"],
                "x-fi-secret-key": os.environ["FI_SECRET_KEY"],
            },
        )
    )
)
trace.set_tracer_provider(provider)

# Attach OpenInference instrumentation to every Anthropic client in the proxy
AnthropicInstrumentor().instrument(tracer_provider=provider)

The AnthropicInstrumentor wraps the Anthropic SDK so every messages.create produces a properly named OpenInference span, openinference.span.kind = "LLM", input on llm.input_messages, output on llm.output_messages, tokens on llm.token_count.*. Tool calls produce TOOL child spans automatically.

Step 4: Add session + developer + repo attributes

The default span tree is correct but anonymous. To make chargeback and per-repo slicing work, set three custom attributes on every root span: session.id (the Claude Code conversation), user.id (the developer’s SSO email), repo.url (the git remote).

from opentelemetry import trace
from contextlib import contextmanager

tracer = trace.get_tracer(__name__)

@contextmanager
def claude_code_span(session_id: str, user_id: str, repo_url: str):
    with tracer.start_as_current_span("claude_code.turn") as span:
        span.set_attribute("session.id", session_id)
        span.set_attribute("user.id", user_id)
        span.set_attribute("repo.url", repo_url)
        span.set_attribute("openinference.span.kind", "AGENT")
        yield span

The Claude Code wrapper passes X-Session-Id, X-User-Id, X-Repo-Url headers (FAGI derives them from the API key claim). Every child span inherits the parent’s session.id via OTel context propagation, tag once at the root, the whole tree is slice-able.

Verification: what success looks like

Run Claude Code with the gateway in front; within ~10 seconds the backend should show:

  • One root span per turn (claude_code.turn) with the three custom attributes set.
  • Nested openinference.span.kind = "LLM" spans for each messages.create, with token counts and model name.
  • Nested openinference.span.kind = "TOOL" spans for every bash, file read, or grep Claude Code made.
  • A session.id filter that surfaces every turn in the same conversation across developers and repositories.

Flat spans (no nesting) mean OTel context didn’t propagate across the gateway’s async handlers, fix that before scoring backends. Empty llm.input_messages means a sampler or redaction policy is stripping content; check the SDK config.


5 observability backends scored

The setup above is producer-agnostic. These five backends all consume OpenInference; they vary in what they do once spans land.

The 7 axes:

AxisWhat it measures
1. OpenInference complianceDoes the backend understand the OpenInference semantic conventions natively, or via a custom adapter?
2. OTel-native exportDoes it ingest OTLP/gRPC + OTLP/HTTP without proprietary collectors?
3. Span attribute richnessDoes the UI surface tool-call args, retrieval hits, eval scores per span?
4. Sampling controlsTail-based sampling, head-based, attribute-driven — what’s supported?
5. Trace correlation across sessionsCan you filter every span for a session.id across the entire team?
6. Eval hook surfaceCan scoring run on captured spans (latency, faithfulness, code-correctness)?
7. RetentionDefault raw-span retention and aggregate retention.

Five picks, scored on all seven, with where each falls short.


1. Future AGI Agent Command Center: Best for trace-to-optimizer loop

Verdict: Future AGI is the only backend here that takes the OpenInference spans it ingests and uses them to improve the system that produced them. The other four are observation layers; Agent Command Center is an observation layer wired to an evaluator and an optimizer.

What it does for Claude Code observability:

  • OpenInference compliance is native. traceAI (Apache 2.0) is the open producer; Agent Command Center is the open consumer. No adapter layer.
  • OTel-native export over OTLP/gRPC and OTLP/HTTP at api.futureagi.com/v1/traces. Self-host endpoint in the BYOC deployment.
  • Span attribute richness. UI groups by openinference.span.kind, shows tool-call args + responses inline, surfaces eval scores per span.
  • Sampling controls include attribute-driven (sample 100% of user.id in pilot_users, 10% otherwise) and tail-based (sample 100% of spans where llm.token_count.total > 100000).
  • Trace correlation is the strongest feature, session.id is a first-class index, with one-click pivot from a Claude Code turn back to every other turn in the same session.
  • Eval hook surface is fi.evals running on every captured span: faithfulness, code-correctness, tool-use accuracy. Low-scoring sessions cluster into a failure dataset.
  • Retention is 30 days raw spans on Free, 90 days on Scale, custom on Enterprise.

The loop. Every trace gets scored. traceAI instruments 35+ frameworks OpenInference-natively, and Error Feed (FAGI’s “Sentry for AI agents”) sits alongside as the zero-config error monitor: auto-clusters low-scoring sessions by failure mode into named issues (50 traces → 1 issue, e.g., “Claude Opus called when Sonnet would have been enough”), auto-writes the root cause from span evidence plus a quick fix plus a long-term recommendation, and tracks rising/steady/falling trend per issue so regressions surface like exceptions rather than buried in flame graphs. fi.opt.optimizers (ProTeGi, Bayesian, GEPA) rewrites prompts or routing policy against the clustered failures. The gateway applies the updated route on the next request. The Future AGI Protect model family sits on the same data path at ~67 ms p50 text and ~109 ms p50 image (arXiv 2510.13351). FAGI’s own fine-tuned Gemma 3n adapters across content moderation, bias detection, security/prompt-injection, and data privacy/PII, multi-modal across text/image/audio, a model family rather than a plugin chain.

Where it falls short:

  • agent-opt is opt-in, for a one-week pilot focused on looking at spans, start with traceAI + ai-evaluation. Phoenix is also a lighter alternative for that single-feature brief.
  • The flame-graph view is opinionated, fewer knobs than Phoenix’s, faster to read; Phoenix power users will want a week to acclimatise.

Pricing: Free tier with 100K spans/month. Scale tier from $99/month. Enterprise custom with SOC 2 Type II + BAA. AWS Marketplace listing for procurement.

Score: 7/7 axes.


2. Arize Phoenix: Best for OSS-first OpenInference reference

Verdict: Phoenix is the reference implementation of OpenInference. Arize maintains the spec; Phoenix is the matching open-source consumer. If your team wants a 100% OSS observability stack, Phoenix is the right starting point. Falls short on the optimizer side.

What it does for Claude Code observability:

  • OpenInference compliance is, by definition, native. The UI was designed around OpenInference attribute names.
  • OTel-native export over OTLP/gRPC and OTLP/HTTP. Runs self-hosted via Python package or Docker; Arize-hosted version is also available.
  • Span attribute richness is strong. Tool-use, retrieval hits, model parameters, and evaluations side by side. Flame-graph view is best-in-class.
  • Sampling controls are mostly client-side; tail-based requires an upstream OTel collector.
  • Trace correlation by session.id works but the UI’s session-pivot is less polished, you filter, you don’t click-through pivot.
  • Eval hook surface is phoenix.evals: hallucination, relevance, toxicity, custom LLM-as-judge. No optimizer downstream.
  • Retention depends on your Postgres / SQLite storage on self-host.

Where it falls short:

  • No optimizer. Phoenix observes and evaluates; it doesn’t feed back into routing.
  • Hosted offering is younger; scale-out beyond a few hundred RPS on self-host needs Postgres tuning.
  • Per-developer chargeback requires custom SQL on the spans table.

Pricing: OSS Phoenix is free under the Elastic License 2.0. Arize-hosted starts free; Pro from ~$50/seat/month; Enterprise custom with SOC 2 Type II.

Score: 5.5/7 axes (missing: optimizer, mature hosted scale-out).


3. Langfuse: Best for hosted OTel consumer with mature ops

Verdict: Langfuse pivoted onto OpenTelemetry in late 2025 and added OpenInference attribute mapping in early 2026. The dashboard, prompt management, eval library, and RBAC are all polished. Falls short on optimizer feedback and on depth of OpenInference-specific span typing.

What it does for Claude Code observability:

  • OpenInference compliance is via attribute mapping, not native. Fidelity is good enough for typical Claude Code workloads but not exhaustive.
  • OTel-native export at Langfuse’s OTLP ingress (cloud or self-host). HTTP and gRPC both supported as of v2.95+.
  • Span attribute richness is strong on the standard set (input, output, tokens, model, latency). Retrieval-span typing is shallower than Phoenix or FAGI.
  • Sampling controls are head-based at the producer. Tail-based not natively supported.
  • Trace correlation by session_id is a first-class concept.
  • Eval hook surface is langfuse.evals plus LLM-as-judge templates. No optimizer downstream.
  • Retention is 90 days on Hobby, 365 on Pro, custom on Enterprise; self-host is unbounded.

Where it falls short:

  • No optimizer.
  • OpenInference-to-Langfuse mapping occasionally drops less-common attributes (e.g., retrieval.documents[].content). Rarely matters for pure Claude Code; does for RAG-heavy workloads.
  • Hosted product is EU-first; US region exists but feature parity sometimes lags a release.

Pricing: Hobby free with 50K observations/month. Core $59/month. Pro $199/month. Self-host (MIT) free.

Score: 5.5/7 axes (missing: optimizer, full OpenInference fidelity).


4. Helicone: Best for low-friction per-request observability

Verdict: Helicone is the easiest drop-in proxy here. As of v3 it emits OpenInference-formatted spans via OTLP. Shallower than the four above, but for a 10-developer team that just wants per-request cost + a span search UI for Claude Code, the time-to-value is minutes.

What it does for Claude Code observability:

  • OpenInference compliance added in v3 (March 2026). OTLP exporter emits OpenInference span kinds correctly.
  • OTel-native export through Helicone’s OTLP relay endpoint; fan-out to your own backend via OTel collector.
  • Span attribute richness is decent on the basics; tool-call argument surfacing is workable but less deep than Phoenix.
  • Sampling controls are head-based via custom properties; tail-based not supported.
  • Trace correlation by Helicone-Session-Id header. Claude Code wrapper must set it.
  • Eval hook surface is Helicone’s “Score” feature plus user feedback hooks. Less expressive than Phoenix or Langfuse.
  • Retention is 30 days on Free, 90 on Pro, custom on Enterprise.

Where it falls short:

  • No optimizer.
  • OpenInference v3 support is recent; nested AGENT spans need manual proxy instrumentation. Plan a half-day to wire.
  • Routing intelligence is basic, round-robin and failover. Model-tier routing has to be coded upstream.

Pricing: Free with 10K requests/month. Pro from $25/month. Enterprise custom.

Score: 4.5/7 axes (missing: optimizer, deep span typing, tail-based sampling).


5. Maxim Bifrost: Best for OTel-collector-first deployments

Verdict: Bifrost is a Go-based gateway that ships OpenInference-formatted spans through a standard OTel collector. The pitch is “we’re the gateway; you bring your own backend.” If your platform team already runs an OTel collector + Tempo/Jaeger/ClickHouse, Bifrost slots in without a new SaaS dependency. Trade-off: the LLM-specific UI is leaner than dedicated LLM observability products.

What it does for Claude Code observability:

  • OpenInference compliance in the OTLP output, span kinds and attribute names on every request.
  • OTel-native export is the entire point. OTLP/gRPC + HTTP, head-based + tail-based sampling at the collector tier.
  • Span attribute richness depends on your downstream. Bifrost emits, your backend (Tempo, ClickHouse, Honeycomb) renders.
  • Sampling controls are the strongest in this list, you push sampling into the OTel collector where existing patterns work.
  • Trace correlation by session.id works as a span attribute; what you do with it’s a property of your backend.
  • Eval hook surface is Maxim’s eval product, a separate purchase. Gateway emits clean spans.
  • Retention is whatever your downstream backend provides.

Where it falls short:

  • No optimizer.
  • The LLM observability UI is owned by your backend. Bifrost ships spans, not a polished LLM-specific dashboard. Plan a week to wire the UX layer.
  • Maxim’s hosted observability + eval are separate purchases; expect mix-and-match for the full story.

Pricing: Bifrost is OSS under Apache 2.0. Maxim’s hosted observability + eval starts free; Pro from $99/month; Enterprise custom.

Score: 5/7 axes (missing: optimizer, native LLM-grade dashboard, polished eval-on-trace UX).


Capability matrix

AxisFAGI Agent Command CenterArize PhoenixLangfuseHeliconeMaxim Bifrost
OpenInference complianceNativeNative (reference)Mappedv3 nativeNative
OTel-native exportOTLP/gRPC + HTTPOTLP/gRPC + HTTPOTLP/gRPC + HTTPOTLP relayOTel collector first
Span attribute richnessHighHighMedium-highMediumBackend-dependent
Sampling controlsHead + tail + attributeHead (client)HeadHeadTail (collector)
Trace correlation by session.idFirst-class pivotFilterFirst-classHeader-drivenBackend-dependent
Eval hook surfacefi.evals + optimizerphoenix.evalslangfuse.evalsScores + feedbackMaxim eval (separate)
Retention (raw spans)30/90/customSelf-hosted = unbounded90/365/custom30/90/customBackend-dependent
Feedback loop / optimizerfi.optNoneNoneNoneNone

Decision framework: Choose X if

Choose Future AGI Agent Command Center if the OpenInference traces should also drive prompt and route optimization. Pick this when Claude Code is becoming a significant line item ($10K+/month) and you want the eval-and-optimize loop end-to-end without three separate vendors.

Choose Arize Phoenix if your team wants the canonical OSS reference for OpenInference and is comfortable self-hosting a Postgres-backed service. Pick when open-source-first and reference correctness matter more than hosted polish.

Choose Langfuse if you want a hosted OTel consumer with the most polished LLM-ops dashboard, mature RBAC, and prompt-management alongside traces. Pick when procurement matters and you accept mapped (not native) OpenInference fidelity.

Choose Helicone for small teams that want per-request cost visibility today and don’t need an optimizer or deep sampling. Best fit for under 10 developers on Claude Code.

Choose Maxim Bifrost if your platform team already runs an OTel collector + Tempo/ClickHouse/Honeycomb and wants LLM spans in the same stack. Pick when “no new SaaS” is the governing constraint.


Common mistakes when wiring Claude Code observability

MistakeWhat goes wrongFix
Buffering streaming responses at the gatewayClaude Code’s progress UI freezes mid-turnConfirm SSE pass-through on the gateway — every pick above supports it but some self-built proxies do not
Tagging only user_id, not session.idSession-level cost attribution is impossible; one ballooned context turn cannot be traced back to its conversationTag both; session.id is the OpenInference standard for the conversation root
Sampling at the application before the proxySpans go missing for the workloads you care about mostSample at the OTel collector with tail-based rules; let the gateway always emit
Forgetting context propagation across async handlers in the gatewaySpans appear flat (no parent-child nesting)Use OTel’s context propagation in async middleware; traceAI does this automatically, hand-rolled proxies often miss it
Mixing OpenInference and the vendor’s proprietary attribute namesHalf your data uses llm.model_name, half uses model; queries breakPick one (OpenInference) and stick to it; map the vendor’s names at the producer, not the consumer
Setting OTLP sync export in devPer-turn latency in Claude Code goes up 50-200msAlways use the batch span processor; sync export is for local debugging only

How Future AGI closes the loop on Claude Code observability

The other four backends treat OpenInference traces as an end state: capture, show, alert. Future AGI treats them as the input to a feedback loop:

  1. Trace. Every Claude Code turn produces an OpenInference span tree via traceAI (Apache 2.0). Captures inputs, outputs, tool calls, model, and the session.id + user.id + repo.url attributes.
  2. Evaluate. fi.evals scores every turn against task-completion, faithfulness, code-correctness, and tool-use accuracy. Scores live on the same span as the cost data.
  3. Cluster. Low-scoring sessions cluster by failure mode in Agent Command Center, e.g., “Claude Opus called when Sonnet would have been enough.”
  4. Optimize. fi.opt.optimizers (ProTeGi, Bayesian, GEPA) rewrites the system prompt or routing policy against the clustered failures, feeding off the same OpenInference spans.
  5. Route. The gateway applies the updated policy on the next request. Live Protect guardrails at ~67ms (arXiv 2510.13351) sit on the same data path.
  6. Re-deploy. Versioned prompt + route; automatic rollback if the score regresses.

The three Apache 2.0 building blocks: traceAI, ai-evaluation, agent-opt. The hosted Agent Command Center adds failure-cluster view, live Protect, RBAC, SOC 2 Type II certified, and AWS Marketplace.


What we did not include

We deliberately left out three backends that show up in other 2026 listicles:

  • Datadog LLM Observability. OpenInference-aware via mapping, but LLM-specific eval surface is narrower and pricing escalates aggressively on span volume. Worth re-evaluating if your team already runs Datadog APM.
  • Honeycomb. Outstanding OTel backend and a natural pair with Maxim Bifrost, but LLM-specific dashboarding is less prescriptive than Phoenix or FAGI.
  • LangSmith. Strong product, but OpenInference compliance is via adapter as of May 2026; remains primarily a LangChain-shaped consumer.


Sources

  • OpenInference specification, github.com/Arize-ai/openinference
  • OpenTelemetry specification, opentelemetry.io/docs/specs
  • traceAI, github.com/future-agi/traceAI (Apache 2.0)
  • ai-evaluation, github.com/future-agi/ai-evaluation (Apache 2.0)
  • agent-opt, github.com/future-agi/agent-opt (Apache 2.0)
  • Arize Phoenix, github.com/Arize-ai/phoenix
  • Langfuse OTel support, langfuse.com/docs/opentelemetry
  • Helicone OpenInference v3 release notes, helicone.ai/changelog
  • Maxim Bifrost gateway, github.com/maximhq/bifrost
  • Future AGI Protect latency benchmarks, arxiv.org/abs/2510.13351 (67ms text, 109ms image)
  • Anthropic Claude Code documentation, claude.ai/docs/claude-code

Frequently asked questions

Do I have to use traceAI or can I use Arize's openinference-instrumentation-anthropic?
Either works. Both produce OpenInference-compliant spans. `traceAI` is the FAGI-maintained Apache 2.0 producer; Arize's `openinference-instrumentation-anthropic` is the reference. They share the spec and emit interoperable spans.
How much latency does the OTel exporter add to a Claude Code turn?
With the batch span processor, effectively zero — span export happens asynchronously after the response streams back. Span construction at the gateway adds 1–3ms; OTLP serialise + send is off the critical path.
Can I sample at the OTel collector for cost reasons?
Yes — and you should once volumes exceed a few hundred thousand spans per day. Use `tail_sampling_processor` with attribute rules: keep 100% of spans where `llm.token_count.total > 100000`, keep 100% of errors, sample the rest at 10%.
Is OpenInference stable enough to commit to in 2026?
Yes. Stable since v0.1.0 (mid-2024); v1.0 shipped late 2025. Major LLM observability vendors all consume OpenInference spans; the LLM semantic conventions inside the OTel spec itself track OpenInference closely.
How does this compare to Anthropic's first-party observability when it ships?
Anthropic has not announced one as of May 2026. If they ship one, the open path still wins on portability — your spans work the same whether your model is Claude, GPT, Gemini, or self-hosted Llama. Betting on a vendor-specific format means rewriting instrumentation when you change models.
Related Articles
View all
Top 5 Tools for Claude Code Cost Management in 2026
Guides

Five tools for Claude Code cost management in 2026 — four gateways plus the native Anthropic dashboard and a FinOps platform — scored on attribution, chargeback, caps, routing, cache observability, FinOps integration, and audit trail.

NVJK Kartik
NVJK Kartik ·
18 min
Stay updated on AI observability

Get weekly insights on building reliable AI systems. No spam.