Claude Code Observability with OpenInference and OpenTelemetry in 2026
A 2026 walkthrough for Claude Code observability with OpenInference and OpenTelemetry: architecture, 4-step setup with config, and 5 scored backends.
Table of Contents
Claude Code is the most popular interactive coding agent on the market, and in May 2026 it still has no native trace exporter. The CLI streams output to your terminal, logs minimal local diagnostics, and forwards every request to api.anthropic.com. Nothing else leaves the box. That’s fine for one developer; it’s a problem the moment a team has more than five engineers, a finance department asking for chargeback, or a security team that wants to know which prompts touched a regulated repository.
The fix is a pair of open standards. OpenInference (github.com/Arize-ai/openinference) is the semantic convention for LLM telemetry, it defines what a span for an LLM call looks like, what attributes it carries (llm.input_messages, llm.token_count.total, llm.tool_calls), and how tool-use and retrieval spans nest underneath. OpenTelemetry (OTel) is the W3C-stewarded transport: it ships those spans to any backend you choose. Together they give Claude Code the observability story Anthropic hasn’t built, without locking you into a single vendor.
This post is half walkthrough, half buyer’s guide. The first half is the architecture and the four-step setup to get Claude Code emitting OpenInference-compliant spans over OTLP. The second half scores five backends that consume those spans, with a 7-axis matrix and a “where it falls short” block on each.
The problem: Claude Code has no native trace export
Anthropic ships Claude Code with three places telemetry could come from, none enough on its own.
~/.claude/logs/, a local rotating log of CLI events. Useful for “why did my CLI crash”; useless for “how much did the team spend on Tuesday.” No full request payload, no model response, rotated on a short window.- The Anthropic dashboard, aggregate token totals per API key, refreshed daily. If your team shares one key, every developer collapses into one row. If every developer has their own key, you lose the bulk discount.
- The CLI’s
--debugmode, verbose stdout traces per turn. Local debugging only. You can’t pipe it into a SIEM, can’t query with SQL, and it leaks into the same terminal the developer is working in.
What an OTel-native engineer wants in 2026 is a span tree per session: one root span per Claude Code turn, child spans for each tool call, model name and token counts as attributes, the whole trace shipped over OTLP to a backend the team controls. That’s what OpenInference defines and what Claude Code doesn’t produce.
The workaround is a gateway in front of Claude Code that intercepts the Anthropic call, builds the OpenInference span, and exports it. The gateway is the seam.
Why OpenInference + OpenTelemetry matter
You could solve this with a proprietary SDK. The reason to choose the open path comes down to four things:
No vendor lock-in on the wire format. OpenInference spans look the same whether produced by Future AGI’s traceAI, Arize’s openinference-instrumentation-anthropic, or a hand-rolled span builder. If you switch backends in 18 months, you don’t rewrite instrumentation, you change the OTLP endpoint URL.
Standards-grade semantic conventions. Attribute names like llm.model_name, llm.token_count.prompt, llm.token_count.completion, llm.tool_calls, input.value, output.value are stable across the ecosystem. Any analytics tool that speaks OpenInference can read your spans without a custom adapter.
OTel transport is battle-tested. OTLP/gRPC and OTLP/HTTP are the same protocols your microservices already export to. The collectors, exporters, and samplers your platform team already runs for non-LLM workloads work unchanged for LLM workloads.
The eval and optimizer ecosystem is built on these spans. ai-evaluation, Arize Phoenix’s eval framework, and several open-source agent optimizers consume OpenInference traces directly. If your spans are well-formed, evals and failure clustering come for free.
Architecture: Claude Code through the seam
The reference architecture has four moving parts. Reading left to right:
Claude Code CLI
│ (Anthropic API)
▼
Gateway (Future AGI / Phoenix / Langfuse / Helicone / Maxim)
│ (1) Forward to api.anthropic.com
│ (2) Build OpenInference span
│
├──► api.anthropic.com (the actual model call)
│
└──► OTel exporter (OTLP/gRPC or OTLP/HTTP)
│
▼
Backend storage + query
(ClickHouse / Postgres / Phoenix / FAGI)
│
▼
Dashboard + eval + alerts
The gateway is the only new thing the developer notices, and only because ANTHROPIC_BASE_URL now points to it. The Claude Code CLI needs no plugin, fork, or custom build. Tool calls survive because the gateway preserves Anthropic’s tool-use blocks. Streaming survives because the gateway passes SSE through without buffering.
The span builder in the gateway constructs an OpenInference span tree on every request. The root span carries openinference.span.kind = "LLM", llm.model_name, llm.token_count.{prompt,completion,total}, and a session.id that ties every turn in a conversation together. Tool-use child spans get openinference.span.kind = "TOOL" with the tool name and arguments. Retrieval spans get "RETRIEVER". The whole tree exports over OTLP on the gateway’s own connection.
Three numbers to set expectations on overhead:
- Span construction at the gateway: typically 1 to 3 ms per call.
- OTLP export: asynchronous and batched. The default batch span processor sends every 5 seconds or 512 spans, whichever comes first. Zero added latency on the Claude Code turn.
- Sampling 100% in dev / 10% in prod: a typical posture. A 30-developer team at 100% produces around 250K spans/day; at 10% production sampling, ~25K spans/day.
4-step setup walkthrough
This is the minimal config to get Claude Code spans into any OpenInference-compatible backend. Examples use traceAI (Apache 2.0); the same shape works for Arize Phoenix’s openinference-instrumentation-anthropic if you prefer producer-side instrumentation.
Step 1: Prereqs
Versions pinned as of May 2026:
pip install traceai-anthropic==0.6.4 \
opentelemetry-api==1.30.0 \
opentelemetry-sdk==1.30.0 \
opentelemetry-exporter-otlp-proto-grpc==1.30.0
# environment
export ANTHROPIC_API_KEY="sk-ant-..." # your real key
export FI_API_KEY="..." # if using FAGI as backend
export FI_SECRET_KEY="..."
export OTEL_EXPORTER_OTLP_ENDPOINT="https://api.futureagi.com/v1/traces"
export OTEL_EXPORTER_OTLP_PROTOCOL="grpc"
For Phoenix as backend, swap the endpoint to http://localhost:6006/v1/traces and add OTEL_EXPORTER_OTLP_HEADERS="api_key=...". Langfuse uses the same OTLP collector, point at its ingress, supply public/secret key pair as headers.
Step 2: Point Claude Code at the gateway
Set ANTHROPIC_BASE_URL in your shell profile so both the CLI and IDE plugin use the gateway:
# ~/.zshrc or ~/.bashrc
export ANTHROPIC_BASE_URL="https://gateway.your-company.internal/anthropic"
export ANTHROPIC_AUTH_TOKEN="${ANTHROPIC_API_KEY}" # gateway re-validates
Most gateways accept Anthropic-shape requests at /anthropic/v1/messages and forward transparently. Confirm SSE pass-through, buffering streaming responses breaks Claude Code’s progress UI mid-turn.
Step 3: Wire the OpenInference instrumentation at the gateway
In FAGI’s Agent Command Center the instrumentation is built in, traceAI emits OpenInference spans on every Anthropic call without user code. For a self-hosted proxy (or LiteLLM/Bifrost deployment), the instrumentation lives in the proxy process:
from traceai_anthropic import AnthropicInstrumentor
from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry.sdk.resources import Resource
resource = Resource.create({
"service.name": "claude-code-gateway",
"service.version": "2026.5.17",
"deployment.environment": "prod",
})
provider = TracerProvider(resource=resource)
provider.add_span_processor(
BatchSpanProcessor(
OTLPSpanExporter(
endpoint="https://api.futureagi.com/v1/traces",
headers={
"x-fi-api-key": os.environ["FI_API_KEY"],
"x-fi-secret-key": os.environ["FI_SECRET_KEY"],
},
)
)
)
trace.set_tracer_provider(provider)
# Attach OpenInference instrumentation to every Anthropic client in the proxy
AnthropicInstrumentor().instrument(tracer_provider=provider)
The AnthropicInstrumentor wraps the Anthropic SDK so every messages.create produces a properly named OpenInference span, openinference.span.kind = "LLM", input on llm.input_messages, output on llm.output_messages, tokens on llm.token_count.*. Tool calls produce TOOL child spans automatically.
Step 4: Add session + developer + repo attributes
The default span tree is correct but anonymous. To make chargeback and per-repo slicing work, set three custom attributes on every root span: session.id (the Claude Code conversation), user.id (the developer’s SSO email), repo.url (the git remote).
from opentelemetry import trace
from contextlib import contextmanager
tracer = trace.get_tracer(__name__)
@contextmanager
def claude_code_span(session_id: str, user_id: str, repo_url: str):
with tracer.start_as_current_span("claude_code.turn") as span:
span.set_attribute("session.id", session_id)
span.set_attribute("user.id", user_id)
span.set_attribute("repo.url", repo_url)
span.set_attribute("openinference.span.kind", "AGENT")
yield span
The Claude Code wrapper passes X-Session-Id, X-User-Id, X-Repo-Url headers (FAGI derives them from the API key claim). Every child span inherits the parent’s session.id via OTel context propagation, tag once at the root, the whole tree is slice-able.
Verification: what success looks like
Run Claude Code with the gateway in front; within ~10 seconds the backend should show:
- One root span per turn (
claude_code.turn) with the three custom attributes set. - Nested
openinference.span.kind = "LLM"spans for eachmessages.create, with token counts and model name. - Nested
openinference.span.kind = "TOOL"spans for every bash, file read, or grep Claude Code made. - A
session.idfilter that surfaces every turn in the same conversation across developers and repositories.
Flat spans (no nesting) mean OTel context didn’t propagate across the gateway’s async handlers, fix that before scoring backends. Empty llm.input_messages means a sampler or redaction policy is stripping content; check the SDK config.
5 observability backends scored
The setup above is producer-agnostic. These five backends all consume OpenInference; they vary in what they do once spans land.
The 7 axes:
| Axis | What it measures |
|---|---|
| 1. OpenInference compliance | Does the backend understand the OpenInference semantic conventions natively, or via a custom adapter? |
| 2. OTel-native export | Does it ingest OTLP/gRPC + OTLP/HTTP without proprietary collectors? |
| 3. Span attribute richness | Does the UI surface tool-call args, retrieval hits, eval scores per span? |
| 4. Sampling controls | Tail-based sampling, head-based, attribute-driven — what’s supported? |
| 5. Trace correlation across sessions | Can you filter every span for a session.id across the entire team? |
| 6. Eval hook surface | Can scoring run on captured spans (latency, faithfulness, code-correctness)? |
| 7. Retention | Default raw-span retention and aggregate retention. |
Five picks, scored on all seven, with where each falls short.
1. Future AGI Agent Command Center: Best for trace-to-optimizer loop
Verdict: Future AGI is the only backend here that takes the OpenInference spans it ingests and uses them to improve the system that produced them. The other four are observation layers; Agent Command Center is an observation layer wired to an evaluator and an optimizer.
What it does for Claude Code observability:
- OpenInference compliance is native.
traceAI(Apache 2.0) is the open producer; Agent Command Center is the open consumer. No adapter layer. - OTel-native export over OTLP/gRPC and OTLP/HTTP at
api.futureagi.com/v1/traces. Self-host endpoint in the BYOC deployment. - Span attribute richness. UI groups by
openinference.span.kind, shows tool-call args + responses inline, surfaces eval scores per span. - Sampling controls include attribute-driven (sample 100% of
user.idinpilot_users, 10% otherwise) and tail-based (sample 100% of spans wherellm.token_count.total > 100000). - Trace correlation is the strongest feature,
session.idis a first-class index, with one-click pivot from a Claude Code turn back to every other turn in the same session. - Eval hook surface is
fi.evalsrunning on every captured span: faithfulness, code-correctness, tool-use accuracy. Low-scoring sessions cluster into a failure dataset. - Retention is 30 days raw spans on Free, 90 days on Scale, custom on Enterprise.
The loop. Every trace gets scored. traceAI instruments 35+ frameworks OpenInference-natively, and Error Feed (FAGI’s “Sentry for AI agents”) sits alongside as the zero-config error monitor: auto-clusters low-scoring sessions by failure mode into named issues (50 traces → 1 issue, e.g., “Claude Opus called when Sonnet would have been enough”), auto-writes the root cause from span evidence plus a quick fix plus a long-term recommendation, and tracks rising/steady/falling trend per issue so regressions surface like exceptions rather than buried in flame graphs. fi.opt.optimizers (ProTeGi, Bayesian, GEPA) rewrites prompts or routing policy against the clustered failures. The gateway applies the updated route on the next request. The Future AGI Protect model family sits on the same data path at ~67 ms p50 text and ~109 ms p50 image (arXiv 2510.13351). FAGI’s own fine-tuned Gemma 3n adapters across content moderation, bias detection, security/prompt-injection, and data privacy/PII, multi-modal across text/image/audio, a model family rather than a plugin chain.
Where it falls short:
- agent-opt is opt-in, for a one-week pilot focused on looking at spans, start with traceAI + ai-evaluation. Phoenix is also a lighter alternative for that single-feature brief.
- The flame-graph view is opinionated, fewer knobs than Phoenix’s, faster to read; Phoenix power users will want a week to acclimatise.
Pricing: Free tier with 100K spans/month. Scale tier from $99/month. Enterprise custom with SOC 2 Type II + BAA. AWS Marketplace listing for procurement.
Score: 7/7 axes.
2. Arize Phoenix: Best for OSS-first OpenInference reference
Verdict: Phoenix is the reference implementation of OpenInference. Arize maintains the spec; Phoenix is the matching open-source consumer. If your team wants a 100% OSS observability stack, Phoenix is the right starting point. Falls short on the optimizer side.
What it does for Claude Code observability:
- OpenInference compliance is, by definition, native. The UI was designed around OpenInference attribute names.
- OTel-native export over OTLP/gRPC and OTLP/HTTP. Runs self-hosted via Python package or Docker; Arize-hosted version is also available.
- Span attribute richness is strong. Tool-use, retrieval hits, model parameters, and evaluations side by side. Flame-graph view is best-in-class.
- Sampling controls are mostly client-side; tail-based requires an upstream OTel collector.
- Trace correlation by
session.idworks but the UI’s session-pivot is less polished, you filter, you don’t click-through pivot. - Eval hook surface is
phoenix.evals: hallucination, relevance, toxicity, custom LLM-as-judge. No optimizer downstream. - Retention depends on your Postgres / SQLite storage on self-host.
Where it falls short:
- No optimizer. Phoenix observes and evaluates; it doesn’t feed back into routing.
- Hosted offering is younger; scale-out beyond a few hundred RPS on self-host needs Postgres tuning.
- Per-developer chargeback requires custom SQL on the spans table.
Pricing: OSS Phoenix is free under the Elastic License 2.0. Arize-hosted starts free; Pro from ~$50/seat/month; Enterprise custom with SOC 2 Type II.
Score: 5.5/7 axes (missing: optimizer, mature hosted scale-out).
3. Langfuse: Best for hosted OTel consumer with mature ops
Verdict: Langfuse pivoted onto OpenTelemetry in late 2025 and added OpenInference attribute mapping in early 2026. The dashboard, prompt management, eval library, and RBAC are all polished. Falls short on optimizer feedback and on depth of OpenInference-specific span typing.
What it does for Claude Code observability:
- OpenInference compliance is via attribute mapping, not native. Fidelity is good enough for typical Claude Code workloads but not exhaustive.
- OTel-native export at Langfuse’s OTLP ingress (cloud or self-host). HTTP and gRPC both supported as of v2.95+.
- Span attribute richness is strong on the standard set (input, output, tokens, model, latency). Retrieval-span typing is shallower than Phoenix or FAGI.
- Sampling controls are head-based at the producer. Tail-based not natively supported.
- Trace correlation by
session_idis a first-class concept. - Eval hook surface is
langfuse.evalsplus LLM-as-judge templates. No optimizer downstream. - Retention is 90 days on Hobby, 365 on Pro, custom on Enterprise; self-host is unbounded.
Where it falls short:
- No optimizer.
- OpenInference-to-Langfuse mapping occasionally drops less-common attributes (e.g.,
retrieval.documents[].content). Rarely matters for pure Claude Code; does for RAG-heavy workloads. - Hosted product is EU-first; US region exists but feature parity sometimes lags a release.
Pricing: Hobby free with 50K observations/month. Core $59/month. Pro $199/month. Self-host (MIT) free.
Score: 5.5/7 axes (missing: optimizer, full OpenInference fidelity).
4. Helicone: Best for low-friction per-request observability
Verdict: Helicone is the easiest drop-in proxy here. As of v3 it emits OpenInference-formatted spans via OTLP. Shallower than the four above, but for a 10-developer team that just wants per-request cost + a span search UI for Claude Code, the time-to-value is minutes.
What it does for Claude Code observability:
- OpenInference compliance added in v3 (March 2026). OTLP exporter emits OpenInference span kinds correctly.
- OTel-native export through Helicone’s OTLP relay endpoint; fan-out to your own backend via OTel collector.
- Span attribute richness is decent on the basics; tool-call argument surfacing is workable but less deep than Phoenix.
- Sampling controls are head-based via custom properties; tail-based not supported.
- Trace correlation by
Helicone-Session-Idheader. Claude Code wrapper must set it. - Eval hook surface is Helicone’s “Score” feature plus user feedback hooks. Less expressive than Phoenix or Langfuse.
- Retention is 30 days on Free, 90 on Pro, custom on Enterprise.
Where it falls short:
- No optimizer.
- OpenInference v3 support is recent; nested AGENT spans need manual proxy instrumentation. Plan a half-day to wire.
- Routing intelligence is basic, round-robin and failover. Model-tier routing has to be coded upstream.
Pricing: Free with 10K requests/month. Pro from $25/month. Enterprise custom.
Score: 4.5/7 axes (missing: optimizer, deep span typing, tail-based sampling).
5. Maxim Bifrost: Best for OTel-collector-first deployments
Verdict: Bifrost is a Go-based gateway that ships OpenInference-formatted spans through a standard OTel collector. The pitch is “we’re the gateway; you bring your own backend.” If your platform team already runs an OTel collector + Tempo/Jaeger/ClickHouse, Bifrost slots in without a new SaaS dependency. Trade-off: the LLM-specific UI is leaner than dedicated LLM observability products.
What it does for Claude Code observability:
- OpenInference compliance in the OTLP output, span kinds and attribute names on every request.
- OTel-native export is the entire point. OTLP/gRPC + HTTP, head-based + tail-based sampling at the collector tier.
- Span attribute richness depends on your downstream. Bifrost emits, your backend (Tempo, ClickHouse, Honeycomb) renders.
- Sampling controls are the strongest in this list, you push sampling into the OTel collector where existing patterns work.
- Trace correlation by
session.idworks as a span attribute; what you do with it’s a property of your backend. - Eval hook surface is Maxim’s eval product, a separate purchase. Gateway emits clean spans.
- Retention is whatever your downstream backend provides.
Where it falls short:
- No optimizer.
- The LLM observability UI is owned by your backend. Bifrost ships spans, not a polished LLM-specific dashboard. Plan a week to wire the UX layer.
- Maxim’s hosted observability + eval are separate purchases; expect mix-and-match for the full story.
Pricing: Bifrost is OSS under Apache 2.0. Maxim’s hosted observability + eval starts free; Pro from $99/month; Enterprise custom.
Score: 5/7 axes (missing: optimizer, native LLM-grade dashboard, polished eval-on-trace UX).
Capability matrix
| Axis | FAGI Agent Command Center | Arize Phoenix | Langfuse | Helicone | Maxim Bifrost |
|---|---|---|---|---|---|
| OpenInference compliance | Native | Native (reference) | Mapped | v3 native | Native |
| OTel-native export | OTLP/gRPC + HTTP | OTLP/gRPC + HTTP | OTLP/gRPC + HTTP | OTLP relay | OTel collector first |
| Span attribute richness | High | High | Medium-high | Medium | Backend-dependent |
| Sampling controls | Head + tail + attribute | Head (client) | Head | Head | Tail (collector) |
| Trace correlation by session.id | First-class pivot | Filter | First-class | Header-driven | Backend-dependent |
| Eval hook surface | fi.evals + optimizer | phoenix.evals | langfuse.evals | Scores + feedback | Maxim eval (separate) |
| Retention (raw spans) | 30/90/custom | Self-hosted = unbounded | 90/365/custom | 30/90/custom | Backend-dependent |
| Feedback loop / optimizer | fi.opt | None | None | None | None |
Decision framework: Choose X if
Choose Future AGI Agent Command Center if the OpenInference traces should also drive prompt and route optimization. Pick this when Claude Code is becoming a significant line item ($10K+/month) and you want the eval-and-optimize loop end-to-end without three separate vendors.
Choose Arize Phoenix if your team wants the canonical OSS reference for OpenInference and is comfortable self-hosting a Postgres-backed service. Pick when open-source-first and reference correctness matter more than hosted polish.
Choose Langfuse if you want a hosted OTel consumer with the most polished LLM-ops dashboard, mature RBAC, and prompt-management alongside traces. Pick when procurement matters and you accept mapped (not native) OpenInference fidelity.
Choose Helicone for small teams that want per-request cost visibility today and don’t need an optimizer or deep sampling. Best fit for under 10 developers on Claude Code.
Choose Maxim Bifrost if your platform team already runs an OTel collector + Tempo/ClickHouse/Honeycomb and wants LLM spans in the same stack. Pick when “no new SaaS” is the governing constraint.
Common mistakes when wiring Claude Code observability
| Mistake | What goes wrong | Fix |
|---|---|---|
| Buffering streaming responses at the gateway | Claude Code’s progress UI freezes mid-turn | Confirm SSE pass-through on the gateway — every pick above supports it but some self-built proxies do not |
Tagging only user_id, not session.id | Session-level cost attribution is impossible; one ballooned context turn cannot be traced back to its conversation | Tag both; session.id is the OpenInference standard for the conversation root |
| Sampling at the application before the proxy | Spans go missing for the workloads you care about most | Sample at the OTel collector with tail-based rules; let the gateway always emit |
| Forgetting context propagation across async handlers in the gateway | Spans appear flat (no parent-child nesting) | Use OTel’s context propagation in async middleware; traceAI does this automatically, hand-rolled proxies often miss it |
| Mixing OpenInference and the vendor’s proprietary attribute names | Half your data uses llm.model_name, half uses model; queries break | Pick one (OpenInference) and stick to it; map the vendor’s names at the producer, not the consumer |
| Setting OTLP sync export in dev | Per-turn latency in Claude Code goes up 50-200ms | Always use the batch span processor; sync export is for local debugging only |
How Future AGI closes the loop on Claude Code observability
The other four backends treat OpenInference traces as an end state: capture, show, alert. Future AGI treats them as the input to a feedback loop:
- Trace. Every Claude Code turn produces an OpenInference span tree via
traceAI(Apache 2.0). Captures inputs, outputs, tool calls, model, and thesession.id+user.id+repo.urlattributes. - Evaluate.
fi.evalsscores every turn against task-completion, faithfulness, code-correctness, and tool-use accuracy. Scores live on the same span as the cost data. - Cluster. Low-scoring sessions cluster by failure mode in Agent Command Center, e.g., “Claude Opus called when Sonnet would have been enough.”
- Optimize.
fi.opt.optimizers(ProTeGi, Bayesian, GEPA) rewrites the system prompt or routing policy against the clustered failures, feeding off the same OpenInference spans. - Route. The gateway applies the updated policy on the next request. Live Protect guardrails at ~67ms (arXiv 2510.13351) sit on the same data path.
- Re-deploy. Versioned prompt + route; automatic rollback if the score regresses.
The three Apache 2.0 building blocks: traceAI, ai-evaluation, agent-opt. The hosted Agent Command Center adds failure-cluster view, live Protect, RBAC, SOC 2 Type II certified, and AWS Marketplace.
What we did not include
We deliberately left out three backends that show up in other 2026 listicles:
- Datadog LLM Observability. OpenInference-aware via mapping, but LLM-specific eval surface is narrower and pricing escalates aggressively on span volume. Worth re-evaluating if your team already runs Datadog APM.
- Honeycomb. Outstanding OTel backend and a natural pair with Maxim Bifrost, but LLM-specific dashboarding is less prescriptive than Phoenix or FAGI.
- LangSmith. Strong product, but OpenInference compliance is via adapter as of May 2026; remains primarily a LangChain-shaped consumer.
Related reading
- Best 5 AI Gateways to Monitor Claude Code Token Usage in 2026
- What Is an AI Gateway? The 2026 Definition
- Best LLM Cost Tracking Tools in 2026
- Best AI Gateways for Agentic AI in 2026
Sources
- OpenInference specification, github.com/Arize-ai/openinference
- OpenTelemetry specification, opentelemetry.io/docs/specs
- traceAI, github.com/future-agi/traceAI (Apache 2.0)
- ai-evaluation, github.com/future-agi/ai-evaluation (Apache 2.0)
- agent-opt, github.com/future-agi/agent-opt (Apache 2.0)
- Arize Phoenix, github.com/Arize-ai/phoenix
- Langfuse OTel support, langfuse.com/docs/opentelemetry
- Helicone OpenInference v3 release notes, helicone.ai/changelog
- Maxim Bifrost gateway, github.com/maximhq/bifrost
- Future AGI Protect latency benchmarks, arxiv.org/abs/2510.13351 (67ms text, 109ms image)
- Anthropic Claude Code documentation, claude.ai/docs/claude-code
Frequently asked questions
Do I have to use traceAI or can I use Arize's openinference-instrumentation-anthropic?
How much latency does the OTel exporter add to a Claude Code turn?
Can I sample at the OTel collector for cost reasons?
Is OpenInference stable enough to commit to in 2026?
How does this compare to Anthropic's first-party observability when it ships?
Five AI gateways scored on caching Claude Code calls in 2026: cross-developer cache scope, semantic-match thresholds, hit-rate observability, TTL controls, and what each one misses.
Five tools for Claude Code cost management in 2026 — four gateways plus the native Anthropic dashboard and a FinOps platform — scored on attribution, chargeback, caps, routing, cache observability, FinOps integration, and audit trail.
Future AGI vs LangSmith scored on tracing, evaluation, prompt management, deployment, security, and developer experience. Honest verdict, May 2026 pricing, where each one falls short, and why only one closes the loop.