How is trace comparison different from distributed tracing?

Distributed tracing records a request path across services. Trace comparison puts two request paths side by side, then explains what changed across spans, models, tools, tokens, and quality scores.

How do you measure trace comparison?

Use traceAI fields such as fi.span.kind, gen_ai.request.model, token attributes, and agent.trajectory.step, then compare trace-attached evaluators such as TrajectoryScore and TaskCompletion.

What Is Trace Comparison? FutureAGI Guide (2026)

Q: What is trace comparison?

Trace comparison is the practice of diffing two LLM or agent traces to find changed spans, costs, latency, errors, tool calls, and evaluator outcomes. It explains why one run behaved differently from another.

What Is Trace Comparison?

Trace comparison is an LLM observability technique for diffing two production traces to find what changed in an LLM or agent run. It compares span order, model calls, retrievals, tool calls, token usage, latency, errors, and evaluator scores across a baseline and candidate trace. FutureAGI uses traceAI integrations such as traceAI-langchain to capture the span attributes needed for comparison, so engineers can explain why a prompt, model, route, or workflow release became slower, costlier, or less reliable.

Why Trace Comparison Matters in Production LLM and Agent Systems

Bad releases often look like “quality dropped” until the trace is compared step by step. A model swap may preserve the final answer format while adding two extra tool calls. A prompt edit may leave the LLM span normal but change the retrieval query enough to pull stale context. A routing change may improve median latency and still make one high-value workflow hit a costly fallback path.

The pain is shared. Developers need to know which commit changed behavior. SREs need to explain p99 and token-cost spikes without guessing. Product teams need to know why a cohort started abandoning a workflow. Compliance teams need evidence that the same policy, guardrail, and data path ran before and after a release.

Useful symptoms show up in traces and dashboards: new span kinds, missing parent-child links, different gen_ai.request.model values, output-token growth, repeated agent.trajectory.step spans, higher tool-timeout rate, or eval-fail-rate-by-cohort moving after deployment. Logs rarely show this cleanly because logs flatten causality.

Trace comparison matters more in 2026-era agent systems because one user turn can include planning, retrieval, tool use, sub-agent handoff, guardrails, fallback, and final answer synthesis. A single before/after answer diff is too shallow. The trace diff shows where the behavior changed.

How FutureAGI Uses traceAI for Trace Comparison

FutureAGI’s approach is to make the trace the unit of regression analysis. In a LangChain support agent instrumented with traceAI-langchain, a baseline trace and a candidate trace share the same task input but may differ by prompt version, model, retrieval index, or route. The comparison aligns spans by fi.span.kind, parent-child structure, and agent.trajectory.step, then surfaces differences in gen_ai.request.model, gen_ai.usage.input_tokens, gen_ai.usage.output_tokens, gen_ai.server.time_to_first_token, errors, and attached eval scores.

A real workflow: an engineer tests a refund-policy agent before shipping a new prompt. The old trace has one retrieval span, one policy-tool span, and a final answer with TaskCompletion above threshold. The new trace adds a second retrieval, calls the policy tool with a narrower date range, and returns a shorter answer. TrajectoryScore drops because step efficiency and tool selection changed, while raw latency still looks acceptable. The engineer keeps the prompt out of production, adds the failing trace to a regression dataset, and sets an alert for the same route when TrajectoryScore falls more than five points against the baseline.

Unlike a generic Jaeger trace, which can show timing but not LLM quality, FutureAGI keeps evaluator results next to the span tree. Unlike a plain LangSmith run diff, the traceAI view can be filtered by OpenTelemetry attributes across production cohorts, not only by a manually selected pair of runs.

How to Measure or Detect Trace Differences

Measure trace comparison as a structured diff between a baseline trace set and a candidate trace set:

Span-shape drift: changed count, order, or nesting of fi.span.kind values across comparable requests.
Model and route drift: changed gen_ai.request.model, provider route, fallback count, or guardrail path.
Token and latency deltas: movement in input tokens, output tokens, p99 trace duration, and gen_ai.server.time_to_first_token.
Agent-step changes: added, removed, or repeated agent.trajectory.step spans, especially after prompt or tool updates.
Evaluator movement: TrajectoryScore returns a 0-1 trajectory score with component breakdown; compare it beside TaskCompletion for outcome health.
User proxy: thumbs-down rate, escalation rate, and abandoned workflows joined back to trace ids.

Minimal evaluator check:

from fi.evals import TrajectoryScore

metric = TrajectoryScore()
before = metric.evaluate(trajectory=baseline.trajectory,
                         task=task_definition)
after = metric.evaluate(trajectory=candidate.trajectory,
                        task=task_definition)
print(after.score - before.score)

The strongest signal is not one changed field. It is a trace-level pattern: a span changed, a cost or latency measure moved, and a quality evaluator or user proxy moved in the same cohort.

Common Mistakes

Comparing traces by wall-clock order only. Agent branches can run in parallel; align by parent-child span relationships and fi.span.kind first.
Diffing prompt text while ignoring tool outputs. The prompt may be unchanged while a retriever or API response shifts the answer.
Averaging across cohorts. A model swap can improve simple chats and regress high-value workflows; compare by route, tenant, and intent.
Dropping failed or canceled traces. The missing trace is often the incident; keep timeouts, guardrail blocks, and fallback chains in scope.
Treating every span change as bad. Some diffs are expected after caching, routing, or prompt versioning; judge against SLOs and eval thresholds.