Models

What Is Root Cause Analysis (AI / LLM Systems)?

The practice of tracing an AI failure back to the specific span, prompt, retrieval, or model decision that caused it.

What Is Root Cause Analysis (AI / LLM Systems)?

Root cause analysis (RCA) in an AI system is the practice of tracing a user-visible failure — wrong answer, hallucinated claim, dropped tool call, runaway cost, broken handoff — back to the specific span, prompt, retrieval, or model decision that caused it. In a single LLM call, RCA is mostly prompt and retrieval inspection. In a multi-step agent trajectory, it requires step-level traces, per-step evaluator scores, and the ability to diff failed runs against successful ones. In a FutureAGI workflow, RCA starts at the failed trace, walks the trajectory of OpenTelemetry spans, and pinpoints the step where the evaluator score collapsed.

Why It Matters in Production LLM and Agent Systems

Without RCA, every failure becomes “the model is broken” and every fix becomes a guess. A team gets a Slack message — “the agent gave a wrong refund amount” — and starts debugging. They re-run the prompt. The model gives the right answer this time. They check a different prompt. It works. They give up and assume it was flaky. Three days later the same failure surfaces from a different user, and they still cannot explain it.

The pain is rooted in the fact that LLM failures rarely throw exceptions. The pipeline runs to completion; it just produces wrong output. A retriever returned six chunks and the wrong one was top-ranked. A planner picked the wrong tool because the input phrasing happened to match a different tool’s signature. A summariser dropped a sentence because the context window was full and the relevant content was at the bottom. None of these surface as errors — they surface as low evaluator scores days later.

In 2026 multi-step agent stacks, RCA without trajectory data is impossible. A five-step agent with one bad step still produces output; the user complains; the team has to find the one step in the five. Per-step evaluators plus per-span traces turn that needle-in-haystack into a query.

How FutureAGI Handles Root Cause Analysis

FutureAGI’s approach is to make the trajectory queryable. Every agent step lands as an OpenTelemetry span instrumented by traceAI-openai-agents, traceAI-langgraph, traceAI-langchain, or any of the 30+ traceAI integrations. Each span carries agent.trajectory.step, tool.name, llm.model, llm.prompt, llm.output, retrieval.chunks, and a parent-trace ID. Evaluator scores attach to spans as span_event records — so a low Groundedness score on step 3 of an 8-step trajectory is filterable.

Concretely: a team gets a customer-reported wrong refund amount. They open the trace by user-session ID in FutureAGI. The trajectory shows eight spans; six are green, two are red. Span 3 (intent classification) and span 5 (amount lookup tool call) both have low ToolSelectionAccuracy. The platform surfaces a “diff” view comparing this trajectory to ten similar trajectories that produced correct outputs. The diff reveals that span 3’s prompt included a customer note (“urgent please”) that overlapped with an unrelated tool’s signature, and the agent picked that tool instead. Fix: refine the planner prompt’s tool descriptions. Validation: regression eval on the affected cohort drops eval-fail-rate-by-cohort from 8% to 1.2%.

For ongoing RCA, the team builds a Dataset of recurring failure modes and runs a weekly regression eval to catch the same root cause if it returns. RCA becomes a feedback loop, not a one-off investigation.

How to Measure or Detect It

RCA depends on signals that localise failures, not signals that aggregate them:

  • TaskCompletion: aggregate score per trajectory; the entry point for “did it work?”
  • Groundedness, ToolSelectionAccuracy, ReasoningQuality: per-step evaluators that tell you which step broke.
  • agent.trajectory.step (OTel attribute): canonical span attribute for filtering by step number.
  • Trace diff: compare failed trajectories against passing trajectories on the same task.
  • eval-fail-rate-by-cohort: dashboard signal — RCA starts when this spikes.
  • Replay: rerunning a failed trace with deterministic seed reveals whether the bug is the input or the model.
from fi.evals import TaskCompletion, ToolSelectionAccuracy, Groundedness

task = TaskCompletion()
tool = ToolSelectionAccuracy()
ground = Groundedness()

# Iterate over the trajectory spans for a failed trace.
for span in failed_trace.spans:
    print(span.step, tool.evaluate(span).score)

Common Mistakes

  • Re-running the prompt to debug. LLM outputs are non-deterministic; one passing run does not invalidate the failure.
  • Reading only the final response. The bug usually lives mid-trajectory; the final response is the symptom.
  • Skipping per-step evaluators. Without them, “agent fail rate up” is unactionable.
  • No regression eval after the fix. A fix that ships without a regression eval will silently regress within weeks.
  • Treating RCA as an SRE concern. RCA on LLM failures usually involves prompts and retrieval — engineers, not just SREs, own it.

Frequently Asked Questions

What is root cause analysis in AI systems?

Root cause analysis is the practice of tracing an AI failure — wrong answer, hallucinated claim, dropped tool call, runaway cost — back to the specific span, prompt, retrieval, or model decision that caused it.

How is RCA different in LLM systems vs traditional software?

Traditional RCA reads stack traces and exception messages. LLM RCA reads trajectory traces, prompt content, retrieved chunks, and per-step evaluator scores. The failure is rarely an exception — it is a low-quality output.

How do you do root cause analysis on an AI agent failure?

FutureAGI surfaces the failed trace, runs per-step evaluators (TaskCompletion, Groundedness, ToolSelectionAccuracy), and diffs the failed trace against successful runs of the same task to localise the broken step.