What is failure analysis in ML?

Failure analysis in ML explains why a model, LLM app, or agent failed and turns the evidence into a fix. FutureAGI connects evaluator results, traces, and regression datasets so teams can reproduce and prevent repeat failures.

How is failure analysis different from a failure mode?

A failure mode is the class of defect, such as hallucination or schema failure. Failure analysis investigates a specific failed case, identifies its cause, and proves the fix with regression evidence.

How do you measure failure analysis?

Use FutureAGI evaluator outputs such as HallucinationScore, JSONValidation, and ToolSelectionAccuracy, then track eval-fail-rate-by-cohort, regression pass rate, and mean time to root cause.

What Is Failure Analysis (ML)? FutureAGI Guide (2026)

What Is Failure Analysis (ML)?

Failure analysis in ML is the structured process of explaining why a model, LLM workflow, or agent failed and proving the fix with evidence. It is a failure-mode practice used in eval pipelines and production traces when an output hallucinates, violates schema, selects the wrong tool, leaks prompt data, or degrades for a cohort. In FutureAGI, teams anchor failure analysis on eval:HallucinationScore, eval:JSONValidation, or eval:ToolSelectionAccuracy, then convert failures into alerts, guardrails, and regression evals.

Why Failure Analysis Matters in Production LLM and Agent Systems

Failure analysis starts where a normal metric ends: a release did not just score 0.71, it failed for a reason. Ignore that reason and the same defect returns under a new prompt, model, retriever, or route. A support assistant may cite a refund policy from stale context. A workflow agent may choose cancel_subscription instead of pause_subscription. A JSON extraction service may pass unit tests but break on one vendor invoice shape.

The pain is operational. Developers need a minimal failing example, not a dashboard average. SREs need to know whether p99 latency, token cost, retry count, or evaluator fail rate changed first. Compliance teams need evidence that a policy, safety rule, or privacy boundary was breached. End users see the visible symptom: wrong answers, loops, unsafe actions, or refusals that appear arbitrary.

For 2026 multi-step pipelines, failure analysis is harder because the defect can begin several spans before the final answer. A planner copies a false premise, a retriever returns irrelevant chunks, a tool call times out, and a fallback model writes a plausible summary. In traces, look for clustered evaluator failures, repeated tool attempts, rising agent.trajectory.step counts, divergent outputs for the same prompt version, and user feedback that mentions “wrong source,” “old policy,” or “it did not do what I asked.”

How FutureAGI Handles Failure Analysis

FutureAGI’s approach is to make failure analysis an eval-backed workflow, not an anecdotal postmortem. For the anchor eval:*, the engineer chooses evaluators that match the failure class: HallucinationScore for unsupported claims, Groundedness for context-bound answers, JSONValidation for structured output failures, ToolSelectionAccuracy for agent tool choice, and PromptInjection or ProtectFlash for adversarial input.

A practical workflow starts with a checkout-support agent instrumented through traceAI-langchain. Each production trace carries the prompt version, retrieved chunks, tool names, route, model, and evaluator outputs. A failed case might show ToolSelectionAccuracy = 0 on step 4, JSONValidation = false on the final extraction, and elevated llm.token_count.prompt after a memory append. The engineer opens the trace, tags the cause as prompt, retrieval, tool, policy, model, or data, then promotes the row into a regression eval. That gives the team one incident record: observed output, expected behavior, evaluator evidence, trace span, owner, and fix version.

The next action depends on the cause. A prompt-injection case can add a pre-guardrail using PromptInjection. An unsupported answer can add a post-guardrail, a fallback response, or a stricter Groundedness threshold. A wrong tool call can trigger a tool-schema fix and a ToolSelectionAccuracy release gate. Unlike a Ragas-only faithfulness run, which narrows the question to RAG claim support, FutureAGI connects multiple evaluator classes to traces, datasets, guardrails, and release decisions.

How to Measure or Detect Failure Analysis

Measure failure analysis by tracking whether failures become reproducible, classified, and covered by regression tests:

fi.evals.HallucinationScore — scores unsupported or fabricated claims so teams can separate factual failures from tool or schema failures.
fi.evals.JSONValidation — evaluates JSON output against a schema and pinpoints structured-output failures.
fi.evals.ToolSelectionAccuracy — evaluates whether an agent chose the expected tool for the task.
Trace fields — inspect agent.trajectory.step, llm.token_count.prompt, route, prompt version, tool name, and fallback path.
Dashboard signals — track eval-fail-rate-by-cohort, regression-pass-rate-after-fix, and mean-time-to-root-cause.
User-feedback proxy — watch thumbs-down rate, correction rate, and escalation rate for clusters tied to the same cause.

from fi.evals import HallucinationScore

evaluator = HallucinationScore()
result = evaluator.evaluate(
    output="The policy allows refunds for 60 days.",
    context="Refunds are allowed within 30 days."
)
print(result.score, result.reason)

A useful detection view groups failures by first bad span, not only by final answer. If the first failing signal is retrieval relevance, fix the index or reranker. If the first failure is tool choice, fix the tool description or planner prompt.

Common Mistakes

Good failure analysis is easy to weaken. The usual mistakes are procedural, not mathematical:

Stopping at one metric. A low hallucination score does not reveal whether retrieval failed, the model ignored context, or the prompt asked for unsupported certainty.
Blaming the model first. Many failures come from stale context, ambiguous tool schemas, missing guardrails, or invalid assumptions in the task definition.
Mixing incident data with regression data. Keep the failing input, expected behavior, evaluator output, and fix version together.
Ignoring successful retries. A recovered trace can still reveal a timeout, unsafe first action, or fallback path that should be tested.
Using averages for rare harms. Cohort-level fail rates hide failures concentrated in one customer tier, language, tool, or policy boundary.