What Is Failure Analysis in Machine Learning?
The structured investigation of where, how, and why a machine learning model produces wrong outputs, using slice analysis, feature attribution, and error clustering.
What Is Failure Analysis in Machine Learning?
Failure analysis in machine learning is the structured investigation of where a model is wrong, why it is wrong, and what to change. It treats each misprediction as evidence: which input slice failed, which feature contributed, which pipeline stage broke. Tools include confusion matrices, error clustering, slice-level metrics, and feature attribution. In LLM and agent systems, failure analysis extends to hallucinations, schema violations, tool-call errors, and trajectory drops. The output is a ranked list of failure cohorts engineers can act on — not a single accuracy number that hides the long tail.
Why It Matters in Production LLM and Agent Systems
A model that scores 92% offline can ship 50% wrong on a single user cohort and nobody notices for weeks. Aggregate metrics average across slices; the slices people actually use look very different. Failure analysis is what surfaces these hidden cohorts before the support tickets do.
The pain shows up across roles. ML engineers stare at a stable evaluation score while users churn. SREs see latency p99 spike on a specific route and cannot tie it to model behaviour. Product owners get told “the model is fine” while CSAT trends down. Compliance teams cannot answer “which inputs does the model fail on” because nobody clustered the errors.
In 2026-era stacks, the problem compounds. A multi-step agent fails when one of seven steps fails. End-to-end accuracy can hide which step broke. A trajectory-level failure analysis decomposes the run: the planner picked the right tool, the retriever returned stale context, the generator hallucinated, the critic missed it. Each step is its own failure surface. Without per-step error clustering, teams patch the wrong layer — they re-tune the prompt when the retriever was the problem, or they swap models when the tool schema was wrong. FutureAGI’s eval-fail-rate-by-cohort dashboard is the canonical fix: it groups failures by route, model, prompt version, and user cohort so the right team owns the right fix.
How FutureAGI Handles Failure Analysis
FutureAGI’s approach is to make failure analysis a first-class workflow over the same Dataset you evaluate against. After running Dataset.add_evaluation() with evaluators like DetectHallucination, TaskCompletion, and ContextRelevance, the platform exposes per-row scores plus reasons — so each failure carries a structured “why” string, not just a 0/1 flag. Engineers can filter the dataset by score < 0.5 and group by any metadata column (route, prompt version, user segment, model variant) to see which cohort dominates the failures.
A concrete workflow: a RAG team running on traceAI-langchain notices Faithfulness drops from 0.91 to 0.78 in one week. They open the failure cohort, sort by reason, and find 60% of failures share the same retrieval pattern — a single document chunk is being retrieved for unrelated queries. They fix the embedding pipeline, run a regression eval on the same canonical dataset, and confirm the score recovers. Unlike a static benchmark snapshot, this loop is continuous. Failures are not endpoints — they are the input to the next iteration. The Agent Command Center can also pin a model fallback route for traffic where the analysis shows a smaller model handles certain cohorts better, without re-deploying the primary model.
How to Measure or Detect It
Failure analysis is not a single metric but a workflow over signals:
DetectHallucination: returns a per-row 0–1 score plus a reason string identifying the unsupported claim.TaskCompletion: returns whether the agent reached its goal, with trajectory-level breakdown for partial credit.- Confusion matrix: for classification heads, the off-diagonal cells are your top failure cohorts.
- Eval-fail-rate-by-cohort (dashboard signal): the percentage of failures per route, model, or prompt version.
- Span attributes (
agent.trajectory.step,llm.token_count.completion): segment failures by pipeline stage to isolate the broken layer. - Reason clustering: group evaluator reason strings via embedding similarity to find recurring failure patterns.
from fi.evals import DetectHallucination
score = DetectHallucination()
result = score.evaluate(
input="What was Q3 revenue?",
output="Q3 revenue was $42M.",
context="Q3 revenue was $38M."
)
print(result.score, result.reason)
Common Mistakes
- Stopping at the aggregate score. A 92% accuracy hides the 8% that ruin trust. Always segment by cohort.
- Skipping the reason field. A 0/1 score tells you something failed; the reason tells you why. Don’t discard it.
- Confusing model failure with pipeline failure. A wrong answer can come from the retriever, the prompt, or the model. Don’t blame the LLM until you’ve checked the upstream span.
- Re-running the same eval and hoping. Failure analysis is forensic, not retrospective. Add new dataset rows that exercise the failed cohort.
- Treating one bad week as drift. Spike-vs-trend matters; weight your decisions on sustained signal.
Frequently Asked Questions
What is failure analysis in machine learning?
Failure analysis is the structured investigation of model errors — where they occur (which input slice), why they occur (which feature or pipeline stage), and how to fix them. It turns aggregate accuracy into actionable failure cohorts.
How is failure analysis different from model evaluation?
Evaluation gives you the score; failure analysis gives you the diagnosis. Evaluation says 92% accuracy. Failure analysis says the 8% errors cluster on inputs longer than 4K tokens with retrieved context older than 30 days.
How do you measure failures in an LLM pipeline?
FutureAGI runs evaluators like `DetectHallucination` and `TaskCompletion` against production traces, then segments the results by cohort, route, and span attribute so failure clusters surface as named groups, not just averages.