Failure Modes

What Is Catastrophic Forgetting?

A failure mode where later training improves one capability but degrades behavior the model had already learned.

What Is Catastrophic Forgetting?

Catastrophic forgetting is an AI failure mode where a model loses behavior it already learned after fine-tuning, continual learning, or a later training stage optimizes a different capability. In production LLM systems, it shows up in training and release-eval surfaces: a new model handles fresh cases but fails older prompts, tools, policies, or facts. FutureAGI treats it as a release regression, measured with golden datasets, capability-level eval deltas, and traces tied to model and prompt versions.

Why It Matters in Production LLM and Agent Systems

Catastrophic forgetting is expensive because it hides behind a successful release note. A support model fine-tuned on new refund policies may answer the new policy correctly while forgetting prior escalation rules. An agent planner trained to call a billing tool may stop respecting safety refusals that were passing last week. The visible failure is not random bad quality; it is a trade where one capability improves while another silently falls below its old baseline.

Developers feel it as flaky release gates: the new task demo works, but archived tickets, older API schemas, or legacy customer segments regress. SREs see unchanged latency and cost, so the incident does not look like infrastructure. Compliance teams notice only after a retained policy answer or refusal rule breaks. End users experience it as a model that “used to know this” and now gives an overconfident wrong answer.

The symptoms are measurable if the system keeps history: per-capability pass rate drops, old golden-dataset cohorts fail while new fine-tuning cohorts pass, tool calls shift toward newer schemas, and thumbs-down events cluster around long-tail workflows. Agentic systems make the problem worse because memory, planning, retrieval, and tool-use policies can forget independently. A model that forgets one policy can produce a wrong plan, call the wrong tool, and then write a confident final answer that masks the original regression.

How FutureAGI Handles Catastrophic Forgetting

FutureAGI does not treat catastrophic forgetting as a single dedicated detector. FutureAGI’s approach is to model it as a release regression across named capabilities. The core workflow starts with fi.datasets.Dataset: keep a stable golden dataset with rows tagged by capability, policy area, tool schema, customer segment, and model lineage. Then attach evaluators through Dataset.add_evaluation() such as TaskCompletion for task success, GroundTruthMatch for expected-answer preservation, and FactualAccuracy for domain facts that must survive fine-tuning.

A real example: a team fine-tunes a support model to improve multilingual refund handling. Before rollout, CI runs the old support-policy cohort, the new multilingual cohort, and a tool-calling cohort. The new model improves Spanish refund answers by 9 points, but TaskCompletion on legacy cancellation tickets drops from 0.91 to 0.73. The engineer opens the failing rows, sees that the model forgot the older cancellation flow, and blocks the release until the fine-tuning set includes those counterexamples.

For production attribution, the team instruments the app with traceAI-langchain and stores model version, prompt version, dataset cohort, and llm.token_count.prompt on spans. If the regression slips through, FutureAGI can group failures by release and route. Unlike a bare MLflow model registry check, which can confirm the artifact changed but not which old behavior degraded, this workflow ties forgetting to concrete eval deltas and failing traces.

How to Measure or Detect It

Catastrophic forgetting is not measured from one output. Measure the delta between a new release and the previous accepted release on unchanged tasks.

  • TaskCompletion by cohort — returns a task-success signal that shows whether older tasks still complete after the new training stage.
  • GroundTruthMatch on protected examples — catches answers that no longer match known-good policy, API, or workflow outputs.
  • Capability-regression delta — compare pass rate by tag: old tool schema, old product area, old language, old compliance rule.
  • Trace fields — record model version, prompt version, dataset cohort, llm.token_count.prompt, and release ID on every scored span.
  • User-feedback proxy — watch escalation rate and thumbs-down rate for workflows that were stable before rollout.
from fi.evals import TaskCompletion

evaluator = TaskCompletion()
result = evaluator.evaluate(
    input="Cancel my paid plan and confirm the refund rule.",
    output="I cancelled it. Refunds are never available.",
    expected_output="Cancel the plan and explain the 30-day refund window.",
)
print(result.score, result.reason)

Set alerts on score deltas, not absolute score alone. A 0.88 pass rate can be fine for a new cohort and a release blocker if the same cohort scored 0.96 yesterday.

Common Mistakes

Most catastrophic-forgetting misses come from evaluation design, not from the model being mysterious.

  • Testing only the new objective. Fine-tuning evals must include old capabilities, old refusals, and old tool schemas.
  • Replacing the golden dataset each release. If the baseline changes, the team cannot prove a skill was preserved.
  • Hiding cohort drops inside one aggregate. A multilingual gain can mask a cancellation-policy regression.
  • Confusing it with model drift. Catastrophic forgetting is caused by training or adaptation; drift is a production distribution shift.
  • Checking single-turn prompts only. Agent memory, planner behavior, and tool selection can forget even when final-answer text looks acceptable.
  • Ignoring small stable cohorts. Low-volume enterprise workflows often contain the policies that create the largest incident cost.

Frequently Asked Questions

What is catastrophic forgetting?

Catastrophic forgetting is a model failure mode where fine-tuning or later training improves one task but erases behavior the model previously handled. In LLM systems, it appears as release-over-release regression on older prompts, tools, policies, or facts.

How is catastrophic forgetting different from model drift?

Catastrophic forgetting is caused by a training or adaptation step that damages prior behavior. Model drift is a production behavior shift over time, often caused by changing inputs, users, or upstream data.

How do you measure catastrophic forgetting?

Run `TaskCompletion`, `GroundTruthMatch`, and `FactualAccuracy` on stable golden datasets before every release. FutureAGI compares per-capability score deltas and ties failures to model, prompt, and dataset versions in traces.