How does it apply to LLM agents?

LLM agents decompose user requests into planner steps, tool calls, and sub-goals. Evaluation has to follow the decomposition: scoring each step with FutureAGI's TaskCompletion, StepEfficiency, and GoalProgress is how you find which step failed.

How is decomposition different from chain-of-thought?

Chain-of-thought is a prompting style that asks the model to show its reasoning. Decomposition is the broader practice of structuring a problem into sub-problems; it can be implemented with planner agents, sub-agents, or static workflows.

What Is Data Decomposition? FutureAGI Guide (2026)

What Is Data Decomposition?

Data decomposition is the practice of breaking a complex dataset, signal, or workload into smaller, well-defined components that can be analyzed, modeled, or evaluated independently. Classical examples include time-series decomposition (trend, seasonality, residual), feature decomposition with PCA or ICA, and matrix factorization. In LLM and agent stacks the same idea applies to tasks: a user request is decomposed into planner steps, sub-goals, and tool calls. FutureAGI evaluates the decomposed pieces with TaskCompletion, GoalProgress, and StepEfficiency rather than scoring the whole task as a black box.

Why It Matters in Production LLM and Agent Systems

When agents fail, they fail at a step. A question about a refund touches retrieval, policy lookup, account check, calculation, and email drafting. If you only score the final answer, you cannot tell which step broke. If retrieval pulled the wrong policy, the planner’s “calculation” was already doomed and the email looks correct in shape. Decomposed evaluation — scoring each step against its sub-goal — is the only way to assign blame and ship a targeted fix.

The pain spans roles. ML engineers debug end-to-end failures and find no single broken component. SREs see retry storms when one decomposed step breaks but the agent keeps trying. Product teams ship a “fix” that improves average completion but actually masks a regression in step three. Compliance teams need step-by-step evidence under audit — a single end-to-end pass/fail does not answer “which decision used which retrieved document under which policy version.”

In 2026 multi-step agent stacks built on MCP, LangGraph, or OpenAI Agents SDK, decomposition is structural: every planner produces sub-tasks, every tool call is a sub-task, and trajectories are explicitly stepwise. Useful symptoms include step_efficiency_score dropping while task_completion_score looks stable, retrieval-step success rates that diverge from final-answer success, and trajectories where the same goal takes 12 steps in one cohort and 4 in another.

How FutureAGI Handles Data Decomposition

FutureAGI’s approach is built around stepwise eval. The TaskCompletion evaluator scores whether the agent achieved the overall goal; GoalProgress scores progress toward the goal step-by-step; StepEfficiency scores whether each step was necessary; TrajectoryScore aggregates the whole. A practical workflow: a customer-support agent decomposes a refund request into four planner sub-goals (verify identity, retrieve policy, calculate amount, send confirmation). Each sub-goal becomes a span, recorded by traceAI-langchain with agent.trajectory.step set.

Offline, the team builds a Dataset of trajectories with sub-goal labels and runs TaskCompletion, GoalProgress, and StepEfficiency against the spans. Release is gated when any sub-goal regresses below threshold — a model that improves average completion but tanks the “retrieve policy” step is blocked, not shipped. Online, the same evaluators run on production samples; per-step dashboards show where to look first.

For data-style decomposition (time-series, feature-level), FutureAGI doesn’t replace SciPy or Prophet — it sits one layer above. If a forecasting model is decomposed into trend + seasonal + residual, FutureAGI’s RegressionEval scores the model’s outputs against the decomposed reference. Unlike a generic LLM benchmark that produces one accuracy number, this surfaces which decomposed component the model is getting wrong. FutureAGI’s approach is consistent: respect the structure already in the data or task, and put the evaluator on each piece.

How to Measure or Detect It

Decomposition is observable through stepwise evaluators and trace fields:

TaskCompletion — overall goal completion across the trajectory.
GoalProgress — per-step progress toward the goal; surfaces the step where progress stalled.
StepEfficiency — flags unnecessary steps and loops.
TrajectoryScore — composite trajectory score combining multiple step-level metrics.
agent.trajectory.step distribution — number of steps per trajectory by cohort; widening tail hints at decomposition failure.

from fi.evals import TaskCompletion, GoalProgress, StepEfficiency

trajectory = {
  "goal": "process refund request",
  "steps": [{"action": "verify_identity"}, {"action": "retrieve_policy"}, {"action": "calculate_amount"}, {"action": "send_email"}]
}
print(TaskCompletion().evaluate(**trajectory))
print(GoalProgress().evaluate(**trajectory))
print(StepEfficiency().evaluate(**trajectory))

Common Mistakes

Scoring only the final answer. End-to-end pass/fail hides which decomposed step broke.
Letting the model decompose freely without labels. Without sub-goal labels in the dataset, the evaluator cannot align steps to ground truth.
Mixing decomposition strategies in one dataset. Some trajectories use a planner agent and others use a static workflow — score them separately.
Ignoring step efficiency. A correct answer that took 18 steps when 4 would do is a quiet regression in cost and latency.
Skipping cohort slicing. Decomposition behavior often varies by language, prompt length, and tool catalog — slice metrics accordingly.