How is decomposed evaluation different from end-to-end evaluation?

End-to-end evaluation returns one outcome score for the whole workflow. Decomposed evaluation keeps sub-scores visible, so engineers can see whether the retriever, planner, tool call, or final response caused the failure.

How do you measure decomposed evaluation?

Use FutureAGI `fi.evals.CustomEvaluation` for each component, then combine scores with `AggregatedMetric` while tracking trace fields such as `agent.trajectory.step` and eval-fail-rate-by-cohort.

What Is Decomposed Evaluation? FutureAGI Guide (2026)

Q: What is decomposed evaluation?

Decomposed evaluation breaks a complex LLM or agent task into separately scored components, such as retrieval, reasoning, tool use, schema validity, and final-answer grounding. FutureAGI maps this pattern to `eval:CustomEvaluation` and aggregate release gates.

What Is Decomposed Evaluation?

Decomposed evaluation is an LLM-evaluation method that breaks a complex model or agent task into separately scored components, then combines those scores for diagnosis and release decisions. Instead of asking one judge whether the whole response is good, it scores retrieval quality, reasoning, tool choice, schema compliance, and final-answer grounding independently. It shows up in eval pipelines and production traces for multi-step agents. FutureAGI maps this pattern to eval:CustomEvaluation, per-step evaluators, and aggregate thresholds.

Why Decomposed Evaluation Matters in Production LLM and Agent Systems

One aggregate quality score hides the failure that needs fixing. A RAG agent can return a wrong answer because the retriever missed the right document, because the planner chose the wrong tool, because the tool result was ignored, or because the final model invented a bridge between facts. If all four cases become “quality = 0.62,” the engineer has no repair path.

The pain shows up differently by role. Developers see low reproducibility: the same prompt change improves one dataset slice and breaks another. SREs see retry storms, p99 latency jumps, and traces with many agent.trajectory.step spans before failure. Product teams see task-completion drops without knowing whether UX, retrieval, or model reasoning changed. Compliance teams lose the ability to prove which control failed when an answer is unsupported or a required disclosure is missing. End-users feel it as confident answers that are almost right but unsafe to act on.

Decomposed evaluation matters more for 2026-era multi-step systems than for single-turn chat. Agents cross boundaries: retriever to planner, planner to tool, tool to formatter, formatter to final answer. Each boundary has a different contract. A final-answer judge can catch the bad outcome, but it cannot tell whether the cause was context recall, tool selection accuracy, JSON schema drift, or hallucinated synthesis. Decomposition turns a vague failure into a work queue.

How FutureAGI Handles Decomposed Evaluation

The explicit FutureAGI surface for this term is eval:CustomEvaluation, exposed as fi.evals.CustomEvaluation. The inventory defines CustomEvaluation as a dynamically created evaluation from a builder or decorator. FutureAGI’s approach is to make each decomposition dimension a named evaluator with its own rubric, score range, threshold, and version, then keep the component scores visible after aggregation.

A real workflow starts with a LangChain support agent instrumented through traceAI-langchain. Each trace stores retrieved chunks, tool calls, final response, token fields such as llm.token_count.prompt, and step boundaries such as agent.trajectory.step. The team defines four component checks: ContextRelevance for whether retrieved context matches the question, ToolSelectionAccuracy for whether the planner picked the right CRM tool, Groundedness for whether the final answer is supported, and a CustomEvaluation named refund_policy_disclosure for the business-specific compliance sentence.

Those scores run on a golden dataset before release and on sampled production traces after release. AggregatedMetric produces the headline release gate, but the dashboard still shows each component. If ToolSelectionAccuracy falls below 0.90 only on enterprise-account traces, the engineer opens a targeted regression eval for the planner prompt instead of rewriting the whole answer prompt. Unlike Ragas faithfulness, which focuses on support between context and answer, decomposed evaluation can grade retriever, planner, tool, schema, and final answer as separate failure surfaces.

How to Measure or Detect Decomposed Evaluation

Treat decomposed evaluation as a component scorecard plus an aggregate gate:

fi.evals.CustomEvaluation returns a team-defined score, label, or reason for one workflow component.
ToolSelectionAccuracy checks whether an agent selected the expected tool for a step or goal.
Groundedness evaluates whether the final response is supported by the provided context.
Trace fields such as agent.trajectory.step and llm.token_count.prompt locate which step failed and how expensive it was.
Dashboard signals include eval-fail-rate-by-cohort, component score distribution, repair-loop count, and task-completion rate after retries.
User-feedback proxies include thumbs-down rate, escalation rate, and manual-review overturn rate for each failed component.

Minimal Python:

from fi.evals import CustomEvaluation, AggregatedMetric

retrieval = CustomEvaluation(name="retrieval_fit", rubric="Score 0-1 for context fit.")
tooling = CustomEvaluation(name="tool_plan", rubric="Score 0-1 for tool choice.")
suite = AggregatedMetric(evaluators=[retrieval, tooling], weights=[0.5, 0.5])
result = suite.evaluate(input=query, output=response, context=trace)
print(result.score, result.sub_scores)

Common Mistakes

Bad decomposition usually comes from splitting the workflow in ways that do not match real failure modes.

Decomposing by code layer only. Score user-visible obligations, not just retriever, model, and formatter modules.
Averaging safety failures away. Use fail-fast thresholds for compliance, schema, and unsafe-action components.
Changing sub-metrics every release. Version each rubric, threshold, and component name so regressions stay comparable.
Replacing end-to-end evals. Keep final outcome scoring; decomposition explains failures, but it does not prove the whole task succeeded.
Ignoring score correlation. If two sub-scores always move together, merge them or clarify the rubrics.