Evaluating Claude Sub-Agents: The Dispatch Is the Unit (2026)
Evaluating Claude sub-agents in 2026: dispatch is the eval unit. Three rubrics, per-handoff scoring in CI, traceAI Task-tool spans, the production loop.
Table of Contents
A Claude Code supervisor scores 0.92 on TaskCompletion against the final user answer. The refactor lands, the tests pass, the run closes green. Pull the trace: the supervisor fired the Task tool nine times. Four of those dispatches were to the wrong sub-agent type. Two sub-agents drifted out of scope and the supervisor stitched their drift into the final patch. One dispatch returned a useful diff the supervisor ignored and regenerated inline. Nothing in the final-answer rubric moved. Everything broke in the dispatches.
Sub-agent eval is not single-agent eval plus a tree. The unit is the dispatch: the supervisor LLM picks a sub-agent through the Task tool, hands it scope, and reads what comes back. Three things break independently: the dispatch itself, the sub-agent’s scope adherence, and the supervisor’s integration of the result. This post is the working pattern for evaluating Anthropic Claude Agent SDK sub-agents in 2026: the dispatch as the unit, the three rubrics that catch the three failures, the test set built from real production dispatches, the traceAI Claude instrumentor that emits the spans the rubrics need, and the Error Feed loop that clusters failures back to a supervisor or sub-agent prompt change.
Why Claude sub-agent eval differs
The Claude Agent SDK pattern is dispatch, not orchestration. A supervisor Claude session runs a plan. When a chunk of work is well-scoped, the supervisor calls the built-in Task tool with three parameters: a subagent_type (which sub-agent definition to instantiate, like research-agent or refactor-agent), a prompt that scopes the work, and optionally an allowed_tools subset. A new Claude session spins up with its own context window and that tool subset, runs to completion, and returns a single string result. The supervisor reads it and continues planning.
That dispatch shape is verified in the source. The traceAI Claude instrumentor at traceAI/python/frameworks/claude-agent-sdk/traceai_claude_agent_sdk/_subagent_tracker.py pulls subagent_type, description, prompt, and allowed_tools out of every Task tool invocation and emits them on a dedicated SUBAGENT span. The parent_tool_use_id chain reconstructs nesting when one sub-agent dispatches another.
This makes Claude sub-agents different from AutoGen group chats or LangGraph nodes. AutoGen passes a message between peers; LangGraph traverses an edge in a fixed graph. Claude’s supervisor makes a discrete choice at runtime (dispatch or not, which type, what scope) and then has to do something with the return value. The failure surface is the choice and the integration, not the path through a graph. Evaluating that surface needs rubrics that score the dispatch and the integration as first-class artifacts, not as side effects of the final answer. The evaluating LLM agent handoffs post covers the shared spine across patterns; the definitive agent evaluation guide covers the broader axis taxonomy.
The dispatch as the unit of evaluation
A dispatch is a triple: the supervisor’s decision at the Task tool call, the sub-agent’s run inside the dispatched scope, and the supervisor’s next turn after the Task tool returns. The eval unit is (supervisor_decision, sub_agent_run, supervisor_integration), not the final assistant message and not the sub-agent’s output in isolation.
This reframes everything. The regression set is a list of real production dispatches with their expected sub-agent type, scope coverage, and supervisor follow-through. The CI gate runs assertions against per-dispatch scores, not against one supervisor-level number. Error Feed clusters failing supervisors by dispatch pattern (wrong type, scope bleed, integration skip), not by final-answer category. The optimizer tunes the supervisor planner prompt and each sub-agent system prompt as separate study targets because the failure attribution is per-dispatch.
A working definition: a dispatch is correct when the supervisor picked the right sub-agent type for the goal, the sub-agent stayed inside the dispatched prompt and tool subset, and the supervisor actually used the return value in its next plan step. A dispatch is broken when any one of those three fails. Three named failure modes, three dedicated rubrics, one localised diagnostic the moment a regression lands.
Three rubrics: dispatch correctness, scope fidelity, result integration
Three rubrics, one per failure mode, all built on the CustomLLMJudge from the ai-evaluation SDK. The Future AGI Platform’s classifier-backed scoring runs these at lower per-eval cost than Galileo Luna-2 once the rubrics stabilise, but the SDK is the starting surface.
from fi.evals import Evaluator
from fi.evals.templates import (
TaskCompletion, EvaluateFunctionCalling, # alias: LLMFunctionCalling
AnswerRefusal, ConversationCoherence,
)
from fi.evals.metrics.llm_as_judges.custom_judge import CustomLLMJudge
evaluator = Evaluator(fi_api_key="...", fi_secret_key="...")
dispatch_correctness = CustomLLMJudge(
name="DispatchCorrectness",
rubric=(
"Given the supervisor's plan state, the Task tool call "
"(subagent_type, prompt, allowed_tools), and the available sub-agent "
"catalog, score whether the supervisor picked the right type, scoped "
"the prompt, gave the right tool subset, and chose dispatch over "
"inline when dispatch was warranted. "
"1.0 = correct. 0.5 = right type, weak scope or wrong tool subset. "
"0.0 = wrong type, or dispatch where inline was right."
),
input_mapping={
"supervisor_plan_state": "supervisor_pre_dispatch_context",
"dispatch_call": "task_tool_input",
"available_subagents": "subagent_catalog",
},
)
scope_fidelity = CustomLLMJudge(
name="ScopeFidelity",
rubric=(
"Given the dispatched prompt, the dispatched allowed_tools, and the "
"sub-agent's full turn trace, score whether the child stayed inside "
"the dispatched scope. Penalise tools called outside the subset, work "
"done outside the prompt's goal, fabricated context the dispatch did "
"not supply, sibling-scope bleed. Score 0.0 to 1.0."
),
input_mapping={
"dispatch_prompt": "subagent_prompt",
"dispatch_tools": "subagent_allowed_tools",
"subagent_trace": "subagent_turn_transcript",
},
)
result_integration = CustomLLMJudge(
name="ResultIntegration",
rubric=(
"Given the sub-agent's return value and the supervisor's next turn "
"after the Task tool returned, score whether the supervisor read the "
"result, propagated its constraints, and let the result change the "
"plan. Penalise: regenerates the work inline, ignores a returned "
"constraint, contradicts the child's conclusion without justification."
),
input_mapping={
"subagent_result": "task_tool_return_value",
"supervisor_next_turn": "supervisor_post_dispatch_turn",
},
)
Run them in layers. TaskCompletion, EvaluateFunctionCalling, AnswerRefusal, and ConversationCoherence cover the supervisor and sub-agent baseline. The three dispatch rubrics cover the seams the baseline cannot see. Wire per-axis thresholds into the CI gate — a reasonable starting set is DispatchCorrectness >= 0.85, ScopeFidelity >= 0.90 per sub-agent type, ResultIntegration >= 0.80 — and bind the assertions to the supervisor system prompt version, the sub-agent definition versions, and the test set tag.
The CI gate fails on the failing axis. One bisect instead of three days. The LLM evaluation playbook covers threshold calibration; the agent passes evals fails production post covers the axis-blindness pattern these rubrics are designed to defeat.
Building the test set from real production dispatches
Synthetic dispatches mislead. The supervisor LLM picks the Task tool based on the actual plan state in front of it, and that state is messy in ways a hand-written test case will not capture. The test set has to come from real production traces.
Forty to two hundred dispatches is enough to start. Pull a representative cross-section from the live traceAI stream, stratified by subagent_type so each sub-agent definition has at least eight to ten dispatches in the set. Annotate each one with four artifacts:
- Supervisor pre-dispatch context. The last two to three assistant turns plus the user input that led to the Task tool call. This is what
DispatchCorrectnessreads. - The dispatch call itself. The
subagent_type,prompt, andallowed_toolsthe supervisor passed to the Task tool. Verbatim, not paraphrased. - The sub-agent’s full turn trace. Every assistant turn and tool call the child made. This is what
ScopeFidelityscores against the dispatched prompt and tool subset. - The supervisor’s next turn after the Task tool returned. The first assistant turn the supervisor produced after reading the child’s result. This is what
ResultIntegrationscores.
Wrap each as a TestCase, version the set in git alongside the supervisor and sub-agent system prompts, and tag dispatches with their failure mode when you find them so the next regression has a labelled example to bisect against. Keep sixty to seventy percent green dispatches in the set — the CI gate needs a passing baseline to detect regressions. If every case is a known failure, you cannot tell whether a prompt change made the supervisor better or just shifted the failure surface.
Per-dispatch scoring in CI
The CI loop has one job: refuse to merge a prompt change that drops dispatch quality below the per-axis threshold on the test set. The Evaluator.evaluate call returns scores per template per case; the assertion is per axis, per sub-agent type, per test case.
from fi.evals import Evaluator
from fi.evals.templates import (
TaskCompletion, EvaluateFunctionCalling,
AnswerRefusal, ConversationCoherence,
)
evaluator = Evaluator(fi_api_key="...", fi_secret_key="...")
results = evaluator.evaluate(
eval_templates=[
TaskCompletion(), EvaluateFunctionCalling(),
AnswerRefusal(), ConversationCoherence(),
dispatch_correctness, scope_fidelity, result_integration,
],
inputs=production_dispatch_test_set, # list of TestCase, one per dispatch
)
THRESHOLDS = {
"DispatchCorrectness": 0.85,
"ScopeFidelity": 0.90,
"ResultIntegration": 0.80,
"TaskCompletion": 0.85,
}
failures = [
(r.case_id, m.name, m.value)
for r in results.eval_results
for m in r.metrics
if (t := THRESHOLDS.get(m.name)) and m.value < t
]
assert not failures, f"Dispatch eval failed on {len(failures)} axis x case pairs"
Run this on every PR that touches the supervisor planner prompt, any sub-agent definition, or the supervisor’s allowed_tools surface. Run it on every model checkpoint bump. The failure report localises to a sub-agent type and an axis, which means the bisect is one prompt or one rubric — not one supervisor session.
The platform side carries a managed version of the same surface: the CI job posts the dispatch set as an Evaluator.submit async job, the platform runs the rubrics in parallel, and the result lands on a PR comment with the failing axis and a link to the failing case in the trace UI.
traceAI Claude instrumentor: per-dispatch span attribution
A dispatch rubric needs a dispatch span. The ClaudeAgentInstrumentor emits that span tree without code changes inside your agent definitions. It supports claude-agent-sdk >= 0.1.0, patches ClaudeSDKClient on import, and emits five span kinds: CONVERSATION for the supervisor session, ASSISTANT_TURN for each Claude turn, TOOL_EXECUTION for each tool call, MCP_TOOL for MCP server tool calls, and SUBAGENT for each Task tool dispatch.
pip install claude-agent-sdk traceAI-claude-agent-sdk \
fi-instrumentation-otel ai-evaluation
import os
os.environ["FI_API_KEY"] = "..."
os.environ["FI_SECRET_KEY"] = "..."
from fi_instrumentation import register
from fi_instrumentation.fi_types import ProjectType
from traceai_claude_agent_sdk import ClaudeAgentInstrumentor
trace_provider = register(
project_type=ProjectType.OBSERVE,
project_name="claude-supervisor-dispatch-eval",
)
ClaudeAgentInstrumentor().instrument(tracer_provider=trace_provider)
After this call, every Claude Agent SDK session emits the per-dispatch signal the rubrics need.
The SUBAGENT span carries claude_agent.subagent.type (which sub-agent definition was dispatched), claude_agent.subagent.prompt (the scoped dispatch prompt verbatim), claude_agent.subagent.tools (JSON list of the allowed tool subset), and claude_agent.parent_tool_use_id for nested dispatches. The SubagentTracker aggregates input tokens, output tokens, and cost.total_usd up the parent chain so the supervisor span carries an aggregated_cost_usd including every nested dispatch. The supervisor session itself gets a CONVERSATION span with the session id, system prompt, and allowed_tools.
That topology is exactly what the three rubrics read. DispatchCorrectness reads the supervisor’s ASSISTANT_TURN immediately preceding a SUBAGENT span. ScopeFidelity reads the SUBAGENT span’s subagent.prompt plus subagent.tools and compares against the nested TOOL_EXECUTION spans. ResultIntegration reads the SUBAGENT span’s tool_output plus the supervisor’s ASSISTANT_TURN immediately following. A generic OTel tracer collapses the whole run into one chat span and loses this attribution, which is why a hand-rolled scorer over the chat transcript misses every dispatch defect. The Claude Code observability with OpenInference and OpenTelemetry post covers the cross-tool span topology.
Production observability and Error Feed clustering by dispatch failure
CI is necessary, not sufficient. A 100-dispatch regression set is a snapshot; production is a river. Score the live trace stream with the same three rubrics and you catch the regressions the offline set cannot have, because the offline set was frozen before users found the failure mode. EvalTag on the registered tracer attaches DispatchCorrectness, ScopeFidelity, and ResultIntegration to matching spans server-side, at zero inline latency to the supervisor.
Error Feed is the loop closer. Failing supervisor sessions flow into ClickHouse with their span embeddings; HDBSCAN soft-clustering groups them into named issues. The clusters that turn up most on Claude sub-agent stacks are dispatch-shaped:
- Wrong-dispatch clusters. The supervisor invokes
research-agentwhenrefactor-agentwas right, or dispatches at all when an inline tool call would have been faster. Theimmediate_fixis usually a planner prompt edit that adds an explicit dispatch-or-inline criterion plus a one-shot of each branch. - Scope-bleed clusters. The
validator-agentstarts drafting code, therefactor-agentproposes tests, or a sub-agent calls a tool outside itsallowed_toolssubset. The fix is a tighter role contract in the sub-agent system prompt plus a one-shot of the scope boundary. - Integration-skip clusters. The supervisor receives a useful result and continues as if the dispatch never returned — regenerates the work inline, ignores a returned constraint, or contradicts the child’s conclusion. The fix is a planner prompt change that forces a one-sentence acknowledgement of the return value before the next step.
Per cluster, a Claude Sonnet 4.5 JudgeAgent runs a 30-turn investigation across eight span-tools (with a Haiku Chauffeur for spans over 3000 characters; prompt-cache hit around 90 percent). The Judge writes three artifacts engineers read: a 5-category, 30-subtype taxonomy entry, the 4-D trace score (factual_grounding, privacy_and_safety, instruction_adherence, optimal_plan_execution, 1 to 5 each), and an immediate_fix naming the prompt edit to ship today. Linear OAuth is wired today; Slack, GitHub, Jira, and PagerDuty are on the roadmap.
The fix feeds the platform’s self-improving evaluators so the rubric sharpens on that mode next run. The cluster’s representative dispatches promote into the regression set. agent-opt then tunes the supervisor planner prompt and each sub-agent system prompt as separate study targets — BayesianSearchOptimizer or GEPAOptimizer works well, and per-prompt separation keeps a winning tweak on the refactor-agent from being masked by a flat planner. The automated optimization for agents post covers the optimizer mix. Agent Command Center adds the safety floor — MaxAgentDepth at the gateway caps how many nested dispatches a single request can fan out (default 10, configurable to 25), so a buggy planner hits the cap before the bill arrives.
Common Claude sub-agent eval anti-patterns
Four mistakes that hide each failure mode above.
Scoring only the supervisor’s final answer. TaskCompletion on the supervisor’s last assistant message misses every dispatch defect that did not quite poison the final string. Score per-dispatch or the diagnostic that tells you which decision to fix never surfaces.
One rubric for supervisor and sub-agent. Supervisor and sub-agent are doing different jobs. Supervisor decides and integrates; sub-agent executes inside a scope. A single rubric blurs the two and produces nothing actionable. DispatchCorrectness is supervisor-side. ScopeFidelity is sub-agent-side. ResultIntegration is supervisor-side again. Keep them separated.
No SUBAGENT span, only a chat transcript. A generic OTel tracer collapses a supervisor session into one chat span. claude_agent.subagent.prompt and claude_agent.subagent.tools are gone. The three rubrics cannot run because there is no dispatch object to score against. Use the ClaudeAgentInstrumentor or build the same span shape manually before attempting per-dispatch eval.
Treating model checkpoint refreshes as silent. Refreshed claude-sonnet-4-5 checkpoints drift dispatch behaviour first and final-answer quality second; more helpful supervisors over-dispatch, more eager sub-agents drift scope. DispatchCorrectness and ScopeFidelity are the earliest indicators. Pin model versions in the supervisor and sub-agent configs, run the regression set on every refresh, and track per-axis trend lines per checkpoint.
How Future AGI ships the full Claude sub-agent eval stack
Four surfaces, one loop, no separate products to glue together.
ai-evaluation SDK (Apache 2.0) ships the Evaluator, 60-plus EvalTemplate classes (TaskCompletion, EvaluateFunctionCalling, AnswerRefusal, ConversationCoherence, Groundedness, ChunkAttribution, 11 CustomerAgent* templates), the CustomLLMJudge that carries DispatchCorrectness, ScopeFidelity, and ResultIntegration, 13 guardrail backends, and distributed runners (Celery, Ray, Temporal, Kubernetes) for when the dispatch set outgrows one process.
traceAI (Apache 2.0) ships the ClaudeAgentInstrumentor with its five span kinds (CONVERSATION, ASSISTANT_TURN, TOOL_EXECUTION, SUBAGENT, MCP_TOOL), the SubagentTracker that rolls cost and tokens up the dispatch chain, 50-plus other AI surface instrumentors across Python and TypeScript, and the EvalTag mechanism that attaches a rubric to a span kind so evals run server-side without polling.
Future AGI Platform ships the self-improving evaluators tuned by feedback, in-product custom rubric authoring, and classifier-backed scoring at lower per-eval cost than Galileo Luna-2. Error Feed sits inside the platform with the HDBSCAN clustering, the Sonnet 4.5 JudgeAgent, the 5-category 30-subtype taxonomy, the 4-D trace score, and the immediate_fix artifact.
Agent Command Center plus agent-opt close the safety and improvement loops. The gateway caps dispatch depth, surfaces per-call cost on response headers traceAI rolls up the dispatch chain, and exposes guardrails for sub-agent output safety. agent-opt’s six optimizers (RandomSearchOptimizer, BayesianSearchOptimizer, MetaPromptOptimizer, ProTeGi, GEPAOptimizer, PromptWizardOptimizer) consume the dispatch rubrics as the optimization objective.
Honest tradeoff: if your stack is one Claude session with a tool registry and zero Task tool calls, you do not need this. A TaskCompletion rubric and the base traceai-anthropic instrumentor cover it. The dispatch eval stack earns its weight the moment the supervisor starts fanning out work (real production traffic, multiple sub-agent definitions, dispatch depth greater than one), and the dispatch is the unit that decides whether the supervisor is doing real planning or busywork it did not need.
What to do this week
One supervisor config, end to end. Five steps.
- Wire
ClaudeAgentInstrumentor().instrument(tracer_provider=trace_provider)into your existing Claude Agent SDK project. VerifyCONVERSATION,ASSISTANT_TURN,TOOL_EXECUTION, andSUBAGENTspans land in traceAI withclaude_agent.subagent.type,claude_agent.subagent.prompt, andclaude_agent.subagent.toolspopulated. - Pull a hundred real production dispatches across all sub-agent types. Annotate each with supervisor pre-dispatch context, the dispatch call, the sub-agent trace, and the supervisor’s post-dispatch turn. Skew toward the patterns that show up most in your traffic.
- Define
DispatchCorrectness,ScopeFidelity, andResultIntegrationasCustomLLMJudgerubrics. Run them throughEvaluator.evaluatealongsideTaskCompletion,EvaluateFunctionCalling,AnswerRefusal, andConversationCoherence. - Wire per-axis thresholds into CI. Start at
DispatchCorrectness >= 0.85,ScopeFidelity >= 0.90per sub-agent type,ResultIntegration >= 0.80. Tune as the dataset matures. CapMaxAgentDepthat the gateway at 10. - Turn on Error Feed. Watch the first week’s clusters. Promote each cluster’s representative dispatches into the regression set. Run a
BayesianSearchOptimizerstudy on the highest-impact prompt — usually the supervisor planner ifDispatchCorrectnessis the failing axis, the sub-agent system prompt ifScopeFidelityis.
The teams shipping reliable Claude sub-agent stacks in 2026 stopped grading the supervisor’s final answer and started grading the dispatches. The Claude Agent SDK gives you the runtime. The eval stack gives you the per-dispatch signal that keeps the supervisor honest, one Task tool call at a time.
Related reading
Frequently asked questions
What is a Claude sub-agent in the Claude Agent SDK?
Why is sub-agent evaluation different from chat or single-agent evaluation?
What are the three rubrics for Claude sub-agent evaluation?
What does traceAI's ClaudeAgentInstrumentor capture that generic OTel tracers miss?
How do you wire the three rubrics into CI on per-dispatch scoring?
How does Error Feed cluster Claude sub-agent failures by dispatch defect?
Where does Future AGI ship the full Claude sub-agent eval stack?
The 2026 working pattern for AI agent evaluation. Six dimensions, six rubrics, a 4-D trajectory score, the CI gate that beats aggregate scoring, and the loop production needs.
Evaluating AutoGen agents in 2026: the handoff is the eval unit. Three failure modes, three rubrics, per-pair spans, and the production loop.
Evaluating Claude Code tool use in 2026: per-tool selection F1, argument fidelity, irreversibility awareness, recovery on error, on traceAI traces.