Guides

Evaluating Claude Sub-Agents: The Dispatch Is the Unit (2026)

Evaluating Claude sub-agents in 2026: dispatch is the eval unit. Three rubrics, per-handoff scoring in CI, traceAI Task-tool spans, the production loop.

May 3, 2026

Updated May 20, 2026

12 min read

claude-agent-sdk sub-agents agent-evaluation task-tool traceai dispatch-evaluation 2026

Table of Contents

A Claude Code supervisor scores 0.92 on TaskCompletion against the final user answer. The refactor lands, the tests pass, the run closes green. Pull the trace: the supervisor fired the Task tool nine times. Four of those dispatches were to the wrong sub-agent type. Two sub-agents drifted out of scope and the supervisor stitched their drift into the final patch. One dispatch returned a useful diff the supervisor ignored and regenerated inline. Nothing in the final-answer rubric moved. Everything broke in the dispatches.

Sub-agent eval is not single-agent eval plus a tree. The unit is the dispatch: the supervisor LLM picks a sub-agent through the Task tool, hands it scope, and reads what comes back. Three things break independently: the dispatch itself, the sub-agent’s scope adherence, and the supervisor’s integration of the result. This post is the working pattern for evaluating Anthropic Claude Agent SDK sub-agents in 2026: the dispatch as the unit, the three rubrics that catch the three failures, the test set built from real production dispatches, the traceAI Claude instrumentor that emits the spans the rubrics need, and the Error Feed loop that clusters failures back to a supervisor or sub-agent prompt change.

Why Claude sub-agent eval differs

The Claude Agent SDK pattern is dispatch, not orchestration. A supervisor Claude session runs a plan. When a chunk of work is well-scoped, the supervisor calls the built-in Task tool with three parameters: a subagent_type (which sub-agent definition to instantiate, like research-agent or refactor-agent), a prompt that scopes the work, and optionally an allowed_tools subset. A new Claude session spins up with its own context window and that tool subset, runs to completion, and returns a single string result. The supervisor reads it and continues planning.

That dispatch shape is verified in the source. The traceAI Claude instrumentor at traceAI/python/frameworks/claude-agent-sdk/traceai_claude_agent_sdk/_subagent_tracker.py pulls subagent_type, description, prompt, and allowed_tools out of every Task tool invocation and emits them on a dedicated SUBAGENT span. The parent_tool_use_id chain reconstructs nesting when one sub-agent dispatches another.

This makes Claude sub-agents different from AutoGen group chats or LangGraph nodes. AutoGen passes a message between peers; LangGraph traverses an edge in a fixed graph. Claude’s supervisor makes a discrete choice at runtime (dispatch or not, which type, what scope) and then has to do something with the return value. The failure surface is the choice and the integration, not the path through a graph. Evaluating that surface needs rubrics that score the dispatch and the integration as first-class artifacts, not as side effects of the final answer. The evaluating LLM agent handoffs post covers the shared spine across patterns; the definitive agent evaluation guide covers the broader axis taxonomy.

The dispatch as the unit of evaluation

A dispatch is a triple: the supervisor’s decision at the Task tool call, the sub-agent’s run inside the dispatched scope, and the supervisor’s next turn after the Task tool returns. The eval unit is (supervisor_decision, sub_agent_run, supervisor_integration), not the final assistant message and not the sub-agent’s output in isolation.

This reframes everything. The regression set is a list of real production dispatches with their expected sub-agent type, scope coverage, and supervisor follow-through. The CI gate runs assertions against per-dispatch scores, not against one supervisor-level number. Error Feed clusters failing supervisors by dispatch pattern (wrong type, scope bleed, integration skip), not by final-answer category. The optimizer tunes the supervisor planner prompt and each sub-agent system prompt as separate study targets because the failure attribution is per-dispatch.

A working definition: a dispatch is correct when the supervisor picked the right sub-agent type for the goal, the sub-agent stayed inside the dispatched prompt and tool subset, and the supervisor actually used the return value in its next plan step. A dispatch is broken when any one of those three fails. Three named failure modes, three dedicated rubrics, one localised diagnostic the moment a regression lands.

Three rubrics: dispatch correctness, scope fidelity, result integration

Three rubrics, one per failure mode, all built on the CustomLLMJudge from the ai-evaluation SDK. The Future AGI Platform’s classifier-backed scoring runs these at lower per-eval cost than Galileo Luna-2 once the rubrics stabilise, but the SDK is the starting surface.

from fi.evals import Evaluator
from fi.evals.templates import (
    TaskCompletion, EvaluateFunctionCalling,  # alias: LLMFunctionCalling
    AnswerRefusal, ConversationCoherence,
)
from fi.evals.metrics.llm_as_judges.custom_judge import CustomLLMJudge

evaluator = Evaluator(fi_api_key="...", fi_secret_key="...")

dispatch_correctness = CustomLLMJudge(
    name="DispatchCorrectness",
    rubric=(
        "Given the supervisor's plan state, the Task tool call "
        "(subagent_type, prompt, allowed_tools), and the available sub-agent "
        "catalog, score whether the supervisor picked the right type, scoped "
        "the prompt, gave the right tool subset, and chose dispatch over "
        "inline when dispatch was warranted. "
        "1.0 = correct. 0.5 = right type, weak scope or wrong tool subset. "
        "0.0 = wrong type, or dispatch where inline was right."
    ),
    input_mapping={
        "supervisor_plan_state": "supervisor_pre_dispatch_context",
        "dispatch_call": "task_tool_input",
        "available_subagents": "subagent_catalog",
    },
)

scope_fidelity = CustomLLMJudge(
    name="ScopeFidelity",
    rubric=(
        "Given the dispatched prompt, the dispatched allowed_tools, and the "
        "sub-agent's full turn trace, score whether the child stayed inside "
        "the dispatched scope. Penalise tools called outside the subset, work "
        "done outside the prompt's goal, fabricated context the dispatch did "
        "not supply, sibling-scope bleed. Score 0.0 to 1.0."
    ),
    input_mapping={
        "dispatch_prompt": "subagent_prompt",
        "dispatch_tools": "subagent_allowed_tools",
        "subagent_trace": "subagent_turn_transcript",
    },
)

result_integration = CustomLLMJudge(
    name="ResultIntegration",
    rubric=(
        "Given the sub-agent's return value and the supervisor's next turn "
        "after the Task tool returned, score whether the supervisor read the "
        "result, propagated its constraints, and let the result change the "
        "plan. Penalise: regenerates the work inline, ignores a returned "
        "constraint, contradicts the child's conclusion without justification."
    ),
    input_mapping={
        "subagent_result": "task_tool_return_value",
        "supervisor_next_turn": "supervisor_post_dispatch_turn",
    },
)

Run them in layers. TaskCompletion, EvaluateFunctionCalling, AnswerRefusal, and ConversationCoherence cover the supervisor and sub-agent baseline. The three dispatch rubrics cover the seams the baseline cannot see. Wire per-axis thresholds into the CI gate — a reasonable starting set is DispatchCorrectness >= 0.85, ScopeFidelity >= 0.90 per sub-agent type, ResultIntegration >= 0.80 — and bind the assertions to the supervisor system prompt version, the sub-agent definition versions, and the test set tag.

The CI gate fails on the failing axis. One bisect instead of three days. The LLM evaluation playbook covers threshold calibration; the agent passes evals fails production post covers the axis-blindness pattern these rubrics are designed to defeat.

Building the test set from real production dispatches

Synthetic dispatches mislead. The supervisor LLM picks the Task tool based on the actual plan state in front of it, and that state is messy in ways a hand-written test case will not capture. The test set has to come from real production traces.

Forty to two hundred dispatches is enough to start. Pull a representative cross-section from the live traceAI stream, stratified by subagent_type so each sub-agent definition has at least eight to ten dispatches in the set. Annotate each one with four artifacts:

Supervisor pre-dispatch context. The last two to three assistant turns plus the user input that led to the Task tool call. This is what DispatchCorrectness reads.
The dispatch call itself. The subagent_type, prompt, and allowed_tools the supervisor passed to the Task tool. Verbatim, not paraphrased.
The sub-agent’s full turn trace. Every assistant turn and tool call the child made. This is what ScopeFidelity scores against the dispatched prompt and tool subset.
The supervisor’s next turn after the Task tool returned. The first assistant turn the supervisor produced after reading the child’s result. This is what ResultIntegration scores.

Wrap each as a TestCase, version the set in git alongside the supervisor and sub-agent system prompts, and tag dispatches with their failure mode when you find them so the next regression has a labelled example to bisect against. Keep sixty to seventy percent green dispatches in the set — the CI gate needs a passing baseline to detect regressions. If every case is a known failure, you cannot tell whether a prompt change made the supervisor better or just shifted the failure surface.

Per-dispatch scoring in CI

The CI loop has one job: refuse to merge a prompt change that drops dispatch quality below the per-axis threshold on the test set. The Evaluator.evaluate call returns scores per template per case; the assertion is per axis, per sub-agent type, per test case.

from fi.evals import Evaluator
from fi.evals.templates import (
    TaskCompletion, EvaluateFunctionCalling,
    AnswerRefusal, ConversationCoherence,
)

evaluator = Evaluator(fi_api_key="...", fi_secret_key="...")

results = evaluator.evaluate(
    eval_templates=[
        TaskCompletion(), EvaluateFunctionCalling(),
        AnswerRefusal(), ConversationCoherence(),
        dispatch_correctness, scope_fidelity, result_integration,
    ],
    inputs=production_dispatch_test_set,  # list of TestCase, one per dispatch
)

THRESHOLDS = {
    "DispatchCorrectness": 0.85,
    "ScopeFidelity": 0.90,
    "ResultIntegration": 0.80,
    "TaskCompletion": 0.85,
}

failures = [
    (r.case_id, m.name, m.value)
    for r in results.eval_results
    for m in r.metrics
    if (t := THRESHOLDS.get(m.name)) and m.value < t
]
assert not failures, f"Dispatch eval failed on {len(failures)} axis x case pairs"

Run this on every PR that touches the supervisor planner prompt, any sub-agent definition, or the supervisor’s allowed_tools surface. Run it on every model checkpoint bump. The failure report localises to a sub-agent type and an axis, which means the bisect is one prompt or one rubric — not one supervisor session.

The platform side carries a managed version of the same surface: the CI job posts the dispatch set as an Evaluator.submit async job, the platform runs the rubrics in parallel, and the result lands on a PR comment with the failing axis and a link to the failing case in the trace UI.

traceAI Claude instrumentor: per-dispatch span attribution

A dispatch rubric needs a dispatch span. The ClaudeAgentInstrumentor emits that span tree without code changes inside your agent definitions. It supports claude-agent-sdk >= 0.1.0, patches ClaudeSDKClient on import, and emits five span kinds: CONVERSATION for the supervisor session, ASSISTANT_TURN for each Claude turn, TOOL_EXECUTION for each tool call, MCP_TOOL for MCP server tool calls, and SUBAGENT for each Task tool dispatch.

pip install claude-agent-sdk traceAI-claude-agent-sdk \
    fi-instrumentation-otel ai-evaluation

import os
os.environ["FI_API_KEY"] = "..."
os.environ["FI_SECRET_KEY"] = "..."

from fi_instrumentation import register
from fi_instrumentation.fi_types import ProjectType
from traceai_claude_agent_sdk import ClaudeAgentInstrumentor

trace_provider = register(
    project_type=ProjectType.OBSERVE,
    project_name="claude-supervisor-dispatch-eval",
)
ClaudeAgentInstrumentor().instrument(tracer_provider=trace_provider)

After this call, every Claude Agent SDK session emits the per-dispatch signal the rubrics need.

The SUBAGENT span carries claude_agent.subagent.type (which sub-agent definition was dispatched), claude_agent.subagent.prompt (the scoped dispatch prompt verbatim), claude_agent.subagent.tools (JSON list of the allowed tool subset), and claude_agent.parent_tool_use_id for nested dispatches. The SubagentTracker aggregates input tokens, output tokens, and cost.total_usd up the parent chain so the supervisor span carries an aggregated_cost_usd including every nested dispatch. The supervisor session itself gets a CONVERSATION span with the session id, system prompt, and allowed_tools.

That topology is exactly what the three rubrics read. DispatchCorrectness reads the supervisor’s ASSISTANT_TURN immediately preceding a SUBAGENT span. ScopeFidelity reads the SUBAGENT span’s subagent.prompt plus subagent.tools and compares against the nested TOOL_EXECUTION spans. ResultIntegration reads the SUBAGENT span’s tool_output plus the supervisor’s ASSISTANT_TURN immediately following. A generic OTel tracer collapses the whole run into one chat span and loses this attribution, which is why a hand-rolled scorer over the chat transcript misses every dispatch defect. The Claude Code observability with OpenInference and OpenTelemetry post covers the cross-tool span topology.

Production observability and Error Feed clustering by dispatch failure

CI is necessary, not sufficient. A 100-dispatch regression set is a snapshot; production is a river. Score the live trace stream with the same three rubrics and you catch the regressions the offline set cannot have, because the offline set was frozen before users found the failure mode. EvalTag on the registered tracer attaches DispatchCorrectness, ScopeFidelity, and ResultIntegration to matching spans server-side, at zero inline latency to the supervisor.

Error Feed is the loop closer. Failing supervisor sessions flow into ClickHouse with their span embeddings; HDBSCAN soft-clustering groups them into named issues. The clusters that turn up most on Claude sub-agent stacks are dispatch-shaped:

Wrong-dispatch clusters. The supervisor invokes research-agent when refactor-agent was right, or dispatches at all when an inline tool call would have been faster. The immediate_fix is usually a planner prompt edit that adds an explicit dispatch-or-inline criterion plus a one-shot of each branch.
Scope-bleed clusters. The validator-agent starts drafting code, the refactor-agent proposes tests, or a sub-agent calls a tool outside its allowed_tools subset. The fix is a tighter role contract in the sub-agent system prompt plus a one-shot of the scope boundary.
Integration-skip clusters. The supervisor receives a useful result and continues as if the dispatch never returned — regenerates the work inline, ignores a returned constraint, or contradicts the child’s conclusion. The fix is a planner prompt change that forces a one-sentence acknowledgement of the return value before the next step.

Per cluster, a Claude Sonnet 4.5 JudgeAgent runs a 30-turn investigation across eight span-tools (with a Haiku Chauffeur for spans over 3000 characters; prompt-cache hit around 90 percent). The Judge writes three artifacts engineers read: a 5-category, 30-subtype taxonomy entry, the 4-D trace score (factual_grounding, privacy_and_safety, instruction_adherence, optimal_plan_execution, 1 to 5 each), and an immediate_fix naming the prompt edit to ship today. Linear OAuth is wired today; Slack, GitHub, Jira, and PagerDuty are on the roadmap.

The fix feeds the platform’s self-improving evaluators so the rubric sharpens on that mode next run. The cluster’s representative dispatches promote into the regression set. agent-opt then tunes the supervisor planner prompt and each sub-agent system prompt as separate study targets — BayesianSearchOptimizer or GEPAOptimizer works well, and per-prompt separation keeps a winning tweak on the refactor-agent from being masked by a flat planner. The automated optimization for agents post covers the optimizer mix. Agent Command Center adds the safety floor — MaxAgentDepth at the gateway caps how many nested dispatches a single request can fan out (default 10, configurable to 25), so a buggy planner hits the cap before the bill arrives.

Common Claude sub-agent eval anti-patterns

Four mistakes that hide each failure mode above.

Scoring only the supervisor’s final answer. TaskCompletion on the supervisor’s last assistant message misses every dispatch defect that did not quite poison the final string. Score per-dispatch or the diagnostic that tells you which decision to fix never surfaces.

One rubric for supervisor and sub-agent. Supervisor and sub-agent are doing different jobs. Supervisor decides and integrates; sub-agent executes inside a scope. A single rubric blurs the two and produces nothing actionable. DispatchCorrectness is supervisor-side. ScopeFidelity is sub-agent-side. ResultIntegration is supervisor-side again. Keep them separated.

No SUBAGENT span, only a chat transcript. A generic OTel tracer collapses a supervisor session into one chat span. claude_agent.subagent.prompt and claude_agent.subagent.tools are gone. The three rubrics cannot run because there is no dispatch object to score against. Use the ClaudeAgentInstrumentor or build the same span shape manually before attempting per-dispatch eval.

Treating model checkpoint refreshes as silent. Refreshed claude-sonnet-4-5 checkpoints drift dispatch behaviour first and final-answer quality second; more helpful supervisors over-dispatch, more eager sub-agents drift scope. DispatchCorrectness and ScopeFidelity are the earliest indicators. Pin model versions in the supervisor and sub-agent configs, run the regression set on every refresh, and track per-axis trend lines per checkpoint.

How Future AGI ships the full Claude sub-agent eval stack

Four surfaces, one loop, no separate products to glue together.

ai-evaluation SDK (Apache 2.0) ships the Evaluator, 60-plus EvalTemplate classes (TaskCompletion, EvaluateFunctionCalling, AnswerRefusal, ConversationCoherence, Groundedness, ChunkAttribution, 11 CustomerAgent* templates), the CustomLLMJudge that carries DispatchCorrectness, ScopeFidelity, and ResultIntegration, 13 guardrail backends, and distributed runners (Celery, Ray, Temporal, Kubernetes) for when the dispatch set outgrows one process.

traceAI (Apache 2.0) ships the ClaudeAgentInstrumentor with its five span kinds (CONVERSATION, ASSISTANT_TURN, TOOL_EXECUTION, SUBAGENT, MCP_TOOL), the SubagentTracker that rolls cost and tokens up the dispatch chain, 50-plus other AI surface instrumentors across Python and TypeScript, and the EvalTag mechanism that attaches a rubric to a span kind so evals run server-side without polling.

Future AGI Platform ships the self-improving evaluators tuned by feedback, in-product custom rubric authoring, and classifier-backed scoring at lower per-eval cost than Galileo Luna-2. Error Feed sits inside the platform with the HDBSCAN clustering, the Sonnet 4.5 JudgeAgent, the 5-category 30-subtype taxonomy, the 4-D trace score, and the immediate_fix artifact.

Agent Command Center plus agent-opt close the safety and improvement loops. The gateway caps dispatch depth, surfaces per-call cost on response headers traceAI rolls up the dispatch chain, and exposes guardrails for sub-agent output safety. agent-opt’s six optimizers (RandomSearchOptimizer, BayesianSearchOptimizer, MetaPromptOptimizer, ProTeGi, GEPAOptimizer, PromptWizardOptimizer) consume the dispatch rubrics as the optimization objective.

Honest tradeoff: if your stack is one Claude session with a tool registry and zero Task tool calls, you do not need this. A TaskCompletion rubric and the base traceai-anthropic instrumentor cover it. The dispatch eval stack earns its weight the moment the supervisor starts fanning out work (real production traffic, multiple sub-agent definitions, dispatch depth greater than one), and the dispatch is the unit that decides whether the supervisor is doing real planning or busywork it did not need.

What to do this week

One supervisor config, end to end. Five steps.

Wire ClaudeAgentInstrumentor().instrument(tracer_provider=trace_provider) into your existing Claude Agent SDK project. Verify CONVERSATION, ASSISTANT_TURN, TOOL_EXECUTION, and SUBAGENT spans land in traceAI with claude_agent.subagent.type, claude_agent.subagent.prompt, and claude_agent.subagent.tools populated.
Pull a hundred real production dispatches across all sub-agent types. Annotate each with supervisor pre-dispatch context, the dispatch call, the sub-agent trace, and the supervisor’s post-dispatch turn. Skew toward the patterns that show up most in your traffic.
Define DispatchCorrectness, ScopeFidelity, and ResultIntegration as CustomLLMJudge rubrics. Run them through Evaluator.evaluate alongside TaskCompletion, EvaluateFunctionCalling, AnswerRefusal, and ConversationCoherence.
Wire per-axis thresholds into CI. Start at DispatchCorrectness >= 0.85, ScopeFidelity >= 0.90 per sub-agent type, ResultIntegration >= 0.80. Tune as the dataset matures. Cap MaxAgentDepth at the gateway at 10.
Turn on Error Feed. Watch the first week’s clusters. Promote each cluster’s representative dispatches into the regression set. Run a BayesianSearchOptimizer study on the highest-impact prompt — usually the supervisor planner if DispatchCorrectness is the failing axis, the sub-agent system prompt if ScopeFidelity is.

The teams shipping reliable Claude sub-agent stacks in 2026 stopped grading the supervisor’s final answer and started grading the dispatches. The Claude Agent SDK gives you the runtime. The eval stack gives you the per-dispatch signal that keeps the supervisor honest, one Task tool call at a time.

Frequently asked questions

What is a Claude sub-agent in the Claude Agent SDK?

A Claude sub-agent is a scoped child agent the supervisor LLM invokes through the Task tool inside the Claude Agent SDK or Claude Code. The supervisor decides three things at dispatch time: which sub-agent type to call (research-agent, refactor-agent, validator-agent), the prompt that scopes the work, and the allowed_tools subset the child can use. The child runs to completion in its own context window with its own turn budget, then returns a single string result that the supervisor reads and acts on. The Task tool is the dispatch primitive. The traceAI ClaudeAgentInstrumentor confirms the shape: a SUBAGENT span per dispatch with claude_agent.subagent.type, claude_agent.subagent.tools, claude_agent.subagent.prompt, and a parent_tool_use_id chain for nesting. Each dispatch is a first-class unit of work with its own trace, its own cost, and its own correctness rubric.

Why is sub-agent evaluation different from chat or single-agent evaluation?

Chat eval scores one input-output pair against a rubric. Single-agent eval extends that to tool selection and argument extraction inside one actor. Sub-agent evaluation has to score a delegation, not a turn. The supervisor LLM made a decision (pick this sub-agent, scope the prompt this way, give it this tool subset) and then had to use whatever came back. Three things can break independently: the supervisor picks the wrong sub-agent type, the sub-agent ignores or exceeds the dispatched scope, and the supervisor fails to integrate the return value into the next planning step. A TaskCompletion score on the supervisor's final answer hides all three. The unit of evaluation is the dispatch triple (supervisor_decision, sub_agent_run, supervisor_integration).

What are the three rubrics for Claude sub-agent evaluation?

Dispatch correctness, scope fidelity, and result integration. DispatchCorrectness scores the supervisor's choice at the Task tool call: was this the right sub-agent type for the goal, was the prompt sufficient and well-scoped, was the allowed_tools subset right, or should the supervisor have handled the work inline. ScopeFidelity scores the sub-agent run: did the child stay inside the dispatched prompt and tool subset, or did it drift into adjacent work, call tools outside its subset, or fabricate context the dispatch never supplied. ResultIntegration scores the supervisor's next turn after the Task tool returns: did the supervisor read the child's result, propagate constraints from it, and let the result actually change the plan, or did it discard the work and continue as if the dispatch never happened. Each rubric maps to a different span in the traceAI tree and a different fix in the loop.

What does traceAI's ClaudeAgentInstrumentor capture that generic OTel tracers miss?

traceAI's ClaudeAgentInstrumentor (verified at traceAI/python/frameworks/claude-agent-sdk/traceai_claude_agent_sdk/_instrumentor.py) patches ClaudeSDKClient and emits a typed span tree the rubrics need. The SUBAGENT span kind carries claude_agent.subagent.type, claude_agent.subagent.description, claude_agent.subagent.prompt, claude_agent.subagent.tools (JSON list of the allowed tool subset), and claude_agent.parent_tool_use_id for nesting. The SubagentTracker rolls per-child usage_input_tokens, usage_output_tokens, and cost_total_usd up the parent chain so the supervisor span carries an aggregated_cost_usd that includes every nested dispatch. The CONVERSATION span at the root holds the supervisor session id, system prompt, allowed_tools, and num_turns. A generic OTel tracer collapses the run into one CHAT span with no parent_tool_use_id chain and no per-dispatch attribution, which is exactly the topology dispatch correctness, scope fidelity, and result integration need to score.

How do you wire the three rubrics into CI on per-dispatch scoring?

Build a test set of real production dispatches (forty to two hundred is enough to start), each annotated with the supervisor input, the dispatch decision the supervisor made, the sub-agent run that resulted, and the supervisor's next turn after the Task tool returned. Wrap each as a TestCase and run the three CustomLLMJudge rubrics through evaluator.evaluate alongside TaskCompletion, LLMFunctionCalling, AnswerRefusal, and ConversationCoherence from the ai-evaluation SDK. Wire per-axis CI thresholds: DispatchCorrectness greater than or equal to 0.85, ScopeFidelity greater than or equal to 0.90 per sub-agent type, ResultIntegration greater than or equal to 0.80. Bind the thresholds to the supervisor system prompt version and the sub-agent definition versions so a regression localises to one prompt change. CI fails on the failing axis, not on one aggregate.

How does Error Feed cluster Claude sub-agent failures by dispatch defect?

Error Feed runs HDBSCAN soft-clustering over span attributes plus per-dispatch embeddings inside ClickHouse, then fires a Claude Sonnet 4.5 JudgeAgent on each cluster with a 30-turn budget, eight span-tools, and a Haiku Chauffeur for spans over 3000 characters. Prompt-cache hit ratio sits around 90 percent. On Claude sub-agent stacks the clusters are dispatch-shaped: wrong-dispatch clusters (the supervisor invokes research-agent when refactor-agent was right, or dispatches at all when inline would have been faster), scope-bleed clusters (the validator-agent starts drafting, the refactor-agent starts proposing tests), and integration-skip clusters (the supervisor receives a sub-agent result and continues as if it never returned). The Judge writes a 5-category 30-subtype taxonomy entry, the 4-D trace score (factual_grounding, privacy_and_safety, instruction_adherence, optimal_plan_execution, 1 to 5 each), and an immediate_fix naming the supervisor planner prompt edit, the sub-agent system prompt tighten, or the rubric calibration that should ship today. Linear OAuth is wired today; Slack, GitHub, Jira, and PagerDuty are on the roadmap.

Where does Future AGI ship the full Claude sub-agent eval stack?

The eval stack ships as a package, not a single product. ai-evaluation SDK (Apache 2.0) ships 60-plus EvalTemplate classes including TaskCompletion, EvaluateFunctionCalling (alias LLMFunctionCalling), AnswerRefusal, ConversationCoherence, Groundedness, ChunkAttribution, plus the CustomLLMJudge that carries DispatchCorrectness, ScopeFidelity, and ResultIntegration. traceAI (Apache 2.0) ships the ClaudeAgentInstrumentor that emits SUBAGENT, TOOL_EXECUTION, ASSISTANT_TURN, and CONVERSATION span kinds across 50-plus AI surfaces in Python and TypeScript. The Future AGI Platform adds self-improving evaluators tuned by feedback, in-product custom rubric authoring, and classifier-backed scoring at lower per-eval cost than Galileo Luna-2. Error Feed sits inside the stack as the clustering and what-to-fix layer. Agent Command Center caps MaxAgentDepth at the gateway so a buggy supervisor cannot fan out a runaway dispatch tree. agent-opt's six optimizers tune the supervisor planner prompt and each sub-agent system prompt as separate study targets.

View all

Guides

Evaluating Pydantic AI Agents That Use MCP Tools (2026)

Evaluate Pydantic AI agents that call MCP tools in 2026: per-typed-output rubrics, tool-call argument fidelity, MCP security checks, dependency invariants.

Vrinda Damani · May 21, 2026

11 min

Guides

The Definitive Guide to AI Agent Evaluation (2026)

2026 working pattern for AI agent evaluation. Six dimensions, six rubrics, 4-D trajectory score, CI gate beats aggregate scoring, loop production needs.

Nikhil Pareek · May 4, 2026

13 min

Guides

Evaluating LLM Agent Handoffs (2026)

Evaluating LLM agent handoffs in 2026: the handoff is the cross-framework eval unit. Four rubrics, per-handoff spans, CI gates, and Error Feed clustering.

Nikhil Pareek · Apr 19, 2026

11 min

Why Claude sub-agent eval differs

The dispatch as the unit of evaluation

Three rubrics: dispatch correctness, scope fidelity, result integration

Building the test set from real production dispatches

Per-dispatch scoring in CI

traceAI Claude instrumentor: per-dispatch span attribution

Production observability and Error Feed clustering by dispatch failure

Common Claude sub-agent eval anti-patterns

How Future AGI ships the full Claude sub-agent eval stack

What to do this week

Related reading

Frequently asked questions