Evaluating LLM Agent Handoffs (2026)
Evaluating LLM agent handoffs in 2026: the handoff is the cross-framework eval unit. Four rubrics, per-handoff spans, CI gates, and Error Feed clustering.
Table of Contents
A four-agent system passes CI at 0.91 TaskCompletion on the user-visible answer. Planner hands to researcher. Researcher returns a citation. Critic checks it. Writer drafts the response. A week later the same run scores 0.89, and the production trace tells a different story: the planner dispatched twice when inline was right, the researcher dropped two constraints, the critic’s flag got buried because the writer never read the return value, and a tool timeout on turn five was swallowed without a retry. Every per-agent rubric is green. Every seam is broken.
The handoff is the cross-framework eval unit. Every multi-agent system has them. Every framework calls them something different: Task in the Claude Agent SDK, handoff() in the OpenAI Agents SDK, an edge in LangGraph, delegation in CrewAI, a turn in AutoGen, an A2A message between services. The shape is identical, and the eval that catches the real failures lives at the transition, not on either side of it. This guide is the framework-agnostic methodology: the cross-framework mapping, the four rubrics that catch the real failures, the traceAI HANDOFF span the rubrics read, the CI gate, and the Error Feed loop that closes the iteration.
Handoffs across frameworks: one primitive, many names
Stop pretending these are different problems.
| Framework | What the handoff is called | Where the eval fires |
|---|---|---|
| Claude Agent SDK | Task tool dispatch into a sub-agent | SUBAGENT span: scope, allowed_tools, return value |
| OpenAI Agents SDK | handoff() primitive between agents | HANDOFF span: source, target, context |
| LangGraph | Edge between graph nodes | Edge span: from-node, to-node, channel state |
| CrewAI | Task delegation between agents | Delegation span: delegator, delegate, expected output |
| AutoGen | Turn in a group chat (RoundRobin, Selector, Swarm, MagenticOne) | Per-pair AGENT_RUN span: speaker, prior speaker, channel |
| Google ADK | Sub-agent invocation in SequentialAgent / ParallelAgent | Sub-agent span: parent, child, payload |
| A2A protocol | A2A message between services | A2A_CLIENT / A2A_SERVER span pair |
Seven surfaces. One primitive. A sender produces output, scope, and constraints; a receiver consumes them and returns a result; the sender reads the result and decides what to do next. The framework decides the syntax; the eval decides whether the unit is sound. A team that picks Claude sub-agents for one workflow and LangGraph for another should not be writing two evaluation systems. The definitive agent evaluation guide covers the broader axis taxonomy; the Claude sub-agents and AutoGen posts cover the framework-specific span shapes.
Why the handoff is the unit, and why per-agent rubrics miss it
A single agent is a function: input, system prompt, tools, output. Score it with TaskCompletion, LLMFunctionCalling, and a CustomLLMJudge and you’re done. A multi-agent system is not that. Each agent’s output becomes the next agent’s input through whatever channel the framework provides, the receiver only sees what the sender chose to pass, and the math compounds in the seams. A team where every agent scores 0.95 on per-turn quality can still ship the wrong answer 30 percent of the time if each handoff drops one constraint.
This is the axis-blindness pattern at the multi-agent level. Aggregate TaskCompletion on the final answer catches half the failures and tells you nothing about which handoff broke. A team-level number that moves is not a diagnostic. The diagnostic lives at the sender-receiver pair.
A working definition: a handoff is correct when the sender chose to dispatch (or not) for the right reason, the receiver stayed inside the dispatched scope, the sender used the return value in its next step, and any error on the round-trip got handled rather than swallowed. Four named failure modes. Four dedicated rubrics.
Four framework-agnostic rubrics
These four work across every framework in the table because they score the transition, not the syntax. Each maps to a different span attribute, threshold, and prompt to fix.
Rubric 1: dispatch correctness
Score the sender’s decision. Right moment to dispatch, or should the sender have answered inline? Right receiver, or did a research agent get a refactor task? Tight enough scope, or vague enough that the receiver had to guess? Right tool subset, or did the sender over-grant capability?
The label sits on the dispatch call itself: pre-handoff state plus call plus receiver catalog. Penalise over-dispatch more heavily than under-dispatch when the receiver is expensive; flip the asymmetry when the receiver is cheap and running the sender inline blows the latency budget.
Rubric 2: scope fidelity
Score the receiver against the dispatched scope. Stay inside the prompt and tool subset, or drift? Call a tool outside the allowed set? Fabricate context the sender never supplied (a “we already agreed on X” reference to a turn that does not exist)? Planner starts drafting, critic starts planning, validator starts coding?
Score per receiver type. A refactor-agent that drifts into design has a different fix than a research-agent that drifts into critique. This is the rubric most sensitive to model checkpoint refreshes: a more helpful checkpoint drifts more aggressively, and ScopeFidelity drops before final-answer quality does.
Rubric 3: result integration
Score the sender’s next step after the receiver returns. Did it read the return value, propagate constraints, and let it change the plan? Or continue as if the dispatch never happened: regenerate the work inline, ignore a returned constraint, contradict the receiver’s conclusion without justification?
Most teams miss this one because it shows up as wasted work, not a wrong answer. A research agent returns a useful artifact. The supervisor ignores it and re-derives it inline. The user-visible answer is fine. The cost graph is double and nothing in TaskCompletion surfaces it.
Rubric 4: recovery on error
Score what happens when the receiver fails, times out, or returns an error. Did the sender retry with a tighter scope, fall back, escalate, or silently continue with a missing artifact? A handoff that fires and fails without structured recovery is worse than no handoff at all: the sender is now planning against state that does not exist.
Inputs: pre-handoff state, receiver error (timeout, tool error, guardrail block, refusal), sender’s next action. Score retries that loop on the same scope as failures unless the underlying error is genuinely transient. Score swallowed errors as zero. The agent failure modes catalog covers the error taxonomy this reads against.
Per-handoff scoring: the traceAI HANDOFF span
A handoff rubric needs a handoff span. traceAI emits one regardless of the framework underneath. The per-framework instrumentors (AutogenInstrumentor, ClaudeAgentInstrumentor, LangGraphInstrumentor, CrewAIInstrumentor, OpenAIAgentsInstrumentor) all land on the same HANDOFF span kind with the same attribute schema. The rubric reads the span; the framework is invisible to it.
pip install ai-evaluation fi-instrumentation-otel
# plus the framework instrumentor you need, e.g.:
pip install traceAI-autogen traceAI-claude-agent-sdk traceAI-langgraph
import os
os.environ["FI_API_KEY"] = "..."
os.environ["FI_SECRET_KEY"] = "..."
from fi_instrumentation import register
from fi_instrumentation.fi_types import ProjectType
from traceai_autogen import AutogenInstrumentor # swap per framework
trace_provider = register(
project_type=ProjectType.OBSERVE,
project_name="multi-agent-handoff-eval",
)
AutogenInstrumentor().instrument(tracer_provider=trace_provider)
After registration, every handoff emits a span with the attributes the four rubrics read directly: handoff.source and handoff.target (sender, receiver), handoff.scope_prompt (verbatim scope), handoff.allowed_tools (tool subset), handoff.context_summary (state summary), handoff.reason (dispatch trigger), handoff.return_value (receiver result), handoff.error (structured error when one fires), handoff.parent_id (parent handoff for nesting).
Tool calls land as TOOL_CALL spans nested under the owning agent’s AGENT span, so per-receiver tool-use scoring is a filter, not a parse. Cost and latency roll up from the underlying LLM spans. A generic OTel tracer collapses the run into one conversation span and handoff.scope_prompt is gone; the four rubrics cannot run, which is why a hand-rolled scorer over the chat transcript misses every handoff defect.
Build each rubric as a CustomLLMJudge and run all four alongside the SDK templates that cover the per-leg baseline. The pattern is the same for each (two shown below, the other two follow identically):
from fi.evals import Evaluator
from fi.evals.templates import (
TaskCompletion, LLMFunctionCalling,
AnswerRefusal, ConversationCoherence,
)
from fi.evals.metrics.llm_as_judges.custom_judge import CustomLLMJudge
evaluator = Evaluator(fi_api_key="...", fi_secret_key="...")
dispatch_correctness = CustomLLMJudge(
name="DispatchCorrectness",
rubric=(
"Given the sender's pre-handoff state, the handoff call (target, "
"scope_prompt, allowed_tools), and the receiver catalog, score whether "
"the sender picked the right receiver, scoped the prompt, and chose "
"to dispatch rather than answer inline."
),
input_mapping={
"sender_state": "handoff.sender_pre_state",
"dispatch_call": "handoff.call",
"receiver_catalog": "handoff.receiver_catalog",
},
)
scope_fidelity = CustomLLMJudge(
name="ScopeFidelity",
rubric=(
"Given the dispatched scope_prompt, allowed_tools, and the receiver's "
"full turn trace, score whether the receiver stayed inside scope. "
"Penalise tools outside the subset, work outside the scope, fabricated "
"context the dispatch did not supply."
),
input_mapping={
"scope_prompt": "handoff.scope_prompt",
"allowed_tools": "handoff.allowed_tools",
"receiver_trace": "handoff.receiver_trace",
},
)
# result_integration and recovery_on_error follow the same pattern,
# reading handoff.return_value / handoff.sender_post_turn and
# handoff.error / handoff.sender_post_turn respectively.
Wire the handoff suite into a CI gate
CI does one job: refuse to merge a prompt change that drops handoff quality below the per-axis threshold. Threshold is per axis, per receiver type, per case, not one aggregate.
results = evaluator.evaluate(
eval_templates=[
TaskCompletion(), LLMFunctionCalling(),
AnswerRefusal(), ConversationCoherence(),
dispatch_correctness, scope_fidelity,
result_integration, recovery_on_error,
],
inputs=production_handoff_regression_set, # list of TestCase, one per handoff
)
THRESHOLDS = {
"DispatchCorrectness": 0.85,
"ScopeFidelity": 0.90,
"ResultIntegration": 0.80,
"RecoveryOnError": 0.85,
"TaskCompletion": 0.85,
}
failures = [
(r.case_id, m.name, m.value)
for r in results.eval_results
for m in r.metrics
if (t := THRESHOLDS.get(m.name)) and m.value < t
]
assert not failures, f"Handoff eval failed on {len(failures)} axis x case pairs"
Run this on every PR that touches a sender prompt, a receiver definition, or routing logic. Run it on every model checkpoint bump. The failure report localises to a receiver type and an axis, so the bisect is one prompt or one rubric, not one team session.
Build the regression set from real production handoffs. Synthetic handoffs mislead because the sender picks the receiver based on actual upstream state, and that state is messy in ways a hand-written test case will not capture. Forty to two hundred is enough to start. Stratify by sender-receiver pair (eight to ten per pair minimum) and keep sixty to seventy percent green cases; the gate needs a passing baseline to detect regressions. The LLM evaluation playbook covers the calibration cadence.
Production observability and Error Feed clustering by handoff failure
CI is necessary, not sufficient. A 200-handoff regression set is a snapshot; production is a river. Score the live trace stream with the same four rubrics and you catch the regressions the offline set cannot, because the offline set was frozen before users found the failure mode. EvalTag on the registered tracer attaches the four rubrics to matching HANDOFF spans server-side, at zero inline latency.
Error Feed sits inside the eval stack as the loop closer. Failing handoffs flow into ClickHouse with their span embeddings; HDBSCAN soft-clustering groups them into named issues. The clusters on multi-agent stacks are handoff-shaped:
- Dropped-constraint clusters. The receiver loses a constraint type at one handoff (budget caps, date ranges, exclusion lists).
immediate_fix: tighter sender prompt that emits constraints in a JSON block. - Wrong-dispatch clusters. The sender invokes
research-agentwhenrefactor-agentwas right, or dispatches when inline would have been faster. Fix: planner edit adding an explicit dispatch-or-inline criterion. - Scope-bleed clusters. A validator starts drafting code, a refactor agent proposes tests. Fix: tighter receiver scope plus a one-shot of the boundary.
- Integration-skip clusters. The sender ignores the receiver’s result and continues inline. Fix: a sender prompt change that forces a one-sentence acknowledgement of the return value.
- Recovery-blind clusters. A receiver times out and the sender continues with a missing artifact. Fix: explicit recovery branch in the sender prompt plus an Agent Command Center retry policy at the gateway.
Per cluster, a Claude Sonnet 4.5 JudgeAgent runs a 30-turn investigation across eight span-tools (with a Haiku Chauffeur for spans over 3000 characters; prompt-cache hit around 90 percent). The Judge writes three artifacts engineers read: a 5-category, 30-subtype taxonomy entry, the 4-D trace score (factual_grounding, privacy_and_safety, instruction_adherence, optimal_plan_execution, 1 to 5 each), and an immediate_fix naming the prompt edit to ship today. Linear OAuth is wired today; Slack, GitHub, Jira, and PagerDuty are on the roadmap.
The fix feeds the Platform’s self-improving evaluators so the rubric sharpens next run. Representative handoffs promote into the regression set. agent-opt’s six optimizers (RandomSearchOptimizer, BayesianSearchOptimizer, MetaPromptOptimizer, ProTeGi, GEPAOptimizer, PromptWizardOptimizer) then tune the sender prompt and each receiver definition as separate study targets. Per-prompt separation keeps a winning tweak on one receiver from being masked by a flat sender. The automated optimization for agents post covers the optimizer mix.
Four anti-patterns that hide handoff failures
Scoring only the final answer. TaskCompletion on the user-visible response misses every handoff defect that did not quite poison the final string. Score per-handoff or the diagnostic never surfaces.
One rubric across sender and receiver. They’re doing different jobs. Sender decides and integrates; receiver executes inside a scope. A single rubric blurs them and produces nothing actionable. Keep DispatchCorrectness and ResultIntegration on the sender, ScopeFidelity on the receiver, RecoveryOnError on the sender’s error path.
No HANDOFF span, only a chat transcript. A generic OTel tracer collapses the run into one conversation span. handoff.scope_prompt, handoff.allowed_tools, and handoff.return_value are gone. Use the framework instrumentor or build the same span shape manually before attempting per-handoff eval.
Treating model checkpoint refreshes as silent. Refreshed checkpoints drift dispatch behaviour first and final-answer quality second; helpful senders over-dispatch, eager receivers drift scope. DispatchCorrectness and ScopeFidelity are the earliest indicators. Pin model versions, run the regression set on every refresh, track per-axis trend lines per checkpoint.
How Future AGI ships the full handoff eval stack
Four surfaces, one loop, no separate products to glue together.
ai-evaluation SDK (Apache 2.0) ships the Evaluator, 60-plus EvalTemplate classes (TaskCompletion, LLMFunctionCalling, AnswerRefusal, ConversationCoherence, Groundedness, ContextAdherence, ChunkAttribution, 11 CustomerAgent* templates), the CustomLLMJudge that carries the four handoff rubrics, 13 guardrail backends, and distributed runners (Celery, Ray, Temporal, Kubernetes).
traceAI (Apache 2.0) ships the HANDOFF span kind plus the framework instrumentors (AutogenInstrumentor, ClaudeAgentInstrumentor, LangGraphInstrumentor, CrewAIInstrumentor, OpenAIAgentsInstrumentor, ADKInstrumentor) so the same rubrics run across every framework. 50-plus AI surface instrumentors across Python, TypeScript, Java, and C#. EvalTag attaches a rubric to a span kind so evals run server-side without polling.
Future AGI Platform adds self-improving evaluators tuned by feedback, in-product custom rubric authoring, and classifier-backed scoring at lower per-eval cost than Galileo Luna-2. Error Feed sits inside the platform with HDBSCAN clustering, the Sonnet 4.5 JudgeAgent, the 5-category 30-subtype taxonomy, the 4-D trace score, and the immediate_fix artifact.
Agent Command Center plus agent-opt close the safety and improvement loops. The gateway caps handoff depth (MaxAgentDepth default 10), surfaces per-call cost on response headers traceAI rolls up the handoff tree, and exposes 18+ built-in guardrail scanners plus 15 third-party adapters. agent-opt’s six optimizers consume the four handoff rubrics as the optimization objective. SOC 2 Type II, HIPAA, GDPR, and CCPA certified per futureagi.com/trust.
Honest tradeoff: if your stack is a single agent with zero handoffs, you do not need this. A TaskCompletion rubric and the base instrumentor cover it. The handoff eval stack earns its weight the moment a second agent enters the loop (three-plus agents, mixed models, dynamic routing, production traffic), and the handoff is the unit that decides whether the team ships.
What to do this week
One multi-agent config, end to end. Five steps.
- Wire the framework-specific traceAI instrumentor into your existing project. Verify per-agent
AGENTspans,TOOL_CALLspans, andHANDOFFspans withhandoff.source,handoff.target,handoff.scope_prompt,handoff.allowed_tools, andhandoff.return_valuepopulated. - Pull 100 real production handoffs across all sender-receiver pairs. Annotate each with pre-handoff state, dispatch call, receiver trace, sender post-return turn, and expected behaviour on errors.
- Define
DispatchCorrectness,ScopeFidelity,ResultIntegration, andRecoveryOnErrorasCustomLLMJudgerubrics. Run them throughEvaluator.evaluatealongsideTaskCompletion,LLMFunctionCalling,AnswerRefusal, andConversationCoherence. - Wire per-axis thresholds into CI. Start at
DispatchCorrectness >= 0.85,ScopeFidelity >= 0.90per receiver type,ResultIntegration >= 0.80,RecoveryOnError >= 0.85. CapMaxAgentDepthat the gateway at 10. - Turn on Error Feed. Watch the first week’s clusters. Promote representatives into the regression set. Run a
BayesianSearchOptimizerstudy on the highest-impact prompt: sender forDispatchCorrectness, receiver forScopeFidelity, sender error branch forRecoveryOnError.
The teams shipping reliable multi-agent systems in 2026 stopped grading the final answer and started grading the seams. The framework gives you orchestration. The eval stack gives you the per-handoff signal that keeps the team honest, one transition at a time.
Related reading
Frequently asked questions
What is an agent handoff in 2026, and why does the term matter across frameworks?
What are the four framework-agnostic rubrics for handoff evaluation?
How does traceAI emit per-handoff spans, and what attributes do the rubrics read?
How do you wire the four handoff rubrics into CI?
How does Error Feed cluster handoff failures back to a fix?
Where does Future AGI ship the full handoff eval stack?
Evaluating AutoGen agents in 2026: the handoff is the eval unit. Three failure modes, three rubrics, per-pair spans, and the production loop.
Evaluating CrewAI agents in 2026: role adherence as the primary metric, plus task delegation, crew coherence, and manager-worker fidelity.
Google ADK's opinionated primitives (Sequential, Parallel, Loop, sub-agent dispatch) demand ADK-native eval, not a LangChain rig in a trench coat.