Guides

Evaluating LLM Agent Handoffs (2026)

Evaluating LLM agent handoffs in 2026: the handoff is the cross-framework eval unit. Four rubrics, per-handoff spans, CI gates, and Error Feed clustering.

·
Updated
·
11 min read
agent-evaluation multi-agent agent-handoff traceai llm-evaluation framework-agnostic 2026
Editorial cover image for Evaluating LLM Agent Handoffs (2026)
Table of Contents

A four-agent system passes CI at 0.91 TaskCompletion on the user-visible answer. Planner hands to researcher. Researcher returns a citation. Critic checks it. Writer drafts the response. A week later the same run scores 0.89, and the production trace tells a different story: the planner dispatched twice when inline was right, the researcher dropped two constraints, the critic’s flag got buried because the writer never read the return value, and a tool timeout on turn five was swallowed without a retry. Every per-agent rubric is green. Every seam is broken.

The handoff is the cross-framework eval unit. Every multi-agent system has them. Every framework calls them something different: Task in the Claude Agent SDK, handoff() in the OpenAI Agents SDK, an edge in LangGraph, delegation in CrewAI, a turn in AutoGen, an A2A message between services. The shape is identical, and the eval that catches the real failures lives at the transition, not on either side of it. This guide is the framework-agnostic methodology: the cross-framework mapping, the four rubrics that catch the real failures, the traceAI HANDOFF span the rubrics read, the CI gate, and the Error Feed loop that closes the iteration.

Handoffs across frameworks: one primitive, many names

Stop pretending these are different problems.

FrameworkWhat the handoff is calledWhere the eval fires
Claude Agent SDKTask tool dispatch into a sub-agentSUBAGENT span: scope, allowed_tools, return value
OpenAI Agents SDKhandoff() primitive between agentsHANDOFF span: source, target, context
LangGraphEdge between graph nodesEdge span: from-node, to-node, channel state
CrewAITask delegation between agentsDelegation span: delegator, delegate, expected output
AutoGenTurn in a group chat (RoundRobin, Selector, Swarm, MagenticOne)Per-pair AGENT_RUN span: speaker, prior speaker, channel
Google ADKSub-agent invocation in SequentialAgent / ParallelAgentSub-agent span: parent, child, payload
A2A protocolA2A message between servicesA2A_CLIENT / A2A_SERVER span pair

Seven surfaces. One primitive. A sender produces output, scope, and constraints; a receiver consumes them and returns a result; the sender reads the result and decides what to do next. The framework decides the syntax; the eval decides whether the unit is sound. A team that picks Claude sub-agents for one workflow and LangGraph for another should not be writing two evaluation systems. The definitive agent evaluation guide covers the broader axis taxonomy; the Claude sub-agents and AutoGen posts cover the framework-specific span shapes.

Why the handoff is the unit, and why per-agent rubrics miss it

A single agent is a function: input, system prompt, tools, output. Score it with TaskCompletion, LLMFunctionCalling, and a CustomLLMJudge and you’re done. A multi-agent system is not that. Each agent’s output becomes the next agent’s input through whatever channel the framework provides, the receiver only sees what the sender chose to pass, and the math compounds in the seams. A team where every agent scores 0.95 on per-turn quality can still ship the wrong answer 30 percent of the time if each handoff drops one constraint.

This is the axis-blindness pattern at the multi-agent level. Aggregate TaskCompletion on the final answer catches half the failures and tells you nothing about which handoff broke. A team-level number that moves is not a diagnostic. The diagnostic lives at the sender-receiver pair.

A working definition: a handoff is correct when the sender chose to dispatch (or not) for the right reason, the receiver stayed inside the dispatched scope, the sender used the return value in its next step, and any error on the round-trip got handled rather than swallowed. Four named failure modes. Four dedicated rubrics.

Four framework-agnostic rubrics

These four work across every framework in the table because they score the transition, not the syntax. Each maps to a different span attribute, threshold, and prompt to fix.

Rubric 1: dispatch correctness

Score the sender’s decision. Right moment to dispatch, or should the sender have answered inline? Right receiver, or did a research agent get a refactor task? Tight enough scope, or vague enough that the receiver had to guess? Right tool subset, or did the sender over-grant capability?

The label sits on the dispatch call itself: pre-handoff state plus call plus receiver catalog. Penalise over-dispatch more heavily than under-dispatch when the receiver is expensive; flip the asymmetry when the receiver is cheap and running the sender inline blows the latency budget.

Rubric 2: scope fidelity

Score the receiver against the dispatched scope. Stay inside the prompt and tool subset, or drift? Call a tool outside the allowed set? Fabricate context the sender never supplied (a “we already agreed on X” reference to a turn that does not exist)? Planner starts drafting, critic starts planning, validator starts coding?

Score per receiver type. A refactor-agent that drifts into design has a different fix than a research-agent that drifts into critique. This is the rubric most sensitive to model checkpoint refreshes: a more helpful checkpoint drifts more aggressively, and ScopeFidelity drops before final-answer quality does.

Rubric 3: result integration

Score the sender’s next step after the receiver returns. Did it read the return value, propagate constraints, and let it change the plan? Or continue as if the dispatch never happened: regenerate the work inline, ignore a returned constraint, contradict the receiver’s conclusion without justification?

Most teams miss this one because it shows up as wasted work, not a wrong answer. A research agent returns a useful artifact. The supervisor ignores it and re-derives it inline. The user-visible answer is fine. The cost graph is double and nothing in TaskCompletion surfaces it.

Rubric 4: recovery on error

Score what happens when the receiver fails, times out, or returns an error. Did the sender retry with a tighter scope, fall back, escalate, or silently continue with a missing artifact? A handoff that fires and fails without structured recovery is worse than no handoff at all: the sender is now planning against state that does not exist.

Inputs: pre-handoff state, receiver error (timeout, tool error, guardrail block, refusal), sender’s next action. Score retries that loop on the same scope as failures unless the underlying error is genuinely transient. Score swallowed errors as zero. The agent failure modes catalog covers the error taxonomy this reads against.

Per-handoff scoring: the traceAI HANDOFF span

A handoff rubric needs a handoff span. traceAI emits one regardless of the framework underneath. The per-framework instrumentors (AutogenInstrumentor, ClaudeAgentInstrumentor, LangGraphInstrumentor, CrewAIInstrumentor, OpenAIAgentsInstrumentor) all land on the same HANDOFF span kind with the same attribute schema. The rubric reads the span; the framework is invisible to it.

pip install ai-evaluation fi-instrumentation-otel
# plus the framework instrumentor you need, e.g.:
pip install traceAI-autogen traceAI-claude-agent-sdk traceAI-langgraph
import os
os.environ["FI_API_KEY"] = "..."
os.environ["FI_SECRET_KEY"] = "..."

from fi_instrumentation import register
from fi_instrumentation.fi_types import ProjectType
from traceai_autogen import AutogenInstrumentor  # swap per framework

trace_provider = register(
    project_type=ProjectType.OBSERVE,
    project_name="multi-agent-handoff-eval",
)
AutogenInstrumentor().instrument(tracer_provider=trace_provider)

After registration, every handoff emits a span with the attributes the four rubrics read directly: handoff.source and handoff.target (sender, receiver), handoff.scope_prompt (verbatim scope), handoff.allowed_tools (tool subset), handoff.context_summary (state summary), handoff.reason (dispatch trigger), handoff.return_value (receiver result), handoff.error (structured error when one fires), handoff.parent_id (parent handoff for nesting).

Tool calls land as TOOL_CALL spans nested under the owning agent’s AGENT span, so per-receiver tool-use scoring is a filter, not a parse. Cost and latency roll up from the underlying LLM spans. A generic OTel tracer collapses the run into one conversation span and handoff.scope_prompt is gone; the four rubrics cannot run, which is why a hand-rolled scorer over the chat transcript misses every handoff defect.

Build each rubric as a CustomLLMJudge and run all four alongside the SDK templates that cover the per-leg baseline. The pattern is the same for each (two shown below, the other two follow identically):

from fi.evals import Evaluator
from fi.evals.templates import (
    TaskCompletion, LLMFunctionCalling,
    AnswerRefusal, ConversationCoherence,
)
from fi.evals.metrics.llm_as_judges.custom_judge import CustomLLMJudge

evaluator = Evaluator(fi_api_key="...", fi_secret_key="...")

dispatch_correctness = CustomLLMJudge(
    name="DispatchCorrectness",
    rubric=(
        "Given the sender's pre-handoff state, the handoff call (target, "
        "scope_prompt, allowed_tools), and the receiver catalog, score whether "
        "the sender picked the right receiver, scoped the prompt, and chose "
        "to dispatch rather than answer inline."
    ),
    input_mapping={
        "sender_state": "handoff.sender_pre_state",
        "dispatch_call": "handoff.call",
        "receiver_catalog": "handoff.receiver_catalog",
    },
)

scope_fidelity = CustomLLMJudge(
    name="ScopeFidelity",
    rubric=(
        "Given the dispatched scope_prompt, allowed_tools, and the receiver's "
        "full turn trace, score whether the receiver stayed inside scope. "
        "Penalise tools outside the subset, work outside the scope, fabricated "
        "context the dispatch did not supply."
    ),
    input_mapping={
        "scope_prompt": "handoff.scope_prompt",
        "allowed_tools": "handoff.allowed_tools",
        "receiver_trace": "handoff.receiver_trace",
    },
)

# result_integration and recovery_on_error follow the same pattern,
# reading handoff.return_value / handoff.sender_post_turn and
# handoff.error / handoff.sender_post_turn respectively.

Wire the handoff suite into a CI gate

CI does one job: refuse to merge a prompt change that drops handoff quality below the per-axis threshold. Threshold is per axis, per receiver type, per case, not one aggregate.

results = evaluator.evaluate(
    eval_templates=[
        TaskCompletion(), LLMFunctionCalling(),
        AnswerRefusal(), ConversationCoherence(),
        dispatch_correctness, scope_fidelity,
        result_integration, recovery_on_error,
    ],
    inputs=production_handoff_regression_set,  # list of TestCase, one per handoff
)

THRESHOLDS = {
    "DispatchCorrectness": 0.85,
    "ScopeFidelity": 0.90,
    "ResultIntegration": 0.80,
    "RecoveryOnError": 0.85,
    "TaskCompletion": 0.85,
}

failures = [
    (r.case_id, m.name, m.value)
    for r in results.eval_results
    for m in r.metrics
    if (t := THRESHOLDS.get(m.name)) and m.value < t
]
assert not failures, f"Handoff eval failed on {len(failures)} axis x case pairs"

Run this on every PR that touches a sender prompt, a receiver definition, or routing logic. Run it on every model checkpoint bump. The failure report localises to a receiver type and an axis, so the bisect is one prompt or one rubric, not one team session.

Build the regression set from real production handoffs. Synthetic handoffs mislead because the sender picks the receiver based on actual upstream state, and that state is messy in ways a hand-written test case will not capture. Forty to two hundred is enough to start. Stratify by sender-receiver pair (eight to ten per pair minimum) and keep sixty to seventy percent green cases; the gate needs a passing baseline to detect regressions. The LLM evaluation playbook covers the calibration cadence.

Production observability and Error Feed clustering by handoff failure

CI is necessary, not sufficient. A 200-handoff regression set is a snapshot; production is a river. Score the live trace stream with the same four rubrics and you catch the regressions the offline set cannot, because the offline set was frozen before users found the failure mode. EvalTag on the registered tracer attaches the four rubrics to matching HANDOFF spans server-side, at zero inline latency.

Error Feed sits inside the eval stack as the loop closer. Failing handoffs flow into ClickHouse with their span embeddings; HDBSCAN soft-clustering groups them into named issues. The clusters on multi-agent stacks are handoff-shaped:

  • Dropped-constraint clusters. The receiver loses a constraint type at one handoff (budget caps, date ranges, exclusion lists). immediate_fix: tighter sender prompt that emits constraints in a JSON block.
  • Wrong-dispatch clusters. The sender invokes research-agent when refactor-agent was right, or dispatches when inline would have been faster. Fix: planner edit adding an explicit dispatch-or-inline criterion.
  • Scope-bleed clusters. A validator starts drafting code, a refactor agent proposes tests. Fix: tighter receiver scope plus a one-shot of the boundary.
  • Integration-skip clusters. The sender ignores the receiver’s result and continues inline. Fix: a sender prompt change that forces a one-sentence acknowledgement of the return value.
  • Recovery-blind clusters. A receiver times out and the sender continues with a missing artifact. Fix: explicit recovery branch in the sender prompt plus an Agent Command Center retry policy at the gateway.

Per cluster, a Claude Sonnet 4.5 JudgeAgent runs a 30-turn investigation across eight span-tools (with a Haiku Chauffeur for spans over 3000 characters; prompt-cache hit around 90 percent). The Judge writes three artifacts engineers read: a 5-category, 30-subtype taxonomy entry, the 4-D trace score (factual_grounding, privacy_and_safety, instruction_adherence, optimal_plan_execution, 1 to 5 each), and an immediate_fix naming the prompt edit to ship today. Linear OAuth is wired today; Slack, GitHub, Jira, and PagerDuty are on the roadmap.

The fix feeds the Platform’s self-improving evaluators so the rubric sharpens next run. Representative handoffs promote into the regression set. agent-opt’s six optimizers (RandomSearchOptimizer, BayesianSearchOptimizer, MetaPromptOptimizer, ProTeGi, GEPAOptimizer, PromptWizardOptimizer) then tune the sender prompt and each receiver definition as separate study targets. Per-prompt separation keeps a winning tweak on one receiver from being masked by a flat sender. The automated optimization for agents post covers the optimizer mix.

Four anti-patterns that hide handoff failures

Scoring only the final answer. TaskCompletion on the user-visible response misses every handoff defect that did not quite poison the final string. Score per-handoff or the diagnostic never surfaces.

One rubric across sender and receiver. They’re doing different jobs. Sender decides and integrates; receiver executes inside a scope. A single rubric blurs them and produces nothing actionable. Keep DispatchCorrectness and ResultIntegration on the sender, ScopeFidelity on the receiver, RecoveryOnError on the sender’s error path.

No HANDOFF span, only a chat transcript. A generic OTel tracer collapses the run into one conversation span. handoff.scope_prompt, handoff.allowed_tools, and handoff.return_value are gone. Use the framework instrumentor or build the same span shape manually before attempting per-handoff eval.

Treating model checkpoint refreshes as silent. Refreshed checkpoints drift dispatch behaviour first and final-answer quality second; helpful senders over-dispatch, eager receivers drift scope. DispatchCorrectness and ScopeFidelity are the earliest indicators. Pin model versions, run the regression set on every refresh, track per-axis trend lines per checkpoint.

How Future AGI ships the full handoff eval stack

Four surfaces, one loop, no separate products to glue together.

ai-evaluation SDK (Apache 2.0) ships the Evaluator, 60-plus EvalTemplate classes (TaskCompletion, LLMFunctionCalling, AnswerRefusal, ConversationCoherence, Groundedness, ContextAdherence, ChunkAttribution, 11 CustomerAgent* templates), the CustomLLMJudge that carries the four handoff rubrics, 13 guardrail backends, and distributed runners (Celery, Ray, Temporal, Kubernetes).

traceAI (Apache 2.0) ships the HANDOFF span kind plus the framework instrumentors (AutogenInstrumentor, ClaudeAgentInstrumentor, LangGraphInstrumentor, CrewAIInstrumentor, OpenAIAgentsInstrumentor, ADKInstrumentor) so the same rubrics run across every framework. 50-plus AI surface instrumentors across Python, TypeScript, Java, and C#. EvalTag attaches a rubric to a span kind so evals run server-side without polling.

Future AGI Platform adds self-improving evaluators tuned by feedback, in-product custom rubric authoring, and classifier-backed scoring at lower per-eval cost than Galileo Luna-2. Error Feed sits inside the platform with HDBSCAN clustering, the Sonnet 4.5 JudgeAgent, the 5-category 30-subtype taxonomy, the 4-D trace score, and the immediate_fix artifact.

Agent Command Center plus agent-opt close the safety and improvement loops. The gateway caps handoff depth (MaxAgentDepth default 10), surfaces per-call cost on response headers traceAI rolls up the handoff tree, and exposes 18+ built-in guardrail scanners plus 15 third-party adapters. agent-opt’s six optimizers consume the four handoff rubrics as the optimization objective. SOC 2 Type II, HIPAA, GDPR, and CCPA certified per futureagi.com/trust.

Honest tradeoff: if your stack is a single agent with zero handoffs, you do not need this. A TaskCompletion rubric and the base instrumentor cover it. The handoff eval stack earns its weight the moment a second agent enters the loop (three-plus agents, mixed models, dynamic routing, production traffic), and the handoff is the unit that decides whether the team ships.

What to do this week

One multi-agent config, end to end. Five steps.

  1. Wire the framework-specific traceAI instrumentor into your existing project. Verify per-agent AGENT spans, TOOL_CALL spans, and HANDOFF spans with handoff.source, handoff.target, handoff.scope_prompt, handoff.allowed_tools, and handoff.return_value populated.
  2. Pull 100 real production handoffs across all sender-receiver pairs. Annotate each with pre-handoff state, dispatch call, receiver trace, sender post-return turn, and expected behaviour on errors.
  3. Define DispatchCorrectness, ScopeFidelity, ResultIntegration, and RecoveryOnError as CustomLLMJudge rubrics. Run them through Evaluator.evaluate alongside TaskCompletion, LLMFunctionCalling, AnswerRefusal, and ConversationCoherence.
  4. Wire per-axis thresholds into CI. Start at DispatchCorrectness >= 0.85, ScopeFidelity >= 0.90 per receiver type, ResultIntegration >= 0.80, RecoveryOnError >= 0.85. Cap MaxAgentDepth at the gateway at 10.
  5. Turn on Error Feed. Watch the first week’s clusters. Promote representatives into the regression set. Run a BayesianSearchOptimizer study on the highest-impact prompt: sender for DispatchCorrectness, receiver for ScopeFidelity, sender error branch for RecoveryOnError.

The teams shipping reliable multi-agent systems in 2026 stopped grading the final answer and started grading the seams. The framework gives you orchestration. The eval stack gives you the per-handoff signal that keeps the team honest, one transition at a time.

Frequently asked questions

What is an agent handoff in 2026, and why does the term matter across frameworks?
A handoff is the moment work crosses from one agent to another, with whatever context the receiver needs to continue. Every multi-agent system has them. Every framework calls them something different: Anthropic's Claude Agent SDK calls a sub-agent invocation a Task tool dispatch, the OpenAI Agents SDK exposes a handoff() primitive, LangGraph models the transition as an edge between nodes, CrewAI calls it delegation, AutoGen treats it as a turn between participants in a group chat, and A2A wraps the same transition in a network protocol. The shape is identical. One agent produces output, scope, and constraints. Another agent receives them and is supposed to act on them. The eval that works at the framework-agnostic layer scores the transition itself, not either side of it.
What are the four framework-agnostic rubrics for handoff evaluation?
Dispatch correctness, scope fidelity, result integration, and recovery on error. DispatchCorrectness scores the sender's choice: was this the right moment to hand off, the right receiver to pick, and the right scope to set. ScopeFidelity scores the receiver: did it stay inside the dispatched scope and constraints, or drift, fabricate context, or call tools it was not given. ResultIntegration scores the sender's next step after control returns: did the sender read the result, propagate its constraints, and let it change the plan, or discard it and continue inline. RecoveryOnError scores what happens when the receiver fails or times out: did the sender retry, fall back, escalate, or silently continue with a missing artifact. Each rubric maps to a different span, a different threshold, and a different prompt to fix.
How does traceAI emit per-handoff spans, and what attributes do the rubrics read?
traceAI emits a HANDOFF span kind alongside AGENT, TOOL_CALL, and CHAIN, with attributes the four rubrics read directly. handoff.source and handoff.target carry the sender and receiver identities. handoff.scope_prompt carries the dispatched scope verbatim. handoff.allowed_tools carries the tool subset the receiver may use. handoff.context_summary carries the state summary the sender wrote. handoff.reason carries the dispatch trigger. handoff.return_value carries the receiver's result. handoff.error carries the failure mode when one fires. The framework-specific instrumentors (AutogenInstrumentor, ClaudeAgentInstrumentor, LangGraphInstrumentor, CrewAIInstrumentor, OpenAIAgentsInstrumentor) all emit the same HANDOFF span shape. A generic OTel tracer collapses the entire team run into one chat span and the four rubrics cannot run.
How do you wire the four handoff rubrics into CI?
Build a regression set of real production handoffs (forty to two hundred is enough to start), stratified by sender-receiver pair so each pair has at least eight to ten examples. Annotate each with the pre-handoff state, the dispatch call, the receiver's full trace, the sender's post-return turn, and the expected outcome on errors. Wrap each as a TestCase and run the four CustomLLMJudge rubrics through Evaluator.evaluate alongside TaskCompletion, LLMFunctionCalling, AnswerRefusal, and ConversationCoherence from the ai-evaluation SDK. Wire per-axis thresholds (DispatchCorrectness greater than 0.85, ScopeFidelity greater than 0.90 per receiver type, ResultIntegration greater than 0.80, RecoveryOnError greater than 0.85). Bind the assertions to the sender prompt version, the receiver definition versions, and the test set tag so a failing axis localises to one prompt change.
How does Error Feed cluster handoff failures back to a fix?
Error Feed runs HDBSCAN soft-clustering over span attributes plus per-handoff embeddings inside ClickHouse, then fires a Claude Sonnet 4.5 JudgeAgent on each cluster with a 30-turn budget and eight span-tools. For multi-agent stacks the natural clusters are handoff-shaped: dropped-constraint clusters (the receiver lost a budget cap or a date range), scope-bleed clusters (a validator started drafting, a refactor agent proposed tests), integration-skip clusters (the sender ignored the receiver's result), and recovery-blind clusters (a receiver timed out and the sender continued as if it returned). The Judge writes a 5-category 30-subtype taxonomy entry, the 4-D trace score (factual_grounding, privacy_and_safety, instruction_adherence, optimal_plan_execution, 1 to 5 each), and an immediate_fix naming the sender prompt edit, the receiver scope tighten, or the rubric calibration to ship today. Linear OAuth is wired today; Slack, GitHub, Jira, and PagerDuty are on the roadmap.
Where does Future AGI ship the full handoff eval stack?
The eval stack ships as a package, not a single product. ai-evaluation SDK (Apache 2.0) ships 60-plus EvalTemplate classes including TaskCompletion, LLMFunctionCalling, AnswerRefusal, ConversationCoherence, Groundedness, ChunkAttribution, plus the CustomLLMJudge that carries DispatchCorrectness, ScopeFidelity, ResultIntegration, and RecoveryOnError. traceAI (Apache 2.0) ships HANDOFF span kind plus framework instrumentors (AutogenInstrumentor, ClaudeAgentInstrumentor, LangGraphInstrumentor, CrewAIInstrumentor, OpenAIAgentsInstrumentor) so the same rubrics run across every framework. The Future AGI Platform adds self-improving evaluators tuned by feedback, in-product custom rubric authoring, and classifier-backed scoring at lower per-eval cost than Galileo Luna-2. Error Feed clusters failures and writes the immediate_fix. Agent Command Center caps handoff depth at the gateway. agent-opt's six optimizers tune sender and receiver prompts as separate study targets.
Related Articles
View all