Agents

What Is the A2A Protocol?

A2A Protocol is an open interoperability standard for AI agents to discover peers, exchange task messages, and delegate work across systems.

What Is the A2A Protocol?

The A2A Protocol. Agent-to-Agent Protocol. is an open interoperability standard for agents to discover peers, exchange structured task messages, stream artifacts, and delegate work across heterogeneous runtimes. It belongs to the agent family and appears in production traces at handoff boundaries: capability discovery against an AgentCard, task creation, status streaming, artifact transfer, completion, and errors. A2A was originally proposed by Google, later donated to the Linux Foundation, and by May 2026 is shipping inside every major orchestrator alongside MCP. Where MCP standardizes how an agent talks to a tool, A2A standardizes how an agent talks to another agent that can itself plan, call tools, and return state. FutureAGI’s traceAI ships a traceAI-a2a integration so every handoff is measurable for completion, latency, cost, and downstream failure analysis.

The shortest mental model: A2A turns the boundary between a planner agent and a specialist agent from a private webhook into a typed protocol with a published capability card, a lifecycle, and a trace context that survives the network hop. As of May 2026, the major orchestration runtimes. LangGraph, CrewAI, AutoGen, OpenAI Agents SDK, Google ADK, Strands, Pydantic AI. either ship an A2A client/server pair natively or expose an adapter, and the AgentCard format has converged enough that a planner written against one runtime can call a specialist written against another with zero glue code.

Why the A2A Protocol matters in production LLM and agent systems

The dominant failure mode in 2026 multi-agent stacks is silent delegation failure. A planner agent sends a task to a billing agent, receives a plausible completed event, and the workflow continues. except the billing agent used the wrong customer id, hit a stale account, or returned a refund total that ignored loyalty credits. The end user sees a confident answer. The developer sees a generic success response. The SRE sees retries and p99 latency growth but cannot say which boundary caused it because the trace orphans at the network hop.

A2A matters because 2026 agent systems rarely stay inside one runtime. A support workflow may use one agent for intent routing, another for account lookup, another for policy explanation, and a vendor agent for refunds. Without a protocol, each pair becomes a private API contract with unique auth, payload shape, retry semantics, and error handling. Audits become brittle, incident response slows, and the multi-agent system becomes a graph nobody can debug end-to-end. The same anti-pattern killed early service meshes. every team building its own RPC dialect. and A2A is the analogous standardization for agent handoff.

The pain shows up unevenly. Platform engineers see orphaned task ids, missing trace ids across hops, capability descriptions that no longer match live behavior, and cost spikes attached to the wrong service owner. SREs watch p99 grow when one called agent throttles, but the dashboard cannot localize the hop. Compliance teams discover incomplete audit logs when regulators ask “which agent took this action.” Product teams see refund agents refunding the wrong order, tool-calling chains executing against stale state, and end users escalating because the response felt authoritative but landed wrong. In 2026 production agent stacks, where agentic AI workflows routinely fan out across five or more agents per user request, the handoff boundary is the new failure surface, and a protocol-grade view of it is no longer optional.

Unlike MCP, which standardizes how agents reach tools, A2A standardizes how agents reach other agents that reason, call tools, and return task state. The two are complementary; most serious 2026 stacks run both, and traces from each show up side by side in the same trace tree.

There is also a structural reason A2A matters more in 2026 than in 2024: frontier model improvements have pushed single-agent ceilings against domain boundaries. A planner agent built on Claude Opus 4.7 is excellent at reasoning, but it cannot replace a domain-specific finance agent that owns ledger access, audit logs, and a customer master record. The economic answer is composition, not one larger model. A2A is the protocol layer that makes that composition portable across vendors, runtimes, and trust boundaries. and the layer where reliability either holds or breaks.

A2A vs MCP at a glance

Engineers ask this every week. The two protocols share an OAuth-style discovery model and JSON-RPC framing, but they sit at different layers of the stack.

ConcernA2A ProtocolModel Context Protocol (MCP)
CounterpartAnother agent that plans and actsA tool, resource, or prompt server
Discovery unitAgentCard (capabilities, skills, auth)Tool / resource / prompt manifests
Message modelTask lifecycle with streamed eventsSynchronous request / response
StateLong-running, multi-turn task stateStateless tool invocation (mostly)
Trust boundaryAgent-to-agent, potentially cross-orgAgent-to-tool, usually intra-org
Failure modesSilent delegation, capability driftTool error, schema mismatch
FutureAGI integrationtraceAI-a2atraceAI-mcp

Most real 2026 architectures use both: the planner agent uses MCP for its own tool layer and uses A2A to delegate a sub-task to a specialist agent that itself uses MCP under the hood. A clean agent observability story has to capture both surfaces inside a single trace tree.

How FutureAGI handles the A2A Protocol

FutureAGI’s approach is to treat A2A as a reliability boundary, not just an integration format. The traceAI-a2a integration records capability discovery, task creation, message exchange, streamed artifacts, cancellation, completion, and error events as OpenTelemetry spans inside the same distributed trace. W3C traceparent headers ride on every A2A message, so a parent trace that starts in a LangGraph orchestrator at one company keeps its trace id when the task crosses an A2A hop to a specialist agent at another company.

The key span attribute is agent.trajectory.step. On every A2A handoff, FutureAGI tags the caller, the called agent, the task id, the lifecycle state, the AgentCard skill name, the streamed artifact type, and the final outcome. Engineers can then query “all failed handoffs to the refunds agent in the last hour” or “all cancelled A2A tasks where the called agent never streamed an artifact” instead of replaying logs across three services. The trace surface also lets evaluators run at the boundary: TaskCompletion checks whether the delegated task actually finished, ToolSelectionAccuracy checks whether the called agent chose the right tool path, and TrajectoryScore summarizes the full multi-agent route. a critical signal for agent-as-judge workflows where one agent reviews another’s run.

A concrete example. A commerce assistant built on LangGraph delegates a warranty decision to a policy agent over A2A. The policy agent returns artifacts, status updates, and a final answer. FutureAGI attaches those events to the original user trace, runs TaskCompletion and ToolSelectionAccuracy on the delegated subtask, and fires an alert when TaskCompletion drops below the release threshold for premium accounts. The engineer can then route the workflow through a backup policy agent inside the Agent Command Center. using a routing-policy that prefers a vetted internal agent over the third-party one. and add the failing rows to a regression eval cohort before traffic returns to the primary path. Compared with a raw webhook integration, this keeps protocol events, evaluation scores, and operator actions in one timeline. Compared with LangChain’s older agent-chaining pattern, it preserves trace context across processes. and unlike OpenAI’s Swarm-style handoffs, which stay inside one runtime, A2A spans even cross-org boundaries cleanly.

In our 2026 evals, the most common A2A failure pattern is “capability drift”: a remote agent advertised process_refund(order_id) six months ago, but the live agent now expects process_refund(order_id, currency) and silently mis-handles legacy callers. FutureAGI catches this by diffing the live AgentCard against the version cached on first connect and raising a agent.capability.drift event into the same trace. Engineers see the drift, the failing trace, and the eval regression in one view, and can either pin to the older capability or upgrade the caller. The same trace also feeds agent observability dashboards keyed off agent.trajectory.step, so agent loop iteration counts, handoff edges, and tool decisions all render alongside A2A lifecycle events.

Engineers wire this into release gates through the evaluate surface and into live traffic through the Agent Command Center. When a low TaskCompletion score fires post-release, the tracing view points to the exact A2A hop where the trajectory broke. For high-impact paths. refunds, healthcare, financial trading. engineers can also pre-test new agent versions against Persona and Scenario simulations through simulate before exposing them to production traffic. The same trace shape produced by traceAI-a2a in production is produced by the simulation harness, so a failing regression scenario in CI looks identical to a failing production incident. the engineer never context-switches between two trace formats.

A second concrete pattern. A B2B SaaS company exposes a partner-facing A2A endpoint that lets customer agents pull invoice data and trigger payment workflows. The endpoint is rate-limited per partner via Agent Command Center, guarded by a pre-guardrail that screens for prompt injection embedded in task payloads, and post-guarded by a PII evaluator to ensure the response never leaks data outside the partner’s scope. Every A2A hop carries the partner’s tenant id as a span attribute, so per-tenant TaskCompletion, latency, and cost roll up cleanly. When a single partner agent starts misbehaving. calling cancel_invoice instead of query_invoice. the per-tenant dashboard isolates the regression to that partner’s AgentCard version within minutes, and the routing policy automatically degrades that partner to a more conservative skill subset until it is patched.

How to measure or detect A2A Protocol reliability

Measure A2A reliability at the handoff boundary, then aggregate by caller, called agent, task type, AgentCard version, and release. Vanity metrics like “A2A 2xx rate” hide the real failure mode: a successful protocol call that nonetheless completed the wrong task.

  • TaskCompletion. returns whether the delegated task reached the expected outcome, not merely whether the A2A call returned completed. The single most important signal at the boundary.
  • ToolSelectionAccuracy. checks whether the called agent picked the right tool path during its sub-trajectory. Critical when the caller cannot inspect the callee’s internals.
  • TrajectoryScore. summarizes the full multi-step route, including planning, A2A handoff, sub-agent tool use, and final completion. Pair with GoalProgress for partial credit.
  • Faithfulness. when the called agent returns a grounded answer (RAG-backed), check that the answer is supported by the cited sources.
  • agent.trajectory.step. tags the span representing the A2A hop so traces can be grouped by handoff depth, retry count, and parent agent.
  • gen_ai.agent.graph.node_id and gen_ai.agent.graph.parent_node_id. preserve the call graph topology across processes; needed for agent loop detection and graph-view rendering.
  • AgentCard drift. alert when the live capability card diverges from the version cached at first connect.
  • Dashboard signals. p99 A2A round-trip latency, eval-fail-rate-by-called-agent, retry rate, cancellation rate, streamed-artifact-error rate, and token-cost-per-trace by hop.
  • User proxy. escalation rate after delegated work, especially when the caller reports success but the user reopens the issue within 24 hours.

Minimal evaluator pairing:

from fi.evals import TaskCompletion, ToolSelectionAccuracy, TrajectoryScore

completion = TaskCompletion()
tool_choice = ToolSelectionAccuracy()
trajectory = TrajectoryScore()

c = completion.evaluate(input=task, output=result)
t = tool_choice.evaluate(trajectory=called_agent_trace)
j = trajectory.evaluate(trace=full_a2a_trace, goal=user_goal)
print(c.score, t.score, j.score)

When the score drops, the trace view shows exactly which A2A hop fired the regression and which AgentCard skill was invoked. Pair these scores with Faithfulness and Groundedness whenever the called agent backs its answer with retrieved sources; pair with PII and Toxicity when the response crosses a tenant or trust boundary.

To wire the same evaluators online, attach them to the A2A span emitted by traceAI-a2a so every production handoff is scored as it happens:

from traceai_a2a import A2AInstrumentor
from fi.evals import TaskCompletion, ToolSelectionAccuracy, Faithfulness

A2AInstrumentor().instrument()

online_chain = [
    TaskCompletion(),
    ToolSelectionAccuracy(),
    Faithfulness(),
]

# Each A2A hop becomes a span tagged `agent.trajectory.step=a2a.hop`;
# the evaluator chain runs against that span tree per request.
for span in a2a_spans_for_trace(trace_id):
    for evaluator in online_chain:
        result = evaluator.evaluate_span(span)
        if result.score < 0.7:
            alert("a2a.regression", span=span, evaluator=evaluator, score=result.score)

Benchmarking A2A reliability against agent-era suites

In our 2026 evals at FutureAGI we have found that traditional single-turn QA benchmarks tell you almost nothing about A2A reliability. The benchmarks that do correlate are the trajectory suites: τ-bench retail and airline (multi-turn customer support with tool state), SWE-Bench Verified (multi-step code edits across files), GAIA (multi-hop assistant tasks), and OSWorld (real OS-level multi-agent action). Teams that score above 65% on τ-bench retail with an A2A-decomposed architecture typically run two to four specialist agents behind a planner; teams that try to handle the same workload with a single monolithic agent saturate around 55%. The delta is almost entirely explained by tool-selection accuracy at handoff boundaries. which is exactly what ToolSelectionAccuracy on the A2A span tree measures. The corollary: if your internal eval scores trail your single-turn benchmark scores, the A2A boundary is the first place to look.

Common mistakes

  • Treating A2A as raw RPC. Capability discovery, task lifecycle, auth, and streamed events are part of the contract; dropping them breaks interoperability and silently regresses to a bespoke webhook.
  • Trusting completed events without evaluators. A completed A2A task can still be wrong, unsafe, late, or attached to the wrong user. Always pair the protocol signal with TaskCompletion.
  • Caching AgentCards forever. Remote capabilities drift. Refresh on schedule and alert on diff, or callers will hit silent schema mismatches.
  • Using aggregate metrics only. Handoff failures hide inside global averages unless dashboards slice by caller, called agent, AgentCard version, task type, and release.
  • Confusing A2A with MCP. A2A is agent-to-agent; MCP is agent-to-tool. Mixing them produces unclear ownership and weak incident traces.
  • Skipping trace propagation. A2A messages must carry W3C traceparent or the second agent starts a fresh trace and the graph orphans.
  • No fallback design. Third-party agents fail, throttle, drift, or get retired. The Agent Command Center routing policy should define a backup A2A target before launch, with a model fallback-style chain for agents.
  • Sending free-text instructions inside the task payload. A2A tasks are structured. Stuffing freeform instructions inside a text field invites prompt injection against the callee. Use typed skill arguments and validate them.
  • No artifact-stream timeout. A2A supports streamed artifacts; a misbehaving callee can stall the stream indefinitely. Set per-skill timeouts and budget the calling agent loop accordingly.
  • Ignoring authentication scope. Each AgentCard skill should declare the minimum auth scope it needs. Calling a query skill with a write token expands blast radius for compromised partners.
  • Reusing the same A2A endpoint for tools. When the called counterpart is really a tool (no planning, no autonomy), use MCP instead. Wrapping tools in A2A adds protocol surface without value.

Frequently Asked Questions

What is the A2A Protocol?

The A2A Protocol is an open standard that lets AI agents discover each other through AgentCards, exchange structured task messages, stream artifacts, and delegate work across systems. FutureAGI observes those handoffs with traceAI-a2a so teams can evaluate reliability at the agent boundary.

How is A2A different from MCP?

MCP connects an agent to tools and data sources; A2A connects agents to other agents that can plan, call tools, and return task status. Most 2026 production stacks need both. MCP for the tool layer and A2A for the agent-to-agent layer.

How do you measure A2A Protocol reliability?

Use traceAI-a2a spans tagged with agent.trajectory.step, then score delegated work with TaskCompletion, ToolSelectionAccuracy, and TrajectoryScore. Track p99 handoff latency, retry rate, and cancellation rate by called agent.