Guides

LLM Eval vs Agent Protocol Evolution in 2026: A Protocol-Neutral Eval Playbook for MCP, A2A, Responses, Realtime, Tool Use, and ADK

MCP, A2A, OpenAI Responses, Realtime, Anthropic Tool Use, Google ADK. Six protocols changed agent eval in 2026. Here is how to keep your eval stack protocol-neutral.

·
14 min read
agents evaluation mcp a2a protocols observability
LLM Eval vs Agent Protocol Evolution: Going Protocol-Neutral in 2026
Table of Contents

The Problem: Six Protocols Shipped, Your Eval Stack Was Built for One

In Q1 2026 most production agent teams are running on at least three protocol surfaces at the same time. The agent talks to tools over MCP. It hands work to other agents over A2A. It streams chat back to a frontend through the OpenAI Responses API or the Anthropic Messages API. A subset of teams add Realtime voice, native Tool Use vision, and Google ADK session-state on top.

The eval stack underneath was built for one of those surfaces. Usually the first one. Usually chat completions.

That is the gap. Every new protocol changes how tool calls, handoffs, streams, and modalities appear in your traces. If your evaluators only know how to read the original surface, they will score whatever subset still looks like a chat completion and silently miss the rest. A multi-agent handoff returns a green dashboard while a critical hop in the middle violated policy. A Realtime call rolls out PII through a streaming token that the post-call evaluator never saw. An MCP server installed last sprint slips a poisoned tool descriptor past a guardrail that only looked at the final message.

The fix is not to write five parallel eval stacks. The fix is to build the one stack to be protocol-neutral, the same way the observability stack had to become vendor-neutral after every cloud shipped its own tracing format. This post walks through the six protocols, the five eval-protocol gaps emerging in 2026, the Future AGI primitives that close those gaps, and a 5-step protocol-readiness audit you can run this week.

Why Protocol Evolution Matters for Eval

Eval is not a model property. Eval is a property of the path the request took. The path is defined by the protocol.

A simple chat completion has a single path: prompt in, response out. One rubric grades it. The moment you add MCP, the path forks into prompt, tool descriptor list, tool selection, tool-call arguments, tool result, final response. Add A2A and the path becomes a tree across multiple agents. Add Realtime and the path becomes a stream that never has a single end token. Each new protocol multiplies the number of decision points the agent makes and the number of seams where the trace can be broken or hijacked.

Three concrete consequences:

  1. Eval becomes a per-seam problem. You need a rubric per decision point, not one rubric for the final message. MCP needs per-tool audit. A2A needs per-hop audit. Realtime needs mid-stream audit.
  2. Trace fidelity is the upper bound on eval fidelity. If your trace cannot represent an A2A hop or a streaming chunk, no evaluator downstream can score it. Span kinds and semantic conventions become first-class eval infrastructure.
  3. Protocol drift is constant. MCP shipped a new revision in 2025-06-18. A2A is on 0.3.x. The Responses API ships changes every quarter. Hard-coding any single protocol surface guarantees a rebuild within twelve months.

Teams that treat eval as a protocol-neutral layer ride those revisions cheaply. Teams that bake a single protocol into their evaluators end up rewriting the eval stack on every protocol bump.

The Six Protocols and What Each Requires From Eval

The six protocols below are the ones that matter for an enterprise agent stack in 2026. Each has a distinct eval surface.

MCP: Model Context Protocol

MCP is the de facto standard for tool access. The agent connects to a server, lists tools, then calls them with structured arguments. The eval surface is the tool catalog itself plus every tool call.

The Future AGI gateway runs a dual scanner. The first pass, mcpsec.go, evaluates the full tool catalog at the chat-completion stage and rejects servers whose descriptors look poisoned or whose schema implies a sensitive write the user did not approve. The second pass, toolguard.go, fires at per-tool-call time via the mcp.ToolCallGuard interface. Every individual call is scored against policy before the gateway lets it through. For a deeper walkthrough see evaluating MCP servers for security and the step by step MCP eval guide.

A2A: Agent-to-Agent

A2A is the peer-to-peer surface for cross-agent coordination. The eval challenge is distributed observability. When agent A hands a task to agent B, most stacks lose the trace at the hop. The traceAI SDK emits A2A_CLIENT and A2A_SERVER span kinds and propagates a gen_ai.a2a.propagated_trace_id attribute across the hop, so the entire multi-agent path lands in one trace. Once the trace is whole, you can run a per-hop rubric: did the calling agent pass enough context, did the called agent stay in scope, did the artifact returned match the requested task? See MCP vs A2A and evaluating LLM agent handoffs for the underlying patterns.

OpenAI Responses API

The Responses API folds streaming, structured output, and tool calls into one envelope. The eval rubric must split by output type. A structured JSON output is graded against a schema. A streaming text segment is graded incrementally. A tool call is graded against a per-tool policy. One end-to-end pass does not work because the three sub-outputs have nothing in common except the wrapping envelope.

OpenAI Realtime API

Realtime is bidirectional. Audio in, audio out, with optional text and tool calls in between. The eval discipline shifts to mid-stream. You cannot wait for the full response because in Realtime there is no full response, just a continuous stream. Streaming guardrails with a check_interval evaluate partial output and act with a stop or disclaimer action when a rule fires. Latency budgets sit below 500ms end to end, which constrains how heavy each eval pass can be. The patterns in evaluating streaming LLM responses and the sub 500ms voice AI guide cover the practical wiring.

Anthropic Tool Use

Tool Use in the Anthropic Messages API uses native tool blocks rather than a separate function-calling channel. Vision is inline in the same message. The eval surface adds a per-modality rubric: the text portion uses one set of templates, the image portion needs a vision-capable judge. The LLMFunctionCalling template handles the tool-call portion, while a multi-modal CustomLLMJudge handles the image portion in the same trace. See evaluating Claude Code tool use for the developer flow.

Google ADK and Vertex Agent Engine

ADK adds explicit session-state, Gemini-native multi-modal, and a deployment surface in Vertex. The eval surface needs to read session-state across turns rather than treating each turn as independent, and needs a multi-modal judge that scores text plus image plus audio in the same rubric. See evaluating Google ADK agents for the full pattern.

The headline: six protocols, six distinct eval surfaces, all of them in production at the same enterprise at the same time.

The Five Eval-Protocol Gaps Emerging in 2026

Across audits this year, the same five gaps keep showing up. Each one is a place where the eval stack lags the protocol surface.

Gap 1: Distributed Handoff Observability

Most eval stacks were built when the agent was a single process. The moment a second agent enters the picture, the trace breaks at the hop. The fix is span kinds that explicitly model the hop. The traceAI SDK ships fourteen span kinds, including A2A_CLIENT and A2A_SERVER, plus a gen_ai.a2a.propagated_trace_id attribute that survives the wire. Without that, your multi-agent dashboard lies.

Gap 2: Streaming-Eval Support

A guardrail that runs after the model finishes is too late on a streaming surface. The Realtime API and any streaming chat surface need mid-stream evaluators. The Protect runtime supports a check_interval so the evaluator runs every N tokens, and stop or disclaimer actions so it can cut the stream when a policy fires.

Gap 3: Multi-Modal Eval

Image, voice, and computer-use need different rubrics than text. A vision-capable judge needs to actually open the image. A voice judge needs to operate on the audio plus the ASR transcript, because errors at the ASR layer drive most of the production defects. The CustomLLMJudge template handles multi-modal inputs inline via LiteLLM, so a single rubric can include text, image, and audio in one prompt.

Gap 4: Per-Tool Audit on MCP

MCP exposes a tool catalog to the agent at runtime. The catalog can come from any server. A poisoned descriptor or a tool whose schema implies a sensitive write needs to be caught before the model picks it. Per-tool audit at the gateway level is now a baseline requirement, not a nice to have.

Gap 5: Cross-Protocol Routing

The same logical agent often gets invoked over multiple protocols. A support agent might be reached over MCP from an IDE, over the Responses API from a web app, and over A2A from another agent. Each protocol surfaces it differently. A unified eval has to score the same agent identity across all three. That is where pluggable semantic conventions matter: register(semantic_convention=FI | OTEL_GENAI | OPENINFERENCE | OPENLLMETRY) normalizes the spans so one set of rubrics applies across surfaces.

How Future AGI Closes the Gaps

The Future AGI stack is built around protocol-native primitives rather than protocol-specific evaluators. The point is that you should not have to choose between protocols at the eval layer. The eval layer should adapt under you.

TraceAI Span Kinds for Every Protocol Surface

The traceAI SDK ships fourteen span kinds that already cover the six-protocol matrix. A2A_CLIENT and A2A_SERVER handle distributed handoff. computer_use, voice, and image cover the multi-modal surfaces. GUARDRAIL and EVALUATOR carry the protect runtime and eval runtime telemetry, so guardrail decisions and evaluator scores are first-class spans in the same trace as the model call. The result is a single trace per request that survives across protocols. See what a good LLM trace looks like and the traceAI OpenTelemetry intro for the underlying model.

Gateway with Native Protocol Support

The Future AGI gateway speaks the protocols natively rather than translating to a single internal shape. It accepts OpenAI v1, Anthropic /v1/messages, Gemini /v1beta, MCP /mcp, A2A first-class via model: "a2a/<name>", and a Realtime WebSocket surface. Each protocol is preserved end to end, which means evaluators downstream see the exact shape the upstream client sent. For the enterprise control surface see the AI gateway overview.

Dual MCP Scanner

The MCP scanner is split deliberately. The mcpsec.go pass runs at chat completion time and scores the full tool catalog before any tool is called. The toolguard.go pass runs per tool call via the mcp.ToolCallGuard interface. The split matters because catalog-level threats and call-level threats have different signatures. Poisoned descriptors look fine at call time and only show up against the full catalog. Argument-injection attacks look fine at catalog time and only show up against a specific call.

Multi-Modal CustomLLMJudge

A single CustomLLMJudge template can take a text prompt, an image input, and an audio input through LiteLLM. The judge model can be GPT-4o, Claude Sonnet 4.5, Gemini 2.5 Pro, or a self-hosted model. The same template works for an image-grounded answer in an Anthropic Tool Use call and a voice transcript in a Realtime session.

Streaming Guardrails

The Protect runtime accepts a check_interval so the evaluator runs every N tokens of a stream. When a rule fires, the runtime applies a stop or disclaimer action mid-stream. For Realtime voice the same primitive runs every N audio chunks.

Templates That Work Across Protocols

The template library is intentionally protocol-agnostic. Groundedness, LLMFunctionCalling, TaskCompletion, AnswerRefusal, Toxicity, PromptInjection, and DataPrivacyCompliance are written against the normalized trace, not against a particular protocol surface. The same Groundedness rubric grades a RAG answer that came in over the Responses API and a RAG answer that came in over MCP. For a deeper read see the ultimate guide to LLM guardrails.

Pluggable Semantic Conventions

The SDK exposes register(semantic_convention=FI | OTEL_GENAI | OPENINFERENCE | OPENLLMETRY). The same SDK call can emit FI conventions for first-party tools, OTEL GenAI for the OpenTelemetry collectors that already exist in most enterprises, OpenInference for teams running Arize Phoenix, or OpenLLMetry for teams running Traceloop. The benefit is protocol-evolution resilience: when a new convention ships, you flip the flag rather than rewriting instrumentation.

The 5-Step Protocol-Readiness Audit

Run this audit when you scope quarterly eval work. Each step takes a day or two. The goal is to find the seams where the eval stack lags the protocol surface and to gate releases on per-protocol thresholds.

Step 1: Inventory Which Protocols Your Agent Stack Uses Today

List every surface in production. MCP servers connected to any agent. A2A endpoints. Streaming surfaces. Voice surfaces. Vision surfaces. Computer-use surfaces. For each surface, note the protocol version and the client library. Most teams find at least one surface they did not plan for, usually an MCP server installed by a developer last sprint.

Step 2: Instrument With TraceAI and Confirm Span Kinds

For each surface, install traceAI and pull a trace from production. Confirm the right span kinds emit. An A2A hop should show an A2A_CLIENT on the calling side and an A2A_SERVER on the called side with the propagated trace id. A Realtime call should show a streaming voice span. A vision call should show an image span. Where a span kind is missing or wrong, fix the instrumentation before defining rubrics. The Future AGI MCP server post and instrumenting your AI agent with traceAI cover the setup.

Step 3: Define Per-Protocol Rubric Sets

For each protocol, write the rubric set that protocol needs. MCP needs per-tool audit, which means a LLMFunctionCalling template per tool plus a DataPrivacyCompliance template against the tool arguments. A2A needs per-hop audit, which means a TaskCompletion template per hop plus an AnswerRefusal template against the artifact returned. Responses needs split rubrics per output type. Realtime needs mid-stream guardrails. Vision needs a CustomLLMJudge configured for image input. Document the rubric set as code in your repo so reviewers can see what is being scored.

Step 4: Run a Per-Protocol Regression Suite, Gate on Per-Protocol Thresholds

Build a regression dataset per protocol surface. Run the rubric set against it on every change. Gate releases on per-protocol thresholds rather than a single overall score, because a regression on the MCP tool-audit rubric is a different kind of failure than a regression on the Realtime guardrail. See evaluate RAG applications in CI CD and CI CD LLM eval with GitHub Actions for the wiring.

Step 5: Monitor With Error Feed, Cluster by Protocol, Auto-Generate Fixes

In production, route every failed evaluator score into the Error Feed. The feed uses HDBSCAN to cluster failures, and clustering by protocol surfaces patterns that look like noise when you read them one at a time. A Sonnet 4.5 Judge writes an immediate_fix per cluster, which feeds the self-improving evaluator loop on the Platform side. Today the Error Feed integrates with Linear for ticketing. The trace-stream-to-agent-opt connector is on the roadmap.

The Meta-Point: Protocol-Neutral Is the New Vendor-Neutral

Five years ago every observability team learned the same lesson: lock to one vendor’s tracing format and you rebuild every time you switch vendors. The industry response was OpenTelemetry. The shape is repeating now at the protocol layer for agents. Lock your eval to one protocol’s surface and you rebuild every time a new protocol ships, which is every quarter.

The teams that absorb that lesson early build their evaluators against a normalized trace shape with pluggable semantic conventions, span kinds that already model the next protocol’s primitives, and rubric templates that score the path rather than the surface. The teams that do not absorb it pay rebuild costs every time a new protocol ships and ship blind in the gap between.

Anti-Patterns That Show Up in 2026 Eval Audits

Five patterns to watch for. Each one is a sign the eval stack is single-protocol when the agent stack is multi-protocol.

  1. Hard-coding one protocol’s eval surface. Evaluators written directly against the OpenAI chat completions shape, no abstraction over the trace. Every new protocol triggers a rewrite.
  2. Ignoring distributed handoff observability. Multi-agent evals that score each agent in isolation and miss the seam where the handoff dropped context. The end-to-end dashboard looks green while critical hops failed silently.
  3. No streaming-guard eval. Realtime and streaming surfaces ship with post-call evaluators only. By the time the evaluator sees the output, the dangerous tokens are already on the wire.
  4. No per-tool audit on MCP. Tool descriptors and tool-call arguments are accepted from any MCP server without a gateway-level scan. An untrusted server installs a backdoor that the eval layer never sees.
  5. No multi-modal eval. Voice and computer-use surfaces are scored only on text. ASR errors and vision errors drive most of the production defects and are silently invisible.

For more on the operational side see agent observability vs evaluation vs benchmarking and when an agent passes evals but fails production.

Honest Framing: What Ships Today, What is On the Roadmap

To keep this useful for scoping, three honest notes:

  • The trace-stream-to-agent-opt connector that closes the loop between production failures and automated agent re-optimization is on the roadmap. Today the loop is closed manually: the Error Feed clusters failures, the Sonnet 4.5 Judge writes an immediate_fix, the engineer applies it.
  • Eval-driven optimization on prompts ships today via six optimizers on the Platform. The same surface for agent-graph optimization is in development.
  • Linear is the only Error Feed ticketing integration today. Jira, GitHub Issues, and ServiceNow are in the queue.
  • Future AGI Protect ML weights are closed. The gateway self-hosts in your VPC. The ML hop reaches api.futureagi.com or, under the enterprise license, your own private vLLM endpoint. That is the trade-off for the ML quality you get; full on-prem ML weights are not a current option.

What to Do This Week

If you only have a half day, run Step 1 of the audit. List every protocol surface in production, including the ones developers added without telling you. The list is usually longer than the team expects, and it tells you which evaluators are currently scoring the wrong path.

If you have a week, run Steps 1 through 3. Inventory, instrument, write the rubric sets. The output is a per-protocol rubric document and a working trace per surface. That document is the artifact that survives every future protocol revision, because it is written against the path rather than the surface.

If you have a month, run all five steps and wire the Error Feed into the team’s ticket queue. By then the eval stack is protocol-neutral, the audit is repeatable, and the next protocol that ships does not trigger a rebuild.

Protocol evolution is the constant. The eval stack you build now should expect it.

Frequently asked questions

Why does agent protocol evolution change LLM eval in 2026?
Every new agent protocol changes how tool calls, handoffs, streams, and modalities show up in your traces. Eval that was wired to a single protocol surface scores the wrong thing once your stack adds a second one. Protocol-neutral eval traces, span kinds, and rubrics keep evaluators valid as MCP, A2A, Responses, Realtime, Tool Use, and ADK ship new revisions.
Which agent protocols matter for eval in 2026?
Six dominate. MCP for tool discovery and per-tool audit. A2A for distributed agent handoffs. OpenAI Responses API for streaming plus structured output plus tool calls in one envelope. OpenAI Realtime API for bidirectional voice. Anthropic Tool Use for native tool blocks and multi-modal vision. Google ADK plus Vertex Agent Engine for session-state and Gemini-native multi-modal.
What is distributed handoff observability and why does eval need it?
When one agent calls another over A2A, most eval stacks drop context at the handoff seam. The traceAI SDK emits A2A_CLIENT and A2A_SERVER span kinds and propagates a gen_ai.a2a.propagated_trace_id attribute across the hop, so you can score the full multi-agent path instead of one isolated agent at a time.
How do streaming and Realtime APIs change guardrail eval?
Streaming responses arrive token by token. Voice arrives chunk by chunk. You cannot wait for the full response to score it. Streaming guardrails use a check_interval to evaluate partial output, and act mid-stream with stop or disclaimer actions when a policy fires.
What is per-tool audit and why is MCP a special case?
MCP exposes a tool catalog from a server that may be untrusted. Per-tool audit means evaluating each tool descriptor and each tool-call argument set against policy. The Future AGI gateway uses a dual scanner: mcpsec.go at the chat-completion stage and toolguard.go at per-tool-call time via the mcp.ToolCallGuard interface.
How do you keep one eval stack protocol-neutral across MCP, A2A, Responses, Realtime, and ADK?
Use pluggable semantic conventions so your spans normalize across vendors, span kinds that already cover A2A_CLIENT, A2A_SERVER, computer_use, voice, image, GUARDRAIL, and EVALUATOR, and rubric templates that work across protocols such as Groundedness, LLMFunctionCalling, TaskCompletion, AnswerRefusal, Toxicity, PromptInjection, and DataPrivacyCompliance.
What is the 5-step protocol-readiness audit?
Step 1, inventory which protocols your agent stack uses today. Step 2, instrument with traceAI and confirm the right span kinds emit per protocol. Step 3, define per-protocol rubric sets, with per-tool audit on MCP and per-hop audit on A2A. Step 4, run a per-protocol regression suite and gate on per-protocol thresholds. Step 5, monitor with the Error Feed, cluster by protocol via HDBSCAN, and let the Sonnet 4.5 Judge write an immediate_fix per cluster.
Related Articles
View all