Guides

Evaluating Google ADK Agents in 2026

Google ADK's opinionated primitives (Sequential, Parallel, Loop, sub-agent dispatch) demand ADK-native eval, not a LangChain rig in a trench coat.

·
Updated
·
11 min read
google-adk agent-evaluation vertex-ai gemini llm-evaluation multi-agent 2026
Editorial cover image for Evaluating Google ADK Agents in Production 2026
Table of Contents

Google ADK is the most opinionated agent framework on the market. SequentialAgent runs sub-agents in order. ParallelAgent runs them concurrently and merges. LoopAgent re-invokes until a condition holds. Sub-agent dispatch is a typed tool call (transfer_to_agent). VertexAiSearchTool and GoogleSearchTool are first-class registered tools, not user code. That opinion makes some eval primitives easy (the trace is structured) and others a problem (the rubrics ADK ships are built for the development inner loop and grade Google’s idea of correctness, not your domain’s). The eval that matches ADK reads the workflow primitive, scores sub-agent dispatch, separates ADK-tool failures from custom-tool failures, grades Vertex Search retrieval, and asserts coherence across parallel branches. This post is that eval, end to end.

Why ADK eval differs from AutoGen and CrewAI

AutoGen treats agents as participants in a group chat; the substrate is messages. CrewAI treats them as role-typed workers with a process over tasks; the substrate is tasks. ADK treats them as a tree of typed agents with workflow primitives that compose. The substrate is the workflow. That difference shows up in three places the eval has to handle.

Workflow primitives are part of the contract. A SequentialAgent’s correctness is a property of the order and what got dropped at each boundary. A ParallelAgent’s correctness is a property of the merge. A LoopAgent’s correctness is a property of the termination signal. Score per primitive or you cannot bisect the failure.

Sub-agent dispatch is a tool call, not a routing decision. AutoGen routes via a manager LLM. CrewAI routes via process configuration. ADK routes via transfer_to_agent, which the LLM emits as a function call that the framework intercepts. Dispatch failures look like tool-call failures in the trace, and you score them with the same function-calling rubric plus a sub-agent-name F1 with an irrelevance bucket. A generic LangChain or CrewAI rig does not have this surface.

The tool registry is two-tier. ADK ships managed tools (VertexAiSearchTool, GoogleSearchTool, BuiltInCodeExecutor, AgentTool) alongside your custom BaseTool subclasses. A custom tool either raises or returns; a managed tool returns a blob the model has to interpret. Mixing the two under one tool_correctness score hides Vertex Search retrieval regressions behind a strong global mean. For the wider framing, see agent observability vs evaluation vs benchmarking.

The five ADK-native axes

A complete eval suite for an ADK agent covers five axes that no generic agent evaluation framework gives you out of the box.

1. Sub-agent dispatch correctness

Dispatch is the load-bearing decision in any multi-agent ADK build. A coordinator with three sub-agents (billing, support, escalation) is exactly one bad transfer away from sending a refund request to support. Three sub-scores cover it.

  • Sub-agent F1 with an irrelevance bucket. Macro-F1 across the registry with no_transfer as a class. Without the bucket, the over-dispatch failure where a prompt revision makes the coordinator bolder is invisible until users complain.
  • Scope preservation across the handoff. Read the Session snapshot on each side of the transfer and score whether the keys the coordinator wrote (user_id, intent, verified_human) survived into the sub-agent’s context.
  • Return-to-coordinator behavior. Multi-hop tasks break in two ways: the sub-agent never returns, or control returns but the coordinator forgets the original task. Score AGENT span tree depth against the gold trajectory and final-response match against the original user intent, not the sub-agent’s last output.

The traceAI google-adk instrumentor surfaces the transfer_to_agent call as a TOOL span and the receiving sub-agent run as a fresh AGENT span. Source: traceAI/python/frameworks/google-adk/traceai_google_adk/_wrappers.py. The eval reads the call graph off the trace; no manual logging.

2. ADK-tool registry vs custom-tool correctness

Split the rubric by tool class.

For custom BaseTool subclasses, score the four function-calling axes the ai-evaluation SDK ships as local heuristic metrics: function_name_match, parameter_validation, function_call_accuracy, plus return-payload utilization (Groundedness with the context slot pointed at the tool return). All four run deterministically off the TOOL span attributes. Source: ai-evaluation/python/fi/evals/metrics/function_calling/metrics.py.

For managed ADK tools, the question is the output, not the invocation. The model chose to call VertexAiSearchTool. That choice rarely fails. What fails is the response chain that conditions on what came back. Score managed tools downstream:

  • VertexAiSearchTool: Recall@k on a labeled query-to-doc-id set against the configured datastore, plus Groundedness and ChunkAttribution on the final answer.
  • GoogleSearchTool: snippets carry no stable chunk IDs. Score Groundedness against the snippet text directly, plus citation correctness (did the agent attribute the claim or hallucinate the source).
  • BuiltInCodeExecutor: parse/run correctness on the return, plus output utilization.
  • AgentTool (the wrapper that turns another agent into a tool): treat as sub-agent dispatch and score under axis 1.

The mistake we see most: treating Vertex Search like a function call. A tool_correctness score that runs function_name_match on VertexAiSearchTool will be 0.99 forever while the agent silently grounds on stale chunks. The split rubric makes the regression visible. For the broader pattern, see evaluating tool-calling agents.

3. Vertex Search retrieval quality

Vertex AI Search is the Google-specific RAG layer ADK ships out of the box. The retrieval call hides inside an opaque tool invocation, so a faithfulness regression looks identical to a model regression unless you score retrieval as its own step.

The pattern: build a labeled set of (query, expected_doc_ids, expected_chunk_text) against your datastore. Replay the agent’s retrieval. Compute Recall@k for k in 10. Pair with Groundedness and ChunkAttribution on the retrieved chunks, plus ContextAdherence and ChunkUtilization on the final answer. The split bisects the failure: low Recall@k says fix the datastore (chunking, embeddings, the index); high Recall@k with low Groundedness says fix the system prompt or swap the Gemini variant.

Two Vertex-specific gotchas worth naming. Datastore region (global, us, eu) silently affects retrieval quality on cross-language corpora; tag every test case with a language field and alert when any non-English subset falls more than 5 points below the English baseline. The implicit extractive vs generative answer mode can flip on a GCP console update; pin the mode and assert it in CI. For deeper RAG metrics, see the 2026 LLM evaluation playbook.

4. Parallel-execution coherence

ParallelAgent and LoopAgent are where ADK earns its opinionated reputation, and where the eval thins out fastest.

For ParallelAgent, three failure modes recur. Write-set collision: two branches write to the same Session key and the last writer wins. Deterministic check: collect the state_delta from each branch’s AGENT span, intersect the key sets, flag any non-empty intersection. Merge-decision correctness: when branches return conflicting answers, the parent picks one. Score the choice against a labeled rubric (a CustomLLMJudge named ParallelMergeCorrectness) and watch the per-branch contribution rate so a parent that always picks branch one is visible. Side-effect isolation: a branch that calls a write-class tool (charge_card, send_email, delete_record) before the merge leaks. Inspect every TOOL span inside a parallel block for verbs on a configurable write list and fail any unjustified call.

For LoopAgent, the termination condition is the failure surface. A loop that re-invokes until a state key is set will run forever if the sub-agent reads the wrong key. Track the iteration-count distribution against a gold budget. The instrumentor emits one AGENT span per iteration, so the count reads off the trace. For the broader frame, see best multi-agent debugging tools.

5. Vertex Agent Engine parity

ADK runs locally and deploys to Vertex AI Agent Engine. The two surfaces drift on safety_settings defaults, Gemini revision pinning, tool timeout values, and Session serialization. Run every scenario against both, capture the Session and tool trajectory, and score the pair with a VertexParity custom rubric. Gate deploys on a parity threshold. This is the cheapest insurance against the bug where the local agent works, the deployed agent works, and the two diverge in production. The deeper hosting-side patterns live in evaluating Vertex AI Agent Engine.

The traceAI Google ADK instrumentor

The trace is the unit. Two lines of setup attach OpenTelemetry to every ADK run in the process.

pip install google-adk traceai-google-adk ai-evaluation
from fi_instrumentation import register, ProjectType
from traceai_google_adk import GoogleADKInstrumentor

trace_provider = register(
    project_type=ProjectType.OBSERVE,
    project_name="adk-prod",
)
GoogleADKInstrumentor().instrument(tracer_provider=trace_provider)

After that, every Runner.run_async emits a CHAIN span, every BaseAgent invocation emits an AGENT span with gen_ai.agent.name, every BaseTool call emits a TOOL span with gen_ai.tool.name and gen_ai.tool.description, and every Gemini call emits an LLM span with gen_ai.image.* and gen_ai.voice.* attributes for multi-modal payloads. The standard fi.span.kind taxonomy applies (CHAIN, AGENT, LLM, TOOL, RETRIEVER, GUARDRAIL, EVALUATOR), so the same evaluator runs on a single agent and on a SequentialAgent + ParallelAgent composition without rewriting the rubric. For distributed ADK using A2A across processes, traceAI ships A2A_CLIENT and A2A_SERVER span kinds.

Wiring the five axes into ai-evaluation

The ai-evaluation SDK (Apache 2.0) ships 60+ EvalTemplate classes, 13 guardrail backends, and four distributed runners (Celery, Ray, Temporal, Kubernetes) that parallelize the eval across Gemini variants and Vertex datastore regions without changing the rubric code.

from fi.evals import Evaluator
from fi.evals.templates import (
    EvaluateFunctionCalling,
    TaskCompletion,
    Groundedness,
    ContextAdherence,
    ChunkAttribution,
    ChunkUtilization,
    AnswerRefusal,
    CustomLLMJudge,
)

sub_agent_dispatch = CustomLLMJudge(
    name="SubAgentDispatchCorrectness",
    rubric=(
        "Score 1 if the coordinator transferred to the expected sub-agent "
        "(or correctly did not transfer when no_transfer was expected), "
        "preserved the required Session keys across the handoff, and "
        "returned control to the coordinator on multi-hop tasks. "
        "Use AGENT span tree and the transfer_to_agent TOOL span attributes."
    ),
)

vertex_search_recall = CustomLLMJudge(
    name="VertexSearchRecallAtK",
    rubric=(
        "Score the fraction of expected_doc_ids that appear in the "
        "VertexAiSearchTool TOOL span's results, top-5. Penalize if "
        "Groundedness on the retrieved chunks is below 0.85."
    ),
)

parallel_merge = CustomLLMJudge(
    name="ParallelMergeCorrectness",
    rubric=(
        "Given conflicting branch outputs from a ParallelAgent and the "
        "merged parent answer, score whether the merge chose the branch "
        "whose answer best matches the gold response. Flag write-set "
        "collisions and unjustified side-effect tool calls in branches."
    ),
)

evaluator = Evaluator()
report = evaluator.evaluate(
    eval_templates=[
        EvaluateFunctionCalling(),
        TaskCompletion(),
        Groundedness(),
        ContextAdherence(),
        ChunkAttribution(),
        ChunkUtilization(),
        AnswerRefusal(),
        sub_agent_dispatch,
        vertex_search_recall,
        parallel_merge,
    ],
    inputs=golden_set,
)

The golden set carries the ADK-specific labels.

from fi.evals import TestCase

golden_set = [
    TestCase(
        input="Refund my last order, the food was cold.",
        expected_transfer_agent="billing",
        expected_session_keys=["user_id", "order_id", "verified_human"],
        expected_doc_ids=["refund_policy_v3", "order_4421"],
        retrieval_context_required=True,
        metadata={
            "scenario": "refund_dispatch",
            "workflow_primitive": "SequentialAgent",
            "should_block": False,
        },
    ),
    # 50-100 cases per route, stratified by sub-agent, workflow primitive,
    # Vertex Search datastore region, and Gemini variant
]

Run the suite across every Gemini variant the agent might resolve to. The default ADK matrix in 2026 is gemini-2.5-pro, gemini-2.5-flash, gemini-2.5-flash-lite, plus the Live variants for voice. The Ray runner finishes a 200-case suite across four variants in single-digit minutes on a modest cluster.

The SDK also exposes enable_auto_enrichment() and enrich_span_with_evaluation(), which attach the score and reason back onto the scored span. Observe then shows per-turn quality next to per-turn latency and cost on the same timeline; cross-axis debugging becomes one trace view instead of three dashboards.

CI gate: per-axis thresholds, not an aggregate

The bug is treating one agent_score as a ship gate. A 0.85 aggregate hides a 0.62 on sub-agent dispatch behind a 0.97 on tool selection, and the production failure rides on the dispatch layer. Wire per-axis assertions.

# config.yaml for `fi run`
assertions:
  - "sub_agent_dispatch.score >= 0.95 for at_least 95% of cases"
  - "function_call_accuracy.score >= 0.90 for at_least 90% of cases"
  - "vertex_search_recall_at_5.score >= 0.85 for at_least 90% of cases"
  - "groundedness.score >= 0.90 for at_least 90% of cases"
  - "parallel_merge.score >= 0.90 for at_least 90% of cases"
  - "task_completion.score >= 0.85 for at_least 90% of cases"
  - "vertex_parity.score >= 0.92 for at_least 95% of cases"

When the gate fails, the failing assertion name is the root cause. One bisect instead of three days.

Production observability and Error Feed

CI is necessary, not sufficient. The eval set is a snapshot; production is a river. Score the live trace stream with the same rubrics and you get the regression signal the offline set cannot have, because the offline set was frozen before users found the failure.

Error Feed sits inside the eval stack. Failing ADK traces land in ClickHouse with their span embeddings. HDBSCAN soft-clustering groups them. Each cluster fires a JudgeAgent on Claude Sonnet 4.5 for a 30-turn investigation across eight span-tools (read_span, get_children, get_spans_by_type, search_spans, plus a Haiku Chauffeur for spans over 3000 characters, with ~90% prompt-cache hit).

Per cluster, the Judge writes three artifacts engineers read: a 5-category, 30-subtype taxonomy, a 4-D trace score (factual grounding, privacy and safety, instruction adherence, optimal plan execution; 1-5 each), and an immediate_fix. On ADK the recurring clusters look like:

  • “Coordinator skips the billing transfer on refund intent when verified_human is false.” Fix: dispatch regardless of verification state and let billing handle the check.
  • “VertexAiSearchTool returns zero results in eu-west1 for German queries; LoopAgent never terminates.” Fix: pin datastore to global, re-index multilingual content, add a max-iteration guard.
  • “ParallelAgent branches both write user_intent and the second branch wins.” Fix: namespace state keys per branch (branch_a.user_intent) and resolve in the merge.

Each fix feeds the Platform’s self-improving evaluators, so the next eval run already knows the failure mode. Linear is the only ticket destination wired today; Slack, GitHub, Jira, and PagerDuty are on the roadmap. For the loop from named issue back to fixed agent, automated optimization for agents walks through pointing one of agent-opt’s six optimizers (RandomSearch, BayesianSearch with Optuna, MetaPrompt, ProTeGi, GEPA, PromptWizard) at the ADK agent’s instruction field with the eval suite as the scoring function.

Five ADK anti-patterns

Patterns we see often enough to name.

  1. Treating ADK’s built-in evaluators as the production eval. final_response_match_v2, tool_trajectory_avg_score, hallucinations_v1, safety_v1 are useful for the development inner loop. They do not see sub-agent dispatch, do not separate VertexAiSearchTool from custom tools, and do not run on the live trace stream. Use them locally; do not promote them to the CI gate.
  2. One tool_correctness score across the registry. Score managed ADK tools on their output. Score custom BaseTool subclasses on function-calling axes. Mixing the two hides Vertex Search retrieval regressions behind strong global means.
  3. No irrelevance bucket on sub-agent dispatch. Without no_transfer as a class, the over-dispatch failure where a prompt revision makes the coordinator bolder is invisible until users complain.
  4. Skipping parallel-execution write-set checks. A ParallelAgent that lets two branches write to the same Session key is a race condition shaped like an agent.
  5. Single-variant golden set when production routes across Gemini variants. A pass on gemini-2.5-pro is not a pass on gemini-2.5-flash. Run the suite against every variant the agent might resolve to.

How Future AGI ships the ADK eval stack

Three packages do the work. They are designed to be used together, but ship independently.

traceAI (Apache 2.0). GoogleADKInstrumentor for Runner, BaseAgent, BaseTool, and the Gemini call path, across Python, TypeScript, and Java. 14 span kinds with the standard fi.span.kind taxonomy. 50+ AI surfaces total across 4 languages. Pluggable semantic conventions at register() time (FI, OTEL_GENAI, OPENINFERENCE, OPENLLMETRY) so spans flow into whatever OTel collector you already run.

ai-evaluation (Apache 2.0). 60+ EvalTemplate classes including EvaluateFunctionCalling, TaskCompletion, Groundedness, ContextAdherence, ChunkAttribution, ChunkUtilization, AnswerRefusal, and CustomLLMJudge for the ADK-specific axes. 20+ local heuristic metrics (function_name_match, parameter_validation, function_call_accuracy). 13 guardrail backends (Llama Guard 3, Qwen3Guard, Granite Guardian, WildGuard, ShieldGemma, Turing Flash, Turing Safety, OpenAI Moderation, Azure Content Safety). Four distributed runners parallelize the matrix across Gemini variants and Vertex regions.

Agent Command Center (Apache 2.0, single Go binary). The gateway includes Gemini and Vertex AI as native providers (100+ total) and exposes a /v1beta adapter so ADK calls Gemini directly without the OpenAI-translation hop. Every response carries x-prism-cost, x-prism-latency-ms, x-prism-model-used, and on fallback x-prism-fallback-used headers. 18+ built-in scanners + 15 third-party adapters. ~29k req/s, P99 ≤ 21 ms with guardrails on, on t3.xlarge. The gateway self-hosts inside your GCP project, which keeps Gemini and Vertex traffic in-residency for EU and India workloads.

The eval-stack story is one package across three surfaces: code-first per-axis scoring through the SDK, hosted self-improving evaluators on the Platform at lower per-eval cost than Galileo Luna-2, and Error Feed inside the same loop so failure clusters drive the next eval run. The Platform is SOC 2 Type II, HIPAA, GDPR, and CCPA certified per futureagi.com/trust; ISO/IEC 27001 is in active audit.

Ready to evaluate your first ADK agent? Wire the GoogleADKInstrumentor against a sandboxed ADK build this afternoon, drop the seven CI assertions above into your pytest fixture against the ai-evaluation SDK, and route the live trace stream through Agent Command Center so Error Feed can start clustering the dispatch and Vertex Search failures the offline set hasn’t seen yet.

Frequently asked questions

Why does evaluating Google ADK agents need a different rig than LangChain or CrewAI?
ADK is the most opinionated agent framework on the market. It ships SequentialAgent, ParallelAgent, and LoopAgent as first-class workflow primitives, sub-agent dispatch through the AGENT-typed `transfer_to_agent` call, a typed tool registry (BaseTool + AgentTool + VertexAiSearchTool + GoogleSearchTool), and a Gemini-native call path that treats image and audio as first-class. None of those primitives have a clean equivalent in LangChain or CrewAI, so a generic agent eval rig drops three signals: it cannot score sub-agent dispatch correctness, it cannot tell ADK-tool failures from custom-tool failures, and it cannot grade Vertex Search retrieval as its own step. ADK eval has to read the workflow primitive in the trace and score per primitive, not per turn.
What is sub-agent dispatch correctness and how do you score it?
Sub-agent dispatch is ADK's mechanism for handing control to a specialist. A coordinator agent calls `transfer_to_agent(agent_name='billing')`, ADK switches the active agent, and the conversation continues inside billing's instructions and tool list. Dispatch correctness has three sub-scores: F1 per sub-agent against the gold dispatch label including an irrelevance bucket for `no transfer expected`, scope preservation that checks the right Session state keys carried across the handoff, and return-to-coordinator behavior on multi-hop tasks. The traceAI google-adk instrumentor emits an AGENT span per active agent and a TOOL span for the `transfer_to_agent` call itself, so the dispatch graph reads off the trace without custom plumbing.
Do I evaluate built-in ADK tools the same way as custom tools?
No. ADK's tool registry separates managed tools (VertexAiSearchTool, GoogleSearchTool, BuiltInCodeExecutor, the new AgentTool that wraps another agent) from your `BaseTool` subclasses. Managed tools fail in opaque ways. Vertex Search returns a results blob that the model has to ground on; GoogleSearchTool's snippets carry no chunk IDs; BuiltInCodeExecutor runs out-of-process. Score managed tools on their output quality (Recall@k on Vertex Search against a labeled set, Groundedness on the answer that conditions on the snippets) rather than on argument schemas. Score custom tools on the four function-calling axes (`function_name_match`, `parameter_validation`, `function_call_accuracy`, return-payload utilization). Mixing the two patterns hides regressions in both.
How do you grade ParallelAgent execution coherence?
ParallelAgent runs sub-agents concurrently and merges their outputs into the parent state. Three things go wrong: branches write to the same Session key and clobber each other, the merge step picks the wrong branch when results conflict, or one branch's tool calls leak side effects (a write to the database) before the others have run. Grade coherence on three signals: write-set disjointness across branches (deterministic, off the AGENT spans), merge-decision correctness against a labeled rubric, and side-effect isolation per branch (any TOOL span with a write-class verb during a parallel block needs justification). LoopAgent has a related but distinct failure: the termination condition checks state that changes mid-iteration. Score the loop on iteration-count distribution against the gold budget.
What does traceAI capture for Google ADK that ADK's own evaluators do not?
ADK ships `final_response_match_v2`, `tool_trajectory_avg_score`, `hallucinations_v1`, and `safety_v1` for the development inner loop. Those are useful for a notebook check on the agent's final response. They do not surface span trees, they do not name the active sub-agent per turn, and they do not separate VertexAiSearchTool spans from custom-tool spans. The traceAI google-adk instrumentor emits a CHAIN span for every Runner invocation, an AGENT span for every BaseAgent run (with the agent name as an attribute), a TOOL span for every BaseTool call including built-in tools, and an LLM span for every Gemini call with `gen_ai.image.*` and `gen_ai.voice.*` attributes for multi-modal payloads. The trace becomes the unit of evaluation rather than the response.
How does FAGI plug into Vertex AI Agent Engine deployments specifically?
Two paths. The traceAI google-adk instrumentor runs inside the Agent Engine container if you bring it into your reasoning_engine deployment package, so production spans land in the same Observe view as local runs. The Agent Command Center gateway includes Gemini and Vertex AI as native providers and exposes a /v1beta adapter that ADK calls directly without the OpenAI-translation hop; every response carries `x-prism-cost`, `x-prism-latency-ms`, `x-prism-model-used`, and on fallback `x-prism-fallback-used` headers. The gateway self-hosts as a single Go binary inside the same GCP project as Agent Engine, which keeps the request path in-residency for EU and India workloads.
Can Error Feed cluster ADK-specific failure modes?
Yes. Failing ADK traces flow into ClickHouse with their span embeddings. HDBSCAN soft-clustering groups them; a Claude Sonnet 4.5 Judge runs a 30-turn investigation across eight span-tools (read_span, get_children, get_spans_by_type, search_spans, plus a Haiku Chauffeur for spans over 3000 characters, with ~90% prompt-cache hit). Per cluster, the Judge writes a 5-category 30-subtype taxonomy, a 4-dimensional trace score (factual grounding, privacy and safety, instruction adherence, optimal plan execution; 1-5 each), and an `immediate_fix`. On ADK the recurring clusters look like 'coordinator forgets to route to billing sub-agent on refunds', 'Vertex Search returns zero results in eu-west and the loop never terminates', or 'ParallelAgent branches both write to `user_intent` and the second branch wins'. Linear is the only ticket destination wired today; Slack, GitHub, Jira, and PagerDuty are on the roadmap.
Related Articles
View all
Evaluating LLM Agent Handoffs (2026)
Guides

Evaluating LLM agent handoffs in 2026: the handoff is the cross-framework eval unit. Four rubrics, per-handoff spans, CI gates, and Error Feed clustering.

NVJK Kartik
NVJK Kartik ·
11 min