AI Evaluations

AI Agents

How to Evaluate Google ADK Agents with FutureAGI

How to Evaluate Google ADK Agents with FutureAGI

How to Evaluate Google ADK Agents with FutureAGI

How to Evaluate Google ADK Agents with FutureAGI

How to Evaluate Google ADK Agents with FutureAGI

Last Updated

Mar 11, 2026

By

Rishav Hada
Rishav Hada

Time to read

16 mins

Table of Contents

TABLE OF CONTENTS

  1. Introduction

Google’s Agent Development Kit (ADK) has carved out a solid reputation as a framework for building multi-agent systems. It gives you workflow orchestration through SequentialAgent, ParallelAgent, and LoopAgent, plays well with Gemini models, and deploys natively to Vertex AI Agent Engine or Cloud Run. ADK also ships with a built-in evaluation framework that lets you test tool trajectories, score responses, and even run hallucination checks during development. That evaluation layer covers a lot of ground for pre-deployment testing.

Where things get thin is after deployment. ADK’s eval tools are designed for the development loop: you define test cases, run them through pytest or the CLI, and check scores before merging code. That workflow does not extend into production. Once your agents start handling real user requests, you lose visibility into quality drift, cost attribution per agent, latency bottlenecks across workflow steps, and continuous quality scoring on live traffic. That gap between development eval and production Google ADK observability is exactly what this guide addresses.

We will walk through a complete Google ADK agent testing and monitoring setup. You will learn what ADK’s built-in evaluation actually measures (and where it falls short), how FutureAGI’s end-to-end stack for observability and evaluation fills the production gaps, how to instrument ADK agents with traceAI, how to evaluate every workflow pattern ADK supports, and how to set up dashboards that track cost, latency, and quality in real time.

  1. Understanding ADK’s Built-in Evaluation and Where It Falls Short

Before layering on external tooling, it is worth understanding what ADK already provides and where those capabilities hit their ceiling. According to the ADK evaluation criteria documentation, the framework ships with several evaluation metrics. The two foundational ones are tool_trajectory_avg_score and response_match_score. The first compares the sequence of tools your agent actually called against a list of expected calls, supporting EXACT, IN_ORDER, and ANY_ORDER match types. The second uses ROUGE-1 a word-overlap metric that predates generative AI by years to measure how closely the agent’s final response matches a reference answer.

ADK has expanded beyond these two metrics over time. It now includes final_response_match_v2 (which uses an LLM as a judge for semantic equivalence), hallucinations_v1 (which segments responses and checks each sentence for grounding), safety_v1 (which delegates to the Vertex AI Eval SDK for harmlessness scoring), and rubric-based criteria for both response quality and tool usage. These additions are significant, and they make ADK’s eval story much stronger than it was at launch.

You define test cases in JSON files and run them via pytest, the CLI, or the web UI. Here is a basic pytest integration:

from google.adk.evaluation.agent_evaluator import AgentEvaluator
import pytest

@pytest.mark.asyncio
async def test_agent_basic():
    await AgentEvaluator.evaluate(
        agent_module="my_agent",
        eval_dataset_file_path_or_dir="tests/eval.test.json",
    )

This approach catches regressions during development and validates agent behavior before deployment. But there are clear gaps that show up once you move past the development loop:

  • No production monitoring: Every ADK eval method web UI, CLI, pytest runs against predefined test cases. None of them operate on live traffic. Once your agents serve real users, you have no automated way to track quality drift, cost spikes, or latency degradation over time.

  • No cost attribution: ADK does not break down token usage or API spend by individual agents in a multi-agent hierarchy. If one sub-agent is burning through your Gemini quota, you will not know from ADK’s eval output alone.

  • No per-step scoring in multi-agent pipelines: ADK evaluates the root agent’s final response. It does not score intermediate outputs from individual sub-agents within a SequentialAgent or ParallelAgent workflow, which means quality issues in early pipeline stages stay invisible until they corrupt the final output.

  • No continuous evaluation on live traffic: ADK’s hallucination and safety checks require predefined eval sets. There is no built-in mechanism to sample production requests and run async quality scoring without adding latency to user-facing responses.

  • Limited multimodal evaluation: If your ADK agents process images, audio, or video through Gemini, ADK’s eval criteria do not cover accuracy checks on those non-text modalities.

This is where an end-to-end observability and evaluation stack picks up the slack. FutureAGI’s traceAI, Evaluate, and Observe products are built for exactly this scenario: you keep ADK for orchestration and agent building, and you add a production-grade evaluation and observability layer on top.

  1. Why Use FutureAGI for Google ADK Evaluation and Observability?

Before jumping into instrumentation code, it helps to understand what FutureAGI actually brings to the table and why it complements ADK’s built-in eval rather than replacing it.

Future AGI and Google ADK

Figure 1: Future AGI x Google ADK

FutureAGI operates as a full-lifecycle platform for AI evaluation, observability, and optimization. For ADK users specifically, three products matter. traceAI is an open-source, OpenTelemetry-native instrumentation package that auto-captures every agent invocation, tool call, LLM request, and sub-agent delegation as structured span data. Evaluate provides 50+ pre-built evaluation templates including groundedness, factual accuracy, instruction adherence, and custom rubrics that run programmatically via SDK or API. Observe turns that trace and eval data into real-time dashboards for cost tracking, latency analysis, quality scoring, and alerting.

The key differentiator is not any single feature. It is the closed loop: trace your agents in production, evaluate sampled traffic continuously, surface quality drops in dashboards, and feed failures back into your eval datasets. ADK’s built-in eval handles the development inner loop. FutureAGI handles everything that comes after.

  1. Instrument ADK Agents with traceAI for Full Observability

The first step in any Google ADK monitoring setup is instrumentation. You need to capture every agent invocation, tool call, sub-agent handoff, and model interaction as structured trace data. traceAI handles this through OpenTelemetry-compatible auto-instrumentation. No manual span creation required.

Step 1: Install the Required Packages

pip install traceai-google-adk google-adk

Step 2: Set Environment Variables

import os
os.environ["FI_API_KEY"] = "your-futureagi-api-key"
os.environ["FI_SECRET_KEY"] = "your-futureagi-secret-key"
os.environ["GOOGLE_API_KEY"] = "your-google-api-key"

Step 3: Initialize the Trace Provider and Instrument

from fi_instrumentation import register
from fi_instrumentation.fi_types import ProjectType
from traceai_google_adk import GoogleADKInstrumentor

trace_provider = register(
    project_type=ProjectType.OBSERVE,
    project_name="my_adk_project",
)

GoogleADKInstrumentor().instrument(
    tracer_provider=trace_provider
)

Once this is in place, traceAI automatically picks up every ADK operation agent invocations, tool executions, LLM completions, and workflow agent orchestration (SequentialAgent, ParallelAgent, LoopAgent) as OpenTelemetry spans. When your root agent delegates work to a sub-agent, both invocations appear as linked parent-child spans in your trace data. That gives you full ADK agent tracing through the entire hierarchy without writing any additional instrumentation code.

The trace data flows directly to FutureAGI’s Observe dashboard, where you see execution visualized as nested timelines. Click into any span to inspect inputs, outputs, token counts, and latency. From here, you can also set up cost tracking per agent, latency alerts, quality drift detection on sampled traffic, and error rate monitoring across your entire ADK deployment.

  1. Evaluate ADK Workflow Patterns: Sequential, Parallel, Loop, and Dynamic

ADK supports four workflow patterns, and each one creates different failure modes. Here is how to approach Google ADK agent testing for each pattern.

5.1 Sequential Workflows (SequentialAgent)

A SequentialAgent runs sub-agents in order. Agent A finishes, Agent B picks up where A left off. The core risk is error compounding: if Agent A produces a weak output, every downstream agent inherits that problem and may amplify it.

Evaluation approach: Score each step’s output independently rather than only grading the final result. Use FutureAGI’s Evaluate SDK to run groundedness and factual accuracy checks on every intermediate result.

from fi.evals import Evaluator

evaluator = Evaluator(fi_api_key="...", fi_secret_key="...")

result = evaluator.evaluate(
    eval_templates="groundedness",
    inputs={
        "context": step_a_context,
        "output": step_a_output,
    },
    model_name="turing_flash"
)

5.2 Parallel Workflows (ParallelAgent)

A ParallelAgent runs multiple sub-agents at the same time. Think of a travel planner where a flight agent and a hotel agent run simultaneously. The evaluation challenge is whether the results from independent agents were merged correctly and whether the combined output stays consistent.

Evaluation approach: First, check each parallel branch for individual quality using groundedness scoring. Then evaluate the merged output for coherence and consistency. FutureAGI’s instruction adherence metric works well here because it verifies that the merging logic followed whatever rules you specified in the orchestrator’s instructions.

5.3 Loop Workflows (LoopAgent)

A LoopAgent repeats a sub-agent until a termination condition is met. The risks are infinite loops on one end and premature exits on the other. Did the agent loop enough times to actually reach the quality bar? Or did it bail early because of a timeout?

Evaluation approach: Track how many iterations actually ran and compare that against the range you expected. Check whether the final output genuinely met the quality threshold or whether the loop just timed out. You can pull iteration counts straight from traceAI spans and match them against output quality scores to spot the difference.

5.4 Dynamic Routing (LLM-Driven Agent Transfer)

ADK’s LLMAgent can dynamically route to sub-agents based on the user’s request. The LLM reads the user query, decides which sub-agent should handle it, and transfers control. This gives you maximum flexibility but also makes evaluation harder because routing decisions are non-deterministic.

Evaluation approach: Build a Google ADK tool trajectory evaluation dataset that maps input queries to expected sub-agent selections. Treat this as a classification accuracy test. Run your test queries through the agent, record which sub-agent was selected each time, and measure how often the router picked the correct one. If misrouting exceeds an acceptable threshold, revisit your routing instructions or the sub-agent descriptions the LLM uses to make its decisions.

5.5 Workflow Evaluation Quick Reference

Workflow Type

Primary Risk

Evaluation Metric

Tool

SequentialAgent

Error compounding across steps

Per-step groundedness scoring

FutureAGI Evaluate

ParallelAgent

Inconsistent merge results

Coherence + instruction adherence

FutureAGI Evaluate

LoopAgent

Infinite loops or premature exit

Iteration count + output quality

traceAI + Evaluate

Dynamic Routing

Wrong sub-agent selection

Routing accuracy (classification)

FutureAGI Evaluate

Table 1: Workflow Evaluation

  1. LLM-Agnostic Quality Checks for ADK Agents

ADK is model-agnostic by design it works with Gemini, but it also supports other LLMs through its BaseLlm interface. The quality checks below apply regardless of which model powers your agents. Whether you are running Gemini 2.5 Flash, GPT-4.1, or a fine-tuned open-source model, these evaluations catch the same categories of failure.

6.1 Multimodal Input Evaluation

If your ADK agents process images, PDFs, audio, or video, you need to verify that the model correctly interprets non-text inputs. FutureAGI supports multimodal evaluation across text, image, audio, and video modalities. You can run accuracy checks on how your agent understood visual or audio content within the ADK pipeline.

6.2 Grounding Verification

When agents pull in external context whether through Google Search grounding, RAG retrieval, or tool outputs there is always a risk of misinterpreting or selectively citing that context. FutureAGI’s groundedness evaluator checks whether the agent’s claims are actually supported by the context it was given. It does not require manual ground truth labels, which makes it practical for automated pipelines.

6.3 Code Execution Accuracy

ADK agents that use code execution tools can generate and run Python in a sandboxed environment. The evaluation question is straightforward: did the generated code produce the correct result? Run ADK response quality evaluation to check code outputs against expected results. Layer a factual accuracy template on top to catch situations where the code runs without errors but returns the wrong answer.

6.4 Hallucination Detection in Production

ADK now includes hallucinations_v1 for pre-deployment hallucination checks using an LLM-as-a-judge approach. That covers development-time testing well. For production traffic, though, you need a way to run Google ADK hallucination detection continuously on sampled live requests without blocking user responses. FutureAGI’s evaluation SDK runs async hallucination checks on production traces, feeding results into the Observe dashboard so you can catch grounding failures as they happen not days later during a manual review.

  1. Production Monitoring for ADK Agents

Pre-deployment eval catches known failure modes. Production monitoring catches everything else. Once your ADK agents are live whether on Vertex AI Agent Engine, Cloud Run, or your own infrastructure you need continuous Google ADK monitoring to track quality, cost, and latency over time.

Since you have already instrumented your agents with traceAI (covered in the instrumentation section above), the Observe dashboard automatically picks up all trace data and gives you:

  • Cost per agent: Token usage and API costs broken down by individual agents within your hierarchy. You will know exactly which sub-agent is consuming the most resources.

  • Latency per workflow step: Pinpoint bottlenecks. Find out whether your SequentialAgent is slow because of a single sub-agent or whether the LoopAgent is taking too many iterations.

  • Quality drift detection: Run continuous evaluation on sampled production traffic. FutureAGI’s eval metrics run asynchronously, so they do not add latency to your agent responses.

  • Error rate tracking: Monitor tool call failures, LLM errors, and agent timeout rates across your entire ADK deployment.

  1. Complete ADK Agent with FutureAGI Integration

Here is a full working example that combines ADK agent definition with traceAI instrumentation and production-ready monitoring:

import os
from fi_instrumentation import register
from fi_instrumentation.fi_types import ProjectType
from traceai_google_adk import GoogleADKInstrumentor
from google.adk.agents import Agent
from google.adk.runners import Runner
from google.adk.sessions import InMemorySessionService

# Configure keys
os.environ["FI_API_KEY"] = "your-futureagi-api-key"
os.environ["FI_SECRET_KEY"] = "your-futureagi-secret-key"
os.environ["GOOGLE_API_KEY"] = "your-google-api-key"

# Initialize tracing
trace_provider = register(
    project_type=ProjectType.OBSERVE,
    project_name="adk_production",
)
GoogleADKInstrumentor().instrument(tracer_provider=trace_provider)

# Define your ADK agent
agent = Agent(
    name="research_agent",
    model="gemini-2.5-flash",
    instruction="You are a research assistant.",
    tools=[search_tool, summarize_tool],
)

# Run with full tracing
session_service = InMemorySessionService()
runner = Runner(agent=agent, app_name="research",
    session_service=session_service)

Every interaction through this runner is now traced, scored, and visible in your FutureAGI dashboard. No additional code changes required for production deployment.

  1. ADK Built-in Eval vs. FutureAGI: Feature Comparison

Capability

ADK Built-in

FutureAGI

Tool trajectory matching

Yes (exact, in-order, any-order)

Yes (via traceAI spans)

Response scoring

ROUGE-1 + LLM-as-judge (v2)

LLM-based quality scoring (50+ templates)

Hallucination detection

Yes (hallucinations_v1, dev-time)

Yes (groundedness evaluator, dev + production)

Multi-agent per-step scoring

Final response only

Per-agent and per-step scoring

Production monitoring

Cloud Trace (latency only)

Cost, latency, quality, errors

Multimodal evaluation

No

Text, image, audio, video

CI/CD integration

pytest + adk eval CLI

SDK + API + pytest compatible

Cost attribution

No

Per-agent token and cost tracking

Continuous eval on live traffic

No

Async eval on sampled production data

Table 2: ADK Built-in Eval vs. FutureAGI

  1. . Best Practices for Google ADK Agent Performance Testing

Based on real production deployments, here are the patterns that work best for Google ADK agent performance testing:

  • Start with ADK’s built-in eval for development: Use .test.json files and adk eval for fast feedback loops during agent development. Keep test cases focused on tool trajectory and response matching. Use hallucinations_v1 and safety_v1 for pre-deployment quality gates.

  • Add FutureAGI Evaluate for production-grade quality gates: Before merging code, run FutureAGI’s evaluation SDK in your CI pipeline. Check groundedness, factual accuracy, and instruction adherence across a broader set of criteria than ADK’s built-in metrics cover.

  • Instrument early with traceAI: Adding observability after deployment is significantly harder than building it in from the start. Instrument your agents on day one.

  • Monitor quality continuously in production: Sample 5 to 10 percent of production traffic for asynchronous evaluation. Set alerts for quality score drops.

  • Separate eval by workflow type: Do not use the same evaluation criteria for SequentialAgent and ParallelAgent workflows. Each pattern has different failure modes and needs different metrics.

  • Version your eval datasets: As your agents evolve, your test cases should evolve too. Use FutureAGI’s dataset management to track evaluation data alongside your agent code.

  1. Conclusion

Google ADK evaluation does not end with passing a .test.json file. ADK gives you a strong foundation for pre-deployment testing with trajectory matching, ROUGE scoring, LLM-as-judge response matching, and even hallucination detection. That covers the development inner loop well.

But production agents need more: continuous quality scoring on live traffic, per-agent cost attribution, per-step quality checks in multi-agent workflows, multimodal evaluation, and real-time dashboards that tell you what is actually happening once users show up.

FutureAGI fills that gap with three products that map directly to the ADK lifecycle. traceAI auto-instruments your agent hierarchy with zero code changes. Evaluate scores every workflow step with 50+ evaluation templates. Observe gives you real-time dashboards for cost, latency, and quality across your entire Google ADK monitoring setup.

If you are building with ADK, start by instrumenting your agents with traceAI. It takes five lines of code and gives you full visibility from day one. From there, add evaluation gates to your CI pipeline and set up production dashboards as your agents scale.

Ready to evaluate your first ADK agent? Get started with FutureAGI and follow the Google ADK integration guide.

Frequently Asked Questions

How do you evaluate Google ADK agents in production?

What metrics does ADK’s built-in evaluation support?

Can FutureAGI detect hallucinations in Google ADK agents?

How does Google ADK multi-agent evaluation work?

Table of Contents

Table of Contents

Table of Contents

Rishav Hada is an Applied Scientist at Future AGI, specializing in AI evaluation and observability. Previously at Microsoft Research, he built frameworks for generative AI evaluation and multilingual language technologies. His research, funded by Twitter and Meta, has been published in top AI conferences and earned the Best Paper Award at FAccT’24.

Rishav Hada is an Applied Scientist at Future AGI, specializing in AI evaluation and observability. Previously at Microsoft Research, he built frameworks for generative AI evaluation and multilingual language technologies. His research, funded by Twitter and Meta, has been published in top AI conferences and earned the Best Paper Award at FAccT’24.

Related Articles

Related Articles

future agi background
Background image

Ready to deploy Accurate AI?

Book a Demo
Background image

Ready to deploy Accurate AI?

Book a Demo
Background image

Ready to deploy Accurate AI?

Book a Demo
Background image

Ready to deploy Accurate AI?

Book a Demo