Introduction
Google’s Agent Development Kit (ADK) has carved out a solid reputation as a framework for building multi-agent systems. It gives you workflow orchestration through SequentialAgent, ParallelAgent, and LoopAgent, plays well with Gemini models, and deploys natively to Vertex AI Agent Engine or Cloud Run. ADK also ships with a built-in evaluation framework that lets you test tool trajectories, score responses, and even run hallucination checks during development. That evaluation layer covers a lot of ground for pre-deployment testing.
Where things get thin is after deployment. ADK’s eval tools are designed for the development loop: you define test cases, run them through pytest or the CLI, and check scores before merging code. That workflow does not extend into production. Once your agents start handling real user requests, you lose visibility into quality drift, cost attribution per agent, latency bottlenecks across workflow steps, and continuous quality scoring on live traffic. That gap between development eval and production Google ADK observability is exactly what this guide addresses.
We will walk through a complete Google ADK agent testing and monitoring setup. You will learn what ADK’s built-in evaluation actually measures (and where it falls short), how FutureAGI’s end-to-end stack for observability and evaluation fills the production gaps, how to instrument ADK agents with traceAI, how to evaluate every workflow pattern ADK supports, and how to set up dashboards that track cost, latency, and quality in real time.
Understanding ADK’s Built-in Evaluation and Where It Falls Short
Before layering on external tooling, it is worth understanding what ADK already provides and where those capabilities hit their ceiling. According to the ADK evaluation criteria documentation, the framework ships with several evaluation metrics. The two foundational ones are tool_trajectory_avg_score and response_match_score. The first compares the sequence of tools your agent actually called against a list of expected calls, supporting EXACT, IN_ORDER, and ANY_ORDER match types. The second uses ROUGE-1 a word-overlap metric that predates generative AI by years to measure how closely the agent’s final response matches a reference answer.
ADK has expanded beyond these two metrics over time. It now includes final_response_match_v2 (which uses an LLM as a judge for semantic equivalence), hallucinations_v1 (which segments responses and checks each sentence for grounding), safety_v1 (which delegates to the Vertex AI Eval SDK for harmlessness scoring), and rubric-based criteria for both response quality and tool usage. These additions are significant, and they make ADK’s eval story much stronger than it was at launch.
You define test cases in JSON files and run them via pytest, the CLI, or the web UI. Here is a basic pytest integration:
|
This approach catches regressions during development and validates agent behavior before deployment. But there are clear gaps that show up once you move past the development loop:
No production monitoring: Every ADK eval method web UI, CLI, pytest runs against predefined test cases. None of them operate on live traffic. Once your agents serve real users, you have no automated way to track quality drift, cost spikes, or latency degradation over time.
No cost attribution: ADK does not break down token usage or API spend by individual agents in a multi-agent hierarchy. If one sub-agent is burning through your Gemini quota, you will not know from ADK’s eval output alone.
No per-step scoring in multi-agent pipelines: ADK evaluates the root agent’s final response. It does not score intermediate outputs from individual sub-agents within a SequentialAgent or ParallelAgent workflow, which means quality issues in early pipeline stages stay invisible until they corrupt the final output.
No continuous evaluation on live traffic: ADK’s hallucination and safety checks require predefined eval sets. There is no built-in mechanism to sample production requests and run async quality scoring without adding latency to user-facing responses.
Limited multimodal evaluation: If your ADK agents process images, audio, or video through Gemini, ADK’s eval criteria do not cover accuracy checks on those non-text modalities.
This is where an end-to-end observability and evaluation stack picks up the slack. FutureAGI’s traceAI, Evaluate, and Observe products are built for exactly this scenario: you keep ADK for orchestration and agent building, and you add a production-grade evaluation and observability layer on top.
Why Use FutureAGI for Google ADK Evaluation and Observability?
Before jumping into instrumentation code, it helps to understand what FutureAGI actually brings to the table and why it complements ADK’s built-in eval rather than replacing it.

Figure 1: Future AGI x Google ADK
FutureAGI operates as a full-lifecycle platform for AI evaluation, observability, and optimization. For ADK users specifically, three products matter. traceAI is an open-source, OpenTelemetry-native instrumentation package that auto-captures every agent invocation, tool call, LLM request, and sub-agent delegation as structured span data. Evaluate provides 50+ pre-built evaluation templates including groundedness, factual accuracy, instruction adherence, and custom rubrics that run programmatically via SDK or API. Observe turns that trace and eval data into real-time dashboards for cost tracking, latency analysis, quality scoring, and alerting.
The key differentiator is not any single feature. It is the closed loop: trace your agents in production, evaluate sampled traffic continuously, surface quality drops in dashboards, and feed failures back into your eval datasets. ADK’s built-in eval handles the development inner loop. FutureAGI handles everything that comes after.
Instrument ADK Agents with traceAI for Full Observability
The first step in any Google ADK monitoring setup is instrumentation. You need to capture every agent invocation, tool call, sub-agent handoff, and model interaction as structured trace data. traceAI handles this through OpenTelemetry-compatible auto-instrumentation. No manual span creation required.
Step 1: Install the Required Packages
|
Step 2: Set Environment Variables
|
Step 3: Initialize the Trace Provider and Instrument
|
Once this is in place, traceAI automatically picks up every ADK operation agent invocations, tool executions, LLM completions, and workflow agent orchestration (SequentialAgent, ParallelAgent, LoopAgent) as OpenTelemetry spans. When your root agent delegates work to a sub-agent, both invocations appear as linked parent-child spans in your trace data. That gives you full ADK agent tracing through the entire hierarchy without writing any additional instrumentation code.
The trace data flows directly to FutureAGI’s Observe dashboard, where you see execution visualized as nested timelines. Click into any span to inspect inputs, outputs, token counts, and latency. From here, you can also set up cost tracking per agent, latency alerts, quality drift detection on sampled traffic, and error rate monitoring across your entire ADK deployment.
Evaluate ADK Workflow Patterns: Sequential, Parallel, Loop, and Dynamic
ADK supports four workflow patterns, and each one creates different failure modes. Here is how to approach Google ADK agent testing for each pattern.
5.1 Sequential Workflows (SequentialAgent)
A SequentialAgent runs sub-agents in order. Agent A finishes, Agent B picks up where A left off. The core risk is error compounding: if Agent A produces a weak output, every downstream agent inherits that problem and may amplify it.
Evaluation approach: Score each step’s output independently rather than only grading the final result. Use FutureAGI’s Evaluate SDK to run groundedness and factual accuracy checks on every intermediate result.
|
5.2 Parallel Workflows (ParallelAgent)
A ParallelAgent runs multiple sub-agents at the same time. Think of a travel planner where a flight agent and a hotel agent run simultaneously. The evaluation challenge is whether the results from independent agents were merged correctly and whether the combined output stays consistent.
Evaluation approach: First, check each parallel branch for individual quality using groundedness scoring. Then evaluate the merged output for coherence and consistency. FutureAGI’s instruction adherence metric works well here because it verifies that the merging logic followed whatever rules you specified in the orchestrator’s instructions.
5.3 Loop Workflows (LoopAgent)
A LoopAgent repeats a sub-agent until a termination condition is met. The risks are infinite loops on one end and premature exits on the other. Did the agent loop enough times to actually reach the quality bar? Or did it bail early because of a timeout?
Evaluation approach: Track how many iterations actually ran and compare that against the range you expected. Check whether the final output genuinely met the quality threshold or whether the loop just timed out. You can pull iteration counts straight from traceAI spans and match them against output quality scores to spot the difference.
5.4 Dynamic Routing (LLM-Driven Agent Transfer)
ADK’s LLMAgent can dynamically route to sub-agents based on the user’s request. The LLM reads the user query, decides which sub-agent should handle it, and transfers control. This gives you maximum flexibility but also makes evaluation harder because routing decisions are non-deterministic.
Evaluation approach: Build a Google ADK tool trajectory evaluation dataset that maps input queries to expected sub-agent selections. Treat this as a classification accuracy test. Run your test queries through the agent, record which sub-agent was selected each time, and measure how often the router picked the correct one. If misrouting exceeds an acceptable threshold, revisit your routing instructions or the sub-agent descriptions the LLM uses to make its decisions.
5.5 Workflow Evaluation Quick Reference
Workflow Type | Primary Risk | Evaluation Metric | Tool |
SequentialAgent | Error compounding across steps | Per-step groundedness scoring | FutureAGI Evaluate |
ParallelAgent | Inconsistent merge results | Coherence + instruction adherence | FutureAGI Evaluate |
LoopAgent | Infinite loops or premature exit | Iteration count + output quality | traceAI + Evaluate |
Dynamic Routing | Wrong sub-agent selection | Routing accuracy (classification) | FutureAGI Evaluate |
Table 1: Workflow Evaluation
LLM-Agnostic Quality Checks for ADK Agents
ADK is model-agnostic by design it works with Gemini, but it also supports other LLMs through its BaseLlm interface. The quality checks below apply regardless of which model powers your agents. Whether you are running Gemini 2.5 Flash, GPT-4.1, or a fine-tuned open-source model, these evaluations catch the same categories of failure.
6.1 Multimodal Input Evaluation
If your ADK agents process images, PDFs, audio, or video, you need to verify that the model correctly interprets non-text inputs. FutureAGI supports multimodal evaluation across text, image, audio, and video modalities. You can run accuracy checks on how your agent understood visual or audio content within the ADK pipeline.
6.2 Grounding Verification
When agents pull in external context whether through Google Search grounding, RAG retrieval, or tool outputs there is always a risk of misinterpreting or selectively citing that context. FutureAGI’s groundedness evaluator checks whether the agent’s claims are actually supported by the context it was given. It does not require manual ground truth labels, which makes it practical for automated pipelines.
6.3 Code Execution Accuracy
ADK agents that use code execution tools can generate and run Python in a sandboxed environment. The evaluation question is straightforward: did the generated code produce the correct result? Run ADK response quality evaluation to check code outputs against expected results. Layer a factual accuracy template on top to catch situations where the code runs without errors but returns the wrong answer.
6.4 Hallucination Detection in Production
ADK now includes hallucinations_v1 for pre-deployment hallucination checks using an LLM-as-a-judge approach. That covers development-time testing well. For production traffic, though, you need a way to run Google ADK hallucination detection continuously on sampled live requests without blocking user responses. FutureAGI’s evaluation SDK runs async hallucination checks on production traces, feeding results into the Observe dashboard so you can catch grounding failures as they happen not days later during a manual review.
Production Monitoring for ADK Agents
Pre-deployment eval catches known failure modes. Production monitoring catches everything else. Once your ADK agents are live whether on Vertex AI Agent Engine, Cloud Run, or your own infrastructure you need continuous Google ADK monitoring to track quality, cost, and latency over time.
Since you have already instrumented your agents with traceAI (covered in the instrumentation section above), the Observe dashboard automatically picks up all trace data and gives you:
Cost per agent: Token usage and API costs broken down by individual agents within your hierarchy. You will know exactly which sub-agent is consuming the most resources.
Latency per workflow step: Pinpoint bottlenecks. Find out whether your SequentialAgent is slow because of a single sub-agent or whether the LoopAgent is taking too many iterations.
Quality drift detection: Run continuous evaluation on sampled production traffic. FutureAGI’s eval metrics run asynchronously, so they do not add latency to your agent responses.
Error rate tracking: Monitor tool call failures, LLM errors, and agent timeout rates across your entire ADK deployment.
Complete ADK Agent with FutureAGI Integration
Here is a full working example that combines ADK agent definition with traceAI instrumentation and production-ready monitoring:
|
Every interaction through this runner is now traced, scored, and visible in your FutureAGI dashboard. No additional code changes required for production deployment.
ADK Built-in Eval vs. FutureAGI: Feature Comparison
Capability | ADK Built-in | FutureAGI |
Tool trajectory matching | Yes (exact, in-order, any-order) | Yes (via traceAI spans) |
Response scoring | ROUGE-1 + LLM-as-judge (v2) | LLM-based quality scoring (50+ templates) |
Hallucination detection | Yes (hallucinations_v1, dev-time) | Yes (groundedness evaluator, dev + production) |
Multi-agent per-step scoring | Final response only | Per-agent and per-step scoring |
Production monitoring | Cloud Trace (latency only) | Cost, latency, quality, errors |
Multimodal evaluation | No | Text, image, audio, video |
CI/CD integration | pytest + adk eval CLI | SDK + API + pytest compatible |
Cost attribution | No | Per-agent token and cost tracking |
Continuous eval on live traffic | No | Async eval on sampled production data |
Table 2: ADK Built-in Eval vs. FutureAGI
. Best Practices for Google ADK Agent Performance Testing
Based on real production deployments, here are the patterns that work best for Google ADK agent performance testing:
Start with ADK’s built-in eval for development: Use .test.json files and adk eval for fast feedback loops during agent development. Keep test cases focused on tool trajectory and response matching. Use hallucinations_v1 and safety_v1 for pre-deployment quality gates.
Add FutureAGI Evaluate for production-grade quality gates: Before merging code, run FutureAGI’s evaluation SDK in your CI pipeline. Check groundedness, factual accuracy, and instruction adherence across a broader set of criteria than ADK’s built-in metrics cover.
Instrument early with traceAI: Adding observability after deployment is significantly harder than building it in from the start. Instrument your agents on day one.
Monitor quality continuously in production: Sample 5 to 10 percent of production traffic for asynchronous evaluation. Set alerts for quality score drops.
Separate eval by workflow type: Do not use the same evaluation criteria for SequentialAgent and ParallelAgent workflows. Each pattern has different failure modes and needs different metrics.
Version your eval datasets: As your agents evolve, your test cases should evolve too. Use FutureAGI’s dataset management to track evaluation data alongside your agent code.
Conclusion
Google ADK evaluation does not end with passing a .test.json file. ADK gives you a strong foundation for pre-deployment testing with trajectory matching, ROUGE scoring, LLM-as-judge response matching, and even hallucination detection. That covers the development inner loop well.
But production agents need more: continuous quality scoring on live traffic, per-agent cost attribution, per-step quality checks in multi-agent workflows, multimodal evaluation, and real-time dashboards that tell you what is actually happening once users show up.
FutureAGI fills that gap with three products that map directly to the ADK lifecycle. traceAI auto-instruments your agent hierarchy with zero code changes. Evaluate scores every workflow step with 50+ evaluation templates. Observe gives you real-time dashboards for cost, latency, and quality across your entire Google ADK monitoring setup.
If you are building with ADK, start by instrumenting your agents with traceAI. It takes five lines of code and gives you full visibility from day one. From there, add evaluation gates to your CI pipeline and set up production dashboards as your agents scale.
Ready to evaluate your first ADK agent? Get started with FutureAGI and follow the Google ADK integration guide.
Frequently Asked Questions
How do you evaluate Google ADK agents in production?
What metrics does ADK’s built-in evaluation support?
Can FutureAGI detect hallucinations in Google ADK agents?
How does Google ADK multi-agent evaluation work?













