Guides

Evaluate Google ADK Agents: The 6-Step 2026 Production Loop

Evaluate Google ADK agents in 6 steps: traceAI instrumentation, span-attached evaluate() scoring, AgentEvaluator CI gates, persona simulation, and Bayesian prompt opt.

March 11, 2026

Updated May 14, 2026

14 min read

ai-agents ai-evaluations google-adk agent-observability

Table of Contents

Originally published March 11, 2026. Updated May 14, 2026. Replaces the verbose Evaluator(...).evaluate(...).eval_results[0] pattern with the unified fi.evals.evaluate() API, drops manual span_id plumbing in favor of enable_auto_enrichment(), and points readers to agent-simulate and agent-opt for the pre-production and prompt-optimization steps. Steps 1 to 3 are runnable as shown. Steps 4 to 6 are integration patterns that adapt to your agent module, dataset, and mapper.

The ADK Production Eval Loop. Six stages from instrumentation through optimization, showing how Google ADK agents flow through traceAI, fi.evals.evaluate(), AgentEvaluator CI gates, agent-simulate scenarios, and agent-opt prompt optimization.

TL;DR: The 6-Step ADK Production Eval Loop

Step	What you do	Tool	Runnable?
1. Instrument	Auto-trace agents, sub-agents, LLM calls, tool calls	`traceai-google-adk` + `register()`	Yes, 30 seconds
2. Score	Attach groundedness, factuality, safety to spans	`fi.evals.evaluate()` + `enable_auto_enrichment()`	Yes, copy-paste
3. Gate	Block bad releases on `.test.json` thresholds	`google.adk.evaluation.AgentEvaluator`	Yes, pytest
4. Recipe	One eval recipe per ADK workflow pattern	`fi.evals.evaluate()` per Sequential/Parallel/Loop/Routing	Integration pattern
5. Simulate	Persona-driven multi-turn pre-prod scenarios	ADK user-simulator (chat) or `agent-simulate` (voice)	Integration pattern
6. Optimize	Search prompt space on failing traces	`fi.opt.optimizers.BayesianSearchOptimizer`	Integration pattern

Steps 1 to 3 are copy-paste runnable. Steps 4 to 6 are integration patterns that adapt to your agent module, dataset, and mapper. ADK 1.17+ is the floor version; the unified evaluate() API replaces the older Evaluator(...).evaluate(...).eval_results[0] pattern.

Why ADK’s Built-in Evaluation Falls Short After Deployment and What This Guide Covers

Google’s Agent Development Kit (ADK) has carved out a solid reputation as a framework for building multi-agent systems. It gives you workflow orchestration through SequentialAgent, ParallelAgent, and LoopAgent, plays well with Gemini models, and deploys natively to Vertex AI Agent Engine or Cloud Run. ADK also ships with a built-in evaluation framework that lets you test tool trajectories, score responses, and even run hallucination checks during development. That evaluation layer covers a lot of ground for pre-deployment testing.

Where things get thin is after deployment. ADK’s eval tools are designed for the development loop: you define test cases, run them through pytest or the CLI, and check scores before merging code. That workflow does not extend into production. Once your agents start handling real user requests, you lose visibility into quality drift, cost attribution per agent, latency bottlenecks across workflow steps, and continuous quality scoring on live traffic. That gap between development eval and production Google ADK observability is exactly what this guide addresses.

We will walk through a complete Google ADK agent testing and monitoring setup. You will learn what ADK’s built-in evaluation actually measures (and where it falls short), how FutureAGI’s end-to-end stack for observability and evaluation fills the production gaps, how to instrument ADK agents with traceAI, how to evaluate every workflow pattern ADK supports, and how to set up dashboards that track cost, latency, and quality in real time.

The 2026 ADK update story is short and worth getting straight up front. ADK 1.17+ ships native OpenTelemetry export to Google Cloud Observability via --otel_to_cloud, which makes trace export table-stakes. The unresolved layer is continuous, span-level quality scoring on live, multi-step agent traces. That is what the rest of this guide builds.

A Concrete Failure Your Tests Won’t Catch

The agent passes every .test.json case in CI, then in production you see a trace like this:

root_agent.run("Find me a vegetarian restaurant in Berlin and book a 7pm table")
├── search_agent.run(...)   tool="google_search"   ok
├── search_agent.run(...)   tool="google_search"   ok   (same query)
├── search_agent.run(...)   tool="google_search"   ok   (third identical call)
└── booking_agent.run(...)  tool="reserve_table"   ok
final_response = "I've booked you a table at a vegetarian restaurant in Berlin for 7pm."

tool_trajectory_avg_score passes with ANY_ORDER. response_match_score passes on keyword overlap. safety_v1 passes. But the agent looped on google_search three times, the final answer omits the restaurant name and address, and the user lands at a phantom reservation. The eval suite missed it because it only graded the final response and trajectory. Not per-step output, not loop count, not grounding in the actual search results. This is what production eval has to catch and what development eval misses.

The fix is the ADK Production Eval Loop: six steps that go from a 30-second hello-trace to persona-driven simulation and prompt optimization. Steps 1 to 3 are copy-paste runnable. Steps 4 to 6 are integration patterns that need to adapt to your agent’s module shape, scenario harness, and dataset.

Prerequisites

- Python 3.10+
- google-adk >= 1.17 (for native --otel_to_cloud support)
- A Google API key for Gemini (GOOGLE_API_KEY)
- A FutureAGI project + API key + secret (FI_API_KEY, FI_SECRET_KEY). Sign up at https://futureagi.com

The pip install is one line:

pip install traceai-google-adk ai-evaluation google-adk

Two things to know about that line:

traceai-google-adk pulls in fi-instrumentation-otel transitively. That’s the package that gives you from fi_instrumentation import register.
ai-evaluation is the package that gives you from fi.evals import evaluate. The futureagi PyPI package does not include fi/evals. If you previously did pip install futureagi and tried to from fi.evals import Evaluator, that’s why it failed.

Set the three environment variables:

export FI_API_KEY="..."
export FI_SECRET_KEY="..."
export GOOGLE_API_KEY="..."

You’re ready for Step 1.

Step 1. Install and Verify the Trace Pipeline in 30 Seconds

The fastest way to confirm everything works is the canonical hello-trace: a single-agent app that answers a weather question with a tool call. Save this as hello-trace.py:

import asyncio

from google.adk.agents import Agent
from google.adk.runners import InMemoryRunner
from google.genai import types

from traceai_google_adk import GoogleADKInstrumentor
from fi_instrumentation import register
from fi_instrumentation.fi_types import ProjectType

tracer_provider = register(
    project_name="adk_demo",
    project_type=ProjectType.OBSERVE,
)
GoogleADKInstrumentor().instrument(tracer_provider=tracer_provider)

def get_weather(city: str) -> dict:
    if city.lower() == "new york":
        return {"status": "success",
                "report": "New York is sunny, 25°C (77°F)."}
    return {"status": "error", "error_message": f"No data for {city}."}

agent = Agent(
    name="weather_agent",
    model="gemini-2.5-flash",
    description="Answer weather questions using the weather tool.",
    instruction="You must use the available tools to find an answer.",
    tools=[get_weather],
)

async def main():
    app_name, user_id, session_id = "weather_app", "demo_user", "demo_session"
    runner = InMemoryRunner(agent=agent, app_name=app_name)
    await runner.session_service.create_session(
        app_name=app_name, user_id=user_id, session_id=session_id,
    )
    async for event in runner.run_async(
        user_id=user_id,
        session_id=session_id,
        new_message=types.Content(
            role="user", parts=[types.Part(text="What is the weather in New York?")]
        ),
    ):
        if event.is_final_response():
            print(event.content.parts[0].text.strip())

if __name__ == "__main__":
    asyncio.run(main())

Run it:

python hello-trace.py
# Example output (Gemini may rephrase): "New York is sunny, 25°C (77°F)."

Three things just happened automatically. The Runner.run_async call became one OTel span. The Gemini call inside the agent became a child span. The get_weather tool call became another child span. Open Observe in your FutureAGI project. You’ll see the trace tree under the adk_demo project with the agent invocation, the LLM call, and the tool call as separate spans.

Screenshot of the FutureAGI Observe trace tree showing the root weather_agent.run_async span with child spans for the Gemini LLM call and the get_weather tool call.

If the trace doesn’t show up, the most common cause is missing env vars. register() reads FI_API_KEY and FI_SECRET_KEY from the environment. Set them, re-run, refresh Observe.

You now have a working trace pipeline. Step 2 attaches eval scores to those spans.

Step 2. Score Outputs with the Unified evaluate() API

The unified evaluate() API is the simplest way to score an output against any FutureAGI evaluator. It returns an EvalResult with score, passed, reason, and latency_ms. No client setup, no batch result unpacking.

from fi.evals import evaluate

# Re-use Step 1's hello-trace output to keep the example concrete:
user_question = "What is the weather in New York?"
context = get_weather("New York")["report"]
# context = "New York is sunny, 25°C (77°F)."
response = "It's sunny in New York with a temperature of 25°C."

# Local heuristic. Runs offline, sub-second, free.
r = evaluate("contains", output="hello world", keyword="hello")
print(r.score, r.passed, r.reason)
# 1.0  True  "Found 'hello' in output"

# Cloud template via FutureAGI Turing models. LLM-as-a-judge.
r = evaluate("groundedness", output=response, context=context, model="turing_flash")
print(r.score, r.passed, r.reason)
# 0.92  True  "All claims are grounded in the provided context."

# Custom prompt against any LiteLLM-supported model.
r = evaluate(
    prompt="Rate clarity 0-1 of: {output}",
    output="ML is a subset of AI.",
    engine="llm",
    model="gemini/gemini-2.5-flash",
)

You can run multiple evaluators in one call:

from fi.evals import evaluate

# `response`, `context`, `user_question` are the same variables defined above.
results = evaluate(
    ["groundedness", "factual_accuracy", "toxicity"],
    output=response, context=context, input=user_question,
    model="turing_flash",
)
for r in results:
    print(r.eval_name, r.score, r.passed)

That’s the whole API surface for output-level scoring. Three call shapes. One local-metric, one cloud-template, one custom-prompt. And a fourth for batches.

Auto-attaching scores to the active span

To score an ADK trace step (rather than a standalone output), call enable_auto_enrichment() once at startup. After that, every evaluate() result inside an active span context attaches its score, reason, and latency to that span automatically.

from fi.evals import evaluate
from fi.evals.otel import enable_auto_enrichment, get_tracer

enable_auto_enrichment()
tracer = get_tracer()

# `ans` is the agent's answer for the current step; `ctx` is the grounding source.
ans = response                  # from the variables defined above
ctx = context

with tracer.start_as_current_span("rag-pipeline"):
    with tracer.start_as_current_span("generate-answer"):
        r = evaluate("groundedness", output=ans, context=ctx, model="turing_flash")
        # No span_id plumbing required. The score is on this span.

If you want explicit control over which span gets enriched (for sub-agent boundaries, for example), use enrich_span_with_evaluation:

from fi.evals.otel import enrich_span_with_evaluation

with tracer.start_as_current_span("quality-gate") as span:
    r = evaluate("faithfulness", output=ans, context=ctx)
    enrich_span_with_evaluation(
        metric_name="faithfulness",
        score=r.score,
        reason=r.reason[:200],
        latency_ms=r.latency_ms,
        span=span,
    )

Open Observe and you’ll see the eval score as a span attribute next to the trace. Same screen, no log-stitching.

Screenshot of an Observe trace with the FutureAGI eval score attached as a span attribute, showing groundedness=0.92 alongside the agent step's LLM call and tool call.

This is the production eval primitive. The remaining steps are how to use it correctly across CI, scenarios, and optimization.

Step 3. Gate Releases on .test.json with AgentEvaluator

ADK’s built-in evaluation is the right tool for CI gates. Drop a .test.json fixture in your tests directory, define thresholds in a test_config.json, and let AgentEvaluator.evaluate() run them in pytest.

Minimal tests/fixtures/weather_basic.test.json (the schema is from the official ADK docs):

{
  "eval_set_id": "weather_basic_set",
  "name": "weather basic queries",
  "description": "Single-turn weather agent smoke tests.",
  "eval_cases": [
    {
      "eval_id": "ny_weather",
      "conversation": [
        {
          "invocation_id": "1",
          "user_content": {"parts": [{"text": "What is the weather in New York?"}], "role": "user"},
          "final_response": {"parts": [{"text": "New York is sunny, 25°C (77°F)."}], "role": "model"},
          "intermediate_data": {
            "tool_uses": [{"args": {"city": "New York"}, "name": "get_weather"}],
            "intermediate_responses": []
          }
        }
      ],
      "session_input": {"app_name": "weather_app", "user_id": "test_user", "state": {}}
    }
  ]
}

Thresholds in tests/fixtures/test_config.json:

{
  "criteria": {
    "tool_trajectory_avg_score": 1.0,
    "response_match_score": 0.8
  }
}

AgentEvaluator.evaluate() discovers the agent through an importable Python module path. Save Step 1’s agent object as weather_app/agent.py with the special root_agent name ADK looks for, and put test_config.json in the same folder as the .test.json fixtures so ADK auto-loads it:

weather_app/
  __init__.py
  agent.py            # exports root_agent (the Agent from Step 1)
tests/
  fixtures/
    test_config.json
    weather_basic.test.json
  test_weather.py

weather_app/agent.py:

# Re-use the Agent definition from Step 1's hello-trace.
from google.adk.agents import Agent

def get_weather(city: str) -> dict:
    return {"status": "success", "report": "stub for example"}

agent = Agent(
    name="weather_agent",
    model="gemini-2.5-flash",
    description="Answer weather questions using the weather tool.",
    instruction="You must use the available tools to find an answer.",
    tools=[get_weather],
)

root_agent = agent   # AgentEvaluator looks up `root_agent` from the module

Run it from pytest. Pass the folder containing the test config and fixtures:

import pytest
from google.adk.evaluation.agent_evaluator import AgentEvaluator

@pytest.mark.asyncio
async def test_weather_basic():
    await AgentEvaluator.evaluate(
        agent_module="weather_app",
        eval_dataset_file_path_or_dir="tests/fixtures",
    )

The 11 ADK criteria you can reference in test_config.json are listed in the FAQ below. For most ADK projects, start with tool_trajectory_avg_score (1.0) and response_match_score (0.8) (Google’s defaults) and add hallucinations_v1 once you have grounding context.

This is your CI development inner loop. Step 4 covers what to score across each ADK workflow pattern.

Step 4. Apply the Right Eval Recipe per ADK Workflow Pattern

ADK gives you four orchestration patterns. SequentialAgent, ParallelAgent, LoopAgent, and dynamic-routing agents that transfer_to_agent. Each fails differently in production. Each gets a different eval recipe.

SequentialAgent: Score each step’s output to prevent error compounding

from fi.evals import evaluate
from fi.evals.otel import enable_auto_enrichment

enable_auto_enrichment()

# Inside each sub-agent in the chain:
def _planning_step(input_text: str) -> str:
    plan = planning_agent.run(input_text)
    evaluate("instruction_adherence", output=plan, input=input_text, model="turing_flash")
    return plan

Each step’s eval auto-attaches to that step’s span. When the final response is wrong, you can trace which step compounded the error.

ParallelAgent: Score branch quality + merged-output consistency

from fi.evals import evaluate
from fi.evals.otel import enrich_span_with_evaluation

def _evaluate_parallel_outputs(branches: list[str], merged: str) -> None:
    for i, branch in enumerate(branches):
        r = evaluate("conversation_coherence", output=branch)
        # Rename the span attribute so each branch lands distinctly in Observe.
        enrich_span_with_evaluation(
            metric_name=f"branch_{i}_quality",
            score=r.score, reason=r.reason, latency_ms=r.latency_ms,
        )
    evaluate("factual_accuracy", output=merged, context="\n".join(branches),
             model="turing_flash")

Per-branch coherence catches a single bad branch dragging the merged output. Whole-output factual accuracy catches the merged result diverging from any individual branch.

LoopAgent: Score iteration count and per-iteration output

from fi.evals import evaluate
from fi.evals.otel import enrich_span_with_evaluation

def _evaluate_loop(iterations: list[str], goal: str, max_iters: int) -> None:
    if len(iterations) >= max_iters:
        evaluate("contains", output="MAX_ITERATIONS_HIT", keyword="MAX_ITERATIONS_HIT")
    for i, iter_output in enumerate(iterations):
        r = evaluate("instruction_adherence", output=iter_output, input=goal)
        enrich_span_with_evaluation(
            metric_name=f"loop_iter_{i}",
            score=r.score, reason=r.reason, latency_ms=r.latency_ms,
        )

Loop count is the runaway-cost signal. Per-iteration adherence catches the loop drifting away from the original goal.

Dynamic Routing: Score routing accuracy as classification

from fi.evals import evaluate
from fi.evals.otel import enrich_span_with_evaluation

def _evaluate_routing(query: str, chosen_agent: str, expected_agent: str) -> None:
    r = evaluate("contains", output=chosen_agent, keyword=expected_agent)
    enrich_span_with_evaluation(
        metric_name="routing_accuracy",
        score=r.score, reason=r.reason, latency_ms=r.latency_ms,
    )

Routing is a classification task. Treat it that way and score it on a known-routing dataset before you ship the routing rules.

Pattern	Failure mode	Recipe	What it catches
SequentialAgent	Step compounding	Per-step `instruction_adherence`	Bad step → corrupted final
ParallelAgent	Bad branch dragging merge	Branch coherence + merged factuality	Off-pattern branch
LoopAgent	Runaway iterations	Loop-count gate + per-iter adherence	Drift, cost runaway
Dynamic Routing	Wrong sub-agent picked	Routing accuracy as classification	Mis-routing in prod

ADK Built-in vs FutureAGI for ADK at a glance:

Capability	ADK Built-in	FutureAGI
Pre-deployment `.test.json` evaluation	Via `AgentEvaluator`	Use ADK
Live trace export	✅ via `--otel_to_cloud`	✅ via `traceAI` (vendor-portable)
Continuous quality scoring on live traffic	❌	✅ via `evaluate()` + `enable_auto_enrichment()`
Per-agent / per-tool cost attribution	❌	✅ via OTel span attributes
Persona-driven multi-turn simulation	Partial (user simulator framework)	✅ via `agent-simulate`
Prompt optimization on failing traces	❌	✅ via `agent-opt`
Multi-turn evaluators	4 metrics shipped	+ 100+ cloud Turing templates + 76+ local heuristics

Step 5. Run Pre-Production Scenarios with a Multi-Turn Simulator

Per-step eval and CI gates catch most failures before production. The remaining class is cross-turn behavior, where the agent looks fine on each turn but fails the conversation as a whole. That class needs scenario testing.

Two tools cover this for ADK:

ADK’s user-simulator framework is the right fit for chat ADK agents. It pairs the four multi-turn criteria (multi_turn_task_success_v1, multi_turn_trajectory_quality_v1, multi_turn_tool_use_quality_v1, per_turn_user_simulator_quality_v1) with a simulated user driver that replays personas against your root_agent. Read the ADK user-simulator docs for the current API surface and use AgentEvaluator with the multi-turn criteria inside pytest.

FutureAGI’s agent-simulate is the right fit when the agent is a voice agent on LiveKit. The SDK is built around real-time audio rooms, with AgentDefinition(url=..., room_name=..., system_prompt=...), Persona(persona=..., situation=..., outcome=...), Scenario(name=..., dataset=[...]), and an async TestRunner().run_test(). It is not the right tool for a text-only ADK agent; for that, use ADK’s user-simulator above. See the fi.simulate README for the LiveKit setup.

For either path, the loop is the same: drive the persona, capture the conversation, score the transcript with the multi-turn criteria. The output you carry forward to Step 6 is the set of failing transcripts.

Screenshot of a multi-turn scenario report in the FutureAGI Observe UI showing the persona, the four-step plan, the resulting transcript, and the multi_turn_task_success_v1 score per turn.

Step 6. Close the Loop with agent-opt

If your scenarios surface a failure mode the agent reproduces (an instruction it ignores, a tool it picks wrong, a hallucinated detail), the next step is to optimize the prompt that drives the misbehavior. That’s agent-opt.

from fi.opt.optimizers import BayesianSearchOptimizer
from fi.opt.base.evaluator import Evaluator as OptEvaluator
from fi.opt.datamappers import BasicDataMapper
from fi.evals.metrics import BLEUScore

# `failing_trace_dataset` is whatever you load from your Step 2 sampled failures.
# Each row needs `context` and `question` fields plus a `reference` answer.
failing_trace_dataset = [
    {"context": "...", "question": "...", "reference": "..."},
    # ...
]

# `mapper` tells the optimizer which dataset fields fill which prompt placeholders.
mapper = BasicDataMapper(
    field_mapping={"context": "context", "question": "question", "reference": "reference"},
)

optimizer = BayesianSearchOptimizer(
    inference_model_name="gpt-4o-mini",
    teacher_model_name="gpt-4o",
    n_trials=10,
)

result = optimizer.optimize(
    evaluator=OptEvaluator(BLEUScore()),
    data_mapper=mapper,
    dataset=failing_trace_dataset,
    initial_prompts=["Given context: {context}, answer: {question}"],
)

print("final_score:", result.final_score)
print("best_prompt:", result.best_generator.get_prompt_template())

The dataset is your collection of failing traces from production (sampled by Step 2’s eval scores). The evaluator can be a public metric like BLEUScore or any custom evaluator you’ve built. The optimizer searches prompt variants and returns the one that lifts your evaluator score the most. Promote that prompt through your CI gate, redeploy, watch the failure rate drop in Observe.

The loop is now closed: instrument → score → gate → simulate → optimize → re-instrument.

Production Hardening

The six steps above are the loop. Three additional considerations harden it for production traffic.

Sample 5 to 10 percent of production traffic for eval. Running every production trace through evaluate() is too expensive. Sample. The unified evaluate() call is synchronous; for async production scoring, run it in a background worker or job queue you control so the agent’s user-facing response path is never blocked on evaluation latency.

Roll trace cost up to per-user-request budgets. Span-level cost attribution is the new observability requirement; without it, audit-grade cost reporting in production is impossible. OTel GenAI semantic conventions (such as gen_ai.system and gen_ai.request.model) give you the per-call inputs you need; rolling them up the trace tree is the platform’s job.

Use --otel_to_cloud if you’re already on Google Cloud Observability. ADK 1.17+ exports natively. If you also want vendor-portable traces and FutureAGI’s eval scoring on top, run both. The traceAI instrumentation is OTel-native and stacks cleanly with other exporters.

Validation: How to Know the Eval Is Working

After running through Steps 1–6, three checks confirm the pipeline is wired correctly:

Trace tree visible in Observe. Open the latest run. The agent invocation, every sub-agent, every LLM call, and every tool call should be separate spans under one root span.
Eval score on the span. Click into a span where you called evaluate(). You should see the eval name, score, reason, and latency as span attributes. Not just in your logs.
CI fails on a deliberately broken case. Add a .test.json case where the agent should fail (wrong city in the weather example, missing tool call). Run pytest. AgentEvaluator.evaluate() should fail. If it passes, your test_config.json thresholds are too loose.

If all three pass, the loop is real.

Pitfalls

The five mistakes most ADK + FutureAGI integrations hit, ordered by frequency:

pip install futureagi followed by from fi.evals import Evaluator fails. The futureagi package excludes fi/evals. Use pip install ai-evaluation (or pip install traceai-google-adk ai-evaluation google-adk to pull in everything at once).
runner.run_async(new_message="...") raises a Pydantic validation error. ADK’s run_async expects new_message=types.Content(role="user", parts=[types.Part(text="...")]). Build a Content object; don’t pass a string.
runner.run_async raises Session not found. You skipped await runner.session_service.create_session(app_name=..., user_id=..., session_id=...). The session must exist before the first run_async call.
Eval scores don’t show up on spans. You forgot to call enable_auto_enrichment() once at startup, OR you called evaluate() outside an active span context. Wrap the call in with tracer.start_as_current_span(...).
AgentEvaluator.evaluate() passes everything in CI. Your test_config.json thresholds are too low or absent. Pick the tight defaults: tool_trajectory_avg_score: 1.0, response_match_score: 0.8, then tighten further.

Why Use FutureAGI for Google ADK Evaluation and Observability

Different layers of the eval stack solve different problems. ADK’s built-in evaluation is excellent for the development inner loop: .test.json fixtures, pytest integration, and the 11 evaluation criteria documented in current ADK docs. (ADK 1.17+ is required for --otel_to_cloud. The metric set shown here works with the latest google-adk.) What ADK does not give you is continuous evaluation on live production traffic, vendor-portable trace export, persona-driven scenario testing across personas you’d never write by hand, or prompt optimization on failing-trace datasets.

FutureAGI covers these pieces, but it does not replace ADK’s CI evals or remove the need for goldens. traceAI auto-instruments every agent invocation, sub-agent call, LLM completion, and tool execution as OpenTelemetry spans (vendor-portable, OTel 1.37+ GenAI semantic conventions compatible). ai-evaluation ships 76+ local metrics and 100+ cloud Turing templates through the unified evaluate() API, with auto-enrichment that attaches scores to your active span without manual plumbing. agent-simulate runs voice agents on LiveKit through persona-driven multi-turn scenarios; for chat ADK agents, ADK’s own user-simulator framework is the right fit. agent-opt searches the prompt space for variants that lift your evaluator score on a failing-trace dataset, closing the loop from production failures back to versioned prompt improvements.

You keep ADK for orchestration, agent building, and CI gates. You add production-grade observability, continuous live-traffic eval, and prompt optimization on top.

Limitations

A short list of what this guide is and isn’t.

Verified against May 2026 SDK versions. traceai-google-adk 0.1.4, ai-evaluation (current), google-adk >= 1.17 (required for --otel_to_cloud). On older ADK versions the multi-turn metrics may not exist; check the ADK criteria reference for what’s available in your installed version.
The agent-simulate and agent-opt examples assume both packages are installed. They’re separate from ai-evaluation. Install with pip install agent-simulate agent-opt if you want Steps 5 and 6.
Trace screenshots show OBSERVE projects. If you registered with ProjectType.EXPERIMENT, the UI surfaces are slightly different. Both work; pick OBSERVE for production.
The unified evaluate() API is newer than the legacy Evaluator class. Both ship in ai-evaluation. The unified API is more ergonomic; the legacy class is what existing Langfuse-style integrations expect. Use whichever your team has already standardized on.
We are FutureAGI. We sell tools that make this loop easier. The protocol. Instrument, score, gate, simulate, optimize. Is the same on Langfuse, LangSmith, Braintrust, or in-house. The specific code blocks change; the loop doesn’t.

For the broader 2026 production AI context that motivates the Production Eval Loop, see Generative AI Trends 2026: Why Reliability Became the Main Story. For the per-model comparison and benchmark trust framework, see LLM Benchmarking 2026: Compare Top Models When Leaderboards Break. For continuous live-traffic eval on a multi-month buying-guide cadence, see Best LLMs of May 2026 which references this post in its “How to actually pick one for production” section.

Sources

Google ADK documentation: https://google.github.io/adk-docs/
ADK evaluation criteria reference (the 11 metrics, match types, ROUGE-1, multi-turn): https://google.github.io/adk-docs/evaluate/criteria/
ADK evaluation overview (AgentEvaluator, .test.json schema, test_config.json): https://google.github.io/adk-docs/evaluate/
Google ADK Python source: https://github.com/google/adk-python
ADK 1.17+ --otel_to_cloud flag and Vertex AI Agent Engine integration: https://docs.cloud.google.com/stackdriver/docs/instrumentation/ai-agent-adk
FutureAGI traceAI (OTel auto-instrumentation, Google ADK framework adapter): https://github.com/future-agi/traceAI
FutureAGI ai-evaluation (unified evaluate() API, span auto-enrichment): https://docs.futureagi.com/docs/sdk/evals/evaluate/
FutureAGI agent-simulate SDK (LiveKit voice scenarios): https://github.com/future-agi/simulate-sdk
FutureAGI agent-opt SDK (BayesianSearchOptimizer, ProTeGi, GEPA): https://github.com/future-agi/agent-opt
OpenTelemetry GenAI semantic conventions (gen_ai.system, gen_ai.request.model, etc.): https://opentelemetry.io/docs/specs/semconv/gen-ai/
Companion Colab notebook: https://github.com/future-agi/cookbooks/tree/main/google_adk_eval_loop

Ready to evaluate your first ADK agent? Get started with FutureAGI and follow the evaluation platform and tracing docs.

Frequently asked questions

What's the right pip install for evaluating Google ADK agents with FutureAGI?

pip install traceai-google-adk ai-evaluation google-adk. The traceai-google-adk package pulls fi-instrumentation-otel transitively. The futureagi PyPI package does NOT include fi/evals. That lives in ai-evaluation. Past guides that say pip install futureagi will fail with ModuleNotFoundError when you import from fi.evals.

What evaluation areas does ADK currently cover?

ADK evaluation started with two foundational areas. First, tool trajectory: did the agent call the right tools in the right order (tool_trajectory_avg_score with EXACT, IN_ORDER, ANY_ORDER match types)? Second, response quality: how close was the final response to the reference answer (response_match_score using ROUGE-1, plus the LLM-as-judge variant final_response_match_v2 for semantic equivalence). ADK has since expanded the criteria set to include hallucinations_v1, safety_v1, rubric-based metrics, and four multi-turn metrics that shipped alongside the user simulation framework. Check the ADK criteria reference for the current full list.

Can ADK evaluate multi-turn agent conversations?

Yes. ADK shipped multi_turn_task_success_v1, multi_turn_trajectory_quality_v1, multi_turn_tool_use_quality_v1, and per_turn_user_simulator_quality_v1 alongside the user simulation framework. Task success is binary (was the goal achieved), trajectory quality grades the path the agent took regardless of outcome, tool-use quality grades tool correctness across turns, and per-turn simulator quality grades the user simulator's faithfulness. Reference-based criteria like response_match_score do not work without ground-truth references; for user-simulation runs use hallucinations_v1, safety_v1, and the multi-turn criteria where applicable. Check the ADK criteria reference for the current support matrix.

How do I attach a FutureAGI eval to a specific ADK trace span?

Don't plumb span_id manually. Call enable_auto_enrichment() once at startup. After that, every fi.evals.evaluate() call inside an active span context auto-attaches the score, reason, and latency to that span. If you need explicit control, use enrich_span_with_evaluation(metric_name=..., score=..., span=...) inside the with tracer.start_as_current_span(...) block. Both come from fi.evals.otel.

What's the difference between final_response_match_v2 and hallucinations_v1?

final_response_match_v2 uses an LLM as a judge to score semantic equivalence between the agent's final response and a reference answer. It needs ground truth. Hallucinations_v1 segments the response into sentences and checks each one against the provided context for grounding. It does not need a reference answer. Use final_response_match_v2 when you have golden responses; use hallucinations_v1 when you have grounding context but no golden answer.

Does Google ADK support production monitoring out of the box?

Partly. ADK 1.17+ ships native OpenTelemetry export to Google Cloud Observability via the --otel_to_cloud flag, which gives you traces, latency, and basic span attributes in production. What ADK does not give you out of the box is continuous quality scoring on live traffic, per-agent cost attribution across a multi-agent hierarchy, or asynchronous evaluation that does not block user responses. Those are the gaps a production eval and observability layer fills.

How do I evaluate a Sequential Agent's individual steps?

Score each sub-agent's output independently rather than only grading the final result. With traceAI instrumentation already in place, every sub-agent call appears as a separate OTel span. Use enable_auto_enrichment() and call evaluate("groundedness", output=step_output, context=step_input) inside each sub-agent's tool. The score attaches to that span automatically. Per-step scoring catches error compounding before it corrupts the final output.

View all

Guides

How to Evaluate MCP-Connected AI Agents in Production (2026)

Evaluate MCP-connected agents in 2026: tool selection, argument correctness, task completion, OTEL tracing, and the 5-pillar production scoring framework.

Rishav Hada · May 8, 2025

10 min

Guides

Voice AI Integration Guide 2026: Vapi, Retell, LiveKit, Pipecat + Eval

Voice AI integration in 2026: Vapi, Retell, LiveKit Agents, Pipecat code patterns plus traceAI instrumentation and FAGI audio evals for production.

Vrinda Damani · Aug 14, 2025

9 min

Guides

LLM Tool Chaining in 2026: Stop Cascading Failures in Production

LLM tool chaining in 2026. Cascading failure modes, real traceAI patterns, frameworks compared. Stop silent corruption, context loss, and timeout cascades.

Nikhil Pareek · May 11, 2025

11 min