Google ADK is Google's open-source Agent Development Kit for building, evaluating, and deploying AI agents. FutureAGI traces ADK runs so teams can inspect tool calls, agent trajectories, and task outcomes.

How is Google ADK different from LangGraph?

Google ADK is a full agent development framework optimized for Gemini and the Google ecosystem while remaining model-agnostic. LangGraph is a graph-based runtime often used with LangChain for stateful agent workflows.

How do you measure Google ADK agents?

FutureAGI measures ADK agents through traceAI:google-adk spans plus evaluators such as ToolSelectionAccuracy, FunctionCallAccuracy, TaskCompletion, and TrajectoryScore.

What Is Google ADK? Definition & FutureAGI Guide (2026)

What Is Google ADK?

Google ADK is Google’s Agent Development Kit, an open-source agent framework for building, evaluating, and deploying AI agents. It is used to define agents, tools, sessions, memory, callbacks, planning, and multi-agent workflows, often with Gemini or Vertex AI. In production traces, ADK appears as agent steps, tool calls, events, and run outcomes rather than one model response. FutureAGI instruments ADK through traceAI:google-adk so those steps can be evaluated and debugged.

Why It Matters in Production LLM and Agent Systems

Google ADK matters because agent failures happen between steps, not only in the final answer. A support agent can choose a refund tool before identity verification, a research agent can loop across search calls until cost spikes, or a workflow agent can hand off stale state to the next specialist. The user sees one bad outcome; the engineering team sees a trace with planner events, callback decisions, tool calls, and session state.

Developers feel the pain when local ADK examples pass but production traffic breaks under retries, parallel branches, or long-running tools. SREs see p99 latency, tool-timeout rate, and token-cost-per-trace rise after a new tool or model route ships. Product teams see incomplete tasks: the agent sounds helpful but never completes the booking, refund, or data update. Compliance teams need to know which tool call caused a real-world action and whether a policy check ran before it.

This is especially important for 2026 agent pipelines because ADK apps often connect Gemini, Vertex AI Agent Engine, external APIs, memory, callbacks, and multi-agent orchestration. A single failure can sit in a callback, a LoopAgent, a tool schema, or a session boundary. Without trace-level evaluation, teams debate symptoms instead of proving which step failed.

How FutureAGI Handles Google ADK

FutureAGI’s approach is to treat Google ADK as an execution surface, not just a framework label. When an ADK Python or TypeScript app is instrumented with traceAI:google-adk, each run can become an OpenTelemetry-style trace with agent steps, tool spans, model spans, callback outcomes, and status fields. The inventory anchor is specific: traceAI:google-adk maps ADK behavior into FutureAGI traces where evaluator scores can sit beside runtime evidence.

Example: a financial support team builds an ADK agent that checks identity, looks up invoices, decides whether a refund is allowed, and drafts the customer reply. FutureAGI records the path as verify_identity -> invoice_lookup -> refund_policy_check -> draft_reply. The trace carries agent.trajectory.step, tool name, tool status, argument shape, latency, and token fields such as llm.token_count.prompt when the model span emits them.

ToolSelectionAccuracy checks whether the ADK agent chose the correct tool for the user intent. FunctionCallAccuracy checks whether the tool arguments matched the expected schema and values. TaskCompletion checks the final outcome, while TrajectoryScore catches loops, skipped steps, and wasteful detours. Unlike a LangSmith-style trace review that starts with manual inspection, the FutureAGI workflow turns ADK traces into regression evidence. If a new prompt causes the refund tool to run before identity verification, the engineer adds the failed trace to a dataset, sets an eval-fail-rate threshold, and blocks the release until the trajectory recovers.

How to Measure or Detect Google ADK Reliability

Measure Google ADK by scoring both the completed task and the decisions inside the run.

ToolSelectionAccuracy: returns a score for whether the ADK agent selected the expected tool.
FunctionCallAccuracy: evaluates function or tool arguments against the expected call.
TaskCompletion: checks whether the user-visible goal was completed by the end of the run.
TrajectoryScore: summarizes the path through agent steps, handoffs, retries, and tool calls.
Trace signals: repeated agent.trajectory.step, rising llm.token_count.prompt, tool-error rate, callback failure rate, p99 latency, and token-cost-per-trace.
User proxies: thumbs-down rate, escalation rate, reopened-ticket rate, and human-review rate by ADK agent version.

Use the metric trend, not one isolated score. A failing ADK run usually has a small cluster: wrong tool, malformed arguments, extra loop step, and higher latency in the same trace.

from fi.evals import ToolSelectionAccuracy, FunctionCallAccuracy

tool_eval = ToolSelectionAccuracy()
args_eval = FunctionCallAccuracy()

tool = tool_eval.evaluate(trajectory=trace_spans, expected_tool="refund_lookup")
args = args_eval.evaluate(tool_calls=trace_tool_calls, expected_schema=refund_schema)
print(tool.score, args.score)

Common Mistakes

The common pattern is hidden control flow: the ADK app works, but the trace cannot explain why it worked. The fix is usually stricter state boundaries, a clearer tool schema, or an eval that targets the failing step.

Treating ADK evals as final-answer tests. Tool choice and argument shape can fail while the final answer still reads correctly.
Losing session boundaries. Mixing Session, State, and long-term memory makes repeat-user failures look like random model drift.
Ignoring callback failures. A callback that blocks unsafe actions must emit a trace event, not disappear into application logs.
Testing only one workflow path. SequentialAgent, ParallelAgent, and LoopAgent patterns produce different failure signatures under retries.
Assuming Gemini optimization removes model risk. Model-agnostic ADK apps still need provider-specific latency, cost, and safety thresholds.