Agents

What Is Pydantic AI?

Pydantic AI is a Python agent framework for typed LLM agents, structured outputs, dependency injection, and validated tool calls.

What Is Pydantic AI?

Pydantic AI is a Python agent framework for building type-safe LLM agents with tools, dependency injection, and structured outputs validated through Pydantic schemas. It is an agent-framework term, not a model provider. In production, Pydantic AI shows up as agent runs, tool-call spans, output-validation attempts, retries, and token usage inside a trace. FutureAGI connects that surface through traceAI:pydantic-ai, so teams can evaluate typed agent behavior rather than trusting schemas alone.

Why Pydantic AI matters in production LLM and agent systems

Pydantic AI reduces a common agent failure class: untyped tool calls and malformed structured outputs crossing into production services. The risk does not disappear, though. A Pydantic schema can reject an invalid payload, but it cannot prove the model chose the right tool, picked the right customer record, or completed the business task. If teams treat validation as reliability, they still ship agents that are well-typed and wrong.

Developers feel this first as confusing exception patterns: output validation errors, retry spikes, tool arguments that pass type checks but fail domain rules, and logs where the final AgentRunResult.output looks correct while earlier tool calls drifted. SREs see p99 latency and token-cost-per-trace rise when the agent retries after schema failures. Product teams see inconsistent task completion across cohorts. Compliance reviewers ask whether a typed tool call was authorized, not only whether it parsed.

The issue is sharper in 2026 multi-step pipelines because Pydantic AI agents often sit between MCP servers, databases, retrieval systems, user state, and payment or support tools. Unlike LangGraph, which makes workflow state transitions explicit as graph nodes, Pydantic AI emphasizes typed Python functions, dependency injection, and output schemas. That makes it pleasant for Python engineers, but it also means runtime behavior must be traced step by step. The reliability question is not “did Pydantic validate the object?” It is “did the agent take the correct path to a safe, useful result?”

How FutureAGI handles Pydantic AI

FutureAGI’s approach is to treat Pydantic AI’s type safety as a starting condition, then verify each agent run with trace-level evidence and evaluator scores. The anchor traceAI:pydantic-ai maps to the pydantic-ai traceAI integration in the FutureAGI inventory. It captures a Pydantic AI run as an OpenTelemetry trace with agent steps, model calls, tool names, tool arguments, output validation outcomes, latency, and token fields such as llm.token_count.prompt when emitted by the instrumentation layer.

Example: a banking support agent uses Pydantic AI with a SupportDependencies object for account services, a lookup_transaction tool, and a Pydantic output type for the final resolution. A customer asks why a wire transfer failed. FutureAGI records each agent.trajectory.step, the selected tool through gen_ai.tool.name, the raw arguments through gen_ai.tool.call.arguments, and the final typed output. JSONValidation checks whether the structured output matches the schema. FunctionCallAccuracy and ToolSelectionAccuracy check whether the lookup tool and arguments match the user intent. TaskCompletion checks whether the agent actually resolved the support goal.

The engineer then turns failures into release criteria. If JSONValidation stays high but ToolSelectionAccuracy drops after a prompt change, the schema is not the problem; the agent is calling the wrong function. The next action might be a stricter tool description, a smaller tool registry, a regression eval on failed traces, or an Agent Command Center model fallback for high-risk routes. The point is to evaluate the Pydantic AI runtime as a trajectory, not just a validated return type.

How to measure or detect Pydantic AI

Measure Pydantic AI agents across schema, tool, trajectory, and production outcomes:

  • JSONValidation: returns whether the final structured output conforms to the expected JSON Schema.
  • FunctionCallAccuracy: scores function name and argument correctness against expected calls.
  • ToolSelectionAccuracy: scores whether the Pydantic AI agent chose the right tool for the intent.
  • TaskCompletion: checks whether the run completed the assigned user or workflow goal.
  • Trace signals: repeated agent.trajectory.step, rising llm.token_count.prompt, schema-fail rate, retry count, tool-timeout rate, and p99 latency.
  • User proxies: thumbs-down rate, reopened-ticket rate, escalation rate, and human-review rate by agent version.

Minimal Python:

from fi.evals import ToolSelectionAccuracy, JSONValidation

tool_eval = ToolSelectionAccuracy()
schema_eval = JSONValidation()

tool_result = tool_eval.evaluate(trajectory=trace_spans, expected_tool="lookup_transaction")
schema_result = schema_eval.evaluate(output=agent_output, schema=SupportOutput.model_json_schema())
print(tool_result.score, schema_result.passed)

Pydantic AI vs other 2026 agent frameworks

In our 2026 evals, the meaningful comparison between agent frameworks is not “which has prettier syntax” but “which makes failures observable.” Pydantic AI’s selling point is type safety; the production cost is that typed boundaries hide intent errors. The table below is how we think about framework choice when a team is shipping a multi-tool agent against models like Claude Opus 4.7, GPT-5.1, or Gemini 3 Pro:

FrameworkState modelTool surfaceCommon 2026 failure
Pydantic AITyped agent + dependenciesPython functions + Pydantic outputsValidated payloads with wrong intent
LangGraphExplicit graph nodes + stateBound tool nodesWrong edge taken between state transitions
LangChain AgentExecutorImplicit memory + scratchpadLangChain Tool objectsDrift across long agent trajectories
OpenAI Agents SDKHandoffs + sessionsOpenAI tool schemasLost context across handoffs
CrewAIRole + task hierarchyCrew toolsRole confusion under load

What unites the failure modes is that none of them is caught by JSONValidation alone. FutureAGI’s stance is that typed agents reduce one class of bugs (malformed payloads) and shift others (intent drift, tool selection drift, goal-state confusion) into the trajectory, which is why TrajectoryScore, ToolSelectionAccuracy, FunctionCallAccuracy, and TaskCompletion belong in every Pydantic AI release gate.

Common mistakes

  • Treating Pydantic validation as agent correctness. A valid object can still contain the wrong account, action, or recommendation.
  • Skipping tool-level evaluation. Typed tool arguments only prove shape; FunctionCallAccuracy checks whether the name and values match intent.
  • Hiding retries inside the framework. Retries after validation errors should appear in traces and dashboards, not only application logs.
  • Overloading one dependency object. Broad dependencies make tools harder to test and make authorization boundaries harder to audit.
  • Comparing only final answers with LangGraph or CrewAI. Frameworks fail differently; compare trajectories, tool calls, and task completion.
  • Skipping per-frontier-model regression. A Pydantic AI agent that passes TaskCompletion on GPT-5.1 can fail on Claude Opus 4.7 because the system message is interpreted differently. Run the regression eval against every model route before promoting.
  • Ignoring traceAI attributes. Without agent.trajectory.step, gen_ai.tool.name, gen_ai.tool.call.arguments, and llm.token_count.prompt on the span, validation success and intent success cannot be separated.

Pydantic AI on the 2026 release gate

In our 2026 evals, Pydantic AI ships into production with a clear release contract: every agent version is scored on JSONValidation (schema shape), FunctionCallAccuracy (right tool and arguments), ToolSelectionAccuracy (intent fit), TaskCompletion (user goal), and TrajectoryScore (path health) against a golden dataset of 500-2000 rows. The right public anchors are BFCL v3 (Berkeley Function Calling Leaderboard, 2000+ multi-turn function-call cases. frontier models score 80-90% on simple calls but drop to ~55% on multi-step compositional tasks) and τ-bench (Anthropic, multi-turn customer-support tool use, frontier ~50% on retail). both expose intent-fit failures that pure schema validation hides. The same suite runs against Claude Opus 4.7, GPT-5.1, Gemini 3 Pro, and any self-hosted Llama 4 route the team supports. Unlike a LangGraph deployment that exposes graph nodes as the unit of test, Pydantic AI keeps tests at the typed function level, which is good for unit safety and weaker for trajectory-level reasoning regressions. FutureAGI’s evaluator cohort closes that gap.

Frequently Asked Questions

What is Pydantic AI?

Pydantic AI is a Python framework for building typed LLM agents with tools, dependency injection, and structured outputs. FutureAGI traces its runs through traceAI:pydantic-ai so each tool call and validation step can be evaluated.

How is Pydantic AI different from LangGraph?

LangGraph models workflows as explicit graph and state transitions. Pydantic AI starts from typed Python agents, dependencies, tools, and output schemas, so validation and static typing are central to the runtime.

How do you measure Pydantic AI?

FutureAGI instruments the pydantic-ai traceAI integration and scores spans with ToolSelectionAccuracy, FunctionCallAccuracy, JSONValidation, and TaskCompletion. Monitor agent.trajectory.step, tool arguments, schema-fail rate, and eval-fail-rate-by-release.