Guides

Evaluating Pydantic AI Agents That Use MCP Tools (2026)

Evaluate Pydantic AI agents that call MCP tools in 2026: per-typed-output rubrics, tool-call argument fidelity, MCP security checks, dependency invariants.

·
Updated
·
11 min read
pydantic-ai mcp agent-evaluation llm-evaluation traceai tool-calling agents python 2026
Editorial cover image for Evaluating Pydantic AI Agents That Use MCP Tools
Table of Contents

A Pydantic AI agent returns BookingResult(confirmed=True, flight_id='AA127', total_usd=412.50). Every Pydantic check passes. The output_type matches. The tool_call validated against the Pydantic model on the way in. Retry counter at zero. The user asked to book the Sunday flight; AA127 is a Friday flight. The agent confirmed the wrong booking with full type safety.

Pydantic AI’s typed agent contract gives you free schema validation. It does not give you free quality eval. The shape is guaranteed, the retry-on-validation-failure loop is wired, the output_type parses. That is the floor. The ceiling is whether the agent picked the right tool, whether arguments were semantically correct, whether the typed output reflects the user goal, whether the deps invariants held, and whether the MCP tool catalog itself ships an injection payload that bypasses every type check. This post is the methodology that matches Pydantic AI: per-typed-output semantic rubric, tool-call argument fidelity, MCP integration security check, and agent-dependency invariant verification, with the PydanticAIInstrumentor that makes the run legible.

Why a typed contract is the floor, not the ceiling

Pydantic AI is the agent framework from the Pydantic team. You declare an Agent with a model identifier, an output_type (Pydantic model), a typed deps_type for runtime dependencies, and tools whose arguments are Pydantic models too. The framework dispatches the model call, parses the tool selection, validates arguments against the schema, calls the tool, and re-prompts on validation failure. First-class MCP client support ships via MCPServerStdio, MCPServerHTTP, and MCPServerSSE. Background in what is Pydantic AI; alternatives in Pydantic AI alternatives.

That contract handles four classes of bug for free: tool argument shape mismatch, output schema mismatch, non-existent tool name, and deps_type enforcement at call time. Each contract is a question about syntax. The questions that break in production are semantic.

Here is the gap, grouped by what Pydantic AI catches and what it does not:

Failure modeCaught by Pydantic AINeeds eval
Tool argument shape mismatchYesNo
Output output_type shape mismatchYesNo
Wrong tool picked for the user intentNoYes
Right tool, semantically wrong argumentsNoYes
Right tool, right arguments, wrong MCP serverNoYes
Typed output values do not reflect user goalNoYes
Trajectory loops, stops short, or stallsNoYes
Refusal calibration drifts under retry pressureNoYes
MCP tool description carries prompt-injection payloadNoRuntime guardrail
MCP tool result tampers with the next planning turnNoRuntime guardrail
Agent dependency invariant violated by a planned callNoYes

Schema validation is one assertion per field. The eval that matters is the assertion across fields, across calls, across servers, and across the deps that anchor the run. Tool-calling depth in evaluating tool-calling agents; general agent surface in the agent evaluation guide.

Layer 1: per-typed-output semantic rubrics

Pydantic AI’s structured output is the part most teams over-trust. A BookingResult(confirmed: bool, flight_id: str, total_usd: float) constrains the shape. It says nothing about whether confirmed is correctly True, whether flight_id matches the flight the user asked about, or whether total_usd is plausible for the route. The semantic check is the part you write.

The rule of thumb: any field whose correctness depends on the input gets a CustomLLMJudge. Pure-structure fields (a UUID, a fixed-format timestamp) do not. Treat the output_type the way the Instructor structured-outputs post treats response_model: one judge per non-trivial field, each rubric narrow enough that the score is unambiguous.

from pydantic import BaseModel, Field
from fi.evals.templates import CustomLLMJudge

class BookingResult(BaseModel):
    confirmed: bool
    flight_id: str
    total_usd: float = Field(ge=0)
    departure_iso: str

flight_id_judge = CustomLLMJudge(
    config={
        "name": "FlightIdFidelity",
        "rubric": (
            "Given the user request and the chosen flight_id, score whether "
            "the flight matches the user's stated day, route, and time window. "
            "5 = correct day, route, time. 1 = different day or route."
        ),
        "input_mapping": {
            "user_input": "input",
            "flight_value": "output.flight_id",
            "departure": "output.departure_iso",
        },
        "model": "turing-flash",
    }
)

Add the same pattern for confirmed (was the booking actually successful in the trace) and total_usd (is the value plausible for the route). Per-field judges produce per-field signal, which is the diagnostic that names the fix. A monolithic “is this output_type correct” judge produces uninterpretable averages. CI runs per-axis thresholds (FlightIdFidelity >= 4.5, ConfirmationFidelity == 5) and fails the build on the axis that broke. One bisect instead of three.

Layer 2: tool-call argument fidelity

Pydantic validates that book_flight(flight_id='AA127', date='2026-08-15') parses against the tool’s arg model. It says nothing about whether 'AA127' is the flight the user asked about or whether '2026-08-15' matches the date they said. Tool-call argument fidelity sits between Pydantic’s “the arguments parsed” and the user’s “the agent did what I asked”.

EvaluateFunctionCalling (aliased as LLMFunctionCalling, verified at python/fi/evals/templates.py:344) is the right rubric. Point it at every tool_call span. The PydanticAIInstrumentor serializes the resolved arguments on pydantic_ai.tool.args, the tool name on pydantic_ai.tool.name, and the user prompt on the parent agent_run span’s pydantic_ai.run.prompt, which is everything the judge needs.

from fi.evals import Evaluator
from fi.evals.templates import (
    LLMFunctionCalling, TaskCompletion, AnswerRefusal, CustomLLMJudge,
)

server_attribution = CustomLLMJudge(
    config={
        "name": "ToolServerAttribution",
        "rubric": (
            "Given the user request and the list of MCP servers available "
            "in this run, score whether each tool call was routed to the "
            "correct server. PASS only if every call went to the right one."
        ),
        "input_mapping": {
            "user_input": "input",
            "available_servers": "metadata.available_servers",
            "tool_calls": "metadata.tool_calls",
        },
        "model": "turing-flash",
    }
)

evaluator = Evaluator(fi_api_key="...", fi_secret_key="...")
result = evaluator.evaluate(
    eval_templates=[
        LLMFunctionCalling(), TaskCompletion(), AnswerRefusal(), server_attribution,
    ],
    inputs=production_replay_cases,
)

TaskCompletion scores the end-state against the user goal across the trajectory; it runs on the parent agent_run span. AnswerRefusal catches both over-refusal and under-refusal, which matters because Pydantic AI’s retry-on-validation-failure loop can push the agent into a refusal posture when validation keeps failing on edge cases. The ToolServerAttribution rubric carries the case the other templates do not: when two MCP servers expose similarly-named tools (a docs search and a CRM search), the agent can call the wrong one and still produce a plausible answer.

CI gate per run: LLMFunctionCalling >= 0.90, TaskCompletion >= 0.85, AnswerRefusal == "calibrated", ToolServerAttribution == PASS. The build fails on the axis that broke. The general template surface is in agent evaluation frameworks; the function-calling rubric in LLM function calling.

Layer 3: MCP integration security check

Three properties of MCP change the threat model. The tool catalog is part of the prompt: tools/list returns each tool’s name, description, and JSON schema directly into the model’s context. The supply chain is npm install or pip install. The trust boundary moves per session because the active toolset is the union of every registered server at run time. Full methodology in evaluating MCP servers for security; deployment context in what is an MCP gateway.

For a Pydantic AI agent, four runtime checks land at the gateway and one runs in CI. The runtime checks ship through the Agent Command Center dual scanner. mcpsec.go at the chat-completion boundary scans every tool definition before it enters the model’s context, so a poisoned MCP description is rejected at registration. toolguard.go at the per-tool-call hook scans arguments before dispatch and results before they enter the next LLM turn, so footer-injection inside a read_file return never reaches the planning step. Sandbox and permission-escape attempts on arguments score against a fixed payload catalog plus the tool’s declared scope. Cross-tenant isolation comes from per-key AllowedTools / DeniedTools plus x-agentcc-trace-id audit propagation. The CI piece is a 200-500 case regression suite of attack tool definitions and tampered results, run on every PR touching MCP policy. Failures cluster through Error Feed, the Sonnet 4.5 JudgeAgent writes the immediate_fix, the cluster becomes a permanent test on the next run.

Layer 4: agent-dependency invariant verification

Pydantic AI’s deps_type is the most underused contract surface. You declare a deps type, the agent injects it into every tool via ctx.deps on RunContext, and Pydantic enforces the type at parse. What it cannot enforce is whether the model’s plan respects the deps’ invariants. A support agent with deps=AccountContext(tenant_id='acme', role='viewer') can plan a delete_record call. Every type check passes. The action is a policy violation that should never have been planned.

The fix is a deterministic invariant assertion per tool call. The pattern: a pre_call_check that takes (tool_name, args, ctx) and returns an allow or deny verdict against the deps. Surface the verdict as a span attribute. Fail CI on any allow-then-deny case where a verdict flipped because the policy tightened.

from dataclasses import dataclass
from pydantic_ai import Agent, RunContext

@dataclass
class AccountContext:
    tenant_id: str
    role: str  # "viewer", "editor", "admin"

WRITE_TOOLS = {"delete_record", "update_record", "send_email"}

def assert_role_allows(tool_name: str, ctx: RunContext[AccountContext]) -> None:
    if tool_name in WRITE_TOOLS and ctx.deps.role == "viewer":
        raise PermissionError(
            f"role={ctx.deps.role} cannot call {tool_name}"
        )

agent = Agent("openai:gpt-4o", deps_type=AccountContext, output_type=BookingResult)

@agent.tool
def delete_record(ctx: RunContext[AccountContext], record_id: str) -> dict:
    assert_role_allows("delete_record", ctx)
    return _delete(ctx.deps.tenant_id, record_id)

The PermissionError surfaces inside Pydantic AI’s retry loop as a structured tool error. The model sees the message, plans an alternative, and the agent run proceeds without escalating to a wrong action. The pydantic_ai.tool.is_error attribute on the tool_call span carries the verdict, which is what the eval set scores. CI gate: zero tool_call spans where a deps-invariant violation flipped is_error=True on the golden set. Production: a deny verdict at the gateway routes the same call through the toolguard.go permission catalog before it dispatches. Two layers, one invariant, no silent policy drift.

Wiring it up: PydanticAIInstrumentor and EvalTag

You cannot evaluate what you cannot see. The starting point is the PydanticAIInstrumentor (verified at traceAI/python/frameworks/pydantic-ai/traceai_pydantic_ai/_instrumentor.py), a singleton that patches Agent.run, Agent.run_sync, and Agent.run_stream at instrument time plus every decorated @agent.tool function via wrap_tool_function.

pip install pydantic-ai traceai-pydantic-ai ai-evaluation
import os
os.environ["FI_API_KEY"] = "..."
os.environ["FI_SECRET_KEY"] = "..."

from fi_instrumentation import register
from fi_instrumentation.fi_types import ProjectType
from traceai_pydantic_ai import PydanticAIInstrumentor

trace_provider = register(
    project_type=ProjectType.OBSERVE,
    project_name="pydantic-mcp-agent",
    project_version_name="prod-2026-05",
)
PydanticAIInstrumentor().instrument(tracer_provider=trace_provider)

What every agent run emits, per _attributes.py:

  • pydantic_ai.span_kind=agent_run on the parent: agent.model, agent.result_type, agent.deps_type, run.prompt, gen_ai.usage.total_tokens, gen_ai.cost.total_usd.
  • pydantic_ai.span_kind=tool_call per tool: tool.name, tool.args, tool.result, tool.duration_ms, tool.is_error, tool.retry_count.
  • pydantic_ai.span_kind=model_request on the underlying call: GenAI semantic conventions (gen_ai.request.model, gen_ai.usage.input_tokens, gen_ai.usage.output_tokens).
  • pydantic_ai.span_kind=result_validation: validation.is_valid, validation.retries. The retry loop is visible as its own span.

Pair the instrumentor with EvalTag rules on the platform so per-field judges, LLMFunctionCalling, TaskCompletion, AnswerRefusal, and PromptInjection attach server-side. A use_builtin=True flag delegates to Pydantic AI’s own Agent.instrument_all() for teams that want a single source of spans.

Production loop: where the four layers compound

Run the gates cheapest-first. output_type parse and deps_type enforcement run inside the framework at no token cost. The deps-invariant assertion is the next layer up, still pure Python. MCP runtime scanners are sub-10 ms for the SDK Scanner tier and 65 ms median for the Protect prompt_injection adapter. LLMFunctionCalling, TaskCompletion, and per-field CustomLLMJudge calls run last, on rows the deterministic gates let through.

Build the golden set from production retries. 300 cases where each is a prompt that produced two or more attempts in the last 30 days of live traffic. pydantic_ai.validation.retries is the natural filter; the prompts that retry are the prompts closest to the framework’s limits.

Score offline and live with the same rubrics. Offline: evaluator.evaluate over the golden set with per-axis CI thresholds. Live: EvalTag rules apply the same templates to sampled production spans. A regression that escapes CI gets caught by the live stream within hours.

Watch retries as a cost signal. Pydantic AI’s retry default is good for uptime and bad for the bill. A prompt whose mean pydantic_ai.validation.retries drifts from 1.1 to 2.4 has doubled in cost without changing in correctness. Chart the distribution per pydantic_ai.agent.result_type; alert on shifts.

Error Feed clusters you will actually see

CI is necessary, not sufficient. A 300-case offline set is a snapshot; production is a river. Error Feed turns the river into named clusters via HDBSCAN soft-clustering over span embeddings in ClickHouse, then the Sonnet 4.5 JudgeAgent (30-turn budget, eight span tools, around 90% prompt-cache hit). Per cluster the Judge writes a 5-category 30-subtype taxonomy entry, the 4-D trace score (factual_grounding, privacy_and_safety, instruction_adherence, optimal_plan_execution, 1-5 each), and an immediate_fix naming the change to ship today.

On Pydantic AI workloads the clusters tend to be contract-shaped:

  • Per-typed-output drift. BookingResult.confirmed=True correlates with a missing provider span. immediate_fix is usually a tightened rubric or a new Pydantic Field constraint.
  • Tool-call attribution drift. Account-shaped queries route to the docs MCP server. immediate_fix is a system-prompt clarification or a more selective toolsets.
  • MCP security verdicts. A tool definition with an injection payload was blocked by mcpsec.go once before the policy tightened. immediate_fix is a new validate_inputs pattern plus a CI regression case.
  • Dependency invariant violations. role='viewer' traces show planned (then blocked) write-tool calls. immediate_fix updates the system prompt so the model stops planning the call.

Linear OAuth ships today; Slack, GitHub, Jira, and PagerDuty are on the roadmap.

How Future AGI ships the Pydantic AI eval stack

Four surfaces, one loop.

ai-evaluation SDK (Apache 2.0) ships the Evaluator, 60+ EvalTemplate classes (EvaluateFunctionCalling aliased as LLMFunctionCalling, TaskCompletion, AnswerRefusal, Completeness, PromptInjection, Groundedness, plus 11 CustomerAgent* templates), CustomLLMJudge for per-typed-output rubrics, and 20+ local heuristic metrics that run sub-second with zero API cost.

traceAI (Apache 2.0) ships the PydanticAIInstrumentor with the singleton patch on Agent.run/run_sync/run_stream, full @agent.tool wrapping, GenAI semantic conventions on model_request spans, and 50+ other instrumentors across Python, TypeScript, Java, and C#.

Agent Command Center is the runtime enforcement point: OpenAI-compatible AI gateway in a single Go binary, Apache 2.0, ~29k req/s at P99 21 ms with guardrails on (t3.xlarge per README). The MCP dual scanner (mcpsec.go + toolguard.go) covers tool-description and tool-result scanning. Per-key AllowedTools / DeniedTools carry the deps-invariant verdict at the gateway. Cloud at gateway.futureagi.com/v1 or self-hosted.

Future AGI Platform ships self-improving evaluators tuned by feedback, in-product custom rubric authoring, and classifier-backed scoring at lower per-eval cost than Galileo Luna-2. Error Feed sits inside with HDBSCAN clustering, the Sonnet 4.5 JudgeAgent, and the immediate_fix artifact. SOC 2 Type II, HIPAA, GDPR, and CCPA certified; ISO/IEC 27001 is in active audit. agent-opt closes the loop: six optimizers (RandomSearchOptimizer, BayesianSearchOptimizer, MetaPromptOptimizer, ProTeGi, GEPAOptimizer, PromptWizardOptimizer) consume per-typed-output judge scores as separate study targets. The trace-stream-to-agent-opt connector is the active roadmap item; the eval-driven path with manual export ships today.

Honest tradeoff: if your output_type has two fields, no deps_type invariants, and you do not consume MCP, schema-only testing is enough. The four-layer stack earns its weight when the agent has enough fields, enough tool calls, enough MCP servers, and enough deps that the typed contract stops being a useful health metric.

Anti-patterns we keep seeing

Four mistakes recur on Pydantic AI evaluation programs.

Schema-pass-rate as the eval metric. Parse success is a floor. Treating it as the ceiling produces a system that ships wrong values confidently because every type check is green. Add at least one per-typed-output judge per output_type.

Monolithic “is this object correct” judges. A single 1-5 score over the whole output_type is uninterpretable. Did flight_id break or did confirmed? Per-field judges produce per-field signal; per-field signal names the fix.

No deps-invariant assertions. deps_type enforces the type at parse. It does not enforce the policy. Every write tool reachable from a viewer-role deps run is a silent escalation path until the assertion is in place.

Trusting MCP descriptions blindly. A registered MCP tool is not a vetted MCP tool. The mcpsec.go description scan and toolguard.go result scan are not optional once the agent consumes any third-party server.

What to do this week

Five steps, one Pydantic AI agent.

  1. Wire PydanticAIInstrumentor().instrument(tracer_provider=trace_provider) at startup. Verify the agent_run, tool_call, and model_request spans show up with pydantic_ai.tool.args and pydantic_ai.agent.result_type populated.
  2. Add a CustomLLMJudge per non-trivial field of your output_type. Score offline over a 50-row golden set. Wire per-axis CI thresholds.
  3. Add assert_role_allows-style deterministic checks on every write tool. Surface the verdict on the span; fail CI on golden-set runs that planned a denied call.
  4. Front the agent with the Agent Command Center MCP dual scanner. Enable allowed_servers, validate_inputs, validate_outputs, and per-key AllowedTools from day one.
  5. Turn on Error Feed. Watch the first week’s clusters. Promote representative rows into the regression set. Run a BayesianSearchOptimizer study against the per-typed-output judge that scored worst.

The teams shipping reliable Pydantic AI MCP agents in 2026 stopped reporting schema-pass-rate and started reporting per-typed-output judge scores, tool-call argument fidelity, MCP scanner verdicts, and deps-invariant pass rates. Pydantic AI gives you the type-safe agent. The eval stack gives you the correct one.

Frequently asked questions

Pydantic AI gives me typed agents and typed tools. Why do I need a separate eval layer?
Pydantic AI's typed contract is a free floor. It guarantees that the tool name resolves, the arguments parse against the Pydantic model, the `output_type` matches the declared shape, and a validation failure triggers a structured retry. None of that answers whether the agent picked the right tool, whether the argument values were correct for the user's intent, whether the multi-step trajectory was efficient, or whether the final typed object carried the right values. A `BookingResult` with `confirmed=True` and the wrong `flight_id` passes every type check and still fails the user. The eval layer scores the four things Pydantic AI cannot see: per-typed-output semantic correctness, tool-call argument fidelity, MCP tool-server security, and agent-dependency invariant integrity.
What changes when those Pydantic AI agents consume MCP tools instead of in-process Python tools?
Three things change. First, the tool surface becomes dynamic because MCP servers can be added, removed, or upgraded outside your code, so the eval set has to record which server-version-tool tuple ran. Second, the latency and failure modes move to the network, so each MCP call has its own timeout, partial-result, and retry semantics that the eval should score per-server. Third, security expands to per-call risk because an MCP server description is read by the model as planning context, which makes the tool catalog itself a prompt-injection surface. Evaluation has to capture tool-server attribution, per-server timing, and the MCP-specific guardrail verdict alongside the usual correctness metrics. The dual-scanner methodology is unpacked in our [MCP server security evaluation post](/blog/evaluating-mcp-servers-security-2026/).
How does traceAI instrument Pydantic AI?
The `PydanticAIInstrumentor` (verified at `traceAI/python/frameworks/pydantic-ai/traceai_pydantic_ai/_instrumentor.py`) singleton-patches `Agent.run`, `Agent.run_sync`, and `Agent.run_stream` at instrument time, then wraps every decorated `@agent.tool` function via `wrap_tool_function`. Each agent run emits an OpenTelemetry span tree with `pydantic_ai.span_kind` set to `agent_run` for the parent, `tool_call` for each tool, and `model_request` for the underlying LLM call. Standard GenAI semantic conventions carry the cost surface (`gen_ai.usage.input_tokens`, `gen_ai.usage.output_tokens`, `gen_ai.cost.total_usd`); Pydantic-AI-specific extensions carry the typed surface (`pydantic_ai.agent.result_type`, `pydantic_ai.tool.args`, `pydantic_ai.validation.is_valid`, `pydantic_ai.validation.retries`). A `use_builtin=True` flag delegates to Pydantic AI's own `Agent.instrument_all()` when teams want one source of spans.
Which Future AGI evaluation templates apply to Pydantic AI MCP agents?
Four carry most of the load and a fifth catches what the four leave open. `EvaluateFunctionCalling` (aliased as `LLMFunctionCalling`) scores whether the tool name and arguments matched the user intent on every `tool_call` span. `TaskCompletion` scores whether the final typed `output_type` reflects the user goal across the whole trajectory. `AnswerRefusal` scores refusal calibration so retry loops do not push the agent into silent over-refusal. A `CustomLLMJudge` rubric per typed output field scores the semantic correctness Pydantic cannot see (`BookingResult.confirmed` is a bool, but is it the right bool). Layer `PromptInjection` and `DataPrivacyCompliance` on every `model_request` span where the LLM ingested an MCP tool result, since that is where injected content turns into action.
How does the MCP dual scanner protect a Pydantic AI agent at runtime?
Agent Command Center ships an MCP dual scanner that pairs with the Pydantic AI eval set. `mcpsec.go` at the chat-completion boundary scans every tool definition, name, description, and full JSON schema before it lands in the model's context, which is where tool-poisoning attacks ride in through MCP. `toolguard.go` implements the `mcp.ToolCallGuard` interface and fires at every per-tool-call hook, scanning resolved arguments before dispatch and tool results before they enter the next LLM turn. Per-key `AllowedTools` and `DeniedTools` give per-tenant isolation; `MaxAgentDepth` caps loop budget; `x-agentcc-trace-id` ties every call to its issuing key. Both layers can call the `Protect` `prompt_injection` Gemma 3n LoRA adapter (65 ms median per arXiv 2510.13351) and the 8 sub-10 ms SDK Scanners.
What is the agent-dependency invariant and why do typed agents still fail it?
Pydantic AI agents take typed dependencies through `deps_type`, which a tool reads via `ctx.deps` inside `RunContext`. The contract is that the deps are valid for the duration of the run. The failure that schema validation cannot see is when the model's plan implicitly violates the deps' invariants. A support agent with `deps=AccountContext(tenant_id='acme', role='viewer')` calls `delete_record` even though `role='viewer'` means the call is a policy violation that never should have been planned. Every type check passes. The action is wrong. The fix is to run deterministic invariant assertions on every tool call against the live `RunContext`, route violations to a deny verdict at the gateway, and score the assertion pass-rate per `tool_call` span. Cheap to write, free to run, catches the failures that look like clean tool calls in the trace.
How does Error Feed cluster Pydantic AI failures?
Error Feed sits inside the eval stack. HDBSCAN soft-clustering runs over the failing-span embeddings in ClickHouse, then a Claude Sonnet 4.5 JudgeAgent with a 30-turn budget and eight span tools (plus a Haiku Chauffeur for spans over 3000 characters, around 90% prompt-cache hit) writes one `immediate_fix` per cluster. On Pydantic AI workloads the clusters tend to be contract-shaped: per-typed-output drift (`BookingResult.confirmed` is True on `tool_call` failure traces), tool-call attribution drift (`search` calls go to the docs MCP server instead of the CRM MCP server on account questions), MCP security verdicts (the gateway blocked a tool definition with an injection payload, but the same payload landed once before the policy was tightened), and dependency invariant violations (`role='viewer'` plans `delete_record`). Each cluster lands as a Linear ticket today with a pre-written fix; Slack, GitHub, Jira, and PagerDuty are on the roadmap.
Related Articles
View all