The Step-by-Step Guide to MCP Evaluation (2026)
A 2026 workflow for evaluating MCP servers end to end: functional checks, security checks, cross-client compatibility, stress tests, and the CI gate.
Table of Contents
An MCP server is two products under one binary. The first product is the functional surface: tool descriptions that match what the tools do, calls that succeed with the right arguments, results that integrate into the next agent turn without the chain breaking. The second product is the security surface: a tool catalog that can carry injection payloads, results that can be tampered, arguments that can escape sandboxes, tenants that can read each other’s calls. Eval one and you ship a server that works on the demo and breaks on the audit. Eval the other and you ship a server that passes the security review and falls over on a real workload. The MCP eval workflow has to run both gates, on every PR, with one trace tree underneath. This is the step-by-step.
TL;DR: the workflow
| Stage | What it scores | When it runs |
|---|---|---|
| Functional eval | Tool-description accuracy, call success, result integration | Pre-merge on every server PR |
| Security eval | Description injection, result tampering, sandbox escape, cross-tenant | Pre-merge, weekly red-team |
| Compatibility eval | Same tool across Claude / OpenAI / Gemini / Cursor | Pre-merge on schema changes |
| Stress eval | Concurrency, rate-limit, long-tail latency | Nightly, plus pre-release |
| CI gate | All four wired into one PR check | On every server and agent PR |
Functional and security are the two halves of the thesis. Compatibility, stress, and the CI gate are the operational layers that keep both halves honest after the third upstream server change.
Functional eval: three layers, one trace
A functional MCP eval has to score three things, and the cheap layer matters more than the LLM-judge layer.
Tool-description accuracy
The agent reads the tool’s name, description, and JSON schema at tools/list time and uses that text to plan. If the description says “returns the customer’s billing history” and the tool actually returns the last 30 days of invoices, the agent will pick the tool for queries it can’t satisfy. This is a documentation regression that looks like a model regression in the logs.
The eval has two passes. Pass one is deterministic: validate the JSON schema against the tool’s actual signature. A required field missing from the schema, an integer typed as string, an enum that doesn’t match the runtime allow-list — those are mechanical failures with mechanical fixes. The 8 sub-10 ms Scanners from the ai-evaluation SDK pick these up before any LLM hop.
Pass two is semantic. Run a golden corpus of 30 to 60 user prompts that should and should not call this tool through the agent, and score how often the agent picks correctly. The LLMFunctionCalling template (alias EvaluateFunctionCalling, eval ID 98) covers this in one call. A description that’s drifted produces a recall problem; a description that’s overpromised produces a precision problem. Both show up as the same number, then split on the confusion matrix.
Tool-call success rate
A call succeeds when three things hold: the agent picks the right tool, the arguments validate against the schema, and the server returns 200 inside the per-tool budget. The deterministic metrics in python/fi/evals/metrics/function_calling/metrics.py give you function_name_match, parameter_validation, function_call_accuracy, and function_call_exact_match for free. Aim for 98 percent schema compliance in production; below that, an upstream server has drifted and the audit will be three weeks behind.
from fi.evals import Evaluator, LLMFunctionCalling, TestCase
evaluator = Evaluator(fi_api_key=FI_API_KEY, fi_secret_key=FI_SECRET_KEY)
result = evaluator.evaluate(
eval_templates=[LLMFunctionCalling()],
inputs=[TestCase(
input="Refund the duplicate charge on invoice 2024-99812.",
output={
"tool_calls": [
{"name": "search_invoices", "arguments": {"invoice_id": "2024-99812"}},
{"name": "issue_refund", "arguments": {"invoice_id": "2024-99812"}},
],
"final_response": "Refund issued for invoice 2024-99812.",
},
expected_tool_calls=[
{"name": "search_invoices"},
{"name": "issue_refund"},
],
)],
)
Per-tool latency budget matters here. If the user-facing budget for a 4-call chain is 8 seconds, the per-tool p95 lives at about 1.5 seconds. Track p50, p95, p99 per tool and alarm on p95 growth above the 7-day baseline. The tool-calling agents eval guide covers the per-tool span shape.
Result integration
A tool call that returned 200 still fails the chain if the agent can’t parse the result, drops the right field, or ignores the response and replans from the prompt. The result-integration eval scores three signals on the next turn: does the agent reference the returned data, does the chain progress to the next step, and does the final answer cite the right value.
The 7 deterministic agent-trajectory metrics — task_completion, step_efficiency, tool_selection_accuracy, trajectory_score, goal_progress, action_safety, reasoning_quality — score this from an AgentTrajectoryInput of steps, tool calls, and expected goal. All seven set supports_llm_judge=True, so the same metric runs as a cheap heuristic in CI and a judge-augmented rubric in production. The evaluate-MCP-connected-agents guide walks through the chain-level shape end to end.
Security eval: defer to the sister post
The four security checks — tool-description injection scan, tool-result tampering, sandbox and permission-escape attempt detection, cross-tenant data isolation — get their own depth treatment in Evaluating MCP Servers for Security. The summary that matters here:
- The threat surface is wider than a fixed-tool agent because tool catalogs are part of the prompt and the supply chain is
npm install. - The scan layer runs at registration and on every
tools/listrefresh, against name, description, top-level schema, and every nesteddescriptionandenumvalue. - The runtime layer fires at both the chat-completion boundary (
mcpsec.go) and the per-tool-call hook (toolguard.go) because a single stage misses either the catalog or the loop calls. - The cross-tenant layer is structural, not classifier-based: per-key
AllowedTools, namespaced tool IDs, trace-ID scoping on every span.
If you only read one post on MCP security, it is the sister post. If your eval workflow only runs one gate today, run the functional one first and add the security one before the next quarterly red-team. The OWASP LLM Top 10 covers LLM01 (indirect prompt injection) and LLM06 (excessive agency), both of which land directly on MCP.
Compatibility eval: same tool, four clients
An MCP server that works in Claude Desktop and breaks in OpenAI’s Responses API is a production bug that no single-client test will catch. The compatibility gate runs the same tool through four client configs and asserts the same answer.
Three places where compatibility breaks:
- Schema dialect. Claude’s planner handles deeply nested
oneOfandanyOf; OpenAI tool-calling refuses some union shapes. Gemini’s tool config wants flat parameter lists. Cursor’s tool integration is closer to OpenAI’s. A schema that planned cleanly in Claude can crash OpenAI’s tool router. - Argument coercion. A tool that declares
integerand a client that passes"5"works in some clients and fails in others. Booleans get coerced to strings, dates land as ISO 8601 in one client and Unix epoch in another. The schema is the contract; the client is the enforcement. - Result format. Text-only versus content-block results,
mimeTypeon file results, image base64 versus URL. A tool that returns{ content: "..." }works everywhere; a tool that returns{ items: [...] }works in some clients and silently fails in others.
The gate is a fixture suite. For each tool, run a 5 to 10 case fixture set against Claude, OpenAI, Gemini, and Cursor agent configs. Score result equality up to known reformatting (case, whitespace) and assert the same tool sequence per case. A diff between clients is a server bug, not a client quirk; ship the schema fix before the catalog drift hits the next client. The Pydantic AI MCP agent eval covers one specific client; the cross-client matrix is the production-facing version.
Stress eval: concurrency, rate-limit, long tail
Stress eval is the layer that distinguishes a server that runs in dev from one that holds up in production.
Three numbers to gate.
Concurrency. Open N parallel sessions to the server, each calling the same tool with non-overlapping arguments. The break point shows up as 503s, dropped SSE streams, or state leaks between sessions. The stress test runs at 2x peak production concurrency and asserts that error rate stays under 1 percent and that no session reads another session’s state. State leaks are the failure mode that production traffic eventually finds; the eval gate is where you find it first.
Rate-limit behavior. Hit the per-tool rate limit and verify the response is a structured 429 with a Retry-After header, not a 500 with a server stack trace. The gateway’s toolGuardRateCounter atomics enforce this on the Future AGI side; the upstream server needs its own back-pressure layer. The tool_rate_limits field in the gateway config is {tool_name: max_calls_per_minute}. Set it conservatively per tool; widen after observed load.
Long-tail latency. p50 is not the right budget. A 200 ms p50 with a 4-second p99 is a chain that quietly retries on the long tail and pays in token cost. Track p50, p95, p99 per tool and alarm on p95 growth above the 7-day baseline. The Agent Command Center returns x-agentcc-latency-ms per request and tags each fi.span.kind=TOOL span with start and end timestamps, so the latency check runs as a deterministic eval next to the LLM judge.
Load drivers are off-the-shelf — Locust, k6, vegeta. The eval gate is the assertion that the four numbers (error rate, 429 shape, p95, p99) sit inside the production budget. The agent-passes-evals-fails-production post is what this gate exists to prevent.
The CI gate: pre-merge on the server, post-merge on the agent
Two gates, two datasets, one SDK.
Pre-merge on the server PR. When the MCP server changes — a new tool, an updated description, a schema migration — the CI run executes:
- Schema validation across the golden test set. Hard fail on any schema regression.
LLMFunctionCallingon 200 to 500 traces sampled from the last week of production. Threshold: 95 percent pass for tools with stable descriptions, 99 percent for tools the PR did not touch.- Compatibility fixture suite across Claude, OpenAI, Gemini, Cursor configs. Hard fail on any cross-client diff.
- Security regression set from the sister post: description-injection scan, result-tampering scan, sandbox-escape patterns. The threshold is recall above 0.95 on the adversarial set and precision above 0.99 on the benign set.
- Stress run at 2x peak concurrency, with the four-number budget enforced.
Post-merge on the agent PR. When the agent changes — a new prompt, a new model, a new tool inclusion — the CI run executes the integration suite. Frozen prompts run through the agent against the candidate server. The eval asserts the same plan, the same tool sequence, and the same final answer as the baseline. Drift in any of the three is a regression; the CI run blocks the merge until the trace tree explains the diff.
Both gates share the ai-evaluation SDK templates. The difference is the dataset (server-side golden corpus vs. agent-side integration scenarios) and the threshold (98 percent functional pass vs. trajectory equality). The CI/CD LLM eval guide covers the GitHub Actions wiring; the LLM evaluation playbook covers the dataset and judge disciplines.
Worked example: a refund agent across three MCP servers
A support agent connects to invoices (read-only), payments (refund issuance), and notifications (email send). The trace for one request:
- User input: “Refund the duplicate charge on invoice 2024-99812 and tell the customer.”
tools/list: gateway returnsinvoices_search,invoices_get,payments_refund,notifications_email_send. Schemas fetched at session start. Description scan runs against every entry.invoices_get(TOOLspan): arguments{"invoice_id": "2024-99812"}. Latency 180 ms. Schema valid.payments_refund(TOOLspan): arguments{"invoice_id": "2024-99812", "amount": 49.99}. Latency 410 ms. Triggers the per-tool-call hook for the payments allow-list check.notifications_email_send(TOOLspan): the email argument is redacted from the audit log by thedata_privacy_complianceadapter before the row writes.- Final response (
LLMspan): “Refund issued and email sent.”
The eval suite runs on this trace:
function_call_exact_matchagainst the expected three-call chain.parameter_validationagainst each tool’s JSON schema.LLMFunctionCallingfor semantic argument correctness.task_completionandstep_efficiencyfor outcome.action_safetyfor destructive-tool policy.- Description and result scans for the security half.
The trace ID links the audit row, the spans, and the eval scores. A failure in any rubric routes into Error Feed, clusters with similar failures from other traces, and produces an immediate_fix the on-call engineer can ship.
Common mistakes
- Trace at the LLM level only. A response-only score misses the over-calling, retry-on-same-tool, and dropped-context failures that drive cost. Trace at the tool level. The 14 span kinds in traceAI (including
TOOL,A2A_CLIENT,A2A_SERVER) are there for this. - Pin tool names in tests. MCP catalogs are dynamic. Test for tool families and intent, not literal names; otherwise the eval suite breaks on the next
tools/listrefresh. - One budget for the whole chain. A single per-request budget hides the one tool that’s regressing. Budget per tool and alarm on p95 growth.
- Skip cross-client compatibility. A server that works in Claude Desktop and breaks in OpenAI’s Responses API is a production bug. Run the fixture suite across four clients on every schema change.
- No stress gate. A server that’s fine at 1x concurrency can leak state at 2x. Stress eval finds the state leak before the production user does.
- Skip the security gate. Functional eval misses every threat in the sister post. Run both gates on every PR.
How Future AGI ships the MCP eval workflow
The eval stack is one package, and the MCP workflow is the same package read top to bottom.
- ai-evaluation SDK (Apache 2.0). 60+
EvalTemplateclasses includingLLMFunctionCalling,TaskCompletion,Groundedness,PromptInjection,DataPrivacyCompliance,IsHarmfulAdvice. 13 guardrail backends (9 open-weight, 4 API). 8 sub-10 ms Scanners (JailbreakScanner,CodeInjectionScanner,SecretsScanner,MaliciousURLScanner,InvisibleCharScanner,LanguageScanner,TopicRestrictionScanner,RegexScanner). 4 distributed runners (Celery, Ray, Temporal, Kubernetes). 7 deterministic agent-trajectory metrics. The CI gate runs here. - traceAI (Apache 2.0). 50+ AI surfaces across Python, TypeScript, Java, C#.
MCPInstrumentorfor themcppackage;A2AInstrumentorfora2a-sdk. 14 span kinds withTOOL,A2A_CLIENT,A2A_SERVER. Pluggable semantic conventions (FI,OTEL_GENAI,OPENINFERENCE,OPENLLMETRY) atregister()time.@tracer.tooldecorator infersdescriptionfrom docstrings andparametersfrom type hints. - Agent Command Center. Gateway as MCP server and MCP client. Dual scanner (
mcpsec.goat the chat-completion boundary,toolguard.goat the per-tool-call hook). Per-keyAllowedTools/DeniedToolsfor tenant isolation.MaxAgentDepthand per-tool rate counters for budget control. 5-level hierarchical budgets (org, team, user, key, tag).x-agentcc-trace-idpropagation. SOC 2 Type II, HIPAA, GDPR, CCPA certified. Single Go binary, Apache 2.0, ~29k req/s at p99 21 ms with guardrails on per the README. - Future AGI Platform. Self-improving evaluators retune per-tool thresholds from production feedback. Error Feed clusters trace failures via HDBSCAN soft-clustering over ClickHouse-stored embeddings (
euclidean_threshold=0.5,min_cluster_size=2,allow_soft_clustering=True). A Claude Sonnet 4.5 Judge agent on Bedrock (30-turn budget, 8 span tools, Haiku Chauffeur for >3000-char spans, 90 percent prompt-cache hit) writes theimmediate_fixper cluster. Classifier-backed evals run at lower per-eval cost than Galileo Luna-2, so a daily full-dataset rerun is feasible. - futureagi-mcp-server. Exposes evaluation, dataset, prompt, project, optimization, agent simulation, span-search, and
protecttools to any MCP-compatible client (Claude Desktop, Claude Code, Codex CLI, Continue). The futureagi-mcp-server walkthrough covers the tool list.
Start with the SDK and traceAI for the functional eval and the security checks. Add the gateway when traffic and team size justify the policy surface. Wire the Platform’s Error Feed when cluster-level RCA pays back the integration cost.
Related reading
- Evaluating MCP Servers for Security (2026)
- What is an MCP Gateway?
- Best MCP Gateways (2026)
- Evaluating Tool-Calling Agents (2026)
- How to Evaluate MCP-Connected AI Agents in Production
- CI/CD LLM Eval with GitHub Actions (2026)
- The 2026 LLM Evaluation Playbook
- OWASP LLM Top 10 (2025): Risks and Mitigations
- Your AI Agent Passes Evals But Still Fails in Production
- MCP vs A2A: Two Protocols, Two Problems
- Future AGI MCP Server
Frequently asked questions
Why does MCP evaluation need both functional and security checks?
What functional checks belong in an MCP eval suite?
How do compatibility issues show up in MCP servers?
What does stress evaluation look like for MCP?
Where does the CI gate live in an MCP eval workflow?
How does FAGI score MCP servers end to end?
What's the minimum eval set for an MCP server in production?
Evaluate Pydantic AI agents that call MCP tools in 2026: per-typed-output rubrics, tool-call argument fidelity, MCP security checks, dependency invariants.
Bedrock's built-in eval is dev-loop only. Score action-group correctness, KB retrieval quality, and guardrail precision/recall on every release.
How to evaluate LiteLLM-routed apps: paired comparison across providers on your data, tool-call parity, latency parity, and the gateway alternative.