Guides

The Step-by-Step Guide to MCP Evaluation (2026)

A 2026 workflow for evaluating MCP servers end to end: functional checks, security checks, cross-client compatibility, stress tests, and the CI gate.

·
Updated
·
12 min read
mcp model-context-protocol agent-evaluation llm-evaluation ai-gateway tool-calling a2a 2026
Editorial cover image for the Step-by-Step Guide to MCP Evaluation (2026)
Table of Contents

An MCP server is two products under one binary. The first product is the functional surface: tool descriptions that match what the tools do, calls that succeed with the right arguments, results that integrate into the next agent turn without the chain breaking. The second product is the security surface: a tool catalog that can carry injection payloads, results that can be tampered, arguments that can escape sandboxes, tenants that can read each other’s calls. Eval one and you ship a server that works on the demo and breaks on the audit. Eval the other and you ship a server that passes the security review and falls over on a real workload. The MCP eval workflow has to run both gates, on every PR, with one trace tree underneath. This is the step-by-step.

TL;DR: the workflow

StageWhat it scoresWhen it runs
Functional evalTool-description accuracy, call success, result integrationPre-merge on every server PR
Security evalDescription injection, result tampering, sandbox escape, cross-tenantPre-merge, weekly red-team
Compatibility evalSame tool across Claude / OpenAI / Gemini / CursorPre-merge on schema changes
Stress evalConcurrency, rate-limit, long-tail latencyNightly, plus pre-release
CI gateAll four wired into one PR checkOn every server and agent PR

Functional and security are the two halves of the thesis. Compatibility, stress, and the CI gate are the operational layers that keep both halves honest after the third upstream server change.

Functional eval: three layers, one trace

A functional MCP eval has to score three things, and the cheap layer matters more than the LLM-judge layer.

Tool-description accuracy

The agent reads the tool’s name, description, and JSON schema at tools/list time and uses that text to plan. If the description says “returns the customer’s billing history” and the tool actually returns the last 30 days of invoices, the agent will pick the tool for queries it can’t satisfy. This is a documentation regression that looks like a model regression in the logs.

The eval has two passes. Pass one is deterministic: validate the JSON schema against the tool’s actual signature. A required field missing from the schema, an integer typed as string, an enum that doesn’t match the runtime allow-list — those are mechanical failures with mechanical fixes. The 8 sub-10 ms Scanners from the ai-evaluation SDK pick these up before any LLM hop.

Pass two is semantic. Run a golden corpus of 30 to 60 user prompts that should and should not call this tool through the agent, and score how often the agent picks correctly. The LLMFunctionCalling template (alias EvaluateFunctionCalling, eval ID 98) covers this in one call. A description that’s drifted produces a recall problem; a description that’s overpromised produces a precision problem. Both show up as the same number, then split on the confusion matrix.

Tool-call success rate

A call succeeds when three things hold: the agent picks the right tool, the arguments validate against the schema, and the server returns 200 inside the per-tool budget. The deterministic metrics in python/fi/evals/metrics/function_calling/metrics.py give you function_name_match, parameter_validation, function_call_accuracy, and function_call_exact_match for free. Aim for 98 percent schema compliance in production; below that, an upstream server has drifted and the audit will be three weeks behind.

from fi.evals import Evaluator, LLMFunctionCalling, TestCase

evaluator = Evaluator(fi_api_key=FI_API_KEY, fi_secret_key=FI_SECRET_KEY)

result = evaluator.evaluate(
    eval_templates=[LLMFunctionCalling()],
    inputs=[TestCase(
        input="Refund the duplicate charge on invoice 2024-99812.",
        output={
            "tool_calls": [
                {"name": "search_invoices", "arguments": {"invoice_id": "2024-99812"}},
                {"name": "issue_refund",    "arguments": {"invoice_id": "2024-99812"}},
            ],
            "final_response": "Refund issued for invoice 2024-99812.",
        },
        expected_tool_calls=[
            {"name": "search_invoices"},
            {"name": "issue_refund"},
        ],
    )],
)

Per-tool latency budget matters here. If the user-facing budget for a 4-call chain is 8 seconds, the per-tool p95 lives at about 1.5 seconds. Track p50, p95, p99 per tool and alarm on p95 growth above the 7-day baseline. The tool-calling agents eval guide covers the per-tool span shape.

Result integration

A tool call that returned 200 still fails the chain if the agent can’t parse the result, drops the right field, or ignores the response and replans from the prompt. The result-integration eval scores three signals on the next turn: does the agent reference the returned data, does the chain progress to the next step, and does the final answer cite the right value.

The 7 deterministic agent-trajectory metrics — task_completion, step_efficiency, tool_selection_accuracy, trajectory_score, goal_progress, action_safety, reasoning_quality — score this from an AgentTrajectoryInput of steps, tool calls, and expected goal. All seven set supports_llm_judge=True, so the same metric runs as a cheap heuristic in CI and a judge-augmented rubric in production. The evaluate-MCP-connected-agents guide walks through the chain-level shape end to end.

Security eval: defer to the sister post

The four security checks — tool-description injection scan, tool-result tampering, sandbox and permission-escape attempt detection, cross-tenant data isolation — get their own depth treatment in Evaluating MCP Servers for Security. The summary that matters here:

  • The threat surface is wider than a fixed-tool agent because tool catalogs are part of the prompt and the supply chain is npm install.
  • The scan layer runs at registration and on every tools/list refresh, against name, description, top-level schema, and every nested description and enum value.
  • The runtime layer fires at both the chat-completion boundary (mcpsec.go) and the per-tool-call hook (toolguard.go) because a single stage misses either the catalog or the loop calls.
  • The cross-tenant layer is structural, not classifier-based: per-key AllowedTools, namespaced tool IDs, trace-ID scoping on every span.

If you only read one post on MCP security, it is the sister post. If your eval workflow only runs one gate today, run the functional one first and add the security one before the next quarterly red-team. The OWASP LLM Top 10 covers LLM01 (indirect prompt injection) and LLM06 (excessive agency), both of which land directly on MCP.

Compatibility eval: same tool, four clients

An MCP server that works in Claude Desktop and breaks in OpenAI’s Responses API is a production bug that no single-client test will catch. The compatibility gate runs the same tool through four client configs and asserts the same answer.

Three places where compatibility breaks:

  • Schema dialect. Claude’s planner handles deeply nested oneOf and anyOf; OpenAI tool-calling refuses some union shapes. Gemini’s tool config wants flat parameter lists. Cursor’s tool integration is closer to OpenAI’s. A schema that planned cleanly in Claude can crash OpenAI’s tool router.
  • Argument coercion. A tool that declares integer and a client that passes "5" works in some clients and fails in others. Booleans get coerced to strings, dates land as ISO 8601 in one client and Unix epoch in another. The schema is the contract; the client is the enforcement.
  • Result format. Text-only versus content-block results, mimeType on file results, image base64 versus URL. A tool that returns { content: "..." } works everywhere; a tool that returns { items: [...] } works in some clients and silently fails in others.

The gate is a fixture suite. For each tool, run a 5 to 10 case fixture set against Claude, OpenAI, Gemini, and Cursor agent configs. Score result equality up to known reformatting (case, whitespace) and assert the same tool sequence per case. A diff between clients is a server bug, not a client quirk; ship the schema fix before the catalog drift hits the next client. The Pydantic AI MCP agent eval covers one specific client; the cross-client matrix is the production-facing version.

Stress eval: concurrency, rate-limit, long tail

Stress eval is the layer that distinguishes a server that runs in dev from one that holds up in production.

Three numbers to gate.

Concurrency. Open N parallel sessions to the server, each calling the same tool with non-overlapping arguments. The break point shows up as 503s, dropped SSE streams, or state leaks between sessions. The stress test runs at 2x peak production concurrency and asserts that error rate stays under 1 percent and that no session reads another session’s state. State leaks are the failure mode that production traffic eventually finds; the eval gate is where you find it first.

Rate-limit behavior. Hit the per-tool rate limit and verify the response is a structured 429 with a Retry-After header, not a 500 with a server stack trace. The gateway’s toolGuardRateCounter atomics enforce this on the Future AGI side; the upstream server needs its own back-pressure layer. The tool_rate_limits field in the gateway config is {tool_name: max_calls_per_minute}. Set it conservatively per tool; widen after observed load.

Long-tail latency. p50 is not the right budget. A 200 ms p50 with a 4-second p99 is a chain that quietly retries on the long tail and pays in token cost. Track p50, p95, p99 per tool and alarm on p95 growth above the 7-day baseline. The Agent Command Center returns x-agentcc-latency-ms per request and tags each fi.span.kind=TOOL span with start and end timestamps, so the latency check runs as a deterministic eval next to the LLM judge.

Load drivers are off-the-shelf — Locust, k6, vegeta. The eval gate is the assertion that the four numbers (error rate, 429 shape, p95, p99) sit inside the production budget. The agent-passes-evals-fails-production post is what this gate exists to prevent.

The CI gate: pre-merge on the server, post-merge on the agent

Two gates, two datasets, one SDK.

Pre-merge on the server PR. When the MCP server changes — a new tool, an updated description, a schema migration — the CI run executes:

  1. Schema validation across the golden test set. Hard fail on any schema regression.
  2. LLMFunctionCalling on 200 to 500 traces sampled from the last week of production. Threshold: 95 percent pass for tools with stable descriptions, 99 percent for tools the PR did not touch.
  3. Compatibility fixture suite across Claude, OpenAI, Gemini, Cursor configs. Hard fail on any cross-client diff.
  4. Security regression set from the sister post: description-injection scan, result-tampering scan, sandbox-escape patterns. The threshold is recall above 0.95 on the adversarial set and precision above 0.99 on the benign set.
  5. Stress run at 2x peak concurrency, with the four-number budget enforced.

Post-merge on the agent PR. When the agent changes — a new prompt, a new model, a new tool inclusion — the CI run executes the integration suite. Frozen prompts run through the agent against the candidate server. The eval asserts the same plan, the same tool sequence, and the same final answer as the baseline. Drift in any of the three is a regression; the CI run blocks the merge until the trace tree explains the diff.

Both gates share the ai-evaluation SDK templates. The difference is the dataset (server-side golden corpus vs. agent-side integration scenarios) and the threshold (98 percent functional pass vs. trajectory equality). The CI/CD LLM eval guide covers the GitHub Actions wiring; the LLM evaluation playbook covers the dataset and judge disciplines.

Worked example: a refund agent across three MCP servers

A support agent connects to invoices (read-only), payments (refund issuance), and notifications (email send). The trace for one request:

  1. User input: “Refund the duplicate charge on invoice 2024-99812 and tell the customer.”
  2. tools/list: gateway returns invoices_search, invoices_get, payments_refund, notifications_email_send. Schemas fetched at session start. Description scan runs against every entry.
  3. invoices_get (TOOL span): arguments {"invoice_id": "2024-99812"}. Latency 180 ms. Schema valid.
  4. payments_refund (TOOL span): arguments {"invoice_id": "2024-99812", "amount": 49.99}. Latency 410 ms. Triggers the per-tool-call hook for the payments allow-list check.
  5. notifications_email_send (TOOL span): the email argument is redacted from the audit log by the data_privacy_compliance adapter before the row writes.
  6. Final response (LLM span): “Refund issued and email sent.”

The eval suite runs on this trace:

  • function_call_exact_match against the expected three-call chain.
  • parameter_validation against each tool’s JSON schema.
  • LLMFunctionCalling for semantic argument correctness.
  • task_completion and step_efficiency for outcome.
  • action_safety for destructive-tool policy.
  • Description and result scans for the security half.

The trace ID links the audit row, the spans, and the eval scores. A failure in any rubric routes into Error Feed, clusters with similar failures from other traces, and produces an immediate_fix the on-call engineer can ship.

Common mistakes

  • Trace at the LLM level only. A response-only score misses the over-calling, retry-on-same-tool, and dropped-context failures that drive cost. Trace at the tool level. The 14 span kinds in traceAI (including TOOL, A2A_CLIENT, A2A_SERVER) are there for this.
  • Pin tool names in tests. MCP catalogs are dynamic. Test for tool families and intent, not literal names; otherwise the eval suite breaks on the next tools/list refresh.
  • One budget for the whole chain. A single per-request budget hides the one tool that’s regressing. Budget per tool and alarm on p95 growth.
  • Skip cross-client compatibility. A server that works in Claude Desktop and breaks in OpenAI’s Responses API is a production bug. Run the fixture suite across four clients on every schema change.
  • No stress gate. A server that’s fine at 1x concurrency can leak state at 2x. Stress eval finds the state leak before the production user does.
  • Skip the security gate. Functional eval misses every threat in the sister post. Run both gates on every PR.

How Future AGI ships the MCP eval workflow

The eval stack is one package, and the MCP workflow is the same package read top to bottom.

  • ai-evaluation SDK (Apache 2.0). 60+ EvalTemplate classes including LLMFunctionCalling, TaskCompletion, Groundedness, PromptInjection, DataPrivacyCompliance, IsHarmfulAdvice. 13 guardrail backends (9 open-weight, 4 API). 8 sub-10 ms Scanners (JailbreakScanner, CodeInjectionScanner, SecretsScanner, MaliciousURLScanner, InvisibleCharScanner, LanguageScanner, TopicRestrictionScanner, RegexScanner). 4 distributed runners (Celery, Ray, Temporal, Kubernetes). 7 deterministic agent-trajectory metrics. The CI gate runs here.
  • traceAI (Apache 2.0). 50+ AI surfaces across Python, TypeScript, Java, C#. MCPInstrumentor for the mcp package; A2AInstrumentor for a2a-sdk. 14 span kinds with TOOL, A2A_CLIENT, A2A_SERVER. Pluggable semantic conventions (FI, OTEL_GENAI, OPENINFERENCE, OPENLLMETRY) at register() time. @tracer.tool decorator infers description from docstrings and parameters from type hints.
  • Agent Command Center. Gateway as MCP server and MCP client. Dual scanner (mcpsec.go at the chat-completion boundary, toolguard.go at the per-tool-call hook). Per-key AllowedTools / DeniedTools for tenant isolation. MaxAgentDepth and per-tool rate counters for budget control. 5-level hierarchical budgets (org, team, user, key, tag). x-agentcc-trace-id propagation. SOC 2 Type II, HIPAA, GDPR, CCPA certified. Single Go binary, Apache 2.0, ~29k req/s at p99 21 ms with guardrails on per the README.
  • Future AGI Platform. Self-improving evaluators retune per-tool thresholds from production feedback. Error Feed clusters trace failures via HDBSCAN soft-clustering over ClickHouse-stored embeddings (euclidean_threshold=0.5, min_cluster_size=2, allow_soft_clustering=True). A Claude Sonnet 4.5 Judge agent on Bedrock (30-turn budget, 8 span tools, Haiku Chauffeur for >3000-char spans, 90 percent prompt-cache hit) writes the immediate_fix per cluster. Classifier-backed evals run at lower per-eval cost than Galileo Luna-2, so a daily full-dataset rerun is feasible.
  • futureagi-mcp-server. Exposes evaluation, dataset, prompt, project, optimization, agent simulation, span-search, and protect tools to any MCP-compatible client (Claude Desktop, Claude Code, Codex CLI, Continue). The futureagi-mcp-server walkthrough covers the tool list.

Start with the SDK and traceAI for the functional eval and the security checks. Add the gateway when traffic and team size justify the policy surface. Wire the Platform’s Error Feed when cluster-level RCA pays back the integration cost.

Frequently asked questions

Why does MCP evaluation need both functional and security checks?
Because an MCP server can pass each in isolation and still ship a broken product. The functional side scores whether `tools/list` returns accurate descriptions, whether the agent picks and calls the right tool, and whether the result flows back into the next turn without the chain breaking. The security side scores whether a tool description can carry an injection payload, whether tool results can be tampered with, whether a `path` argument can escape the sandbox, and whether one tenant's calls can be enumerated by another. Skip functional and the server works on a demo and breaks on a real workload. Skip security and it works in QA and breaks on the audit. The MCP eval suite that ships to production runs both gates, on every PR that touches the server, with the same trace plumbing underneath.
What functional checks belong in an MCP eval suite?
Three layers. Tool-description accuracy: the description returned by `tools/list` matches what the tool actually does, the JSON schema validates against the real signature, and required arguments are marked required. A pure documentation regression here causes silent over-calling and silent error retries. Tool-call success rate: the agent picks the right tool from the live catalog (`function_name_match`), the call passes schema validation (`parameter_validation`), and the call returns 200 within the per-tool budget. Result integration: the returned payload parses, the agent reads the right field, and the chain progresses to the next call without dropping context. `LLMFunctionCalling` covers the semantic layer; the deterministic metrics from `python/fi/evals/metrics/function_calling/` cover the cheap layer.
How do compatibility issues show up in MCP servers?
Three places. The tool catalog: Claude, OpenAI's Responses API, Gemini, and Cursor each render JSON schemas with slightly different rules. A `oneOf` that works in Claude's planner can crash OpenAI's. The argument types: integers, booleans, and date strings get coerced differently per client. A schema that says `integer` and a client that sends `"5"` is a real production class of bug. The result format: text-only versus content-block results, MIME types on `mimeType`, image base64 versus URL. The compatibility gate is a fixture suite that runs the same tool through Claude, OpenAI, Gemini, and Cursor configs and asserts the same answer. Without it, the server ships to one client and silently breaks the second.
What does stress evaluation look like for MCP?
Three numbers. Concurrency: open N parallel sessions, each calling the same tool. The break point shows up as 503s, dropped streams, or state leaks between sessions. Rate-limit behavior: hit the per-tool rate limit and verify the server returns a structured 429 with a `Retry-After` header, not a 500. Long-tail latency: at p95 and p99, the per-tool latency should sit inside the chain budget. A 200 ms p50 with a 4 second p99 is a chain that retries on the long tail. Locust, k6, or `vegeta` drive the load; the eval gate asserts that error rate stays below 1 percent at 2x peak concurrency.
Where does the CI gate live in an MCP eval workflow?
In two places. A pre-merge gate on the server PR: schema validation runs against a golden test set, `LLMFunctionCalling` runs on 200 to 500 traces from production sampling, compatibility runs across 4 client configs, and the security regression set from `evaluating-mcp-servers-security-2026` runs against the new build. A post-merge gate on the agent side: the integration suite runs the agent against the server using a frozen prompt and asserts the same plan, the same tool sequence, and the same final answer. Both gates use the same `ai-evaluation` SDK templates; the difference is the dataset and the threshold. The [CI/CD LLM eval guide](/blog/ci-cd-llm-eval-github-actions-2026/) covers the GitHub Actions plumbing.
How does FAGI score MCP servers end to end?
The eval surface (`ai-evaluation` SDK) handles the functional and security checks as evaluators; traceAI handles the per-tool span instrumentation; the Agent Command Center gateway handles policy enforcement at runtime with `mcpsec.go` at the chat-completion boundary and `toolguard.go` at the per-tool-call hook. Error Feed clusters production failures via HDBSCAN soft-clustering over ClickHouse-stored embeddings, then a Claude Sonnet 4.5 Judge agent on Bedrock writes the `immediate_fix` per cluster with evidence quotes pulled from the trace. The Future AGI Platform retunes per-tool thresholds from production feedback. Honest framing: classifier-backed evals run at lower per-eval cost than Galileo Luna-2, so a daily full-dataset rerun is feasible.
What's the minimum eval set for an MCP server in production?
Six checks gated on every server PR. (1) `parameter_validation` against the JSON schema on every test call. (2) `LLMFunctionCalling` (eval ID 98) on a 100-trace sample of recent production calls. (3) Description-injection scan with the `PromptInjection` template and the `Protect` `prompt_injection` adapter across every registered tool's name, description, and schema. (4) Result-tampering scan with the same cascade applied to every tool return string. (5) Compatibility fixture suite across Claude, OpenAI, Gemini, and Cursor configs. (6) Stress run at 2x peak concurrency with the error-rate threshold at 1 percent. Pass all six or the PR does not merge. Anything narrower ships a server that works on the happy path.
Related Articles
View all