Research

Claude Skills Evaluation Deep Dive: The Skill Is the Contract (2026)

Evaluating Claude Skills in 2026: the skill is a contract, eval the contract. Three rubrics for dispatch, trajectory, integration, on traceAI.

May 3, 2026

Updated May 20, 2026

11 min read

claude-skills agent-evaluation llm-evaluation traceai anthropic skill-dispatch skill-contract 2026

Table of Contents

A Claude Agent SDK supervisor scores 0.91 on TaskCompletion against the final user answer. The refactor lands, the run closes green. Pull the trace: the supervisor activated the pdf-extraction skill on a plain text-summarisation task, the activated skill called Bash even though its SKILL.md frontmatter listed only Read and Write, and a separate finance-audit skill returned a clean variance report the supervisor ignored and recomputed inline with a stale heuristic. The final string reads fine. Every skill activation broke a different contract.

Claude Skills eval is not generic agent eval or generic tool-call eval. A Claude Skill is a contract. The SKILL.md description promises a capability scope, the allowed-tools frontmatter promises a tool surface, and the markdown body promises a procedure. The supervisor reads the contract at dispatch time and integrates the result at the other end. The eval unit is the skill activation triple — supervisor decision, skill-internal trajectory, supervisor integration — and the rubric is contract conformance per skill. This post walks the three rubrics, the traceAI Claude Agent SDK instrumentor that emits the spans the rubrics read, and where the Future AGI eval stack closes the loop.

Why Claude Skills eval differs from tool-call eval

Three structural differences separate Claude Skills from the kind of surface the four-axis Claude Code tool-use eval was written for, and from the dispatch eval the Claude sub-agent post covered.

A skill is procedural knowledge, not a function call. A Claude tool is one callable with a JSON schema; the model picks it, fills the args, reads the structured return. A Claude Skill is a folder containing a SKILL.md (markdown body with YAML frontmatter for name, description, and optional allowed-tools) plus sibling scripts and reference files. Loading a skill is closer to a human reading a runbook than to a function call: the body steers the next several turns, the skill calls multiple tools, and the result is a piece of work.

A skill is a contract surface, not a single decision. The supervisor reads the description to decide whether to activate, the allowed-tools to know what surface the skill is allowed to touch, and the markdown body for the procedure. Each is a separate promise. A skill can pass dispatch (right pick) and break trajectory (drifts past allowed-tools). It can stay on procedure and break integration (the supervisor ignores the return). One score across the whole activation blurs three contracts.

A skill is shared across agents and versionable in git. A regression in one supervisor session hurts one user; a regression in a popular skill hurts every supervisor that ever activates it. Skill authors update bodies, narrow allowed-tools, swap reference files. Eval has to gate on the skill version, not just the supervisor checkpoint, which means a per-skill golden set and CI on every skill version bump.

If your eval stack only grades the supervisor’s final answer, none of these three are visible.

The skill activation as the unit of evaluation

A skill activation is a triple: the supervisor’s decision to load this skill with this activation prompt, the skill body’s internal trajectory inside its declared scope, and the supervisor’s next turn after the skill returns. The eval unit is (supervisor_decision, skill_internal_trajectory, supervisor_integration), not the final assistant message and not the skill output in isolation.

This reframes everything downstream. The regression set is a list of real production activations with their expected skill, scope coverage, and supervisor follow-through. The CI gate asserts per-activation scores, not one supervisor-level number. Error Feed clusters failures by activation pattern (wrong skill, trajectory drift, integration skip), not by final-answer category. agent-opt tunes the planner prompt and each SKILL.md body as separate study targets because failure attribution is per-activation.

A working definition: an activation is correct when the supervisor picked the right skill (or correctly skipped), the skill body stayed inside its declared description and allowed-tools, and the supervisor used the return value in the next plan step. Three named failure modes, three dedicated rubrics, one localised diagnostic the moment a regression lands.

Three rubrics: dispatch correctness, internal trajectory, output integration

Three rubrics, one per contract surface, all built on CustomLLMJudge from the ai-evaluation SDK. The Future AGI Platform’s classifier-backed scoring runs them at lower per-eval cost than Galileo Luna-2 once the rubrics stabilise; the SDK is the starting surface.

from fi.evals import Evaluator
from fi.evals.templates import (
    TaskCompletion, EvaluateFunctionCalling,  # alias: LLMFunctionCalling
    AnswerRefusal, Groundedness,
)
from fi.evals.metrics.llm_as_judges.custom_judge import CustomLLMJudge

evaluator = Evaluator(fi_api_key="...", fi_secret_key="...")

skill_dispatch = CustomLLMJudge(
    name="SkillDispatchCorrectness",
    rubric=(
        "Given the supervisor's plan state, the catalog of available skills "
        "with their name + description + allowed_tools, and the activation "
        "call (skill_name, activation_prompt), score whether the supervisor "
        "picked the right skill, scoped the activation prompt, and chose "
        "activation over inline when activation was warranted. "
        "1.0 = correct. 0.5 = right skill, weak activation prompt. "
        "0.0 = wrong skill, or activation where inline answer was right."
    ),
    input_mapping={
        "plan_state": "supervisor_pre_activation_context",
        "activation_call": "skill_activation_input",
        "available_skills": "skill_catalog",
    },
)

skill_trajectory = CustomLLMJudge(
    name="SkillInternalTrajectory",
    rubric=(
        "Given the SKILL.md description, the allowed_tools frontmatter, and "
        "the skill's full internal turn trace, score whether the body stayed "
        "inside its declared procedure. Penalise tool calls outside the "
        "allowed_tools subset, work done outside the description's scope, "
        "fabricated context the activation never supplied, and reference-doc "
        "drift. Score 0.0 to 1.0."
    ),
    input_mapping={
        "skill_description": "skill_md_description",
        "allowed_tools": "skill_md_allowed_tools",
        "skill_trace": "skill_internal_transcript",
    },
)

skill_integration = CustomLLMJudge(
    name="SkillOutputIntegration",
    rubric=(
        "Given the skill's return value and the supervisor's next turn after "
        "the skill returned, score whether the supervisor read the result, "
        "propagated its constraints, and let the result change the plan. "
        "Penalise: regenerates the work inline, ignores a returned constraint, "
        "contradicts the skill's conclusion without justification."
    ),
    input_mapping={
        "skill_output": "skill_return_value",
        "supervisor_next_turn": "supervisor_post_activation_turn",
    },
)

Run them in layers. TaskCompletion, EvaluateFunctionCalling, AnswerRefusal, and Groundedness cover the supervisor and the skill-internal trajectory at the baseline. Groundedness matters specifically when the skill loads reference files; if the skill returns claims that are not backed by the sibling docs, the eval surfaces it without a separate hallucination test. The three contract rubrics cover the seams the baseline cannot see.

Wire per-axis thresholds into the CI gate. A reasonable starting set is SkillDispatchCorrectness >= 0.85, SkillInternalTrajectory >= 0.90 per skill, SkillOutputIntegration >= 0.80. Bind the assertions to the supervisor system prompt version, the SKILL.md content hash, and the test set tag, so a regression localises to one prompt change. The CI gate fails on the failing axis on the failing skill — one bisect instead of three days.

Building the golden set per skill

Synthetic activations mislead. The supervisor picks a skill based on the actual plan state in front of it, and that state is messy in ways a hand-written test case will not capture. The golden set has to come from real production traces.

Fifty activations is enough to start for a low-traffic skill; two hundred for a flagship one. Pull a representative cross-section from the live traceAI stream, stratified by claude.skill.name so each skill carries at least eight to ten activations. Annotate each with four artifacts: supervisor pre-activation context, the activation call verbatim, the skill’s full internal trajectory, and the supervisor’s next turn after the skill returned. Those map one-to-one onto the three rubrics.

Version the set in git alongside the supervisor system prompt and every SKILL.md in the catalogue. Tag each activation with data_source and difficulty_bucket so a regression in the hard-prod bucket flags differently from one in the easy-synthetic bucket. Keep sixty to seventy percent green activations — the gate needs a passing baseline to detect regressions. Refresh expected outputs quarterly for low-traffic skills, monthly for flagship ones, and quarantine ambiguous scenarios into an advisory set the moment two reviewers disagree.

traceAI Claude Agent SDK skill instrumentation

A skill rubric needs a skill span. The ClaudeAgentSDKInstrumentor emits that span tree without code changes inside your skill definitions. It supports claude-agent-sdk >= 0.1.0, patches ClaudeSDKClient on import, and emits the typed span kinds the rubrics need.

pip install claude-agent-sdk traceAI-claude-agent-sdk \
    fi-instrumentation-otel ai-evaluation

import os
os.environ["FI_API_KEY"] = "..."
os.environ["FI_SECRET_KEY"] = "..."

from fi_instrumentation import register
from fi_instrumentation.fi_types import ProjectType
from traceai_claude_agent_sdk import ClaudeAgentSDKInstrumentor

trace_provider = register(
    project_type=ProjectType.OBSERVE,
    project_name="claude-skills-prod",
)
ClaudeAgentSDKInstrumentor().instrument(tracer_provider=trace_provider)

After this call, every supervisor session emits the per-activation signal the rubrics need.

Each skill activation produces a span with fi.span.kind set to AGENT for a top-level skill or CHAIN for a skill that composes other skills. Per-activation attributes include claude.skill.name, claude.skill.version, claude.skill.allowed_tools (JSON list from the frontmatter), claude.skill.activation_prompt (the scoped prompt the supervisor passed verbatim), and claude.skill.discovery_score (the model’s own confidence in the pick when exposed). Tool calls inside the skill nest as fi.span.kind=TOOL children of the skill span, which means a query for “all spans with claude.skill.name=finance_audit and claude.skill.version=2.3” returns exactly the population needed to score that version in isolation, and the same trace tree pivots by skill, by tool, or by supervisor session without rebuilding the data model.

That topology is what the three rubrics read. SkillDispatchCorrectness reads the supervisor’s ASSISTANT_TURN immediately preceding the skill span. SkillInternalTrajectory reads the skill span’s allowed_tools plus the nested TOOL children. SkillOutputIntegration reads the skill span’s return plus the supervisor’s ASSISTANT_TURN immediately following. A generic OTel tracer collapses the whole session into one chat span and loses this attribution, which is exactly why a hand-rolled scorer over the chat transcript misses every contract defect. The Claude Code observability with OpenInference post covers the broader span-topology rationale.

Production loop: Error Feed clustering and version regression

CI is necessary, not sufficient. A two-hundred-activation regression set is a snapshot; production is a river. Score the live trace stream with the same three rubrics and you catch regressions the offline set cannot have, because it was frozen before users found the failure mode. EvalTag on the registered tracer attaches the three rubrics to matching spans server-side, at zero inline latency to the supervisor.

Error Feed is the loop closer. Failing activations flow into ClickHouse with their span embeddings; HDBSCAN soft-clustering groups them into named issues. The clusters that turn up most on Claude Skills stacks are contract-shaped:

Wrong-skill clusters. The supervisor activates finance_audit on a plain Q&A or activates at all when an inline answer was right. The immediate_fix is usually a planner prompt edit adding an explicit activate-or-inline criterion plus a tighter description field in the over-picked skill.
Trajectory-drift clusters. validator-skill starts drafting code; refactor-skill calls a tool outside its allowed-tools frontmatter. The fix is a tighter SKILL.md body — one paragraph enumerating what the skill must not do — plus narrowing allowed-tools.
Integration-skip clusters. The supervisor receives a useful skill output and continues as if it never returned. The fix is a supervisor prompt change that forces a one-sentence acknowledgement of the skill output before the next planning step.

Per cluster, a Claude Sonnet 4.5 JudgeAgent runs a 30-turn investigation across eight span-tools, with a Haiku Chauffeur on spans over 3000 characters and roughly 90 percent prompt-cache hit. The Judge writes a 5-category, 30-subtype taxonomy entry, the 4-D trace score (factual_grounding, privacy_and_safety, instruction_adherence, optimal_plan_execution, 1 to 5 each), and an immediate_fix naming the planner prompt edit, the SKILL.md tighten, or the integration prompt change to ship today. Linear OAuth is wired today; Slack, GitHub, Jira, PagerDuty on the roadmap.

The cluster’s representative activations promote into the regression set. agent-opt’s six optimizers consume the three contract rubrics as the optimisation objective, with the planner prompt and each SKILL.md body as separate study targets — per-prompt separation keeps a winning tweak on finance_audit from being masked by a flat planner. The automated optimization for agents post covers the optimiser mix.

For third-party skills entering the catalogue, the registration-time gate runs JailbreakScanner, CodeInjectionScanner, SecretsScanner, MaliciousURLScanner, InvisibleCharScanner, and RegexScanner against the SKILL.md body, the allowed-tools frontmatter, and every reference file. Each scanner returns sub-ten milliseconds; the full bundle finishes well under a second. Block publish on any high-severity finding.

Common Claude Skill eval anti-patterns

Four mistakes that hide each contract surface above.

Scoring only the supervisor’s final answer. TaskCompletion on the closing turn misses every activation defect that did not quite poison the final string. One activation picked the wrong skill, another drifted past allowed-tools, the supervisor ignored the one useful result, the final answer scored fine. Score per-activation or the diagnostic that tells you which contract broke never surfaces.

One rubric across all activations. Dispatch, trajectory, and integration are doing different jobs. A single skill-quality rubric blurs the three contracts and produces nothing actionable. Keep them separated.

No skill span, only a chat transcript. A generic OTel tracer collapses a supervisor session into one chat span; claude.skill.name, claude.skill.activation_prompt, and claude.skill.allowed_tools are gone. The rubrics cannot run because there is no activation object to score against.

No skill-version regression test. Skill authors will ship SKILL.md bumps. Without a frozen golden set per skill version, you only notice the regression when a downstream supervisor complains, by which point five other agents depend on the broken skill. Pin the golden set to claude.skill.version and rerun on every commit that touches the body or its reference files.

How Future AGI ships the Claude Skills eval stack

Four surfaces, one loop, no separate products to glue together.

ai-evaluation SDK (Apache 2.0) ships the Evaluator, 60-plus EvalTemplate classes (TaskCompletion, EvaluateFunctionCalling, AnswerRefusal, Groundedness, ChunkAttribution, 11 CustomerAgent* templates), the CustomLLMJudge that carries the three contract rubrics, the Scanner suite for third-party audits, and distributed runners (Celery, Ray, Temporal, Kubernetes).

traceAI (Apache 2.0) ships the ClaudeAgentSDKInstrumentor with its typed span kinds and per-skill attributes (claude.skill.name, claude.skill.version, claude.skill.allowed_tools, claude.skill.activation_prompt), 50-plus other instrumentors across Python and TypeScript, and the EvalTag mechanism that attaches a rubric to a span kind so evals run server-side without polling.

Future AGI Platform ships self-improving evaluators tuned by feedback, in-product custom rubric authoring, and classifier-backed scoring at lower per-eval cost than Galileo Luna-2. Error Feed sits inside the platform with HDBSCAN clustering, the Sonnet 4.5 JudgeAgent, the 5-category 30-subtype taxonomy, the 4-D trace score, and the immediate_fix artifact. The closed self-improving loop — generate synthetic activations, simulate supervisor sessions, evaluate, optimise the planner and the SKILL.md, fold back into the eval set — is what FAGI ships as a single product. Other vendors ship the parts; FAGI ships the loop.

Agent Command Center plus agent-opt close the safety and improvement loops. The gateway runs a per-tenant policy layer in front of Claude — AllowedTools and DeniedTools on each virtual key extend cleanly to per-skill allowlists, so tenant A can only activate finance skills and tenant B can only activate support skills. Per-call response headers (x-prism-cost, x-prism-latency-ms, x-prism-model-used) give per-tenant per-skill cost telemetry without changing any agent code. SOC 2 Type II, HIPAA, GDPR, CCPA certified per futureagi.com/trust; ISO/IEC 27001 in active audit.

Honest tradeoff: if your stack runs one supervisor with two skills and no third-party imports, the dispatch rubric plus TaskCompletion covers most of it. The three-rubric stack earns its weight the moment you have a shared catalogue, more than one supervisor, or a CI gate that refuses merges on regressions across the skills humans share.

What to do this week

Five steps from zero to a working Claude Skills eval loop.

Wire ClaudeAgentSDKInstrumentor().instrument(tracer_provider=trace_provider). Confirm skill activation spans land with claude.skill.name, claude.skill.version, claude.skill.allowed_tools, and claude.skill.activation_prompt populated.
Pull a hundred production activations across the skills in your catalogue. Annotate with the four artifacts above, stratified by skill name and difficulty bucket.
Define SkillDispatchCorrectness, SkillInternalTrajectory, and SkillOutputIntegration as CustomLLMJudge rubrics. Run them through Evaluator.evaluate alongside TaskCompletion, EvaluateFunctionCalling, AnswerRefusal, and Groundedness.
Wire per-axis thresholds into CI: SkillDispatchCorrectness >= 0.85, SkillInternalTrajectory >= 0.90 per skill, SkillOutputIntegration >= 0.80. Run the six-scanner audit on every third-party skill at registration. Pin the gateway in front of production traffic.
Turn on Error Feed. Promote each week’s cluster reps into the regression set. Run a BayesianSearchOptimizer study on the highest-impact prompt — the planner if SkillDispatchCorrectness is failing, the SKILL.md body if SkillInternalTrajectory is, the integration prompt if SkillOutputIntegration is. Pin the claude-sonnet-4-5 checkpoint; dispatch and trajectory shift first on every refresh.

The teams shipping reliable Claude Skills in 2026 stopped grading the supervisor’s final answer and started grading the contracts. The Claude Agent SDK gives you the runtime. The eval stack gives you the per-activation signal that keeps every skill in the catalogue honest, one SKILL.md at a time.

Frequently asked questions

What is a Claude Skill in the Claude Agent SDK and how does it differ from a tool?

A Claude Skill is a folder containing a SKILL.md file plus optional scripts and reference materials, packaged as a reusable unit of procedural knowledge. The supervisor Claude session sees the skill name and description by default and loads the body only when the task matches. Once loaded, the skill steers the next few turns, calls the tools listed in its allowed-tools frontmatter, and produces a result the supervisor reads. A tool is one callable function with a JSON schema; a skill is a runbook that can call many tools. The Claude Agent SDK and Claude Code both ship with skill support, and traceAI emits a dedicated span for each skill activation. The unit of work is the skill, not the underlying tool calls.

Why is Claude Skills evaluation different from generic agent or tool-call evaluation?

A Claude Skill is a contract: the SKILL.md description promises a capability scope, the allowed-tools list promises a tool surface, and the body promises a procedure. The supervisor reads the contract at dispatch time and integrates the result at the other end. Three things can break independently: the supervisor invokes the wrong skill (or invokes a skill when inline reasoning was right), the skill body drifts off its declared procedure, or the supervisor ignores what the skill returned. A TaskCompletion score on the supervisor's final answer hides all three. The eval unit is the skill activation triple: supervisor decision, skill-internal trajectory, supervisor integration.

What are the three rubrics for Claude Skills evaluation?

SkillDispatchCorrectness scores the supervisor's pre-activation choice — was this the right skill for the goal, was the activation prompt scoped, or should the supervisor have answered inline. SkillInternalTrajectory scores what happens once the skill is loaded — did the body follow its declared procedure, did tool calls stay inside the allowed-tools frontmatter, did the skill respect its capability scope. SkillOutputIntegration scores the supervisor's next turn after the skill returns — did the supervisor read the skill output, propagate constraints, and let the result actually change the plan. Each rubric reads a different span in the traceAI tree and points at a different fix: planner prompt, SKILL.md body, or integration prompt.

How do I instrument Claude Skills with traceAI?

Install claude-agent-sdk plus traceAI-claude-agent-sdk plus ai-evaluation, then call register(project_type=ProjectType.OBSERVE, project_name='claude-skills-prod') and ClaudeAgentSDKInstrumentor().instrument(tracer_provider=trace_provider). Each skill activation produces a span with claude.skill.name, claude.skill.version, claude.skill.allowed_tools, and claude.skill.activation_prompt attributes. Tool calls inside the skill nest as TOOL_EXECUTION children of the skill span, so the same trace tree gives you per-skill rubrics, per-tool rubrics, and per-supervisor-session rubrics without restructuring the data.

How do I catch a third-party Claude Skill carrying prompt injection or scope creep?

Run the Future AGI scanner suite (JailbreakScanner, CodeInjectionScanner, SecretsScanner, MaliciousURLScanner, InvisibleCharScanner, RegexScanner) on the SKILL.md body, the allowed-tools frontmatter, and every reference file at registration time, and rerun on every version bump. JailbreakScanner catches adversarial instructions embedded in the skill body, SecretsScanner blocks embedded credentials, MaliciousURLScanner checks reference doc links, and RegexScanner enforces denylists like ignore-previous patterns. For runtime defense, route Claude traffic through Agent Command Center, which scans both the chat completion stage and each individual tool call inside the skill.

How does Error Feed cluster Claude Skill failures?

Error Feed runs HDBSCAN soft-clustering over span attributes plus per-activation embeddings inside ClickHouse, then fires a Claude Sonnet 4.5 JudgeAgent on each cluster with a 30-turn budget. On Claude Skill stacks the clusters are skill-shaped: wrong-skill clusters (supervisor activates the finance skill on a plain Q&A or activates at all when inline was faster), trajectory-drift clusters (the skill body wanders outside its declared procedure or calls a tool outside its allowed-tools frontmatter), and integration-skip clusters (the supervisor reads the skill output and continues as if it never returned). The Judge writes a 5-category 30-subtype taxonomy entry, the 4-D trace score, and an immediate_fix naming the planner prompt edit, the SKILL.md body tighten, or the integration prompt change that should ship today. Linear OAuth is wired today; Slack, GitHub, Jira, PagerDuty on the roadmap.

View all

Research

Simulated Multi-Turn LLM Evaluation: 2026 Playbook

Simulate persona x scenario x adversary, score multi-turn outcomes, gate releases. Vendor-neutral playbook with code that runs without proprietary SDKs.

NVJK Kartik · Dec 14, 2025

5 min

Research

Evaluating AI Agent Skills in 2026: A Skill-Tree Playbook

Skill-level eval for agents in 2026: discrete skills, per-skill rubrics, regression sets, and CI gates. Vendor-neutral code, no proprietary SDK.

Vrinda Damani · Sep 28, 2025

7 min

Research

Opik Alternatives in 2026: 6 LLM Eval and Observability Tools

FutureAGI, Langfuse, Phoenix, Braintrust, LangSmith, and DeepEval as Comet Opik alternatives in 2026. Pricing, OSS license, judge metrics, and tradeoffs.

Rishav Hada · Apr 10, 2026

13 min

Why Claude Skills eval differs from tool-call eval

The skill activation as the unit of evaluation

Three rubrics: dispatch correctness, internal trajectory, output integration

Building the golden set per skill

traceAI Claude Agent SDK skill instrumentation

Production loop: Error Feed clustering and version regression

Common Claude Skill eval anti-patterns

How Future AGI ships the Claude Skills eval stack

What to do this week

Related reading

Frequently asked questions