Claude Skills Evaluation Deep Dive: The Skill Is the Contract (2026)
Evaluating Claude Skills in 2026: the skill is a contract, eval the contract. Three rubrics for dispatch correctness, internal trajectory, output integration, on traceAI.
Table of Contents
A Claude Agent SDK supervisor scores 0.91 on TaskCompletion against the final user answer. The refactor lands, the run closes green. Pull the trace: the supervisor activated the pdf-extraction skill on a plain text-summarisation task, the activated skill called Bash even though its SKILL.md frontmatter listed only Read and Write, and a separate finance-audit skill returned a clean variance report the supervisor ignored and recomputed inline with a stale heuristic. The final string reads fine. Every skill activation broke a different contract.
Claude Skills eval is not generic agent eval or generic tool-call eval. A Claude Skill is a contract. The SKILL.md description promises a capability scope, the allowed-tools frontmatter promises a tool surface, and the markdown body promises a procedure. The supervisor reads the contract at dispatch time and integrates the result at the other end. The eval unit is the skill activation triple — supervisor decision, skill-internal trajectory, supervisor integration — and the rubric is contract conformance per skill. This post walks the three rubrics, the traceAI Claude Agent SDK instrumentor that emits the spans the rubrics read, and where the Future AGI eval stack closes the loop.
Why Claude Skills eval differs from tool-call eval
Three structural differences separate Claude Skills from the kind of surface the four-axis Claude Code tool-use eval was written for, and from the dispatch eval the Claude sub-agent post covered.
A skill is procedural knowledge, not a function call. A Claude tool is one callable with a JSON schema; the model picks it, fills the args, reads the structured return. A Claude Skill is a folder containing a SKILL.md (markdown body with YAML frontmatter for name, description, and optional allowed-tools) plus sibling scripts and reference files. Loading a skill is closer to a human reading a runbook than to a function call: the body steers the next several turns, the skill calls multiple tools, and the result is a piece of work.
A skill is a contract surface, not a single decision. The supervisor reads the description to decide whether to activate, the allowed-tools to know what surface the skill is allowed to touch, and the markdown body for the procedure. Each is a separate promise. A skill can pass dispatch (right pick) and break trajectory (drifts past allowed-tools). It can stay on procedure and break integration (the supervisor ignores the return). One score across the whole activation blurs three contracts.
A skill is shared across agents and versionable in git. A regression in one supervisor session hurts one user; a regression in a popular skill hurts every supervisor that ever activates it. Skill authors update bodies, narrow allowed-tools, swap reference files. Eval has to gate on the skill version, not just the supervisor checkpoint, which means a per-skill golden set and CI on every skill version bump.
If your eval stack only grades the supervisor’s final answer, none of these three are visible.
The skill activation as the unit of evaluation
A skill activation is a triple: the supervisor’s decision to load this skill with this activation prompt, the skill body’s internal trajectory inside its declared scope, and the supervisor’s next turn after the skill returns. The eval unit is (supervisor_decision, skill_internal_trajectory, supervisor_integration), not the final assistant message and not the skill output in isolation.
This reframes everything downstream. The regression set is a list of real production activations with their expected skill, scope coverage, and supervisor follow-through. The CI gate asserts per-activation scores, not one supervisor-level number. Error Feed clusters failures by activation pattern (wrong skill, trajectory drift, integration skip), not by final-answer category. agent-opt tunes the planner prompt and each SKILL.md body as separate study targets because failure attribution is per-activation.
A working definition: an activation is correct when the supervisor picked the right skill (or correctly skipped), the skill body stayed inside its declared description and allowed-tools, and the supervisor used the return value in the next plan step. Three named failure modes, three dedicated rubrics, one localised diagnostic the moment a regression lands.
Three rubrics: dispatch correctness, internal trajectory, output integration
Three rubrics, one per contract surface, all built on CustomLLMJudge from the ai-evaluation SDK. The Future AGI Platform’s classifier-backed scoring runs them at lower per-eval cost than Galileo Luna-2 once the rubrics stabilise; the SDK is the starting surface.
from fi.evals import Evaluator
from fi.evals.templates import (
TaskCompletion, EvaluateFunctionCalling, # alias: LLMFunctionCalling
AnswerRefusal, Groundedness,
)
from fi.evals.metrics.llm_as_judges.custom_judge import CustomLLMJudge
evaluator = Evaluator(fi_api_key="...", fi_secret_key="...")
skill_dispatch = CustomLLMJudge(
name="SkillDispatchCorrectness",
rubric=(
"Given the supervisor's plan state, the catalog of available skills "
"with their name + description + allowed_tools, and the activation "
"call (skill_name, activation_prompt), score whether the supervisor "
"picked the right skill, scoped the activation prompt, and chose "
"activation over inline when activation was warranted. "
"1.0 = correct. 0.5 = right skill, weak activation prompt. "
"0.0 = wrong skill, or activation where inline answer was right."
),
input_mapping={
"plan_state": "supervisor_pre_activation_context",
"activation_call": "skill_activation_input",
"available_skills": "skill_catalog",
},
)
skill_trajectory = CustomLLMJudge(
name="SkillInternalTrajectory",
rubric=(
"Given the SKILL.md description, the allowed_tools frontmatter, and "
"the skill's full internal turn trace, score whether the body stayed "
"inside its declared procedure. Penalise tool calls outside the "
"allowed_tools subset, work done outside the description's scope, "
"fabricated context the activation never supplied, and reference-doc "
"drift. Score 0.0 to 1.0."
),
input_mapping={
"skill_description": "skill_md_description",
"allowed_tools": "skill_md_allowed_tools",
"skill_trace": "skill_internal_transcript",
},
)
skill_integration = CustomLLMJudge(
name="SkillOutputIntegration",
rubric=(
"Given the skill's return value and the supervisor's next turn after "
"the skill returned, score whether the supervisor read the result, "
"propagated its constraints, and let the result change the plan. "
"Penalise: regenerates the work inline, ignores a returned constraint, "
"contradicts the skill's conclusion without justification."
),
input_mapping={
"skill_output": "skill_return_value",
"supervisor_next_turn": "supervisor_post_activation_turn",
},
)
Run them in layers. TaskCompletion, EvaluateFunctionCalling, AnswerRefusal, and Groundedness cover the supervisor and the skill-internal trajectory at the baseline. Groundedness matters specifically when the skill loads reference files; if the skill returns claims that are not backed by the sibling docs, the eval surfaces it without a separate hallucination test. The three contract rubrics cover the seams the baseline cannot see.
Wire per-axis thresholds into the CI gate. A reasonable starting set is SkillDispatchCorrectness >= 0.85, SkillInternalTrajectory >= 0.90 per skill, SkillOutputIntegration >= 0.80. Bind the assertions to the supervisor system prompt version, the SKILL.md content hash, and the test set tag, so a regression localises to one prompt change. The CI gate fails on the failing axis on the failing skill — one bisect instead of three days.
Building the golden set per skill
Synthetic activations mislead. The supervisor picks a skill based on the actual plan state in front of it, and that state is messy in ways a hand-written test case will not capture. The golden set has to come from real production traces.
Fifty activations is enough to start for a low-traffic skill; two hundred for a flagship one. Pull a representative cross-section from the live traceAI stream, stratified by claude.skill.name so each skill carries at least eight to ten activations. Annotate each with four artifacts: supervisor pre-activation context, the activation call verbatim, the skill’s full internal trajectory, and the supervisor’s next turn after the skill returned. Those map one-to-one onto the three rubrics.
Version the set in git alongside the supervisor system prompt and every SKILL.md in the catalogue. Tag each activation with data_source and difficulty_bucket so a regression in the hard-prod bucket flags differently from one in the easy-synthetic bucket. Keep sixty to seventy percent green activations — the gate needs a passing baseline to detect regressions. Refresh expected outputs quarterly for low-traffic skills, monthly for flagship ones, and quarantine ambiguous scenarios into an advisory set the moment two reviewers disagree.
traceAI Claude Agent SDK skill instrumentation
A skill rubric needs a skill span. The ClaudeAgentSDKInstrumentor emits that span tree without code changes inside your skill definitions. It supports claude-agent-sdk >= 0.1.0, patches ClaudeSDKClient on import, and emits the typed span kinds the rubrics need.
pip install claude-agent-sdk traceAI-claude-agent-sdk \
fi-instrumentation-otel ai-evaluation
import os
os.environ["FI_API_KEY"] = "..."
os.environ["FI_SECRET_KEY"] = "..."
from fi_instrumentation import register
from fi_instrumentation.fi_types import ProjectType
from traceai_claude_agent_sdk import ClaudeAgentSDKInstrumentor
trace_provider = register(
project_type=ProjectType.OBSERVE,
project_name="claude-skills-prod",
)
ClaudeAgentSDKInstrumentor().instrument(tracer_provider=trace_provider)
After this call, every supervisor session emits the per-activation signal the rubrics need.
Each skill activation produces a span with fi.span.kind set to AGENT for a top-level skill or CHAIN for a skill that composes other skills. Per-activation attributes include claude.skill.name, claude.skill.version, claude.skill.allowed_tools (JSON list from the frontmatter), claude.skill.activation_prompt (the scoped prompt the supervisor passed verbatim), and claude.skill.discovery_score (the model’s own confidence in the pick when exposed). Tool calls inside the skill nest as fi.span.kind=TOOL children of the skill span, which means a query for “all spans with claude.skill.name=finance_audit and claude.skill.version=2.3” returns exactly the population needed to score that version in isolation, and the same trace tree pivots by skill, by tool, or by supervisor session without rebuilding the data model.
That topology is what the three rubrics read. SkillDispatchCorrectness reads the supervisor’s ASSISTANT_TURN immediately preceding the skill span. SkillInternalTrajectory reads the skill span’s allowed_tools plus the nested TOOL children. SkillOutputIntegration reads the skill span’s return plus the supervisor’s ASSISTANT_TURN immediately following. A generic OTel tracer collapses the whole session into one chat span and loses this attribution, which is exactly why a hand-rolled scorer over the chat transcript misses every contract defect. The Claude Code observability with OpenInference post covers the broader span-topology rationale.
Production loop: Error Feed clustering and version regression
CI is necessary, not sufficient. A two-hundred-activation regression set is a snapshot; production is a river. Score the live trace stream with the same three rubrics and you catch regressions the offline set cannot have, because it was frozen before users found the failure mode. EvalTag on the registered tracer attaches the three rubrics to matching spans server-side, at zero inline latency to the supervisor.
Error Feed is the loop closer. Failing activations flow into ClickHouse with their span embeddings; HDBSCAN soft-clustering groups them into named issues. The clusters that turn up most on Claude Skills stacks are contract-shaped:
- Wrong-skill clusters. The supervisor activates
finance_auditon a plain Q&A or activates at all when an inline answer was right. Theimmediate_fixis usually a planner prompt edit adding an explicit activate-or-inline criterion plus a tighterdescriptionfield in the over-picked skill. - Trajectory-drift clusters.
validator-skillstarts drafting code;refactor-skillcalls a tool outside itsallowed-toolsfrontmatter. The fix is a tighterSKILL.mdbody — one paragraph enumerating what the skill must not do — plus narrowingallowed-tools. - Integration-skip clusters. The supervisor receives a useful skill output and continues as if it never returned. The fix is a supervisor prompt change that forces a one-sentence acknowledgement of the skill output before the next planning step.
Per cluster, a Claude Sonnet 4.5 JudgeAgent runs a 30-turn investigation across eight span-tools, with a Haiku Chauffeur on spans over 3000 characters and roughly 90 percent prompt-cache hit. The Judge writes a 5-category, 30-subtype taxonomy entry, the 4-D trace score (factual_grounding, privacy_and_safety, instruction_adherence, optimal_plan_execution, 1 to 5 each), and an immediate_fix naming the planner prompt edit, the SKILL.md tighten, or the integration prompt change to ship today. Linear OAuth is wired today; Slack, GitHub, Jira, PagerDuty on the roadmap.
The cluster’s representative activations promote into the regression set. agent-opt’s six optimizers consume the three contract rubrics as the optimisation objective, with the planner prompt and each SKILL.md body as separate study targets — per-prompt separation keeps a winning tweak on finance_audit from being masked by a flat planner. The automated optimization for agents post covers the optimiser mix.
For third-party skills entering the catalogue, the registration-time gate runs JailbreakScanner, CodeInjectionScanner, SecretsScanner, MaliciousURLScanner, InvisibleCharScanner, and RegexScanner against the SKILL.md body, the allowed-tools frontmatter, and every reference file. Each scanner returns sub-ten milliseconds; the full bundle finishes well under a second. Block publish on any high-severity finding.
Common Claude Skill eval anti-patterns
Four mistakes that hide each contract surface above.
Scoring only the supervisor’s final answer. TaskCompletion on the closing turn misses every activation defect that did not quite poison the final string. One activation picked the wrong skill, another drifted past allowed-tools, the supervisor ignored the one useful result, the final answer scored fine. Score per-activation or the diagnostic that tells you which contract broke never surfaces.
One rubric across all activations. Dispatch, trajectory, and integration are doing different jobs. A single skill-quality rubric blurs the three contracts and produces nothing actionable. Keep them separated.
No skill span, only a chat transcript. A generic OTel tracer collapses a supervisor session into one chat span; claude.skill.name, claude.skill.activation_prompt, and claude.skill.allowed_tools are gone. The rubrics cannot run because there is no activation object to score against.
No skill-version regression test. Skill authors will ship SKILL.md bumps. Without a frozen golden set per skill version, you only notice the regression when a downstream supervisor complains, by which point five other agents depend on the broken skill. Pin the golden set to claude.skill.version and rerun on every commit that touches the body or its reference files.
How Future AGI ships the Claude Skills eval stack
Four surfaces, one loop, no separate products to glue together.
ai-evaluation SDK (Apache 2.0) ships the Evaluator, 60-plus EvalTemplate classes (TaskCompletion, EvaluateFunctionCalling, AnswerRefusal, Groundedness, ChunkAttribution, 11 CustomerAgent* templates), the CustomLLMJudge that carries the three contract rubrics, the Scanner suite for third-party audits, and distributed runners (Celery, Ray, Temporal, Kubernetes).
traceAI (Apache 2.0) ships the ClaudeAgentSDKInstrumentor with its typed span kinds and per-skill attributes (claude.skill.name, claude.skill.version, claude.skill.allowed_tools, claude.skill.activation_prompt), 50-plus other instrumentors across Python and TypeScript, and the EvalTag mechanism that attaches a rubric to a span kind so evals run server-side without polling.
Future AGI Platform ships self-improving evaluators tuned by feedback, in-product custom rubric authoring, and classifier-backed scoring at lower per-eval cost than Galileo Luna-2. Error Feed sits inside the platform with HDBSCAN clustering, the Sonnet 4.5 JudgeAgent, the 5-category 30-subtype taxonomy, the 4-D trace score, and the immediate_fix artifact. The closed self-improving loop — generate synthetic activations, simulate supervisor sessions, evaluate, optimise the planner and the SKILL.md, fold back into the eval set — is what FAGI ships as a single product. Other vendors ship the parts; FAGI ships the loop.
Agent Command Center plus agent-opt close the safety and improvement loops. The gateway runs a per-tenant policy layer in front of Claude — AllowedTools and DeniedTools on each virtual key extend cleanly to per-skill allowlists, so tenant A can only activate finance skills and tenant B can only activate support skills. Per-call response headers (x-prism-cost, x-prism-latency-ms, x-prism-model-used) give per-tenant per-skill cost telemetry without changing any agent code. SOC 2 Type II, HIPAA, GDPR, CCPA certified per futureagi.com/trust; ISO/IEC 27001 in active audit.
Honest tradeoff: if your stack runs one supervisor with two skills and no third-party imports, the dispatch rubric plus TaskCompletion covers most of it. The three-rubric stack earns its weight the moment you have a shared catalogue, more than one supervisor, or a CI gate that refuses merges on regressions across the skills humans share.
What to do this week
Five steps from zero to a working Claude Skills eval loop.
- Wire
ClaudeAgentSDKInstrumentor().instrument(tracer_provider=trace_provider). Confirm skill activation spans land withclaude.skill.name,claude.skill.version,claude.skill.allowed_tools, andclaude.skill.activation_promptpopulated. - Pull a hundred production activations across the skills in your catalogue. Annotate with the four artifacts above, stratified by skill name and difficulty bucket.
- Define
SkillDispatchCorrectness,SkillInternalTrajectory, andSkillOutputIntegrationasCustomLLMJudgerubrics. Run them throughEvaluator.evaluatealongsideTaskCompletion,EvaluateFunctionCalling,AnswerRefusal, andGroundedness. - Wire per-axis thresholds into CI:
SkillDispatchCorrectness >= 0.85,SkillInternalTrajectory >= 0.90per skill,SkillOutputIntegration >= 0.80. Run the six-scanner audit on every third-party skill at registration. Pin the gateway in front of production traffic. - Turn on Error Feed. Promote each week’s cluster reps into the regression set. Run a
BayesianSearchOptimizerstudy on the highest-impact prompt — the planner ifSkillDispatchCorrectnessis failing, theSKILL.mdbody ifSkillInternalTrajectoryis, the integration prompt ifSkillOutputIntegrationis. Pin theclaude-sonnet-4-5checkpoint; dispatch and trajectory shift first on every refresh.
The teams shipping reliable Claude Skills in 2026 stopped grading the supervisor’s final answer and started grading the contracts. The Claude Agent SDK gives you the runtime. The eval stack gives you the per-activation signal that keeps every skill in the catalogue honest, one SKILL.md at a time.
Related reading
- Evaluating Claude Sub-Agents: The Dispatch Is the Unit (2026)
- Evaluating Claude Code Tool Use in 2026
- What is an Agent Skill? The SKILL.md Primitive Explained for 2026
- Evaluating AI Agent Skills in 2026: A Skill-Tree Playbook
- Claude Code Observability with OpenInference and OpenTelemetry (2026)
- Automated Optimization for Agents (2026)
Frequently asked questions
What is a Claude Skill in the Claude Agent SDK and how does it differ from a tool?
Why is Claude Skills evaluation different from generic agent or tool-call evaluation?
What are the three rubrics for Claude Skills evaluation?
How do I instrument Claude Skills with traceAI?
How do I catch a third-party Claude Skill carrying prompt injection or scope creep?
How does Error Feed cluster Claude Skill failures?
Simulate persona × scenario × adversary, score multi-turn outcomes, and gate releases. Vendor-neutral playbook with code that runs without proprietary SDKs.
Skill-level eval for agents in 2026: discrete skills, per-skill rubrics, regression sets, and CI gates. Vendor-neutral code, no proprietary SDK.
FutureAGI, Langfuse, Phoenix, Braintrust, LangSmith, and DeepEval as Comet Opik alternatives in 2026. Pricing, OSS license, judge metrics, and tradeoffs.