Research

What is an Agent Skill? The SKILL.md Primitive Explained for 2026

An agent skill is a folder of instructions, scripts, and resources packaged as a SKILL.md unit. What it is, how skills compose, and how teams use them in 2026.

November 27, 2025

10 min read

agent-skill claude-skills skill-md agent-design anthropic agent-primitives ai-agents 2026

Picture a team whose customer-support agent works well on the simple cases (lookup, refund, status) but struggles on legacy escalation paths, where the right answer requires reading a four-page runbook, running three SQL queries, and following a decision tree about when to route to billing vs operations. Cramming the runbook into the system prompt blows the context budget on every turn; pulling it into RAG often retrieves the wrong chunks; agentic decision trees do not chunk well. Packaging the whole workflow as a skill (a folder with SKILL.md describing the problem, the runbook embedded as markdown, a Python script for the SQL queries, and a few example transcripts) gives the agent a single addressable artifact that loads only when a ticket matches the trigger description. The pattern is what skills exist to solve; the actual deltas (pass-rate, token budget, latency) are what you measure with the eval setup later in this piece.

This piece walks through what an agent skill is, the SKILL.md anatomy, how skills compose, where they fit alongside tools and MCP, and how to build and evaluate one in production in 2026.

TL;DR: What an agent skill is

An agent skill is a folder of instructions, scripts, and resources packaged as a unit, with a SKILL.md file at the root that describes when to use it. The pattern was introduced by Anthropic and is used across Claude.ai, Claude Desktop, the Claude Agent SDK, and an increasing number of third-party agent frameworks. A skill folder typically contains the SKILL.md (markdown with YAML frontmatter), optional executable scripts (Python, shell), reference data, and example transcripts. The agent loads the skill name and description by default; it loads the body only when the task matches the description. The pattern is progressive disclosure of context, which lets a single agent carry a large library of skills without paying the context cost for all of them on every turn.

Why skills exist

Three forces converged.

First, system prompts hit a context-budget ceiling. Every system prompt token is paid on every turn. Cramming ten runbooks into one system prompt costs 30K tokens per turn; the responses get slower; the cost compounds across users.

Second, RAG over runbooks can miss when the structure matters. Retrieval against a database of runbooks works when the user’s question is well-formed; agentic workflows that follow a decision tree are not well-suited to chunk retrieval. The right runbook is the runbook the agent has in front of it at the right step, not the runbook the retriever guessed at.

Third, the unit of agent knowledge needed a portable, versionable, evaluatable home. Tools are great at single-call mechanism. Prompts are great at one-shot framing. Neither is the right unit for “here is how this team handles refund escalations end-to-end.”

Skills filled the gap. The skill folder is the unit; SKILL.md is the contract; lazy loading is the context-budget strategy; git is the version-control story; eval is the quality gate.

The skill folder

A skill is a folder. The minimum is one file:

my-skill/
└── SKILL.md

A typical production skill is a few files:

refund-escalation/
├── SKILL.md
├── runbook.md
├── query_orders.py
├── decision_tree.md
└── examples/
    ├── case-001.md
    ├── case-002.md
    └── case-003.md

The agent loads SKILL.md first. The body of SKILL.md references the sibling files by relative path; the agent reads them as needed.

SKILL.md anatomy

The file is markdown with YAML frontmatter.

---
name: refund-escalation
description: Handle refund cases that require legacy-system reconciliation. Use when a refund request involves orders before 2024-06 or has an escalation tag.
# Claude Code only: scope which Claude Code tools this skill may call.
# Agent SDK applications control tool access via SDK options instead.
allowed-tools:
  - Read
  - Bash
  - Edit
license: MIT
version: 1.2.0
---

# Refund escalation

When a customer's refund request matches the conditions in the description, follow the steps below.

1. Read `runbook.md` to confirm the case type.
2. Run `query_orders.py` with the customer id to get the legacy order record.
3. Walk through `decision_tree.md` based on the order record.
4. If the decision tree resolves: issue the refund through the standard flow and explain the resolution to the customer.
5. If the decision tree does not resolve: route to billing-ops with a summary.

See `examples/` for three worked cases.

The frontmatter declares the skill’s identity. The body is the instruction the agent follows when the skill loads.

Frontmatter fields

name (required): unique identifier, kebab-case.
description (required): a sentence or two explaining when to use the skill. The agent matches the user’s task against this description; a clear, specific description is what makes the skill discoverable.
allowed-tools (optional, Claude Code only): a list of Claude Code tool names the skill is allowed to call, used as a scoping mechanism for security in the Claude Code surface. Agent SDK applications instead control tool access through SDK options at runtime.
Cataloging fields like license and version are common conventions in published skills; the canonical Anthropic basic structure is name and description, with everything else surface-specific or community convention.

The body

The body is system-prompt-shaped. It can be free-form prose, numbered steps, decision trees, or a mix. It can reference sibling files by relative path; the agent reads them when the body says to.

How skills compose

Three composition patterns.

Lazy loading

The agent’s runtime presents the list of available skills (name plus description) at the top of the conversation. The body is not loaded until the agent decides the task matches. The pattern is identical to how a developer with a stack of runbooks in a wiki picks the right one to read.

Sequencing

A skill’s body can instruct the agent to load another skill. refund-escalation might say “if the case requires currency conversion, follow currency-conversion.” The chain is explicit and auditable; it is not magic agent reasoning, it is a documented sequence.

Sub-agent dispatch

In a multi-agent system, the planner agent has one set of skills, each worker agent has a different set. Skills are the agent’s specialization layer; the same base model serves many roles by which skills it loads.

Skills vs tools vs MCP vs prompts

A practical map.

Tools. Callable functions: name, schema, return value. Mechanism. The model calls a tool to do one thing.
Prompts. Reusable templates surfaced to the user as commands. User-driven.
MCP servers. A wire protocol for federating tools, resources, and prompts across clients.
Skills. Packaged folders of instructions plus resources. Knowledge-plus-mechanism.

The four compose. A skill body might tell the agent to call certain tools (which themselves come from MCP servers) and reference a prompt template the user has already invoked. None of the four replaces another; they sit at different layers.

Building a skill

The process is editorial more than engineering.

Pick the workload. A specific task the agent does often and gets wrong without context.
Write the description. One or two sentences that match the task. Include the trigger conditions: what user intent or state activates this skill.
Write the body. Numbered steps. Decision trees. Reference data. Worked examples.
Add scripts and resources. If the skill needs SQL queries, regex patterns, schema files, attach them as siblings.
Test against the workload. Run the agent with the skill loaded against the eval set; compare with the agent without the skill loaded.
Iterate. Tighten the description if the skill loads too often (false positives) or too rarely (false negatives). Tighten the body if the agent skips steps.
Version. Commit to git. Publish if the skill is shareable.

Evaluating a skill

Treat the skill as a unit under test.

Curated eval set. 50 to 200 production traces or synthetic tasks the skill is meant to solve.
Baseline. Run the agent without the skill loaded; record per-rubric pass-rate.
Skill-loaded. Run the agent with the skill loaded; record per-rubric pass-rate.
Delta. The skill earns its keep when the delta is large enough to justify the context budget.

For the delta to be measurable, the eval rubric needs to be specific. “Helpful” is not enough; “follows the decision tree without skipping step 3” is. Tools like FutureAGI’s evaluation library, DeepEval, and Promptfoo support per-rubric scoring with LLM-as-judge or heuristic scorers.

How skills appear across the agent ecosystem

Native SKILL.md support today is concentrated in the Anthropic stack; other frameworks express the same mental model with their own primitives.

Native SKILL.md support (Anthropic stack):

Claude Agent SDK. First-class skills with the canonical SKILL.md format.
Claude.ai. Native UI for browsing and invoking skills (custom skills are uploaded per surface, not auto-synced from other clients).
Claude Code. Project skills live under .claude/skills/<name>/SKILL.md; personal skills live under ~/.claude/skills/.
Anthropic Skills repo. Reference skills published by Anthropic; many are Apache 2.0, some document-format skills are source-available.

Similar instruction-pattern conventions in other tools:

OpenAI Agents SDK, Google ADK, CrewAI, AutoGen. No native SKILL.md loader, but the same mental model maps to custom skill loaders, sub-agents, or role/agent definitions; teams have built community skill loaders against several of these.
Cline. Organizes its own per-skill instruction folders inside user or project directories; the mental model is similar to SKILL.md without being identical.
Cursor. Uses .cursor/rules and AGENTS.md for project-level instruction conventions, not a root SKILL.md, but the underlying mental model (project-pinned guidance) is similar.

The format is a convention more than a tightly-versioned protocol; any framework can support it with a small custom loader.

Production patterns

Three that show up.

1. Per-team skill catalog

The customer-support team owns refund-escalation, billing-dispute, account-recovery, plan-change, and twenty more skills. The catalog is a folder in the support repo. Every support agent uses the same catalog. Updates roll out via PRs; eval gates regression.

2. Cross-team shared skills

A skill like pii-redaction or gdpr-export-format is owned by the platform team and used by every agent in the company. The skill lives in a shared repo; consumer agents pull it as a submodule or a vendored copy.

3. User-curated personal skills

In Claude.ai and Claude Desktop, custom skills are managed through the Customize / Skills surface (uploaded or installed through the UI), not by editing a personal config folder. Power users curate their own catalog: a research workflow, a writing style guide, an exam-prep template. Claude Code, by contrast, loads filesystem skills from ~/.claude/skills/ (personal) and .claude/skills/ (project).

Common mistakes

Vague description. A skill that loads on every turn (because the description matches everything) defeats the lazy-loading point. Make the trigger specific.
Body that duplicates the description. The body should be the runbook; the description is the trigger. Don’t restate one in the other.
No examples. A skill without worked examples is harder for the agent to follow. Three examples is a strong default.
Treating skills as code. Skills are markdown plus optional scripts. They are editorial artifacts; the body is the contract.
No eval gate. A skill that does not measurably improve a rubric is paying context cost for nothing. Test before shipping.
Skill bloat. A 5,000-token skill body is itself a large context cost when loaded. Split into multiple skills if the runbook is too long; reference siblings instead of inlining.
Hard-coding paths. Skills are portable across clients. Don’t hard-code /Users/<name>/... paths; use relative or environment-variable paths.
No version pinning. Skills change. Pin a version in the consuming agent’s config so a skill update does not silently change behavior.

How FutureAGI implements agent skill observability and evaluation

FutureAGI is the production-grade observability and evaluation platform for skill-driven agents built around the closed reliability loop that other skill stacks stitch together by hand. The full stack runs on one Apache 2.0 self-hostable plane:

Skill loading and execution tracing, traceAI (Apache 2.0) instruments the agent runtime across 35+ frameworks in Python, TypeScript, Java, and C# to emit a span when a skill loads (skill name, version, body length); subsequent tool calls and LLM calls nest under the skill span automatically.
Per-skill evals, 50+ first-party metrics (Tool Correctness, Plan Adherence, Task Completion, Faithfulness, Hallucination) attach as span attributes and support per-skill rubric tracking so you can see which skills move the needle on which workloads; BYOK lets any LLM serve as the judge at zero platform fee, and turing_flash runs the same rubrics at 50 to 70 ms p95.
Simulation, persona-driven scenarios exercise skill loading on real workloads in pre-prod with the same scorer contract that judges production traces.
Gateway and guardrails, the Agent Command Center fronts 100+ providers with BYOK routing for the model that consumes skills, and 18+ runtime guardrails enforce policy on the same plane.

Beyond the four axes, FutureAGI also ships six prompt-optimization algorithms that consume failing trajectories as training data so skill bodies and triggers feed back into versioned updates that the CI gate enforces. Pricing starts free with a 50 GB tracing tier; Boost is $250 per month, Scale is $750 per month with HIPAA, and Enterprise from $2,000 per month with SOC 2 Type II.

Most teams running skill-driven agents in production end up running three or four tools alongside the agent runtime: one for traces, one for evals, one for the gateway, one for guardrails. FutureAGI is the recommended pick because tracing, evals, simulation, gateway, and guardrails all live on one self-hostable runtime; the loop closes without stitching. Pair with the evaluating AI agent skills workflow for the offline test set.

Sources

Series cross-link

Frequently asked questions

What is an agent skill in plain terms?

An agent skill is a folder of instructions, scripts, and resources packaged as a unit, with a SKILL.md file at the root that describes when to use it. The format is the convention Anthropic introduced and uses across Claude.ai, Claude Code, and the Claude Agent SDK; the same folder format can be adapted to other Anthropic surfaces and SDKs, but custom skills must be installed or uploaded per surface rather than auto-synced. A skill is loaded into the agent's context lazily: the agent sees the skill name and description by default, decides whether to load the body, and uses the contents (markdown instructions, executable scripts, reference data) when the task matches. Skills are composable, versionable, and reusable across agents.

What does a SKILL.md file look like?

A markdown file with YAML frontmatter at the top. The frontmatter declares name and description; in Claude Code skills it can also include `allowed-tools` to scope which Claude Code tools the skill is allowed to call (Agent SDK applications control tool access through SDK options instead). Many published skills also carry common cataloging fields like license and version inline in frontmatter or in a sidecar manifest. The body is the system-prompt-style instructions the agent follows when the skill is loaded. Sibling files in the skill folder (Python scripts, JSON references, examples) are accessible to the skill body. The SKILL.md instructions can reference these files by relative path.

How is a skill different from a tool?

A tool is a callable function exposed to the model: name, description, JSON-schema arguments, structured return. The model calls it once and gets one result. A skill is a folder of declarative instructions plus optional executable scripts. The model loads a skill the way a human reads a runbook: it follows the steps, can call multiple tools, can invoke scripts as sub-processes. Tools are mechanism; skills are knowledge plus mechanism. Both compose: a skill often calls tools.

How is a skill different from a system prompt?

A system prompt is one block of text loaded for every conversation. A skill is loaded conditionally: the agent sees only the name and description by default, loads the body only when the task matches. The pattern is progressive disclosure. With ten system prompts, every conversation pays the context cost; with ten skills, each conversation pays the cost only for the one or two skills it actually uses. The savings on context budget at scale are substantial.

How is a skill different from an MCP server?

An MCP server exposes tools, resources, and prompts to a client over JSON-RPC. A skill is a packaged set of instructions plus optional resources, not a server. They compose: a skill can reference MCP servers in its instructions, and an MCP server can expose skills as resources. Different layers; both useful.

What is the structure of a skill folder?

A typical skill is: SKILL.md at the root with YAML frontmatter (name, description, optional `allowed-tools` for Claude Code), markdown body with the instructions; sibling files like script.py, reference.md, examples/, schemas/. Some skills are just a SKILL.md; others ship dozens of supporting files. The agent loads SKILL.md first, then references siblings as needed.

Where can I find existing skills?

Anthropic publishes official reference skills in the Claude documentation and the anthropics/skills repo on GitHub; many of those skills are Apache 2.0, while document-format skills may ship as source-available. Community catalogs hold many more. Internal teams typically build their own catalog: company-specific skills for the codebase, the deploy system, the on-call runbook, the customer-support templates.

How do I evaluate a skill?

Treat it like any other agent component: define tasks the skill is meant to solve, run the agent with and without the skill loaded, score the outputs against a rubric. The skill earns its context-budget cost when its presence raises the rubric pass-rate by a measurable margin. Tools like FutureAGI's evaluation library, DeepEval, and Promptfoo all support skill-level eval as a special case of agent-level eval.

View all

Research

What is an MCP Server? Architecture, Transports, and Primitives in 2026

An MCP server exposes tools, resources, and prompts to LLM clients via the Model Context Protocol. Architecture, transports (stdio, SSE, streamable HTTP), and lifecycle in 2026.