Research

Evaluating AI Agent Skills in 2026: A Skill-Tree Playbook

Skill-level eval for agents in 2026: discrete skills, per-skill rubrics, regression sets, and CI gates. Vendor-neutral code, no proprietary SDK.

September 28, 2025

7 min read

agent-evaluation agent-skills llm-evaluation skill-tree agent-testing best-practices traceai 2026

Table of Contents

Consider a hypothetical team that ships an agent failing 12 percent of refund conversations. Trace-level eval says “Refund Agent: 88 percent pass.” That’s the only number the team has. Days of debugging later, they discover the issue: the calculate_refund_amount skill is fine on small amounts but fails on amounts above $500 because the prompt says “round to the nearest dollar” and the model is rounding to the nearest ten on long contexts. A skill-level eval would have surfaced the regression on the calculate_refund_amount rubric on the >$500 bucket directly, while every other skill stayed flat. The fix is the same; the time-to-diagnosis collapses from days to the time it takes to read a per-skill table.

This is why skill-level evaluation matters in 2026. Trace-level scores are too coarse. Per-skill scores tell you what to fix. This guide is a vendor-neutral playbook with code that runs end-to-end without any proprietary SDK.

TL;DR: Skill-level eval in one paragraph

Build a skill tree (5-15 leaf skills, composed skills above, high-level skills at the apex). Give each skill a rubric, a regression set, and a threshold. Run on every PR; gate merges on per-skill pass-rate regression. The unit of debugging shifts from “the agent regressed” to “this specific skill regressed.” Combine with trace-level eval as a safety net for emergent failures.

Why skill-level eval in 2026

Three pressures pushed skill-level eval from research to production by 2026.

Trace-level eval is too coarse. A refund agent at 88 percent pass-rate hides which sub-component regressed. The aggregate is a lagging indicator; the per-skill score is a debugging map.

Skills are shared across workflows. parse_user_intent often runs in refund, billing, FAQ, and escalate. Fixing it once fixes all four. Trace-level eval re-measures the same skill in every workflow; skill-level eval measures it once and reports per-workflow.

Agents evolved beyond single-call evaluation. Multi-step production agents in 2026 are typically directed graphs with several LLM and tool calls per request. Final-answer eval misses where the graph broke. Skill-level eval points at the broken edge.

Building the skill tree

Start from real traces. Pull 50-100 production traces. List every distinct LLM call, tool call, and decision point. Group calls with the same input shape and the same rubric into a skill.

As a starting point, try 5-15 leaf skills; split or merge based on rubric reuse, failure-diagnosis value, and dataset size. Examples:

Leaf skill	Rubric	Example agent
parse_user_intent	classification F1	refund, support
lookup_order_by_id	tool-call arg correctness	refund, billing
calculate_refund_amount	numerical correctness	refund
validate_refund_policy	policy adherence	refund
write_resolution_message	writing quality + tone	refund, support
call_escalation_tool	tool-call timing	refund, support
extract_entity_from_doc	extraction F1	research
summarize_long_context	summary quality	research, support
route_to_subagent	routing accuracy	supervisor
validate_tool_output_schema	schema adherence	every agent

Composed skills sit above leaf skills:

multi_step_retrieval (combines: query_rewrite + lookup + rerank)
refund_calculation (combines: lookup_order + validate_policy + calculate_amount)
escalation_handoff (combines: detect_escalation_signal + summarize_context + call_escalation_tool)

High-level skills sit above composed:

resolve_refund_case
handle_escalation
complete_onboarding

Refactor the tree as the agent evolves. Deprecate skills, split skills, merge skills.

Per-skill rubrics

The rubric is skill-specific. A generic “agent quality” rubric is too broad for any single skill. The Python sketches below are illustrative and assume you have wired up your own dataset loader, judge runner, and span helpers.

RUBRICS = {
    "parse_user_intent": {
        "type": "classification",
        "metric": "weighted_f1",
        "threshold": 0.78,
        "judge_prompt": "Classify the predicted intent against the gold intent. Return JSON {match: bool, reasoning: str}.",
    },
    "lookup_order_by_id": {
        "type": "tool_call",
        "metric": "argument_correctness",
        "threshold": 0.95,
        "judge_prompt": "Did the tool call arguments match what the user asked for? JSON {correct: bool, missing: [str]}.",
    },
    "calculate_refund_amount": {
        "type": "numerical",
        "metric": "abs_error_under_1_dollar",
        "threshold": 0.98,
    },
    "write_resolution_message": {
        "type": "writing_quality",
        "metric": "weighted_avg",
        "rubrics": ["tone", "completeness", "policy_compliance"],
        "threshold": 0.80,
    },
}

The threshold is calibrated against the incumbent on the per-skill regression set. Drops below threshold block the merge.

Skill extraction from traces

Skills get extracted from traces by span name. Auto-instrumentation libraries like traceAI and OpenInference emit spans for raw LLM and tool calls; the application adds a skill.name attribute or wraps the call in a named span to map traces back to the skill tree. The spans below show the manual wrapping pattern.

import os
from openai import OpenAI
from opentelemetry import trace

tracer = trace.get_tracer("agent.skills")
client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])

@trace_skill("parse_user_intent")
def parse_user_intent(query: str) -> dict:
    with tracer.start_as_current_span("skill.parse_user_intent") as span:
        resp = client.responses.parse(
            model="gpt-5-nano",
            input=[{"role": "user", "content": f"Classify intent: {query}"}],
            text_format=IntentResult,
            temperature=0,
        )
        result = resp.output_parsed
        span.set_attribute("skill.input", query)
        span.set_attribute("skill.output", result.model_dump_json())
        return result.model_dump()

The trace_skill decorator (a thin wrapper) tags the span with skill.name. The eval harness queries spans by skill.name and runs the matching rubric.

Per-skill regression set

Each skill gets its own dataset. 200-500 rows is the floor. Stratify across difficulty:

easy. Clean inputs, single intent.
medium. Ambiguous inputs, missing fields.
hard. Adversarial inputs, edge cases (long context, special characters, multilingual).

Sources for the regression set:

Hand-labeled production samples. Pull production rows where this skill ran; label each with the gold output and a difficulty tier.
Synthetic generation. Frontier model generates rows conditioned on each difficulty tier.
Negative-feedback expansion. Production rows where the skill produced a result that triggered downstream failure.

Update the regression set every two weeks. Stale datasets hide drift.

CI integration

# tests/test_skills.py
import pytest
from skills_lib import SKILLS, RUBRICS

@pytest.mark.parametrize("skill_name", list(SKILLS.keys()))
def test_skill_pass_rate(skill_name):
    dataset = load_dataset(f"data/skills/{skill_name}.jsonl")
    threshold = RUBRICS[skill_name]["threshold"]
    pass_count = 0
    for row in dataset:
        predicted = SKILLS[skill_name](row["input"])
        score = run_rubric(skill_name, predicted, row["gold"])
        if score >= 0.5:
            pass_count += 1
    pass_rate = pass_count / len(dataset)
    assert pass_rate >= threshold, \
        f"{skill_name} regression: {pass_rate:.3f} < {threshold}"

Run on every PR. The reviewer sees the per-skill table in the PR comment. Drops below threshold block.

pytest tests/test_skills.py -v --tb=short
# parse_user_intent .... pass_rate=0.84  threshold=0.78  PASS
# lookup_order_by_id ... pass_rate=0.91  threshold=0.95  FAIL
# calculate_refund ..... pass_rate=0.99  threshold=0.98  PASS

Production deployment

Three operational details matter beyond CI.

Per-skill drift alerts. Rolling-mean per-skill pass rate per route. Page on 3-5 percent moves. Skill-level drift catches the model-version-update problem before user complaints.

Per-skill A/B with eval-gated rollback. When shipping a new prompt for a skill, route 5 percent of traffic to the new prompt. Monitor that skill’s pass rate. Roll back if the metric regresses below threshold.

Annotation queue per skill. Misclassifications and judge-disagreement rows for a skill flow into a review queue. Reviewers correct labels; corrected rows feed the per-skill regression set.

Common mistakes when evaluating agent skills

Trace-level only. Aggregate scores are lagging indicators. Per-skill scores are debugging maps.
Same rubric for every skill. parse_user_intent and write_resolution_message need different rubrics.
No regression set per skill. A skill without a dataset is unverifiable.
Treating tool calls as opaque. A tool call has its own rubric (argument correctness, timing).
Skill tree drift. The tree must update as the agent evolves; otherwise eval scores stale skills.
Aggregating with simple averages. Weighted averages can hide a critical-skill regression behind a benign-skill improvement.
Final-answer-only. Final-answer eval gives you a lagging indicator, not a map.
Hidden skills. A skill that exists in code but not in the eval tree is the one that ships broken.

What changed in 2026 for skill-level eval

Date	Event	Why it matters
Mar 2026	FutureAGI shipped Agent Command Center and ClickHouse trace storage	Per-skill aggregations across millions of spans became cheap.
2026	OTel GenAI semconv (Development status)	OTel GenAI semantic conventions kept maturing; teams continue to add a custom `skill.name` attribute for skill-level grouping.
Dec 2025	DeepEval v3.9.x agentic metrics	Agentic metrics (Tool Correctness, Argument Correctness, Step Efficiency, Plan Adherence, Plan Quality) make per-skill checks easier when paired with explicit skill grouping.
2026	Galileo Luna 2 SLM judges (Enterprise)	Lower-latency SLM judges reduced reliance on frontier judges for some online evaluation workloads.
Mar 19, 2026	LangSmith Fleet (rename of Agent Builder)	Agent deployment workflows expanded; eval and deployment surfaces stayed separate but adjacent.

Sources

Series cross-link

Frequently asked questions

What is skill-level evaluation for AI agents?

Skill-level evaluation breaks an agent's work into discrete skills and scores each skill independently, instead of grading the agent on its final answer alone. A refund agent's skills might include parse_user_intent, lookup_order, calculate_refund_amount, validate_policy, and write_resolution_message. Each skill has its own rubric, regression set, and threshold. The aggregate agent score is a function of the per-skill scores, not a single trace-level pass-rate.

Why does skill-level eval matter in 2026?

Three reasons. First, trace-level scoring averages away which skill is broken; the agent fails 8 percent and you cannot tell which step is the problem. Second, agents share skills across workflows; fixing parse_user_intent fixes refund, billing, and FAQ at once if you measure the skill, not the workflow. Third, skill-level eval gives a debugging map; the per-skill confusion matrix tells you exactly what regressed.

How do I build a skill tree for my agent?

Start from a real production trace. List every distinct LLM call, tool call, and decision point. Group ones with the same input shape and rubric into a skill. The skill tree should have 5-15 leaf skills for most agents. Above 20, you have over-decomposed. Below 5, you have under-decomposed. Map composed skills (multi-step retrieval) above leaf skills (parse_json, call_tool) and high-level skills (resolve_refund) above composed.

What's the right rubric for a skill?

Skill-specific. parse_user_intent needs a classification accuracy rubric. lookup_order needs a tool-call argument correctness rubric. calculate_refund_amount needs a numerical correctness rubric. write_resolution_message needs a writing-quality rubric (tone, completeness, refusal calibration if relevant). Generic 'agent quality' is too broad for any single skill; pick the right rubric per skill.

Can I run skill-level eval without a proprietary SDK?

Yes. The reference implementation uses OpenAI's SDK plus traceAI (Apache 2.0) for instrumentation. Skills are extracted from the trace tree by span name; rubrics are LLM-as-judge prompts in plain Python; the eval store can be FutureAGI cloud, Phoenix self-host, or a Postgres table. The pipeline is vendor-neutral. The cookbook examples in this guide do not depend on Langfuse, LangSmith, or any other proprietary SDK.

How do I integrate skill-level eval into CI?

Three steps. First, run a per-skill regression set on every PR touching prompts or tools. Each skill has its own dataset (200-500 rows). Latency target: under 10 minutes. Second, gate the merge on per-skill pass-rate regression against the incumbent. Third, surface the per-skill heatmap in the PR comment so the reviewer sees which skill regressed. Combine with trace-level eval as a higher-level safety net.

What is the difference between skill eval and step eval?

Step eval is per-trace: each step in this trace got a score. Skill eval is per-skill across traces: this skill, evaluated across all the traces where it ran, scored X. Skill eval pools across workflows; step eval is per-workflow. Both have value. Skill eval is the right unit when refactoring shared skills. Step eval is the right unit when debugging a specific trace.

What are common mistakes in skill-level eval?

Five. First, treating tool calls and LLM calls as the same skill (different rubrics). Second, no per-skill regression set (you cannot defend the score). Third, aggregating skills with weighted averages that hide regressions. Fourth, not updating the skill tree as the agent evolves. Fifth, scoring final-answer-only because that is what existing tools support; that gives you a lagging indicator, not a debugging map.

View all

Research

Simulated Multi-Turn LLM Evaluation: 2026 Playbook

Simulate persona × scenario × adversary, score multi-turn outcomes, and gate releases. Vendor-neutral playbook with code that runs without proprietary SDKs.

NVJK Kartik · Dec 14, 2025

5 min

Research

Intent Classification LLM Pipeline: 2026 Best Practices

A vendor-neutral 2026 intent classification pipeline. Data, judge prompt, eval, and deploy. Runs end-to-end on OpenAI + traceAI without proprietary SDKs.

Rishav Hada · Jan 14, 2026

6 min

Research

Opik Alternatives in 2026: 6 LLM Eval and Observability Tools

FutureAGI, Langfuse, Phoenix, Braintrust, LangSmith, and DeepEval as Comet Opik alternatives in 2026. Pricing, OSS license, judge metrics, and tradeoffs.

Rishav Hada · Apr 10, 2026

12 min

TL;DR: Skill-level eval in one paragraph

Why skill-level eval in 2026

Building the skill tree

Per-skill rubrics

Skill extraction from traces

Per-skill regression set

CI integration

Production deployment

Common mistakes when evaluating agent skills

What changed in 2026 for skill-level eval

Sources

Series cross-link

Related reading

Frequently asked questions