Guides

Evaluating CrewAI Agents: Role Adherence Is the Unit (2026)

Evaluating CrewAI agents in 2026: role adherence as the primary metric, plus task delegation, crew coherence, and manager-worker fidelity.

·
Updated
·
12 min read
crewai agent-evaluation multi-agent role-adherence traceai agent-observability 2026
Editorial cover image for Evaluating CrewAI Agents: A 2026 Deep Tutorial
Table of Contents

A three-agent CrewAI crew scores 0.92 on TaskCompletion in CI. The Senior Research Analyst produces a clean source list. The Technical Writer ships a 1500-word explainer. The Editor signs off. A week later, the same crew scores 0.90, but the writer is now running fresh web searches mid-draft, the researcher is paraphrasing the writer’s structure back into the brief, and cost per run has doubled because the editor started pulling new citations. Nothing in the final artifact rubric moved. The role contract collapsed.

CrewAI is role-first, so its evaluation has to be role-first too. The thesis of this post is simple: CrewAI’s design makes eval easier in one way — every agent ships with a declared role, goal, and backstory you can grade against — and harder in another, because role adherence becomes a first-class failure mode that other frameworks do not surface. This is the working pattern for evaluating CrewAI agents in 2026: role adherence as the primary metric, plus task delegation correctness, crew-level coherence, and manager-worker fidelity, all riding on the per-agent spans the traceAI CrewAIInstrumentor emits out of the box.

Why CrewAI eval differs from AutoGen and LangGraph eval

Each major multi-agent framework has a different load-bearing primitive, and the eval surface follows it.

AutoGen is message-first: group chats route turns through a team channel and the failure modes live in the handoff between sender and receiver. LangGraph is state-first: the graph mutates a shared state object across nodes and failures live in node correctness, edge selection, and state-diff integrity. CrewAI is role-first. Each agent is declared with an explicit role, goal, and backstory. The runtime executes a declared task list. Roles are not metadata; they are the contract the runtime enforces.

This shifts what you grade. AutoGen eval centres on handoff fidelity. LangGraph eval centres on graph topology. CrewAI eval centres on the role contract: did each agent stay inside its declared role, did the manager pick the right worker for each task, did the final artifact reflect every agent’s contribution, did the manager hand the worker enough context to act without inventing constraints. The unit of evaluation is the agent’s adherence to its role on each task it executes, not the team-level transcript. The cross-framework spine lives in the agent eval guide; the rest of this post stays inside CrewAI.

Role adherence as the primary metric

A CrewAI agent is a declarative object. You ship role, goal, backstory, tools, allow_delegation, and optional max_iter, max_rpm. The runtime reads these as the agent’s contract. Role drift is what happens when the agent answers outside that contract. The Senior Research Analyst starts drafting the explainer. The Technical Writer starts running new searches. The Editor pulls fresh citations the researcher never produced.

Drift breaks two things at once. The per-agent rubric you wrote for that role stops being a meaningful test, because the agent is no longer doing the job the rubric measures. The team’s division of labour collapses, which is why you ran a crew instead of one planner-with-tools to begin with. Drift is also the failure mode most sensitive to model refreshes: a refreshed gpt-4o or claude-sonnet-4-5 checkpoint trained to be more helpful drifts role behaviour first and final-answer quality second. Per-agent RoleAdherence is the earliest indicator the refresh landed.

Score it per agent, per task, against the agent’s own contract. The rubric reads role, goal, backstory, and the agent’s output for that task, and asks one question: did the agent stay inside the contract. Track the trend line per checkpoint. A 5-point drop overnight is the signal you would otherwise pick up from a cost graph two weeks later.

Task delegation correctness on hierarchical crews

Hierarchical crews introduce a second failure surface. The manager owns delegation: it reads each task, picks a worker, and routes the task description plus context. Delegation defects are the hardest crew defect class because the final output can still look fine when the wrong agent did the right task. The researcher writes an OK explainer, the writer produces an OK source list, the cost graph spikes, and the role contract is silently broken across every run.

Two failure modes carry most production delegation defects. Wrong worker. The manager assigns the writer’s job to the researcher, the editor’s job to the writer. Aggregate TaskCompletion still passes because the work gets done, just by the wrong agent at the wrong cost. Over-delegation. The manager delegates trivial work (formatting, light edits, single-line clarifications) to the expensive research agent instead of handling it inline, and the per-task cost spike hides inside the crew aggregate.

DelegationCorrectness is a CustomLLMJudge that reads the task description, the candidate agents, and the chosen agent, and scores whether the chosen agent matches the task shape. A drop here is the leading indicator of latency tails and cost overruns even when the crew-coherence score still looks fine.

Crew-level coherence and manager-worker fidelity

The final crew output is the artifact users see, and it has its own failure mode beyond role drift and bad delegation. Coherence collapse is what happens when the artifact reflects one agent’s solo work with the other agents’ names attached. The researcher’s brief is paraphrased back as the final article. The writer’s draft is shipped as the source list. The editor’s notes are absorbed into the body without changing the underlying claims. No per-agent rubric flags this because every individual turn is well-formed; the team-level synthesis broke between them.

CrewCoherence is a CustomLLMJudge that reads each agent’s contribution and the final output, and scores whether the artifact reflects every agent’s work. Penalise outputs that collapse into one agent’s voice, ignore an upstream correction, or contradict an earlier correct intermediate.

ManagerWorkerFidelity is the rubric pair for hierarchical crews. It reads the manager’s task description and the worker’s response, and scores whether the manager handed enough context for the worker to act without inventing constraints. Most fidelity defects are under-spec: the manager says “research this topic” and the worker invents a date range, a scope cap, or an exclusion list the manager never set. The defect surfaces downstream as constraint bleed that no single agent ever owned. Score the manager’s task descriptions as a first-class artifact, not as plumbing.

The four role-anchored axes in one table

AxisWhat you measureEvaluator
Role adherenceDid each agent stay in the declared roleCustomLLMJudge as RoleAdherence, per agent
Task delegation correctnessDid the manager pick the right workerCustomLLMJudge as DelegationCorrectness, per delegation
Crew-level coherenceDoes the final output reflect every agentCustomLLMJudge as CrewCoherence, per kickoff
Manager-worker fidelityDid the manager hand enough context to the workerCustomLLMJudge as ManagerWorkerFidelity, per task

Layer these on top of the SDK’s existing templates: TaskCompletion per task, EvaluateFunctionCalling (aliased LLMFunctionCalling) per tool, AnswerRefusal per agent output, Groundedness and ContextAdherence on tasks that consume retrieved context. The four role-anchored rubrics give you the CrewAI-specific signal. The SDK templates give you the baseline a single-agent run would also need.

traceAI CrewAIInstrumentor: what the spans actually carry

A role-adherence rubric needs an agent span tagged with the role contract. The CrewAIInstrumentor emits that span tree without code changes inside your crew definitions, by patching three CrewAI internals at import time.

pip install crewai
pip install fi-instrumentation-otel traceAI-crewai
pip install ai-evaluation
import os
os.environ["FI_API_KEY"] = "..."
os.environ["FI_SECRET_KEY"] = "..."

from fi_instrumentation import register
from fi_instrumentation.fi_types import ProjectType
from traceai_crewai import CrewAIInstrumentor

trace_provider = register(
    project_type=ProjectType.OBSERVE,
    project_name="content-research-crew",
)
CrewAIInstrumentor().instrument(tracer_provider=trace_provider)

After this call, every crew kickoff in the process emits the span tree the four rubrics filter on. The instrumentor (verified at traceAI/python/frameworks/crewai/traceai_crewai/_wrappers.py) wraps three internals.

Crew.kickoff becomes a CHAIN span carrying crew_id, crew_key, crew_inputs, the full crew_agents JSON list (role, goal, backstory, allow_delegation, tools_names, max_iter, max_rpm per agent), the full crew_tasks JSON list (id, description, expected_output, async_execution, human_input, agent_role, context, tools_names per task), and the gen_ai.usage.input_tokens, gen_ai.usage.output_tokens, gen_ai.usage.total_tokens totals from CrewAI’s own usage_metrics. The crew topology lands on the root span as structured JSON, not flat strings.

Task._execute_core becomes an AGENT span per task with crew_id, task_id, task_key, and, when crew.share_crew is set, formatted_description and formatted_expected_output. The agent’s role, goal, and backstory ride on the span’s input value. Per-agent RoleAdherence is a filter on fi.span.kind=AGENT joined against the parent CHAIN’s crew_agents list.

ToolUsage._use becomes a TOOL span per tool call with gen_ai.tool.name, the configured function_calling_llm, and the full tool input. Retries surface as sibling TOOL spans on the same parent agent, so retry-count eval is COUNT(*) GROUP BY tool.name, parent_agent.

The instrumentor does not emit a dedicated HANDOFF or DELEGATION span today. Delegation in CrewAI rides on the manager’s AGENT span and the worker’s subsequent AGENT span, joined by the manager-worker assignment in the parent’s crew_tasks list. Score delegation by lifting crew_tasks[i].agent_role from the CHAIN span and matching it against the agent that actually ran the task. One query, not a parser. Honest limitation worth naming.

The four rubrics in code

All four ride the CustomLLMJudge interface from ai-evaluation. One pattern, four configs.

from fi.evals import Evaluator
from fi.evals.metrics import CustomLLMJudge
from fi.evals.llm.providers.litellm import LiteLLMProvider

evaluator = Evaluator(fi_api_key="...", fi_secret_key="...")
provider = LiteLLMProvider()

role_adherence = CustomLLMJudge(
    provider=provider,
    config={
        "name": "RoleAdherence",
        "grading_criteria": (
            "Read the agent's role, goal, and backstory, then read the agent's "
            "output for the task. Score whether the output stayed inside the "
            "contract. Penalise role bleed: a researcher that drafts the explainer, "
            "a writer that runs searches, an editor that pulls new citations."
        ),
    },
)

delegation_correctness = CustomLLMJudge(
    provider=provider,
    config={
        "name": "DelegationCorrectness",
        "grading_criteria": (
            "Read the task description, the candidate agents, and the chosen "
            "agent. Score whether the chosen agent matches the task shape. "
            "Penalise wrong-worker and over-delegation of trivial work."
        ),
    },
)

crew_coherence = CustomLLMJudge(
    provider=provider,
    config={
        "name": "CrewCoherence",
        "grading_criteria": (
            "Read each agent's contribution and the final crew output. Score "
            "whether the artifact reflects every agent's work. Penalise outputs "
            "that collapse into one agent's voice or ignore an upstream correction."
        ),
    },
)

manager_worker_fidelity = CustomLLMJudge(
    provider=provider,
    config={
        "name": "ManagerWorkerFidelity",
        "grading_criteria": (
            "Read the manager's task description and the worker's response. "
            "Score whether the manager handed enough context for the worker to "
            "act without inventing scope, date ranges, or constraints."
        ),
    },
)

Wire per-axis CI thresholds. A reasonable starting set is RoleAdherence >= 0.90 per agent, DelegationCorrectness >= 0.85 per delegation, CrewCoherence >= 0.80 per kickoff, ManagerWorkerFidelity >= 0.85 per task. The gate fails on the failing axis, not a single aggregate. One bisect instead of three days. The LLM evaluation playbook covers threshold cadence; the agent passes evals fails production post covers the axis-blindness pattern that hides role drift behind a passing TaskCompletion.

Production observability and Error Feed clustering

CI is a snapshot; production is a river. Score the live trace stream with the same four rubrics and you get the regression signal the offline set cannot have, because the offline set was frozen before the model refresh that broke role behaviour landed.

Error Feed is the loop closer inside the eval stack. Failing crew runs flow into ClickHouse with their span embeddings. HDBSCAN soft-clustering groups them into named issues. The cluster shapes that recur on CrewAI crews are role-shaped, which is the point of grading at the role contract.

  • Role-bleed clusters. The writer starts citing fresh sources after a refreshed gpt-4o checkpoint lands. RoleAdherence on the writer drops 8 points overnight. The immediate_fix is a stricter role contract plus a one-shot example of the role boundary.
  • Delegation-asymmetry clusters. The manager over-delegates formatting to the expensive research agent. Cost per run climbs 30 percent while CrewCoherence stays flat. The immediate_fix is a manager prompt edit that handles trivial tasks inline.
  • Context-handoff clusters. The researcher does not pass enough context to the writer and Groundedness drops below 0.7. The immediate_fix is a task context chain edit plus a ManagerWorkerFidelity tighten on the research-to-writer handoff.
  • Termination-drift clusters. A sequential crew skips the editor task when the writer output looks long enough. The CHAIN span shows two AGENT spans where the config promised three. The immediate_fix is an explicit acceptance criterion that forces the editor to fire.

Per cluster, a Claude Sonnet 4.5 JudgeAgent runs a 30-turn investigation across eight span-tools (read_span, get_children, get_spans_by_type, search_spans, plus a Haiku Chauffeur for spans over 3000 characters; prompt-cache hit ratio around 90 percent). The Judge writes three artifacts: a 5-category 30-subtype taxonomy entry, the 4-D trace score (factual_grounding, privacy_and_safety, instruction_adherence, optimal_plan_execution, each 1 to 5), and an immediate_fix naming the role contract tighten, manager prompt edit, or rubric calibration that ships today. Linear OAuth is wired today; Slack, GitHub, Jira, and PagerDuty are on the roadmap.

The cluster’s representative crew runs become regression cases. agent-opt then tunes each agent’s role contract and the manager’s delegation prompt as separate study targets. BayesianSearchOptimizer or GEPAOptimizer with EarlyStoppingConfig works well, and the per-agent separation keeps a winning writer tweak from being masked by a flat researcher. The automated optimization for agents post covers the optimizer mix; the trace and debug multi-agent systems guide covers cross-framework topology if you also run AutoGen or LangGraph.

Per-agent cost telemetry via the gateway

Per-agent cost is the axis most teams ship without. Route CrewAI’s underlying LLM calls through https://gateway.futureagi.com/v1 and the per-response headers do the work. CrewAI’s litellm-backed LLM picks up OPENAI_BASE_URL for OpenAI-compatible providers. Each response carries x-prism-cost, x-prism-latency-ms, x-prism-provider, and the resolved-model header. traceAI attaches them to per-agent LLM spans, so per-agent cost roll-up is SUM(x-prism-cost) GROUP BY parent_agent_role and per-task budget is SUM(x-prism-cost) GROUP BY task_id. The agent that spikes 4x on cost in a regression run is one chart, not a forensic spelunk. See AI agent cost optimization for the deeper story.

Common CrewAI eval anti-patterns

Four mistakes that hide each failure mode above.

Scoring only the final crew artifact. A passing final output tells you the crew produced something. It does not tell you which agent did the work, whether agents stayed in role, or what each agent cost. Per-agent scoring is how you debug a crew when scores regress.

One rubric for all agents. A Senior Research Analyst and a Technical Writer need different rubrics. RoleAdherence judged with one team-wide rubric either over-fits to one role or under-specs both. The CustomLLMJudge pattern makes per-role rubrics cheap; use them.

Ignoring delegation on hierarchical crews. Manager-induced regressions (wrong worker, over-delegation, under-spec’d task descriptions) look like worker regressions in aggregate. Without DelegationCorrectness and ManagerWorkerFidelity, you will spend weeks tuning the worker’s prompt when the actual defect is the manager.

Treating model refreshes as silent. Refreshed checkpoints of gpt-4o, claude-sonnet-4-5, and the open-weight roster drift role behaviour first and final-answer quality second. Pin model versions in your crew config, run the regression set on every refresh, and track per-agent RoleAdherence trend lines per checkpoint.

How Future AGI ships the full CrewAI eval stack

Three surfaces, one loop, no separate products to glue together.

ai-evaluation SDK (Apache 2.0) ships the Evaluator, 60-plus EvalTemplate classes (TaskCompletion, EvaluateFunctionCalling aliased LLMFunctionCalling, AnswerRefusal, ConversationCoherence, Groundedness, ContextAdherence, ChunkAttribution, 11 CustomerAgent* templates), the CustomLLMJudge that carries the four role-anchored rubrics, and four distributed runners (Celery, Ray, Temporal, Kubernetes).

traceAI (Apache 2.0) ships the CrewAIInstrumentor wrapping Crew.kickoff (CHAIN), Task._execute_core (AGENT), and ToolUsage._use (TOOL), 50-plus other AI surface instrumentors across Python, TypeScript, Java, and C#, plus the EvalTag mechanism that attaches a rubric to a span kind so evals run server-side without polling.

Future AGI Platform ships the self-improving evaluators tuned by feedback, in-product custom rubric authoring, and classifier-backed scoring at lower per-eval cost than Galileo Luna-2. Error Feed sits inside the platform with the HDBSCAN clustering, the Sonnet 4.5 JudgeAgent, the 30-subtype taxonomy, the 4-D trace score, and the immediate_fix artifact.

agent-opt closes the loop with six optimizers (RandomSearchOptimizer, BayesianSearchOptimizer, MetaPromptOptimizer, ProTeGi, GEPAOptimizer, PromptWizardOptimizer) consuming the role-anchored rubrics as the optimization objective. Each agent’s role contract and the manager’s delegation prompt are separate study targets. Direct trace-stream-to-agent-opt is roadmap; eval-driven optimization ships today.

Honest tradeoff: if your stack is a single Agent and a tool registry, a lighter framework-specific tracer plus a hand-rolled TaskCompletion rubric is enough. The eval stack above earns its weight when you run real crews — three-plus agents, declared roles, hierarchical delegation, production traffic — and the role contract is the unit that decides whether the crew ships.

What to do this week

One crew config, end to end. Five steps.

  1. Wire CrewAIInstrumentor().instrument(tracer_provider=trace_provider) into your project. Verify per-task AGENT spans, per-tool TOOL spans, and the Crew.kickoff CHAIN parent with crew_agents and crew_tasks JSON attributes.
  2. Build a 50-200 scenario regression set per crew config. Tag each scenario with expected role coverage, expected delegation pattern on hierarchical crews, and expected tool sequence per agent.
  3. Define RoleAdherence, DelegationCorrectness, CrewCoherence, and ManagerWorkerFidelity as CustomLLMJudge rubrics. Run alongside TaskCompletion, LLMFunctionCalling, and AnswerRefusal.
  4. Wire per-axis CI thresholds. Start at RoleAdherence >= 0.90, DelegationCorrectness >= 0.85, CrewCoherence >= 0.80, ManagerWorkerFidelity >= 0.85.
  5. Turn on Error Feed. Watch the first week’s clusters. Promote representative crew runs into the regression set. Run a BayesianSearchOptimizer study on the agent whose RoleAdherence trend line ranks worst.

The teams shipping reliable CrewAI crews in 2026 stopped grading the final artifact and started grading the role contract. The framework gives you the declarative role surface; the eval stack gives you the signal that keeps each agent honest to it, one task at a time.

Frequently asked questions

Why is evaluating CrewAI agents different from evaluating AutoGen or LangGraph agents?
CrewAI is role-first. Each agent ships with an explicit role, goal, and backstory, and the runtime executes a declared task list. AutoGen is message-first; LangGraph is state-first. The CrewAI shape moves the primary failure mode from handoff misinterpretation (AutoGen) or state mutation (LangGraph) to role adherence: the researcher starts writing, the writer starts running searches, the editor starts citing fresh sources. A clean final artifact hides every role drift that did not happen to break the output string. Evaluating a CrewAI crew honestly means scoring role adherence as a first-class metric, then task delegation correctness on hierarchical crews, then crew-level coherence, then per-agent tool use and cost. The role contract is the rubric anchor.
What is role drift in a CrewAI crew and how do you catch it?
Role drift is what happens when an agent answers outside the role, goal, and backstory declared on its Agent definition. The Senior Research Analyst starts drafting the explainer. The Technical Writer starts running new web searches. The Editor starts pulling extra citations. None of these turns will fail a generic TaskCompletion rubric in isolation, because each one is well-formed prose with a defensible structure. They fail the team-design contract that made you run a crew instead of a single planner-with-tools in the first place. Catch role drift with a per-agent RoleAdherence rubric configured as a CustomLLMJudge against the agent's own role plus goal plus backstory, scored on every AGENT span the traceAI CrewAIInstrumentor emits via the Task._execute_core wrapper. Track per-agent RoleAdherence trend lines across model refreshes; refreshed checkpoints drift role behaviour faster than they drift output quality.
What does traceAI's CrewAIInstrumentor capture that generic OpenTelemetry tracers miss?
traceAI's CrewAIInstrumentor (verified at traceAI/python/frameworks/crewai/traceai_crewai/_wrappers.py) wraps three CrewAI internals at import time. Crew.kickoff becomes a CHAIN span with crew_id, crew_key, crew_inputs, the full crew_agents JSON list (role, goal, backstory, allow_delegation, tools_names, max_iter, max_rpm per agent), the full crew_tasks JSON list (id, description, expected_output, async_execution, human_input, agent_role, context, tools_names per task), and gen_ai.usage.* token totals. Task._execute_core becomes an AGENT span per task with crew_id, task_id, task_key, formatted_description, and formatted_expected_output. ToolUsage._use becomes a TOOL span per tool call with gen_ai.tool.name, function_calling_llm, and full input value. A generic OTel tracer collapses a crew run into one chain span and loses the per-task agent attribution that role adherence, delegation correctness, and crew coherence rubrics filter on. The dedicated instrumentor preserves the crew topology as first-class attributes.
Which Future AGI evaluators should I attach to a CrewAI crew?
Four rubrics anchored on the role contract, layered on the SDK's template suite. RoleAdherence, configured per agent as a CustomLLMJudge against that agent's role plus goal plus backstory, scored on every Task._execute_core AGENT span. DelegationCorrectness, configured as a CustomLLMJudge against the task description plus the candidate agents plus the chosen agent, scored on every delegation event in hierarchical crews. CrewCoherence, scored once per Crew.kickoff CHAIN span, asking whether the final crew output reflects every agent's contribution rather than collapsing into one agent's solo work. ManagerWorkerFidelity on hierarchical crews, scoring whether the manager's task description carried enough context for the chosen worker to act without inventing new constraints. Run these alongside the SDK's TaskCompletion on per-task spans, LLMFunctionCalling on per-tool spans, and AnswerRefusal on per-agent outputs. Wire per-axis thresholds into the CI gate so a 5-percent drop on RoleAdherence does not hide behind a passing crew-level TaskCompletion score.
How does Error Feed cluster CrewAI failures by role and delegation defect?
Error Feed runs HDBSCAN soft-clustering over span attributes plus per-agent role tags plus trajectory embeddings inside ClickHouse, then fires a Claude Sonnet 4.5 JudgeAgent on each cluster with a 30-turn budget and eight span-tools. For CrewAI crews, the natural clusters are role-shaped: role-bleed clusters (the writer started citing fresh sources after a model refresh), delegation-asymmetry clusters (the manager over-delegated trivial work to the expensive research agent), context-handoff clusters (the researcher did not pass enough context to the writer and groundedness drops on the final artifact), and termination-drift clusters (the sequential crew skipped the editor task when the writer output looked long enough). The Judge writes a 5-category 30-subtype taxonomy entry, the 4-D trace score (factual_grounding, privacy_and_safety, instruction_adherence, optimal_plan_execution), and an immediate_fix naming the role contract tighten, the manager prompt change, or the rubric calibration that should ship today. Linear OAuth is wired today; Slack, GitHub, Jira, and PagerDuty are on the roadmap.
Where does Future AGI ship the full CrewAI eval stack?
The eval stack ships as a package, not a single product. ai-evaluation SDK (Apache 2.0) ships 60-plus EvalTemplate classes including TaskCompletion, EvaluateFunctionCalling (alias LLMFunctionCalling), AnswerRefusal, ConversationCoherence, ConversationResolution, Groundedness, ChunkAttribution, and 11 CustomerAgent templates, plus the CustomLLMJudge that carries RoleAdherence, DelegationCorrectness, CrewCoherence, and ManagerWorkerFidelity. traceAI (Apache 2.0) ships the CrewAIInstrumentor wrapping Crew.kickoff, Task._execute_core, and ToolUsage._use across 50-plus AI surfaces. The Future AGI Platform adds self-improving evaluators tuned by feedback, in-product custom rubric authoring, and classifier-backed scoring at lower per-eval cost than Galileo Luna-2. Error Feed sits inside the stack as the clustering and what-to-fix layer. agent-opt's six optimizers (RandomSearch, BayesianSearch, MetaPrompt, ProTeGi, GEPA, PromptWizard) tune each agent's role contract and the manager's delegation prompt as separate study targets.
Related Articles
View all
Evaluating LLM Agent Handoffs (2026)
Guides

Evaluating LLM agent handoffs in 2026: the handoff is the cross-framework eval unit. Four rubrics, per-handoff spans, CI gates, and Error Feed clustering.

NVJK Kartik
NVJK Kartik ·
11 min