Evaluating CrewAI Agents: Role Adherence Is the Unit (2026)
Evaluating CrewAI agents in 2026: role adherence as the primary metric, plus task delegation, crew coherence, and manager-worker fidelity.
Table of Contents
A three-agent CrewAI crew scores 0.92 on TaskCompletion in CI. The Senior Research Analyst produces a clean source list. The Technical Writer ships a 1500-word explainer. The Editor signs off. A week later, the same crew scores 0.90, but the writer is now running fresh web searches mid-draft, the researcher is paraphrasing the writer’s structure back into the brief, and cost per run has doubled because the editor started pulling new citations. Nothing in the final artifact rubric moved. The role contract collapsed.
CrewAI is role-first, so its evaluation has to be role-first too. The thesis of this post is simple: CrewAI’s design makes eval easier in one way — every agent ships with a declared role, goal, and backstory you can grade against — and harder in another, because role adherence becomes a first-class failure mode that other frameworks do not surface. This is the working pattern for evaluating CrewAI agents in 2026: role adherence as the primary metric, plus task delegation correctness, crew-level coherence, and manager-worker fidelity, all riding on the per-agent spans the traceAI CrewAIInstrumentor emits out of the box.
Why CrewAI eval differs from AutoGen and LangGraph eval
Each major multi-agent framework has a different load-bearing primitive, and the eval surface follows it.
AutoGen is message-first: group chats route turns through a team channel and the failure modes live in the handoff between sender and receiver. LangGraph is state-first: the graph mutates a shared state object across nodes and failures live in node correctness, edge selection, and state-diff integrity. CrewAI is role-first. Each agent is declared with an explicit role, goal, and backstory. The runtime executes a declared task list. Roles are not metadata; they are the contract the runtime enforces.
This shifts what you grade. AutoGen eval centres on handoff fidelity. LangGraph eval centres on graph topology. CrewAI eval centres on the role contract: did each agent stay inside its declared role, did the manager pick the right worker for each task, did the final artifact reflect every agent’s contribution, did the manager hand the worker enough context to act without inventing constraints. The unit of evaluation is the agent’s adherence to its role on each task it executes, not the team-level transcript. The cross-framework spine lives in the agent eval guide; the rest of this post stays inside CrewAI.
Role adherence as the primary metric
A CrewAI agent is a declarative object. You ship role, goal, backstory, tools, allow_delegation, and optional max_iter, max_rpm. The runtime reads these as the agent’s contract. Role drift is what happens when the agent answers outside that contract. The Senior Research Analyst starts drafting the explainer. The Technical Writer starts running new searches. The Editor pulls fresh citations the researcher never produced.
Drift breaks two things at once. The per-agent rubric you wrote for that role stops being a meaningful test, because the agent is no longer doing the job the rubric measures. The team’s division of labour collapses, which is why you ran a crew instead of one planner-with-tools to begin with. Drift is also the failure mode most sensitive to model refreshes: a refreshed gpt-4o or claude-sonnet-4-5 checkpoint trained to be more helpful drifts role behaviour first and final-answer quality second. Per-agent RoleAdherence is the earliest indicator the refresh landed.
Score it per agent, per task, against the agent’s own contract. The rubric reads role, goal, backstory, and the agent’s output for that task, and asks one question: did the agent stay inside the contract. Track the trend line per checkpoint. A 5-point drop overnight is the signal you would otherwise pick up from a cost graph two weeks later.
Task delegation correctness on hierarchical crews
Hierarchical crews introduce a second failure surface. The manager owns delegation: it reads each task, picks a worker, and routes the task description plus context. Delegation defects are the hardest crew defect class because the final output can still look fine when the wrong agent did the right task. The researcher writes an OK explainer, the writer produces an OK source list, the cost graph spikes, and the role contract is silently broken across every run.
Two failure modes carry most production delegation defects. Wrong worker. The manager assigns the writer’s job to the researcher, the editor’s job to the writer. Aggregate TaskCompletion still passes because the work gets done, just by the wrong agent at the wrong cost. Over-delegation. The manager delegates trivial work (formatting, light edits, single-line clarifications) to the expensive research agent instead of handling it inline, and the per-task cost spike hides inside the crew aggregate.
DelegationCorrectness is a CustomLLMJudge that reads the task description, the candidate agents, and the chosen agent, and scores whether the chosen agent matches the task shape. A drop here is the leading indicator of latency tails and cost overruns even when the crew-coherence score still looks fine.
Crew-level coherence and manager-worker fidelity
The final crew output is the artifact users see, and it has its own failure mode beyond role drift and bad delegation. Coherence collapse is what happens when the artifact reflects one agent’s solo work with the other agents’ names attached. The researcher’s brief is paraphrased back as the final article. The writer’s draft is shipped as the source list. The editor’s notes are absorbed into the body without changing the underlying claims. No per-agent rubric flags this because every individual turn is well-formed; the team-level synthesis broke between them.
CrewCoherence is a CustomLLMJudge that reads each agent’s contribution and the final output, and scores whether the artifact reflects every agent’s work. Penalise outputs that collapse into one agent’s voice, ignore an upstream correction, or contradict an earlier correct intermediate.
ManagerWorkerFidelity is the rubric pair for hierarchical crews. It reads the manager’s task description and the worker’s response, and scores whether the manager handed enough context for the worker to act without inventing constraints. Most fidelity defects are under-spec: the manager says “research this topic” and the worker invents a date range, a scope cap, or an exclusion list the manager never set. The defect surfaces downstream as constraint bleed that no single agent ever owned. Score the manager’s task descriptions as a first-class artifact, not as plumbing.
The four role-anchored axes in one table
| Axis | What you measure | Evaluator |
|---|---|---|
| Role adherence | Did each agent stay in the declared role | CustomLLMJudge as RoleAdherence, per agent |
| Task delegation correctness | Did the manager pick the right worker | CustomLLMJudge as DelegationCorrectness, per delegation |
| Crew-level coherence | Does the final output reflect every agent | CustomLLMJudge as CrewCoherence, per kickoff |
| Manager-worker fidelity | Did the manager hand enough context to the worker | CustomLLMJudge as ManagerWorkerFidelity, per task |
Layer these on top of the SDK’s existing templates: TaskCompletion per task, EvaluateFunctionCalling (aliased LLMFunctionCalling) per tool, AnswerRefusal per agent output, Groundedness and ContextAdherence on tasks that consume retrieved context. The four role-anchored rubrics give you the CrewAI-specific signal. The SDK templates give you the baseline a single-agent run would also need.
traceAI CrewAIInstrumentor: what the spans actually carry
A role-adherence rubric needs an agent span tagged with the role contract. The CrewAIInstrumentor emits that span tree without code changes inside your crew definitions, by patching three CrewAI internals at import time.
pip install crewai
pip install fi-instrumentation-otel traceAI-crewai
pip install ai-evaluation
import os
os.environ["FI_API_KEY"] = "..."
os.environ["FI_SECRET_KEY"] = "..."
from fi_instrumentation import register
from fi_instrumentation.fi_types import ProjectType
from traceai_crewai import CrewAIInstrumentor
trace_provider = register(
project_type=ProjectType.OBSERVE,
project_name="content-research-crew",
)
CrewAIInstrumentor().instrument(tracer_provider=trace_provider)
After this call, every crew kickoff in the process emits the span tree the four rubrics filter on. The instrumentor (verified at traceAI/python/frameworks/crewai/traceai_crewai/_wrappers.py) wraps three internals.
Crew.kickoff becomes a CHAIN span carrying crew_id, crew_key, crew_inputs, the full crew_agents JSON list (role, goal, backstory, allow_delegation, tools_names, max_iter, max_rpm per agent), the full crew_tasks JSON list (id, description, expected_output, async_execution, human_input, agent_role, context, tools_names per task), and the gen_ai.usage.input_tokens, gen_ai.usage.output_tokens, gen_ai.usage.total_tokens totals from CrewAI’s own usage_metrics. The crew topology lands on the root span as structured JSON, not flat strings.
Task._execute_core becomes an AGENT span per task with crew_id, task_id, task_key, and, when crew.share_crew is set, formatted_description and formatted_expected_output. The agent’s role, goal, and backstory ride on the span’s input value. Per-agent RoleAdherence is a filter on fi.span.kind=AGENT joined against the parent CHAIN’s crew_agents list.
ToolUsage._use becomes a TOOL span per tool call with gen_ai.tool.name, the configured function_calling_llm, and the full tool input. Retries surface as sibling TOOL spans on the same parent agent, so retry-count eval is COUNT(*) GROUP BY tool.name, parent_agent.
The instrumentor does not emit a dedicated HANDOFF or DELEGATION span today. Delegation in CrewAI rides on the manager’s AGENT span and the worker’s subsequent AGENT span, joined by the manager-worker assignment in the parent’s crew_tasks list. Score delegation by lifting crew_tasks[i].agent_role from the CHAIN span and matching it against the agent that actually ran the task. One query, not a parser. Honest limitation worth naming.
The four rubrics in code
All four ride the CustomLLMJudge interface from ai-evaluation. One pattern, four configs.
from fi.evals import Evaluator
from fi.evals.metrics import CustomLLMJudge
from fi.evals.llm.providers.litellm import LiteLLMProvider
evaluator = Evaluator(fi_api_key="...", fi_secret_key="...")
provider = LiteLLMProvider()
role_adherence = CustomLLMJudge(
provider=provider,
config={
"name": "RoleAdherence",
"grading_criteria": (
"Read the agent's role, goal, and backstory, then read the agent's "
"output for the task. Score whether the output stayed inside the "
"contract. Penalise role bleed: a researcher that drafts the explainer, "
"a writer that runs searches, an editor that pulls new citations."
),
},
)
delegation_correctness = CustomLLMJudge(
provider=provider,
config={
"name": "DelegationCorrectness",
"grading_criteria": (
"Read the task description, the candidate agents, and the chosen "
"agent. Score whether the chosen agent matches the task shape. "
"Penalise wrong-worker and over-delegation of trivial work."
),
},
)
crew_coherence = CustomLLMJudge(
provider=provider,
config={
"name": "CrewCoherence",
"grading_criteria": (
"Read each agent's contribution and the final crew output. Score "
"whether the artifact reflects every agent's work. Penalise outputs "
"that collapse into one agent's voice or ignore an upstream correction."
),
},
)
manager_worker_fidelity = CustomLLMJudge(
provider=provider,
config={
"name": "ManagerWorkerFidelity",
"grading_criteria": (
"Read the manager's task description and the worker's response. "
"Score whether the manager handed enough context for the worker to "
"act without inventing scope, date ranges, or constraints."
),
},
)
Wire per-axis CI thresholds. A reasonable starting set is RoleAdherence >= 0.90 per agent, DelegationCorrectness >= 0.85 per delegation, CrewCoherence >= 0.80 per kickoff, ManagerWorkerFidelity >= 0.85 per task. The gate fails on the failing axis, not a single aggregate. One bisect instead of three days. The LLM evaluation playbook covers threshold cadence; the agent passes evals fails production post covers the axis-blindness pattern that hides role drift behind a passing TaskCompletion.
Production observability and Error Feed clustering
CI is a snapshot; production is a river. Score the live trace stream with the same four rubrics and you get the regression signal the offline set cannot have, because the offline set was frozen before the model refresh that broke role behaviour landed.
Error Feed is the loop closer inside the eval stack. Failing crew runs flow into ClickHouse with their span embeddings. HDBSCAN soft-clustering groups them into named issues. The cluster shapes that recur on CrewAI crews are role-shaped, which is the point of grading at the role contract.
- Role-bleed clusters. The writer starts citing fresh sources after a refreshed
gpt-4ocheckpoint lands.RoleAdherenceon the writer drops 8 points overnight. Theimmediate_fixis a stricter role contract plus a one-shot example of the role boundary. - Delegation-asymmetry clusters. The manager over-delegates formatting to the expensive research agent. Cost per run climbs 30 percent while
CrewCoherencestays flat. Theimmediate_fixis a manager prompt edit that handles trivial tasks inline. - Context-handoff clusters. The researcher does not pass enough context to the writer and
Groundednessdrops below 0.7. Theimmediate_fixis a taskcontextchain edit plus aManagerWorkerFidelitytighten on the research-to-writer handoff. - Termination-drift clusters. A sequential crew skips the editor task when the writer output looks long enough. The CHAIN span shows two AGENT spans where the config promised three. The
immediate_fixis an explicit acceptance criterion that forces the editor to fire.
Per cluster, a Claude Sonnet 4.5 JudgeAgent runs a 30-turn investigation across eight span-tools (read_span, get_children, get_spans_by_type, search_spans, plus a Haiku Chauffeur for spans over 3000 characters; prompt-cache hit ratio around 90 percent). The Judge writes three artifacts: a 5-category 30-subtype taxonomy entry, the 4-D trace score (factual_grounding, privacy_and_safety, instruction_adherence, optimal_plan_execution, each 1 to 5), and an immediate_fix naming the role contract tighten, manager prompt edit, or rubric calibration that ships today. Linear OAuth is wired today; Slack, GitHub, Jira, and PagerDuty are on the roadmap.
The cluster’s representative crew runs become regression cases. agent-opt then tunes each agent’s role contract and the manager’s delegation prompt as separate study targets. BayesianSearchOptimizer or GEPAOptimizer with EarlyStoppingConfig works well, and the per-agent separation keeps a winning writer tweak from being masked by a flat researcher. The automated optimization for agents post covers the optimizer mix; the trace and debug multi-agent systems guide covers cross-framework topology if you also run AutoGen or LangGraph.
Per-agent cost telemetry via the gateway
Per-agent cost is the axis most teams ship without. Route CrewAI’s underlying LLM calls through https://gateway.futureagi.com/v1 and the per-response headers do the work. CrewAI’s litellm-backed LLM picks up OPENAI_BASE_URL for OpenAI-compatible providers. Each response carries x-prism-cost, x-prism-latency-ms, x-prism-provider, and the resolved-model header. traceAI attaches them to per-agent LLM spans, so per-agent cost roll-up is SUM(x-prism-cost) GROUP BY parent_agent_role and per-task budget is SUM(x-prism-cost) GROUP BY task_id. The agent that spikes 4x on cost in a regression run is one chart, not a forensic spelunk. See AI agent cost optimization for the deeper story.
Common CrewAI eval anti-patterns
Four mistakes that hide each failure mode above.
Scoring only the final crew artifact. A passing final output tells you the crew produced something. It does not tell you which agent did the work, whether agents stayed in role, or what each agent cost. Per-agent scoring is how you debug a crew when scores regress.
One rubric for all agents. A Senior Research Analyst and a Technical Writer need different rubrics. RoleAdherence judged with one team-wide rubric either over-fits to one role or under-specs both. The CustomLLMJudge pattern makes per-role rubrics cheap; use them.
Ignoring delegation on hierarchical crews. Manager-induced regressions (wrong worker, over-delegation, under-spec’d task descriptions) look like worker regressions in aggregate. Without DelegationCorrectness and ManagerWorkerFidelity, you will spend weeks tuning the worker’s prompt when the actual defect is the manager.
Treating model refreshes as silent. Refreshed checkpoints of gpt-4o, claude-sonnet-4-5, and the open-weight roster drift role behaviour first and final-answer quality second. Pin model versions in your crew config, run the regression set on every refresh, and track per-agent RoleAdherence trend lines per checkpoint.
How Future AGI ships the full CrewAI eval stack
Three surfaces, one loop, no separate products to glue together.
ai-evaluation SDK (Apache 2.0) ships the Evaluator, 60-plus EvalTemplate classes (TaskCompletion, EvaluateFunctionCalling aliased LLMFunctionCalling, AnswerRefusal, ConversationCoherence, Groundedness, ContextAdherence, ChunkAttribution, 11 CustomerAgent* templates), the CustomLLMJudge that carries the four role-anchored rubrics, and four distributed runners (Celery, Ray, Temporal, Kubernetes).
traceAI (Apache 2.0) ships the CrewAIInstrumentor wrapping Crew.kickoff (CHAIN), Task._execute_core (AGENT), and ToolUsage._use (TOOL), 50-plus other AI surface instrumentors across Python, TypeScript, Java, and C#, plus the EvalTag mechanism that attaches a rubric to a span kind so evals run server-side without polling.
Future AGI Platform ships the self-improving evaluators tuned by feedback, in-product custom rubric authoring, and classifier-backed scoring at lower per-eval cost than Galileo Luna-2. Error Feed sits inside the platform with the HDBSCAN clustering, the Sonnet 4.5 JudgeAgent, the 30-subtype taxonomy, the 4-D trace score, and the immediate_fix artifact.
agent-opt closes the loop with six optimizers (RandomSearchOptimizer, BayesianSearchOptimizer, MetaPromptOptimizer, ProTeGi, GEPAOptimizer, PromptWizardOptimizer) consuming the role-anchored rubrics as the optimization objective. Each agent’s role contract and the manager’s delegation prompt are separate study targets. Direct trace-stream-to-agent-opt is roadmap; eval-driven optimization ships today.
Honest tradeoff: if your stack is a single Agent and a tool registry, a lighter framework-specific tracer plus a hand-rolled TaskCompletion rubric is enough. The eval stack above earns its weight when you run real crews — three-plus agents, declared roles, hierarchical delegation, production traffic — and the role contract is the unit that decides whether the crew ships.
What to do this week
One crew config, end to end. Five steps.
- Wire
CrewAIInstrumentor().instrument(tracer_provider=trace_provider)into your project. Verify per-taskAGENTspans, per-toolTOOLspans, and theCrew.kickoffCHAINparent withcrew_agentsandcrew_tasksJSON attributes. - Build a 50-200 scenario regression set per crew config. Tag each scenario with expected role coverage, expected delegation pattern on hierarchical crews, and expected tool sequence per agent.
- Define
RoleAdherence,DelegationCorrectness,CrewCoherence, andManagerWorkerFidelityasCustomLLMJudgerubrics. Run alongsideTaskCompletion,LLMFunctionCalling, andAnswerRefusal. - Wire per-axis CI thresholds. Start at
RoleAdherence >= 0.90,DelegationCorrectness >= 0.85,CrewCoherence >= 0.80,ManagerWorkerFidelity >= 0.85. - Turn on Error Feed. Watch the first week’s clusters. Promote representative crew runs into the regression set. Run a
BayesianSearchOptimizerstudy on the agent whoseRoleAdherencetrend line ranks worst.
The teams shipping reliable CrewAI crews in 2026 stopped grading the final artifact and started grading the role contract. The framework gives you the declarative role surface; the eval stack gives you the signal that keeps each agent honest to it, one task at a time.
Related reading
Frequently asked questions
Why is evaluating CrewAI agents different from evaluating AutoGen or LangGraph agents?
What is role drift in a CrewAI crew and how do you catch it?
What does traceAI's CrewAIInstrumentor capture that generic OpenTelemetry tracers miss?
Which Future AGI evaluators should I attach to a CrewAI crew?
How does Error Feed cluster CrewAI failures by role and delegation defect?
Where does Future AGI ship the full CrewAI eval stack?
Evaluating AutoGen agents in 2026: the handoff is the eval unit. Three failure modes, three rubrics, per-pair spans, and the production loop.
Evaluating LLM agent handoffs in 2026: the handoff is the cross-framework eval unit. Four rubrics, per-handoff spans, CI gates, and Error Feed clustering.
smolagents' CodeAgent makes the plan AS code, so the eval changes shape: code synthesis correctness, sandbox safety, and result-interpretation fidelity.