Guides

How to Organize an LLM Eval Team in 2026

Eval ownership splits three ways: a platform team that owns tooling, product teams that own rubrics, and a quality council that owns policy. RACI matrix, org models, hiring profiles, FAGI as the platform-team accelerator.

·
13 min read
llm-evaluation team-organization ml-quality eval-ownership raci quality-council 2026
Editorial cover image for How to Organize an LLM Eval Team in 2026
Table of Contents

You ship an LLM product. Six months in, the eval suite has 200 rubrics no one can rank by importance, two of the last three releases loosened a threshold “just for this launch,” the platform engineer who wrote the runner left, and the chief privacy officer just learned the team has been running a custom toxicity rubric they never reviewed. Each problem has a different owner. None of them have a name on the doc.

LLM eval doesn’t fail because the rubrics are wrong or the judge model is undercooked. It fails because the work has three different shapes and most companies try to put all three under one team. The thesis of this piece: eval ownership splits between an eval platform team (the tooling), product teams (the rubrics), and a quality council (the policy and calibration). Skip any one and either the tooling rots, the rubrics drift, or the company has no coherent stance on what “good” means. The rest of this piece is the RACI matrix that makes that split operational, the org models that scale through it, the hiring profiles for each role, and where Future AGI fits.

TL;DR: the three-role split

GroupOwnsFailure mode when missing
Eval platform teamSDK, runners, judge-calibration tooling, Error Feed plumbing, CI gate infraTooling rots; each product team builds a shadow runner
Product teamsRubrics per surface, golden set, per-product triage rotationRubrics drift from user needs; the platform team writes them and gets it wrong
Quality councilCross-cutting safety rubrics, threshold-exception process, judge-of-judges, deploy-block authorityNo coherent “what good means”; every launch relitigates the gate

If you only get one thing right: stand up the quality council. The platform team and the product teams will form anyway. The council is the role companies skip and pay for at the next regulator question or first public incident.

Why three, not one or five

The single-team pattern fails because the work is three different jobs. Building a Ray-backed runner that finishes a 50,000-example regression overnight is platform engineering. Deciding whether a clinical-notes summarizer omitting a side effect is moderate or severe is subject-matter expertise. Adjudicating whether a 2.1-point drop on the jailbreak rubric blocks a release in front of a launch deadline is policy judgment. One team that owns all three becomes the bottleneck on at least two of them, and product teams learn to route around it. The longer view of that decay is in the eval-team scaling guide.

The five-role split most lists publish (Eval Owner, Rubric Authors, Annotators, Eval Engineer, Incident Triager) is the activity decomposition, not the org chart. The activities are real, but the company doesn’t hire five role types; it hires platform engineers into one team, eval leads into product teams, and forms a council from existing leadership. The RACI matrix below is what makes that mapping operational.

The RACI matrix per eval activity

The matrix below is the working pattern for a mid-to-enterprise team running the three-role split. R = responsible (does the work), A = accountable (signs off, has the authority), C = consulted (asked before the decision), I = informed (told after).

ActivityPlatform teamProduct teamQuality council
New rubric for a product surfaceC (infra fit)R / AI
New cross-cutting safety rubric (PII, jailbreak, toxicity)RCA
Golden-set refresh for a productCR / AI
Annotator agreement floor and kappa policyCRA
Judge-calibration regression detectedRCA
Judge-of-judges quarterly reviewRCA
CI gate plumbing and PR check infraR / ACI
CI threshold for a product rubricCR / AI
Deploy-block override (ship past a failing gate)IR (requests)A
Incident triage rotation per productCR / AI
Cross-product incident pattern (same failure in N products)CCR / A
Vendor decision (SDK, judge model, eval platform)RCA
Eval roadmap per productCR / AI
Eval roadmap company-wideCCR / A

Two patterns to read out of this matrix.

The council is Accountable for anything cross-cutting and Informed on the rest. Per-product rubric work flows through product teams. Anything that affects more than one product (safety, judge calibration, deploy policy) needs a single Accountable owner above the products, and that’s the council. Without that line, every cross-cutting decision turns into a multi-team negotiation that never closes.

The platform team is Responsible but rarely Accountable. It builds and runs the infrastructure, but the call on whether a rubric ships, a threshold holds, or a release blocks belongs to the product team (for their surface) or the council (for cross-cutting work). Platform teams that absorb Accountability for the rubric set become the bottleneck the three-role split exists to avoid.

Org models at scale

The right shape of the three-role split depends on company size and product surface.

Small (10-40 engineers, 1-2 product lines)

The platform team and the council overlap. A single Eval Owner runs both, with two or three SMEs sitting on a lightweight council that meets monthly. Product teams are small enough that each one has a part-time Eval Lead (often the founding engineer on the surface). No dedicated platform headcount yet; the SDK does the heavy lifting.

The trap here is prematurely formalizing the council with bi-weekly meetings and a charter document when the company has one product. The council shape at this stage is “Eval Owner walks down the hall and gets a yes.” Build the muscle, skip the ceremony.

Mid (40-200 engineers, 2-6 product lines)

This is where the three-role split becomes load-bearing. A dedicated eval platform team of two to five people. Each product team has an Eval Lead (full-time) plus rotating Rubric Authors from the domain side (PMs, designers, clinicians, lawyers, support leads). A formal quality council of five to eight people meets every two weeks, with a charter and a deploy-block playbook.

Most mid-stage failures are the council not existing yet. The platform team is named, the product teams have leads, but no one sits above the product teams on cross-cutting policy. The first regulator inquiry or the first PR-visible incident is what creates the council, expensively.

Enterprise (200-plus engineers, 5-plus product lines, regulated)

Platform team grows to 8-20 with sub-functions: an SDK and runners group, a judge-calibration and golden-set group, an Error Feed and incident infra group. Product teams have Eval Leads plus dedicated Eval Engineers per surface. The council is a formal cross-functional group reporting to the CTO or Chief AI Officer, often co-chaired with Legal or Trust and Safety, with rubric versioning, regulatory tags per rubric, multi-region policy, and an annual external audit. The chargeback model bills consumer teams per rubric per request via the platform’s hierarchical budgets.

The enterprise failure mode is the platform team accumulating policy decisions because it has the most context. The fix is to keep the council named and active, with a published exception-decision process the platform team executes but doesn’t own.

Hiring profile per role

The three groups need three different profiles. Confusing them is the most common staffing mistake.

Eval platform engineer

Platform engineering with ML literacy, not ML engineering with platform literacy. The day-to-day is distributed-systems work: keeping a Ray runner healthy on 50,000-example regressions, wiring the SDK into pytest and GitHub Actions, maintaining the judge-calibration program, plumbing Error Feed clusters into the on-call tracker, building the chargeback view on top of hierarchical budgets.

Hire from. ML platform teams (Databricks, AWS SageMaker, internal ML platforms), data engineering, backend infrastructure. Strong on Python, async, distributed compute, observability. Comfortable reading judge prompts but doesn’t need to author rubrics.

Avoid hiring. Pure applied ML researchers. They will spend three months over-engineering the calibration math and ship no runners.

Product team Eval Lead

Domain literacy first, code fluency second. Reads 50 production traces a week. Talks to SMEs (clinicians, lawyers, support leads, PMs) and translates ambiguous “this answer feels off” into concrete rubric definitions. Owns the golden set for the surface. Adjudicates annotator disagreement. Runs the per-product incident triage rotation.

Hire from. Applied ML, technical PM, strong domain backgrounds with a coding muscle (e.g., a paralegal who codes for a legal-AI product). Should be able to read and modify rubric code in the ai-evaluation SDK, but does not need to build runners.

Avoid hiring. Strong systems engineers who joined to “do ML.” The role is half product, half quality engineering; pure infra people get bored inside a quarter.

Quality council member

A standing seat, not a hire. The council is composed of existing leadership: head of the platform team, two or three product Eval Leads on rotation, head of Trust and Safety or AppSec, and a senior PM or domain expert per high-risk surface. New hires for the council are rare and senior (often a head of AI quality reporting to the CTO at enterprise scale).

Hire from. When you do hire onto the council, hire from quality engineering, trust and safety, or applied ML with a regulatory background. The role is policy and judgment; pattern-matching from past incidents matters more than depth in any single technical area.

The Rubric Author confusion

Rubric Authors are not a separate hire. They are SMEs from inside the company (or contracted domain experts) who sit on the product team’s rubric work part-time. The product Eval Lead is the persistent owner; the Rubric Author is the recurring contributor. Treating Rubric Authors as a headcount line is what produces the failure mode where the eval team has “five Rubric Authors” and zero actual SMEs from the product side.

The five anti-patterns the split prevents

Each of these is the predictable failure when one of the three roles is missing or merged.

Platform team absent. Each product team rolls its own runner, its own kappa pipeline, its own CI plumbing. The judge model differs by product. Aggregating eval scores across the company is impossible. The fix is naming a platform team and consolidating the infrastructure.

Product team Eval Leads absent. The platform team writes the rubrics. They drift toward what’s easy to measure rather than what users care about. A faithfulness rubric written for news summarization scores the medical surface unchanged six months later. The fix is named Eval Leads inside the product teams with the authority to define what good means.

Quality council absent. Cross-cutting safety rubrics are owned by whoever happened to write them, often a platform engineer with no policy mandate. Threshold overrides happen unilaterally under launch pressure. The first PR-visible incident or the first regulator question is the council-forming event. The fix is forming it before the event.

Platform team owns rubrics. The conflation that turns the split back into the central-team pattern. The platform team has the context (it sees all the traces), so it writes the rubrics, and now the platform team is the bottleneck. The fix is explicit: the platform team is Responsible for infrastructure, never Accountable for rubric content.

Council has no deploy-block authority. The council exists, meets, publishes policy, and is overridden every launch because the gate is advisory. The fix is the same as every other gate failure: written deploy authority for the council, a documented exception process, and a public log of overrides. The agent-rollout view of this is in the agent rollout-strategies post.

Future AGI as the platform-team accelerator

The three-role split is what the company owns. Future AGI is what the platform team buys to avoid building it.

  • ai-evaluation SDK (Apache 2.0). 60-plus EvalTemplate classes (Groundedness, ContextAdherence, Completeness, TaskCompletion, FactualAccuracy, Toxicity, PromptInjection, DataPrivacyCompliance, AnswerRefusal, and more), 13 guardrail backends (9 open-weight: LlamaGuard, Qwen3Guard, Granite Guardian, WildGuard, ShieldGemma; 4 API: OpenAI Moderation, Azure Content Safety, Turing Flash and Safety), 8 sub-10ms Scanners, and 4 distributed runners (Celery, Ray, Temporal, Kubernetes). The platform team uses this instead of building runners, kappa pipelines, and rubric templates from scratch.

The code the platform team owns shrinks to wiring, not building. A typical CI gate looks like:

from fi.evals import Evaluator
from fi.evals.templates import Groundedness, ContextAdherence, Completeness, TaskCompletion
from fi.testcases import TestCase

evaluator = Evaluator(fi_api_key="...", fi_secret_key="...")
dataset = load_golden_set("customer-support-v3")

results = evaluator.evaluate(
    eval_templates=[
        Groundedness(),
        ContextAdherence(),
        Completeness(),
        TaskCompletion(),
    ],
    inputs=[TestCase(**row) for row in dataset],
)

assert_rubric_thresholds(results, {
    "Groundedness": 0.85,
    "ContextAdherence": 0.85,
    "Completeness": 0.80,
    "TaskCompletion": 0.85,
})
  • Future AGI Platform. Self-improving evaluators tuned by production thumbs-up and thumbs-down feedback. In-product authoring agent so Rubric Authors describe a rubric in natural language and ship it as an evaluator without writing Python. Lower per-eval cost than Galileo Luna-2 on classifier-backed evals. Five-level hierarchical budgets (org, team, user, key, tag) so finance maps eval spend onto the same three-role split.
  • Error Feed. HDBSCAN soft-clustering over the trace stream, a Sonnet 4.5 Judge writes RCA plus immediate_fix plus a four-dimension score (factual_grounding, privacy_and_safety, instruction_adherence, optimal_plan_execution) per cluster, Linear OAuth-wired today as the direct ticketing integration. Product teams use it for per-product triage; the council uses it for cross-product pattern detection. The trace-stream-to-dataset connector is on the roadmap; today, promoting a cluster trace into the regression set is a one-click manual step. Mechanics are in the incident response playbook.
  • Agent Command Center. Hosts the runtime next to the eval surface: 17 MB Go binary, six native provider adapters plus 20-plus providers total, shadow/mirror/race modes for canary work next to the CI gate. RBAC, SOC 2 Type II, HIPAA, GDPR, and CCPA, AWS Marketplace, multi-region.

The platform team is the primary buyer. The product teams use the in-product authoring agent and the Error Feed surface. The council uses the cross-cutting safety rubrics, the judge-of-judges calibration in the SDK, and the hierarchical budget view for vendor-level cost visibility.

Where Future AGI sits in each org model

Org modelFuture AGI role
Small (10-40 engineers)Outsourced surface; SDK for code-first, Platform for in-product authoring, Error Feed for triage. The interim Eval Owner owns all three roles.
Mid (40-200 engineers)Platform-team accelerator. The dedicated platform team builds on the SDK, product teams author rubrics in the Platform, council uses Error Feed for cross-product pattern.
Enterprise (200-plus engineers)Underlying platform with BYOC or multi-region deployment, hierarchical budgets for chargeback, rubric versioning per regulated surface, Linear ticketing into the SRE-style on-call rotation.

The eval cost-optimization piece covers the financial side of the platform-team buy. The judge-calibration write-up covers the mechanics of the judge-of-judges loop the council owns. The golden-set design guide covers the rubric-set hygiene each product team is Accountable for.

The cultural shift

Teams that win at LLM eval treat the three-role split as load-bearing org design, not a labeling exercise. The platform team has a roadmap and headcount. Product team Eval Leads sit in the staffing plan alongside backend engineers, not as a side gig for a senior IC. The council meets every two weeks, has a charter, and the deploy-block authority shows up in the launch checklist.

Teams that lose treat eval as one team’s problem. Either the platform team owns everything and becomes the bottleneck, or product teams own everything and lose cross-cutting coherence, or the council exists on paper and gets overridden the first time a launch is in the way. The discipline shift is whose name is on the doc when a rubric is wrong, who’s paged when a cluster appears, and who has the authority to hold the line.

If you’re sitting on the failure modes from the opening paragraph today, the first move isn’t another Eval Engineer. It’s standing up the quality council with a charter and a deploy-block playbook. The platform team and the product Eval Leads will form around it, because the council is the thing that gives them somewhere to escalate. That order surprises most teams because the visible work looks like rubric code, but the underlying constraint is almost always organizational, not technical.

For the broader pattern of how eval fits the lifecycle, the 2026 LLM evaluation playbook covers the six-layer architecture. For the longitudinal view of how this split grows from 5 to 500 engineers, the eval-team scaling guide is the next read. The three-role view in this piece is the steady-state shape underneath both.

Frequently asked questions

Who actually owns LLM evaluation inside a company shipping LLM products?
Three groups, not one. An eval platform team owns the tooling: SDK, runners, judge-calibration program, Error Feed, CI gate plumbing. Product teams own the rubrics for their surface: what 'good' means, the golden set, per-product incident triage. A quality council owns policy: cross-cutting safety rubrics, judge-of-judges calibration, the threshold-exception process, and the deploy-block authority. Skip any one and either the tooling rots, the rubrics drift from user needs, or the company has no coherent stance on what 'good' means and every launch relitigates it.
Why split eval ownership three ways instead of giving it to one team?
Because the work is three different jobs that need three different skill profiles. Building a kappa-checked annotation pipeline and a Ray-backed distributed runner is platform engineering. Deciding whether a clinical-notes summarizer omitting a side effect is moderate or severe failure is subject-matter expertise. Adjudicating whether a missed-jailbreak regression blocks a release is policy judgment. One team that owns all three becomes a bottleneck on at least two of them, and the company learns to route around it.
What is a quality council and who sits on it?
A small standing group (typically 4-7 people) that owns eval policy across the company. Membership: head of the eval platform team, two or three Eval Leads from product teams, head of trust and safety or AppSec, and a senior PM or domain expert per high-risk surface. The council meets every two weeks, owns cross-cutting safety rubrics, signs off on threshold changes, runs the judge-of-judges calibration, and is the named decision-maker when product wants to ship past a failing gate. Without the council, every deadline-pressured launch becomes a unilateral threshold override.
What does the RACI matrix look like for a typical eval activity?
For 'add a new rubric for a product surface': Rubric Author (product team SME) is Responsible, Eval Lead on the product team is Accountable, the platform team is Consulted on infra, the quality council is Informed unless the rubric is cross-cutting. For 'judge-calibration regression detected': platform team is Responsible, quality council is Accountable, product teams are Consulted, all engineering is Informed. The pattern is that the council is Accountable for anything cross-cutting and Informed on the rest; product teams are Accountable for their rubrics; the platform team is Responsible for infra.
What hiring profile fits the eval platform team versus the product team Eval Lead?
The platform team is platform engineering with ML literacy: distributed-systems chops, observability instincts, the patience to run a kappa-floor annotation program. Hire from ML platform, data engineering, or backend infra; the ML pedigree matters less than the systems pedigree. The product team Eval Lead is the opposite: domain literacy first, code-fluency second. They need to read 50 traces a week, talk to SMEs, and translate rubric ambiguity into clear definitions. Hire from applied ML, technical PM, or strong domain backgrounds with a coding muscle. Confusing the two profiles is how teams end up with a platform full of domain experts who can't ship a runner and product teams with Ray experts who can't tell a good answer from a bad one.
How does Future AGI fit a three-role org?
Future AGI is the accelerator for the platform team specifically. The ai-evaluation SDK gives platform engineers 60-plus EvalTemplate classes, 13 guardrail backends, 8 sub-10ms Scanners, and 4 distributed runners (Celery, Ray, Temporal, Kubernetes) so the platform team is not building Ray jobs from scratch. The Future AGI Platform gives product teams an in-product agent for rubric authoring so Rubric Authors ship without code, and self-improving evaluators tuned by thumbs-up and thumbs-down feedback. Error Feed gives the quality council and product teams a shared queue for production failures with a Sonnet 4.5 Judge writing immediate_fix per cluster. Five-level hierarchical budgets (org, team, user, key, tag) map eval spend onto the same three-role split for chargeback.
When is the three-role split the wrong shape for a company?
Below about 25 engineers and one product line, the council and the platform team are the same three people, and forcing the formal split is over-engineering. The right Stage 1 shape is a single Eval Owner with a part-time SME per product surface, on the open-source SDK. The three-role split becomes the right shape around 50 engineers and two or more product lines, when one person can no longer credibly hold both 'what we measure' and 'how we measure it at scale.' The scaling view is in the eval-team scaling guide; this piece covers the steady-state pattern at mid and enterprise scale.
Related Articles
View all