How to Organize an LLM Eval Team in 2026
Eval ownership splits three ways: a platform team that owns tooling, product teams that own rubrics, and a quality council that owns policy. RACI matrix, org models, hiring profiles, FAGI as the platform-team accelerator.
Table of Contents
You ship an LLM product. Six months in, the eval suite has 200 rubrics no one can rank by importance, two of the last three releases loosened a threshold “just for this launch,” the platform engineer who wrote the runner left, and the chief privacy officer just learned the team has been running a custom toxicity rubric they never reviewed. Each problem has a different owner. None of them have a name on the doc.
LLM eval doesn’t fail because the rubrics are wrong or the judge model is undercooked. It fails because the work has three different shapes and most companies try to put all three under one team. The thesis of this piece: eval ownership splits between an eval platform team (the tooling), product teams (the rubrics), and a quality council (the policy and calibration). Skip any one and either the tooling rots, the rubrics drift, or the company has no coherent stance on what “good” means. The rest of this piece is the RACI matrix that makes that split operational, the org models that scale through it, the hiring profiles for each role, and where Future AGI fits.
TL;DR: the three-role split
| Group | Owns | Failure mode when missing |
|---|---|---|
| Eval platform team | SDK, runners, judge-calibration tooling, Error Feed plumbing, CI gate infra | Tooling rots; each product team builds a shadow runner |
| Product teams | Rubrics per surface, golden set, per-product triage rotation | Rubrics drift from user needs; the platform team writes them and gets it wrong |
| Quality council | Cross-cutting safety rubrics, threshold-exception process, judge-of-judges, deploy-block authority | No coherent “what good means”; every launch relitigates the gate |
If you only get one thing right: stand up the quality council. The platform team and the product teams will form anyway. The council is the role companies skip and pay for at the next regulator question or first public incident.
Why three, not one or five
The single-team pattern fails because the work is three different jobs. Building a Ray-backed runner that finishes a 50,000-example regression overnight is platform engineering. Deciding whether a clinical-notes summarizer omitting a side effect is moderate or severe is subject-matter expertise. Adjudicating whether a 2.1-point drop on the jailbreak rubric blocks a release in front of a launch deadline is policy judgment. One team that owns all three becomes the bottleneck on at least two of them, and product teams learn to route around it. The longer view of that decay is in the eval-team scaling guide.
The five-role split most lists publish (Eval Owner, Rubric Authors, Annotators, Eval Engineer, Incident Triager) is the activity decomposition, not the org chart. The activities are real, but the company doesn’t hire five role types; it hires platform engineers into one team, eval leads into product teams, and forms a council from existing leadership. The RACI matrix below is what makes that mapping operational.
The RACI matrix per eval activity
The matrix below is the working pattern for a mid-to-enterprise team running the three-role split. R = responsible (does the work), A = accountable (signs off, has the authority), C = consulted (asked before the decision), I = informed (told after).
| Activity | Platform team | Product team | Quality council |
|---|---|---|---|
| New rubric for a product surface | C (infra fit) | R / A | I |
| New cross-cutting safety rubric (PII, jailbreak, toxicity) | R | C | A |
| Golden-set refresh for a product | C | R / A | I |
| Annotator agreement floor and kappa policy | C | R | A |
| Judge-calibration regression detected | R | C | A |
| Judge-of-judges quarterly review | R | C | A |
| CI gate plumbing and PR check infra | R / A | C | I |
| CI threshold for a product rubric | C | R / A | I |
| Deploy-block override (ship past a failing gate) | I | R (requests) | A |
| Incident triage rotation per product | C | R / A | I |
| Cross-product incident pattern (same failure in N products) | C | C | R / A |
| Vendor decision (SDK, judge model, eval platform) | R | C | A |
| Eval roadmap per product | C | R / A | I |
| Eval roadmap company-wide | C | C | R / A |
Two patterns to read out of this matrix.
The council is Accountable for anything cross-cutting and Informed on the rest. Per-product rubric work flows through product teams. Anything that affects more than one product (safety, judge calibration, deploy policy) needs a single Accountable owner above the products, and that’s the council. Without that line, every cross-cutting decision turns into a multi-team negotiation that never closes.
The platform team is Responsible but rarely Accountable. It builds and runs the infrastructure, but the call on whether a rubric ships, a threshold holds, or a release blocks belongs to the product team (for their surface) or the council (for cross-cutting work). Platform teams that absorb Accountability for the rubric set become the bottleneck the three-role split exists to avoid.
Org models at scale
The right shape of the three-role split depends on company size and product surface.
Small (10-40 engineers, 1-2 product lines)
The platform team and the council overlap. A single Eval Owner runs both, with two or three SMEs sitting on a lightweight council that meets monthly. Product teams are small enough that each one has a part-time Eval Lead (often the founding engineer on the surface). No dedicated platform headcount yet; the SDK does the heavy lifting.
The trap here is prematurely formalizing the council with bi-weekly meetings and a charter document when the company has one product. The council shape at this stage is “Eval Owner walks down the hall and gets a yes.” Build the muscle, skip the ceremony.
Mid (40-200 engineers, 2-6 product lines)
This is where the three-role split becomes load-bearing. A dedicated eval platform team of two to five people. Each product team has an Eval Lead (full-time) plus rotating Rubric Authors from the domain side (PMs, designers, clinicians, lawyers, support leads). A formal quality council of five to eight people meets every two weeks, with a charter and a deploy-block playbook.
Most mid-stage failures are the council not existing yet. The platform team is named, the product teams have leads, but no one sits above the product teams on cross-cutting policy. The first regulator inquiry or the first PR-visible incident is what creates the council, expensively.
Enterprise (200-plus engineers, 5-plus product lines, regulated)
Platform team grows to 8-20 with sub-functions: an SDK and runners group, a judge-calibration and golden-set group, an Error Feed and incident infra group. Product teams have Eval Leads plus dedicated Eval Engineers per surface. The council is a formal cross-functional group reporting to the CTO or Chief AI Officer, often co-chaired with Legal or Trust and Safety, with rubric versioning, regulatory tags per rubric, multi-region policy, and an annual external audit. The chargeback model bills consumer teams per rubric per request via the platform’s hierarchical budgets.
The enterprise failure mode is the platform team accumulating policy decisions because it has the most context. The fix is to keep the council named and active, with a published exception-decision process the platform team executes but doesn’t own.
Hiring profile per role
The three groups need three different profiles. Confusing them is the most common staffing mistake.
Eval platform engineer
Platform engineering with ML literacy, not ML engineering with platform literacy. The day-to-day is distributed-systems work: keeping a Ray runner healthy on 50,000-example regressions, wiring the SDK into pytest and GitHub Actions, maintaining the judge-calibration program, plumbing Error Feed clusters into the on-call tracker, building the chargeback view on top of hierarchical budgets.
Hire from. ML platform teams (Databricks, AWS SageMaker, internal ML platforms), data engineering, backend infrastructure. Strong on Python, async, distributed compute, observability. Comfortable reading judge prompts but doesn’t need to author rubrics.
Avoid hiring. Pure applied ML researchers. They will spend three months over-engineering the calibration math and ship no runners.
Product team Eval Lead
Domain literacy first, code fluency second. Reads 50 production traces a week. Talks to SMEs (clinicians, lawyers, support leads, PMs) and translates ambiguous “this answer feels off” into concrete rubric definitions. Owns the golden set for the surface. Adjudicates annotator disagreement. Runs the per-product incident triage rotation.
Hire from. Applied ML, technical PM, strong domain backgrounds with a coding muscle (e.g., a paralegal who codes for a legal-AI product). Should be able to read and modify rubric code in the ai-evaluation SDK, but does not need to build runners.
Avoid hiring. Strong systems engineers who joined to “do ML.” The role is half product, half quality engineering; pure infra people get bored inside a quarter.
Quality council member
A standing seat, not a hire. The council is composed of existing leadership: head of the platform team, two or three product Eval Leads on rotation, head of Trust and Safety or AppSec, and a senior PM or domain expert per high-risk surface. New hires for the council are rare and senior (often a head of AI quality reporting to the CTO at enterprise scale).
Hire from. When you do hire onto the council, hire from quality engineering, trust and safety, or applied ML with a regulatory background. The role is policy and judgment; pattern-matching from past incidents matters more than depth in any single technical area.
The Rubric Author confusion
Rubric Authors are not a separate hire. They are SMEs from inside the company (or contracted domain experts) who sit on the product team’s rubric work part-time. The product Eval Lead is the persistent owner; the Rubric Author is the recurring contributor. Treating Rubric Authors as a headcount line is what produces the failure mode where the eval team has “five Rubric Authors” and zero actual SMEs from the product side.
The five anti-patterns the split prevents
Each of these is the predictable failure when one of the three roles is missing or merged.
Platform team absent. Each product team rolls its own runner, its own kappa pipeline, its own CI plumbing. The judge model differs by product. Aggregating eval scores across the company is impossible. The fix is naming a platform team and consolidating the infrastructure.
Product team Eval Leads absent. The platform team writes the rubrics. They drift toward what’s easy to measure rather than what users care about. A faithfulness rubric written for news summarization scores the medical surface unchanged six months later. The fix is named Eval Leads inside the product teams with the authority to define what good means.
Quality council absent. Cross-cutting safety rubrics are owned by whoever happened to write them, often a platform engineer with no policy mandate. Threshold overrides happen unilaterally under launch pressure. The first PR-visible incident or the first regulator question is the council-forming event. The fix is forming it before the event.
Platform team owns rubrics. The conflation that turns the split back into the central-team pattern. The platform team has the context (it sees all the traces), so it writes the rubrics, and now the platform team is the bottleneck. The fix is explicit: the platform team is Responsible for infrastructure, never Accountable for rubric content.
Council has no deploy-block authority. The council exists, meets, publishes policy, and is overridden every launch because the gate is advisory. The fix is the same as every other gate failure: written deploy authority for the council, a documented exception process, and a public log of overrides. The agent-rollout view of this is in the agent rollout-strategies post.
Future AGI as the platform-team accelerator
The three-role split is what the company owns. Future AGI is what the platform team buys to avoid building it.
ai-evaluationSDK (Apache 2.0). 60-plus EvalTemplate classes (Groundedness,ContextAdherence,Completeness,TaskCompletion,FactualAccuracy,Toxicity,PromptInjection,DataPrivacyCompliance,AnswerRefusal, and more), 13 guardrail backends (9 open-weight: LlamaGuard, Qwen3Guard, Granite Guardian, WildGuard, ShieldGemma; 4 API: OpenAI Moderation, Azure Content Safety, Turing Flash and Safety), 8 sub-10ms Scanners, and 4 distributed runners (Celery, Ray, Temporal, Kubernetes). The platform team uses this instead of building runners, kappa pipelines, and rubric templates from scratch.
The code the platform team owns shrinks to wiring, not building. A typical CI gate looks like:
from fi.evals import Evaluator
from fi.evals.templates import Groundedness, ContextAdherence, Completeness, TaskCompletion
from fi.testcases import TestCase
evaluator = Evaluator(fi_api_key="...", fi_secret_key="...")
dataset = load_golden_set("customer-support-v3")
results = evaluator.evaluate(
eval_templates=[
Groundedness(),
ContextAdherence(),
Completeness(),
TaskCompletion(),
],
inputs=[TestCase(**row) for row in dataset],
)
assert_rubric_thresholds(results, {
"Groundedness": 0.85,
"ContextAdherence": 0.85,
"Completeness": 0.80,
"TaskCompletion": 0.85,
})
- Future AGI Platform. Self-improving evaluators tuned by production thumbs-up and thumbs-down feedback. In-product authoring agent so Rubric Authors describe a rubric in natural language and ship it as an evaluator without writing Python. Lower per-eval cost than Galileo Luna-2 on classifier-backed evals. Five-level hierarchical budgets (org, team, user, key, tag) so finance maps eval spend onto the same three-role split.
- Error Feed. HDBSCAN soft-clustering over the trace stream, a Sonnet 4.5 Judge writes RCA plus
immediate_fixplus a four-dimension score (factual_grounding,privacy_and_safety,instruction_adherence,optimal_plan_execution) per cluster, Linear OAuth-wired today as the direct ticketing integration. Product teams use it for per-product triage; the council uses it for cross-product pattern detection. The trace-stream-to-dataset connector is on the roadmap; today, promoting a cluster trace into the regression set is a one-click manual step. Mechanics are in the incident response playbook. - Agent Command Center. Hosts the runtime next to the eval surface: 17 MB Go binary, six native provider adapters plus 20-plus providers total, shadow/mirror/race modes for canary work next to the CI gate. RBAC, SOC 2 Type II, HIPAA, GDPR, and CCPA, AWS Marketplace, multi-region.
The platform team is the primary buyer. The product teams use the in-product authoring agent and the Error Feed surface. The council uses the cross-cutting safety rubrics, the judge-of-judges calibration in the SDK, and the hierarchical budget view for vendor-level cost visibility.
Where Future AGI sits in each org model
| Org model | Future AGI role |
|---|---|
| Small (10-40 engineers) | Outsourced surface; SDK for code-first, Platform for in-product authoring, Error Feed for triage. The interim Eval Owner owns all three roles. |
| Mid (40-200 engineers) | Platform-team accelerator. The dedicated platform team builds on the SDK, product teams author rubrics in the Platform, council uses Error Feed for cross-product pattern. |
| Enterprise (200-plus engineers) | Underlying platform with BYOC or multi-region deployment, hierarchical budgets for chargeback, rubric versioning per regulated surface, Linear ticketing into the SRE-style on-call rotation. |
The eval cost-optimization piece covers the financial side of the platform-team buy. The judge-calibration write-up covers the mechanics of the judge-of-judges loop the council owns. The golden-set design guide covers the rubric-set hygiene each product team is Accountable for.
The cultural shift
Teams that win at LLM eval treat the three-role split as load-bearing org design, not a labeling exercise. The platform team has a roadmap and headcount. Product team Eval Leads sit in the staffing plan alongside backend engineers, not as a side gig for a senior IC. The council meets every two weeks, has a charter, and the deploy-block authority shows up in the launch checklist.
Teams that lose treat eval as one team’s problem. Either the platform team owns everything and becomes the bottleneck, or product teams own everything and lose cross-cutting coherence, or the council exists on paper and gets overridden the first time a launch is in the way. The discipline shift is whose name is on the doc when a rubric is wrong, who’s paged when a cluster appears, and who has the authority to hold the line.
If you’re sitting on the failure modes from the opening paragraph today, the first move isn’t another Eval Engineer. It’s standing up the quality council with a charter and a deploy-block playbook. The platform team and the product Eval Leads will form around it, because the council is the thing that gives them somewhere to escalate. That order surprises most teams because the visible work looks like rubric code, but the underlying constraint is almost always organizational, not technical.
For the broader pattern of how eval fits the lifecycle, the 2026 LLM evaluation playbook covers the six-layer architecture. For the longitudinal view of how this split grows from 5 to 500 engineers, the eval-team scaling guide is the next read. The three-role view in this piece is the steady-state shape underneath both.
Frequently asked questions
Who actually owns LLM evaluation inside a company shipping LLM products?
Why split eval ownership three ways instead of giving it to one team?
What is a quality council and who sits on it?
What does the RACI matrix look like for a typical eval activity?
What hiring profile fits the eval platform team versus the product team Eval Lead?
How does Future AGI fit a three-role org?
When is the three-role split the wrong shape for a company?
How the LLM eval function grows non-linearly from 5 to 500 engineers: five stages, four hand-off inflection points, anti-patterns at each, and the FAGI primitives that scale.
Cheap-fast-statistically-significant LLM eval gates in GitHub Actions: classifier cascade, fi CLI exit codes, Welch's t-test, path-scoped triggers, auto-rollback.
Contract review RAG in 2026: clause-level retrieval, citation enforcement, the eval suite in-house counsel will sign off, plus the LangGraph wiring to live OTel traces.