Guides

LLM Evaluation Best Practices Checklist for 2026

The 7-item LLM evaluation best practices checklist that actually ships: dataset, judge calibration, deterministic floor, CI gate, statistical math, production observability, closed loop. Paste it into a sprint.

·
13 min read
llm-evaluation best-practices ai-evaluation guardrails agent-opt ci-cd 2026
Editorial cover image for LLM Evaluation Best Practices Checklist for 2026
Table of Contents

Most “LLM evaluation best practices” lists are not checklists; they are taxonomies. A 20-item taxonomy of every metric family, every guardrail tier, every cohort, every dashboard widget. Engineers read it, nod, file it, and ship without any of it because nothing in a 20-item list is small enough to finish in a sprint.

This checklist is the seven items that survive the cut. Each is something you can paste into a planning doc, scope to a person, and ship in two weeks. Anything more is procrastination.

The 7 items

#ItemFailure mode it prevents
1Stratified golden datasetEval scores on data that does not look like production
2Calibrated judgeConfidently wrong rubric, every downstream chart wrong by an unknown amount
3Deterministic floorJudge cost balloons, easy failures hit the expensive layer
4CI gate that blocks mergesRegressions ship, eval is “advisory” forever
5Statistical math on every delta”Is this real?” debate every PR; engineers stop trusting the gate
6Production observabilityDrift invisible until users complain three days later
7Closed loopThe same bug ships twice; rubrics decay quarter over quarter

If you only build three this sprint: dataset, deterministic floor, CI gate. Without those, the rest cannot stand up. Items 5, 6, and 7 are what keep the program honest after the first month.

Why these seven, not twenty

The 21-item version of this checklist exists. We watched five teams print it, pin it, and ship the same broken eval program they would have shipped without it. Twenty items is what the checklist author writes when they want to be complete; seven is what they write when they want the team to finish.

The seven are load-bearing. Remove any one and the rest collapse into theater:

  • Without the dataset, every metric scores noise.
  • Without judge calibration, every score is wrong by an unknown amount.
  • Without the deterministic floor, judge cost makes the program too expensive to run on every trace.
  • Without the CI gate, regressions ship.
  • Without statistical math, the gate becomes a coin flip and engineers route around it.
  • Without production observability, drift between offline and online is invisible.
  • Without the closed loop, the dataset rots and next quarter starts from zero.

Everything else (refusal targets, latency budgets, canary cohorts, version pinning, four-dimension scoring) is a refinement on one of the seven. Ship the seven, then refine.

Item 1: stratified golden dataset

The dataset is the contract between the eval program and reality. A flat list of 200 hand-written examples scores fine on average and misses every cohort that matters in production.

What. 50 to 200 examples per route, stratified by intent (what the user is trying to do), persona (who they are), edge-case bucket (long input, tool failure, multi-hop), regulated-content flag (PII, financial advice, medical), and language. Pin the hardest 10 percent of historical failures as immutable. Sample the remaining 80 to 90 percent from recent production traffic. Refresh weekly.

Why. Average scores hide cohort regressions. The classic failure is “average groundedness is fine, but Spanish-speaking users got 11 points worse.” With stratification you see the per-cohort delta; without it you ship the regression and learn from churn.

How. Tag every example with its stratum at insertion time. Run the rubric per-stratum and per-aggregate. Track per-stratum coverage as the dataset grows — a stratum with five examples is not a stratum, it is a rumor.

Pitfall. Sampling uniformly from production traffic. Uniform sampling underweights edge cases by definition. Stratify first, sample within strata.

Item 2: calibrated judge

A judge is an LLM (or classifier) that turns a rubric prompt plus a candidate response into a numeric score. Until you measure how often the judge agrees with a human on the same example, every score downstream of it is wrong by an unknown amount.

What. A human-labeled hold-out set of 50 to 100 examples per rubric, kept separate from the golden set. A measured agreement rate between judge and human (Cohen’s kappa, or simple percent-agreement at the score threshold). A recalibration cadence — at minimum quarterly, and whenever the judge model, the rubric prompt, or the underlying product behavior changes.

Why. A judge that drifts to 70 percent agreement is producing systematic mislabels at scale. Every dashboard, every CI gate, every drift alert downstream is then wrong in a direction the team cannot see. The most common eval bug we see in the wild is a high-variance rubric that nobody calibrated, and the team treats its scores as gospel.

How. Sample 50 disagreements per quarter, look at the failure shape, fix the rubric prompt or the threshold (rubric clarification fixes more bugs than swapping judge models), re-run agreement. Track inter-rater agreement between two humans on the same hold-out — that is the ceiling the judge cannot beat.

Pitfall. Picking the judge model that produces the highest absolute scores. The right judge is the one that agrees with humans, not the one that grades on a curve.

Item 3: deterministic floor

Most failures do not need a model to detect. Schema violations, tool-call shape errors, jailbreak patterns, secrets leakage, invisible-character attacks, regex-defined per-tenant patterns — all of these have deterministic checks that run in single-digit milliseconds.

What. A layer of sub-10 ms checks that runs on every request before any classifier or judge. Future AGI’s ai-evaluation SDK ships eight Scanners: JailbreakScanner, CodeInjectionScanner, SecretsScanner, MaliciousURLScanner, InvisibleCharScanner, LanguageScanner, TopicRestrictionScanner, RegexScanner. Wire them as a Guardrails rail at the gateway.

Why. The deterministic floor cuts judge calls by an order of magnitude on the common path. Without it, every easy fail (a missing JSON field, a leaked API key) pays full judge cost and full judge latency. The cascade is the difference between an eval program you can afford to run on every production trace and one you can only afford to sample at 1 percent.

How.

from fi.evals import Evaluator, Guardrails
from fi.evals.scanners import (
    JailbreakScanner, SecretsScanner, RegexScanner,
)

guardrails = Guardrails(
    rails=[
        JailbreakScanner(),
        SecretsScanner(),
        RegexScanner(pattern=r"\b\d{16}\b"),  # naive credit-card sniff
    ],
    aggregation=AggregationStrategy.ANY,
)

Then turn on the cascade in the evaluator so the classifier and the judge only see the cases the floor could not resolve:

result = evaluator.evaluate(
    eval_templates=[Groundedness(), Toxicity()],
    inputs=[TestCase(input=..., output=..., context=...)],
    augment=True,
)

Pitfall. Skipping the floor because “we want the judge to see everything.” Judges hallucinate on trivial inputs. Save the judge for the cases that need semantic scoring.

Item 4: CI gate that blocks merges

An eval program that does not block a merge is advisory. Advisory programs get ignored under deadline pressure, which is when they were supposed to catch regressions.

What. Run the stratified golden set against the candidate branch on every PR. Block the merge if any rubric drops more than 5 percent against the trunk baseline, or if any rubric falls below an agreed absolute floor (0.75 for faithfulness, 0.85 for task completion are reasonable starting points). Surface the per-rubric, per-stratum delta in the PR check.

Why. Without a hard gate, eval is decoration. With one, every prompt change gets argued against the data instead of against opinions. The PR-blocking gate is the single intervention that changes team behavior the most; everything else in this checklist is downstream of it.

How. Two run sizes. A smoke set of 20 to 40 examples runs on every commit in seconds. The full regression set runs on labeled or trunk-bound PRs in minutes. Compare against a trailing seven-day rolling baseline, not a frozen number — models drift, the baseline drifts with them, the gate catches regressions relative to the moving truth. Pair the gate with shadow routing in production for prompt or model changes that need real-traffic validation past CI.

Pitfall. A flaky gate. If the gate fails on noise, engineers learn to retry until it passes. The fix is item 5, not a lower threshold.

Item 5: statistical math on every delta

Most CI-gate arguments are statistical arguments in disguise. “Is the 2.1-point drop real?” “Is the per-stratum noise on the Spanish slice too high to score?” Without three specific numbers, every conversation regresses to vibes.

What. Three numbers, computed once per regression set and refreshed weekly.

  1. Noise band. Run the rubric against trunk five times back-to-back. Take the standard deviation of the rubric mean. Multiply by two. Any delta inside that band is noise, full stop.
  2. Minimum stratum size. The smallest sample size per stratum that produces a stable rubric mean across reruns. Typically 30 to 50 for a binary rubric, 50 to 100 for a 1–5 scale. Strata smaller than this do not get gated.
  3. Bootstrap confidence interval. Resample the regression set with replacement 1,000 times, compute the rubric mean each time, take the 2.5th and 97.5th percentiles. That is the 95 percent CI on the rubric mean. Surface it in the PR check next to the point estimate.

Why. Engineers stop trusting a gate that fails on noise. With the three numbers, the conversation becomes “the drop is 2.6 points, the noise band is 1.4 points, the 95 percent CI does not overlap baseline — it is real.” That conversation closes in 30 seconds.

How. Add the three computations to the eval runner. The ai-evaluation SDK returns per-example scores; the bootstrap is twenty lines of NumPy, the noise band is a five-run wrapper, the minimum stratum size is a one-time power calculation per rubric.

Pitfall. Reporting only the point estimate. A 2-point drop without a CI is a Rorschach test.

Item 6: production observability

The CI gate catches regressions; production observability catches drift. They run the same rubrics, in two places.

What. Sample production traffic (uniformly or by failure signal), score with the rubrics used in CI, attach the score to the OpenTelemetry span as an attribute. Alert on rolling-mean degradation per route, per prompt version, per cohort. A 2 to 5 point sustained drop over 15 to 60 minutes is the right detection threshold for most products.

Why. Offline pass is a necessary, not sufficient, condition. Real users find what the test author did not think of, and the only way to see it is to score the live trace stream. The lag from incident to detection without production observability is days; with it, minutes.

How. traceAI (Apache 2.0) ships 50+ AI surfaces across Python, TypeScript, Java, and C#. Instrument the whole tree:

from fi_instrumentation import register
from fi_instrumentation.fi_types import ProjectType

register(
    project_type=ProjectType.OBSERVE,
    project_name="prod",
)

Span-attached eval scores via EvalTag give you zero-added-latency rubric attributes on the trace tree. Drift alerts run on the rolling mean per route, per rubric, per prompt version. See LLM tracing best practices and agent passes evals fails production for the surrounding pattern.

Pitfall. Running eval at 100 percent inline because “we want to be safe.” That is a 5x cost multiplier and a measurable latency hit. Cascade, sample, and tier by criticality.

Item 7: closed loop

Without a loop, every incident produces a one-off fix and the team writes the same regression twice. The loop is what compounds.

What. A pipeline that takes failing production traces, clusters them into named issues, scores each cluster, and promotes representative examples into the next regression-set refresh. Rubrics retune against the new examples; the optimizer searches for prompts that beat the failing cluster on the golden set; the new prompt ships through the same CI gate.

Why. A dataset that does not refresh from production drifts past usefulness in one or two quarters. A rubric nobody retunes drifts out of calibration. The loop keeps both alive.

How. Three pieces:

  • Error Feed. HDBSCAN soft-clustering over the trace stream in ClickHouse (v2 production; noise points with prob >= 0.4 get reassigned to the highest-probability cluster). A Sonnet 4.5 Judge on Bedrock (30-turn budget, 8 span-tools, Haiku Chauffeur for span summarization, ~90 percent prompt-cache hit) writes the RCA, evidence quotes, an immediate_fix, and a four-dimension score (factual_grounding, privacy_and_safety, instruction_adherence, optimal_plan_execution).
  • Promote-to-dataset. From each cluster the on-call engineer promotes representative traces into the offline eval set. The next PR that touches the offending path has to clear them.
  • Optimizer. Failing rubrics on the regression set become targets for agent-opt. Six optimizers ship (RandomSearchOptimizer, BayesianSearchOptimizer, MetaPromptOptimizer, ProTeGi, GEPAOptimizer, PromptWizardOptimizer); pair any with EarlyStoppingConfig so the run stops when the rubric plateaus.
from agent_opt import BayesianSearchOptimizer, EarlyStoppingConfig

optimizer = BayesianSearchOptimizer(
    eval_template=Groundedness(),
    dataset=regression_set,
    early_stopping=EarlyStoppingConfig(patience=5, min_delta=0.01),
)
best_prompt = optimizer.optimize(seed_prompt=current_prompt, n_trials=50)

Pitfall. Treating “trace to dataset to optimization to deploy” as turnkey. It is not, in any vendor, today. The trace-stream-to-agent-opt-dataset connector is on the Future AGI roadmap; today, the dataset curation step is human-in-the-loop. Build the curation discipline first; the optimizer is the part you bolt on.

The 90-day rollout

Seven items, ninety days. The shape:

WeekItemsConcrete deliverable
1–2Item 1 (dataset)50–200 stratified examples per route, weekly refresh job
3Item 2 (judge calibration)Hold-out set + agreement rate dashboard
4Item 3 (deterministic floor)Scanners wired at the gateway, cascade on
5–6Item 4 (CI gate)Regression set blocks merges; per-stratum delta in PR check
7Item 5 (statistical math)Noise band, min stratum size, bootstrap CI surfaced in PR check
8–10Item 6 (production observability)traceAI in prod, span-attached scores, drift alerts
11–13Item 7 (closed loop)Error Feed live, promote-to-dataset cadence, first optimizer run

If a team finishes weeks 1 to 7 (dataset → CI gate → statistical math) and gets nothing else done in the quarter, the program already beats most production eval setups. Weeks 8 to 13 are what compound the wins.

How Future AGI wires the seven items

Each item maps to a surface. The practice is portable; the surfaces save the months.

  • Dataset (item 1). Hosted, versioned, stratified datasets on the Future AGI Platform. Weekly refresh via Error Feed promotions plus manual annotation.
  • Calibrated judge (item 2). The Platform’s in-product authoring agent runs the rubric against a labeled hold-out on every edit; self-improving evaluators recalibrate from thumbs up/down feedback. The ai-evaluation SDK pins EvalTemplate versions so historical scores stay comparable.
  • Deterministic floor (item 3). Eight sub-10 ms Scanners and 13 guardrail backends (9 open-weight: LLAMAGUARD_3_8B/1B, QWEN3GUARD_8B/4B/0.6B, GRANITE_GUARDIAN_8B/5B, WILDGUARD_7B, SHIELDGEMMA_2B; 4 API: OPENAI_MODERATION, AZURE_CONTENT_SAFETY, TURING_FLASH, TURING_SAFETY). augment=True cascades heuristic → classifier → judge automatically.
  • CI gate (item 4). ai-evaluation plugs into pytest, GitHub Actions, GitLab CI. Four distributed runners (Celery, Ray, Temporal, Kubernetes) for the full-set runs.
  • Statistical math (item 5). Per-example scores ship out of the SDK; the bootstrap CI, noise band, and stratum-size calculations live in the eval runner. The Platform surfaces the CI in the PR check.
  • Production observability (item 6). traceAI (Apache 2.0) ships 50+ AI surfaces across Python, TypeScript, Java, and C#; 14 span kinds; 62 built-in evals via EvalTag. The Agent Command Center returns x-agentcc-cost, x-agentcc-latency-ms, x-agentcc-model-used, x-agentcc-fallback-used, x-agentcc-routing-strategy, and x-agentcc-guardrail-triggered as response headers.
  • Closed loop (item 7). Error Feed (HDBSCAN + Sonnet 4.5 Judge) clusters and scores; promote-to-dataset is one action on the cluster page; agent-opt ships six optimizers with EarlyStoppingConfig. The trace-to-dataset connector is the active roadmap item; curation is human-in-the-loop today.

The Agent Command Center hosts the runtime — a 17 MB Go binary, six native adapters (OpenAI, Anthropic, Gemini, Bedrock, Cohere, Azure) plus 20+ providers total, six routing strategies, shadow/mirror/race modes for the canary work that sits next to the CI gate. RBAC, SOC 2 Type II, HIPAA, GDPR, and CCPA, AWS Marketplace, multi-region.

What to skip in the first 90 days

The list of things to defer is longer than the list of things to ship:

  • Refusal-rate targets per route — refinement on item 2.
  • Per-metric latency budgets — refinement on item 3.
  • Four-dimension scoring — refinement on item 6.
  • Canary cohorts and shadow routing — wait until item 4 is shipping.
  • Self-improving evaluators — wait until item 7 is shipping.
  • Rubric version pinning, eval-the-evaluator agreement charts, per-tenant cost dashboards — all real, all refinements.

Use the 2026 LLM evaluation playbook and agent observability vs evaluation vs benchmarking as the refinement catalog once the seven are live. Do not use them as the starting checklist; that path ends at week 12 with five items half-built and a team that has lost faith in the program.

Ship seven. Refine after.

Frequently asked questions

Why a 7-item LLM evaluation checklist instead of 20?
Because the 21-item version of this checklist is the reason most eval programs never ship. Twenty items is a planning artifact; seven is something an engineer can paste into a sprint and finish. The seven that survive the cut are the ones whose absence will kill the program: a stratified golden set, a calibrated judge, a deterministic floor, a CI gate, real statistical math on the deltas, production observability, and a closed loop from failure back into the dataset. Everything else (refusal targets, version pins, latency budgets, canaries) is a refinement on one of those seven.
How big should the eval golden set be?
50 to 200 examples per route is the working range for CI gating. Smaller sets let noise dominate the signal and produce flaky merges. Larger sets push CI wall time past the budget engineers tolerate, and judge cost stops scaling. Pin the hardest 10 percent of historical failures as immutable, sample 80 to 90 percent from recent production traffic, refresh weekly. Beyond about 500 examples per route, sampling strategy matters more than raw count.
What is judge calibration and how often should I do it?
Calibration is measuring how often the LLM judge agrees with a human label on the same example. Sample 50 to 100 traces per quarter, label by hand, compare to the judge. Track inter-rater agreement (judge versus human) as its own metric. A judge that drifts below 80 percent agreement is silently producing systematic mislabels at scale, and every downstream chart is wrong by an unknown amount. Recalibrate whenever the rubric, the judge model, or the underlying product behavior changes.
What is the deterministic floor in an eval program?
The deterministic floor is the layer of cheap, sub-10 ms checks that catch failures that do not need a model to detect: schema violations, tool-call shape errors, jailbreak patterns, secrets leakage, invisible-character attacks, regex-defined per-tenant patterns. The Future AGI ai-evaluation SDK ships eight Scanners (JailbreakScanner, CodeInjectionScanner, SecretsScanner, MaliciousURLScanner, InvisibleCharScanner, LanguageScanner, TopicRestrictionScanner, RegexScanner) that run before any classifier or judge. Running the floor first cuts judge cost by an order of magnitude on the common path.
How do I gate a pull request on LLM eval results?
Run the regression set on every PR and block the merge if any rubric drops more than 5 percent against the trunk baseline, or if any rubric falls below an agreed absolute floor (0.75 for faithfulness, 0.85 for task completion are reasonable starting points). Surface the per-rubric delta in the PR check. Run a small smoke set on every commit in seconds, the full regression set only on trunk-bound PRs. Pair the gate with shadow routing in production for changes that pass CI but need real-traffic validation.
What statistical math do I actually need for LLM eval?
Three numbers. The noise band on the regression set (run the same eval against trunk five times, take the standard deviation of the rubric mean, multiply by two — that is the band below which a delta is noise). The minimum sample size per stratum that produces a non-flaky CI signal (typically 30 to 50). The bootstrap confidence interval on the rubric mean (so a 'two-point drop' has a defensible error bar). Without these three, every regression conversation devolves into 'is this real?' and engineers learn to ignore the gate.
What closes the loop between production failures and the eval suite?
A pipeline that auto-clusters failing traces, names them, scores them, and promotes representative examples into the next week's regression set. The Future AGI Error Feed runs HDBSCAN soft-clustering over the trace stream in ClickHouse; a Sonnet 4.5 Judge writes an immediate_fix candidate per cluster and a four-dimensional score (factual_grounding, privacy_and_safety, instruction_adherence, optimal_plan_execution). Those fixes feed the platform's self-improving evaluators so the rubric stays calibrated as the product changes. The trace-stream-to-agent-opt-dataset connector is on the roadmap; the dataset curation step is where the engineering still lives.
Related Articles
View all