Guides

LLM Eval Golden Set Design: A 2026 Engineering Guide

Build a four-bucket golden set (production sample, adversarial, edge cases, failure replays) so a CI eval gate actually proves something about production.

May 16, 2026

12 min read

llm-evaluation golden-set ci-eval-gate annotation rag agent-evaluation 2026

Table of Contents

A golden set is not “examples you labeled.” It is the fixture your CI eval gate runs against on every build, and the gate is only as honest as the fixture under it. The serious version has four buckets: a stratified sample of production traffic, an adversarial library, deliberately constructed edge cases, and replays of failures that already shipped. Anything less is a vibe check dressed up as a test suite. Drop a bucket and the green CI run becomes a lie of omission.

This guide is the working pattern for a set that earns its keep, sized for a team running a CI gate against a real product.

TL;DR: the four buckets

Bucket	Source	Share	What it proves
Production sample	Stratified trace export	60%	The system handles the traffic it actually gets
Adversarial	Jailbreak corpora, Scanners, red-team	15%	The system holds under attack
Edge cases	Hand-written by domain experts	15%	The system handles the long tail it has not seen yet
Failure replays	Error Feed clusters, past incidents	10%	The system does not regress on bugs you already paid for

If any bucket hits zero, the gate stops gating. 100 percent production is blind to attacks. 100 percent adversarial is blind to the user. No replays regresses on the same bug every quarter. The buckets are not optional.

Why “examples you label” is not a golden set

The pattern we see most in audits: a founding engineer hand-writes 80 cases in a Notion doc on day three, the team labels them, runs them in CI, calls it a golden set. Six months later the rubric is refined, the judge tuned, the thresholds calibrated, the gate green every build. Production is on fire. The eval tests a system that does not exist.

The break is structural. Hand-written cases are a sample of one engineer’s imagination. The production distribution, the attacks users try, the escalations the support team handles daily, the bug that woke the on-call last quarter, none of it is in the set. Each gap is silent, and the gate reads green because nothing in the set fails.

Stop thinking of the golden set as a list. Think of it as a four-bucket portfolio. Each bucket answers a different question; dropping any bucket drops the answer while still feeling like you have a test suite. The LLM evaluation playbook covers the surrounding stack; this is the dataset layer everything else reads from.

Bucket one: stratified production sample (60 percent)

The base layer. Pull five times your target size from production traces, then downsample to balance coverage. For a 300-case bucket-one target, pull 1,500 raw traces.

Stratify on three axes. Intent (refund, status, complaint, escalation, FAQ). Persona (new user, power user, enterprise admin, regulated-vertical user). Retrieval shape (full-context, partial, missing, contradictory). RAG systems without the retrieval-shape axis miss the most common cause of production failures, which is not the model, it is the context. The RAG evaluation guide covers why this axis dominates corpus size.

Build the coverage matrix as rows by columns. Every cell with non-trivial production volume gets at least 5 to 10 cases. Cells with zero volume get zero cases here; they live in bucket three.

from fi_instrumentation import register
from fi_instrumentation.fi_types import ProjectType

register(project_type=ProjectType.OBSERVE, project_name="support-agent-prod")

raw_traces = traceai_export(
    project="support-agent-prod",
    filters={"fi.span.kind": "LLM"},
    sample_size=1500,
    date_range="last_90d",
)
# session.id and tag.tags are native stratification keys in traceAI
stratified = stratified_sample(
    raw_traces,
    keys=("intent", "persona", "retrieval_shape"),
    target_per_cell=10,
)

Class balance matters here. If production is 95 percent allowed and 5 percent disallowed, do not mirror it: a judge that always predicts “allowed” scores 95 percent accuracy and zero recall on the class that matters. Build bucket one at roughly 40 to 60 percent per class for binary rubrics, then project dashboard metrics back to the real ratio. The data drift handling guide covers the projection math.

Bucket two: adversarial coverage (15 percent)

Production traces underrepresent attacks that have not happened yet. Three sources.

Public corpora. HarmBench, AdvBench, OWASP LLM Top 10 example sets cover the canonical attack space. Pull these in as ground-truth-labeled cases (expected output is refusal, not the attack target). The OWASP LLM Top 10 guide covers the threat catalog.

Scanner harvest from production. The ai-evaluation SDK ships eight sub-10 ms Scanners (JailbreakScanner, CodeInjectionScanner, SecretsScanner, MaliciousURLScanner, InvisibleCharScanner, LanguageScanner, TopicRestrictionScanner, RegexScanner). Run them over the raw bucket-one pull. Any trace that trips a Scanner is a real attack; promote it to bucket two with the Scanner verdict as a label. This bucket grows for free as production grows.

Synthetic adversarial. AutoEvalPipeline.from_description ships seven prebuilt domain templates (customer support, RAG, code assistant, content moderation, agent workflow, healthcare, financial) for attack categories the corpora and Scanner harvest miss.

from fi.evals.scanners import (
    JailbreakScanner, CodeInjectionScanner, SecretsScanner,
    MaliciousURLScanner, InvisibleCharScanner,
)

scanners = [JailbreakScanner(), CodeInjectionScanner(), SecretsScanner(),
            MaliciousURLScanner(), InvisibleCharScanner()]
adversarial_seeds = [
    {"input": t["input"], "scanner": s.__class__.__name__,
     "label": v.reason, "source": "adversarial:scanner_harvest"}
    for t in raw_traces for s in scanners
    if (v := s.scan(t["input"])).flagged
]

Tag every case with source: "adversarial" and a sub-category (jailbreak, injection, malformed, poisoned_retrieval). The dashboard splits adversarial-robustness from production-distribution scores; mixing muddies both. The LLM jailbreak step-by-step guide and the prompt injection examples guide cover the categories.

Bucket three: edge cases (15 percent)

Edge cases are inputs production has not yet sent in volume but will. Regulated-vertical queries, ambiguous prompts where the right answer depends on context the model does not have, rare classes, multilingual inputs, malformed but well-intentioned requests. Bucket two is malicious; bucket three is legitimate-but-hard. The line is intent.

Two rules.

Domain experts write them, not engineers. A clinical edge case looks normal to an ML engineer and dangerous to a nurse. Pay clinicians, paralegals, financial advisors, support team leads to write 50 cases each in their domain. Pair every case with a worked explanation of why the right answer is the right answer. That explanation becomes annotator guidance.

Cover cells bucket one cannot fill. Bucket one stratifies by production volume; bucket three fills cells with zero or trace volume. If your (intent=escalation, persona=regulated_vertical_user, retrieval_shape=missing) cell has two production traces a quarter, bucket one cannot give you 10 cases there. Bucket three writes them.

The minimal record carries enough metadata to audit:

{
  "case_id": "gs_edge_2026_q2_0042",
  "input": "...",
  "expected_output": "...",
  "rubrics": {"faithfulness": 1, "helpfulness": 1, "safety": 1},
  "intent": "escalation",
  "persona": "regulated_vertical_user",
  "retrieval_shape": "missing",
  "source": "edge:expert_written",
  "expert_author": "nurse_practitioner_panel",
  "guidance_version": "v3.1",
  "set_version": "2026.04",
}

Bucket four: failure replays (10 percent)

Every production incident is a free golden case if you harvest it. Bucket four is the regression-test layer, and the one teams most often skip because the discipline is operational, not engineering. Three sources.

Error Feed clusters. Production traces flow through traceAI, failing rubrics cluster nightly via HDBSCAN, a Sonnet 4.5 Judge writes an immediate_fix per cluster. Pull 5 to 10 representative traces from each cluster, label them, promote them into bucket four. The feedback loop design guide covers the upstream pipeline; this is where its output lands as a permanent fixture.

Incident post-mortems. Convert each incident description into 3 to 5 test cases that reproduce the failure and pin the expected behavior. The PR that fixes the incident references the bucket-four cases as the regression test.

Customer complaints that survive triage. Any support ticket that escalates to engineering is signal the eval missed something. Pull those traces into bucket four. The customer already told you the system failed; the bucket records it.

Bucket four has a property the other three do not: it never retires. A bucket-one case retires when its intent disappears from production. A bucket-four case stays until you can prove the failure class cannot happen again, which is almost never. Bucket four grows monotonically with the age of the product.

Sizing math: per-bucket, by route

The floor per route is 300 cases; below that, per-bucket breakdowns lose statistical power.

Route size	Bucket 1 (60%)	Bucket 2 (15%)	Bucket 3 (15%)	Bucket 4 (10%)	Total
Floor	180	45	45	30	300
Working	300	75	75	50	500
Mature	600	150	150	100	1000

Beyond 1,000 cases per route, judge cost makes sampling a stronger lever than raw size. At 60 rubrics and 1,000 cases, one CI run is 60,000 judge calls; at 5,000 it is 300,000, with no equivalent quality lift. Stratified sampling within each bucket per CI run preserves coverage at lower cost. The LLM eval cost optimization guide covers the math.

Size by route, not by application. Support and finance routes fail, get attacked, and escalate differently. One mature golden set per route is the right shape; one giant set for the whole app is what we untangle every audit.

Calibration discipline: multi-rater, Cohen’s kappa

Below a Cohen’s kappa of 0.7, the labels are noise. Two annotators on the same guidance and case should agree most of the time. If they do not, the rubric is fuzzy, the guidance is thin, the case is borderline, or all three. None of those produce a label your judge can learn from.

The cheap version.

Two annotators label the first 50 cases independently, no comparing notes.
Compute Cohen’s kappa per rubric. Faithfulness and helpfulness usually have very different agreement levels; aggregating hides the broken rubric.
If any rubric lands below 0.7, freeze annotation on it, rewrite guidance with worked examples on the cases A and B disagreed on, re-label the 50.
Repeat until every rubric clears 0.7, then scale to the full set.

from sklearn.metrics import cohen_kappa_score

for rubric in ["faithfulness", "helpfulness", "safety"]:
    a = [c[f"annotator_a_{rubric}"] for c in pilot_set]
    b = [c[f"annotator_b_{rubric}"] for c in pilot_set]
    kappa = cohen_kappa_score(a, b)
    if kappa < 0.7:
        print(f"{rubric}: kappa {kappa:.2f}, fix guidance before scaling")

A 300-case set with 0.8 kappa beats a 3,000-case set with 0.4 kappa on every downstream metric. The human vs LLM annotation guide covers when an LLM can serve as a cheap third opinion; the LLM annotation guide covers the workflow patterns.

Two more disciplines worth treating as release-blocking on every refresh.

Provenance. Every case carries who labeled it, when, what guidance version, what production trace it came from, what bucket and stratification cell. Treat the golden set as a database row, not a spreadsheet cell. Without provenance you cannot audit a label, roll back a bad batch, or tell why a score dropped between refreshes.

Outcome anchoring. Where a downstream signal exists (user thumbs, conversation resolution, refund reversal, deal close), anchor the label to it. Annotator opinion drifts; user behavior does not. Outcome-verified labels become the calibration anchor for the annotator-only labels.

Versioning and drift

Golden sets are not append-only. Cases retire, labels get corrected, new buckets land. You need diff-able versions so a score drop can be attributed cleanly to model regression, rubric change, or set change. Without it, every triage cycle becomes an argument about which thing moved.

Three cadences.

Monthly drift refresh. Sample new production patterns into bucket one. Pull new Error Feed clusters into bucket four. Retire bucket-one patterns that no longer fire. Re-run kappa on a 20-case sample. Bump to 2026.QN.M. The LLM drift guide covers what counts as drift worth promoting.

Quarterly adversarial sweep. Add new jailbreak vectors, injection variants, and red-team findings to bucket two. The AI red teaming guide covers the upstream cadence. Bump to 2026.QN.0 of the next quarter.

Annual re-baseline. Re-annotate the full set with refreshed guidance. Label drift compounds silently: a 2026 annotator reading 2025 guidance interprets edge cases differently than the 2025 annotator did. Re-baselining caps the drift. Bump major version.

Every CI run pins to a specific set version. Score deltas across builds are only interpretable when the version is held constant; mixing runs across versions and treating the deltas as model regressions is the most common false-positive we see on dashboards. The prompt versioning guide covers the same discipline at the prompt layer.

Implementation in the ai-evaluation SDK

The shape that ships: one Evaluator.evaluate call runs the full set against every rubric, with bucket and source as filterable metadata.

from fi.evals import Evaluator
from fi.evals.templates import (
    Groundedness, ContextAdherence, ContextRelevance,
    Completeness, FactualAccuracy, AnswerRefusal,
)
from fi.testcases import TestCase

evaluator = Evaluator(fi_api_key="...", fi_secret_key="...")
inputs = [
    TestCase(
        input=case["input"],
        output=case["expected_output"],
        context=case.get("retrieval_context"),
        metadata={
            "bucket": case["bucket"],   # 1, 2, 3, 4
            "source": case["source"],   # production / adversarial / edge / replay
            "set_version": "2026.05",
        },
    )
    for case in golden_set_v2026_05
]
result = evaluator.evaluate(
    eval_templates=[Groundedness(), ContextAdherence(), ContextRelevance(),
                    Completeness(), FactualAccuracy(), AnswerRefusal()],
    inputs=inputs,
)

# Per-bucket scores feed the CI gate; aggregates hide regressions
for bucket in [1, 2, 3, 4]:
    scores = [r for r in result if r.metadata["bucket"] == bucket]
    print(f"bucket {bucket}: {len(scores)} cases")

The CI gate reads per-bucket scores, not aggregates. A drop in bucket two with bucket one steady is an adversarial regression. A drop in bucket four is a re-introduced bug. An aggregated number hides both. Configure the gate to block on per-bucket thresholds, each calibrated against its own baseline.

For multi-rubric orchestration at scale, the Future AGI Platform’s EvalTemplateManager composite surface (create_composite, submit_composite, execute_composite, upload_ground_truth) runs the same set against a composite of rubrics with one submission. The templates that read directly off a labeled set include Groundedness, ContextAdherence, ContextRelevance, Completeness, ChunkAttribution, ChunkUtilization, FactualAccuracy, Toxicity, PromptInjection, DataPrivacyCompliance, AnswerRefusal, TaskCompletion, and LLMFunctionCalling. The CI/CD LLM eval with GitHub Actions guide covers the gate wiring.

Anti-patterns

The failure modes we see most often.

Synthetic-only golden sets. Generated from one prompt, never anchored to production. The judge calibrates to the synthesis style, not to user behavior. Fix: at least 60 percent bucket-one production-sampled cases.

Single annotator. Kappa is undefined, every threshold trusts a single opinion. Fix: two annotators on the kappa-pilot 50 cases minimum, even if one is an LLM judge.

Bucket conflation. Adversarial cases mixed into production-sample dashboards. The aggregate reads fine while attack-robustness quietly drops. Fix: per-bucket dashboards and per-bucket CI gates.

No failure replay bucket. Every production incident is a one-off, not a permanent regression test, and the same class of failure ships twice. Fix: bucket four. Every Error Feed cluster and incident post-mortem produces cases.

Never refreshed. Built at launch, frozen forever. By month six the production distribution has drifted, the attack landscape has changed, and the set scores a system that no longer exists. Fix: monthly, quarterly, and annual cadence.

No provenance. Cases live in a spreadsheet with no annotator, no timestamp, no source trace. When the score drops, no one can tell why. Fix: every case is a database row.

Class imbalance left raw. Production is 95-5; the set mirrors it; the judge scores 95 percent by predicting the majority class. Fix: rebalance inside the bucket, project to production ratios on the dashboard.

How Future AGI’s surfaces map to the workflow

Three surfaces cover the build, the run, and the loop.

ai-evaluation SDK (Apache 2.0) runs the labeled set against rubrics. One Evaluator.evaluate call takes the rubric list and the TestCase list. 60-plus templates cover RAG, safety, agent, and factuality rubrics. Eight sub-10 ms Scanners harvest bucket-two adversarial seeds. 13 guardrail backends (nine open-weight) score safety rubrics without a separate API. Four distributed runners (Celery, Ray, Temporal, Kubernetes) handle large sets.

traceAI (Apache 2.0) does the sampling for bucket one. 50-plus AI surfaces across Python, TypeScript, Java, and C# capture production traces with fi.span.kind, session.id, and tag.tags as native stratification keys. The stratified sample is a query, not an export-and-transform job.

Future AGI Platform orchestrates the run and feeds bucket four. The EvalTemplateManager composite surface runs multi-rubric evaluations from one submission. AutoEval generates the bucket-two synthetic share from seven prebuilt domain templates. Self-improving evaluators retune thresholds as the set grows. Error Feed clusters production failures via HDBSCAN, the Sonnet 4.5 Judge writes the immediate_fix, and Linear OAuth handles human-in-the-loop promotion into bucket four. The DeepEval and Confident AI alternatives guide covers how this surface compares.

Closing

A CI eval gate is only as honest as the golden set under it. Four buckets, sized by route, calibrated with kappa above 0.7, versioned with provenance, refreshed on a real cadence. The shape is not exotic; the discipline of building each bucket on purpose, scoring them separately, and treating bucket four as a monotonically growing regression suite is what separates the gate that catches the next failure from the gate that ships it.

The visible work is the rubric. The work that decides whether the rubric scores anything useful is the set under it. Build the four buckets and the rest of the eval stack inherits the quality.

Frequently asked questions

What is an LLM evaluation golden set?

A golden set is a stratified, labeled fixture that a CI eval gate runs against on every build. The serious version has four buckets: a representative sample of production traffic, a covered library of adversarial inputs, deliberately constructed edge cases, and replays of failures that already shipped to users. Anything less is a vibe check. If the golden set is missing a bucket, the CI gate is not actually proving anything about production behavior.

What are the four buckets of a CI-grade golden set?

Bucket one is stratified production traffic, sampled by intent, persona, and retrieval shape (60 percent of the set). Bucket two is adversarial coverage from jailbreak corpora, prompt-injection variants, and the eight Scanners in the ai-evaluation SDK (15 percent). Bucket three is hand-built edge cases written by domain experts to cover regulated-domain, ambiguous, and rare-class inputs (15 percent). Bucket four is production-failure replays harvested from Error Feed clusters and previous incidents (10 percent). Each bucket is versioned independently and scored on a separate dashboard panel.

How big should each bucket be?

Per route, the floor is 300 to 500 cases: 180 to 300 production samples, 45 to 75 adversarial, 45 to 75 edge cases, 30 to 50 failure replays. Beyond 1,000 cases per route, judge cost makes sampling a stronger lever than raw size. Size by route, not by application, because the support and the finance route fail differently and need separately calibrated bars.

How do I keep annotator agreement high?

Two annotators label the first 50 cases independently. Compute Cohen's kappa per rubric, not just overall. If kappa drops below 0.7, freeze annotation, rewrite the guidance with worked examples on the cases A and B disagreed on, then re-label. Only scale to the full set once kappa clears. Track kappa as a release-blocking metric on every refresh, the same way a code repo tracks test coverage.

How often should a golden set be refreshed?

Monthly, quarterly, and annually. Monthly: promote new production patterns and Error Feed clusters into the set, retire patterns that no longer fire. Quarterly: add new jailbreak vectors and prompt-injection variants from the latest red-team work. Annually: re-baseline the full set with refreshed guidance so label drift does not compound silently. Every refresh bumps a version and writes a diff against the prior set.

Should I use synthetic data for a golden set?

Only for adversarial coverage and rare classes, never as the only source. A synthetic-only set calibrates the judge to the synthesis prompt rather than to user behavior. The honest mix is roughly 60 percent production-sampled, 15 percent adversarial, 15 percent expert-written edge cases, 10 percent failure replays. AutoEval pipelines ship seven prebuilt domain templates that generate the synthetic share without losing distribution control.

How does Future AGI support golden set workflows?

Three surfaces. The ai-evaluation SDK runs any golden set against 60-plus EvalTemplate rubrics with one Evaluator.evaluate call and ships eight sub-10 ms Scanners that harvest adversarial seeds from production traces. traceAI samples stratified production traces using fi.span.kind, session.id, and tag.tags as native stratification keys. The Future AGI Platform's EvalTemplateManager runs multi-rubric golden set evaluations with create_composite, submit_composite, execute_composite, and upload_ground_truth; AutoEval generates the synthetic share; Error Feed clusters production failures so they promote cleanly into the replay bucket.

View all

Guides

Evaluating AWS Bedrock Agents in 2026

Bedrock's built-in eval is dev-loop only. Score action-group correctness, KB retrieval quality, and guardrail precision/recall on every release.

Rishav Hada · May 19, 2026

11 min

Guides

The 2026 LLM Evaluation Playbook

The pillar playbook for LLM evaluation in 2026: dataset, metrics, judge, CI gate, production observation, closed loop from failing trace to regression.

Rishav Hada · Apr 12, 2026

10 min

Guides

Evaluating Search-Augmented Agents in 2026

Generic RAG eval misses what kills search agents: bad queries, stale sources, monoculture, and broken cites. A four-axis rubric you can ship this week.

NVJK Kartik · Apr 7, 2026

11 min