Evaluation

What Is a Golden Dataset?

A reviewed, versioned eval dataset containing representative inputs, expected outputs, labels, and rubrics used to score LLM or agent behavior.

What Is a Golden Dataset?

A golden dataset is a reviewed, versioned set of representative inputs with trusted expected outputs, labels, rubrics, or reference context used to evaluate LLM and agent behavior. It is an LLM-evaluation asset, not raw training data: teams run it through eval pipelines, regression-eval suites, and sampled production-trace checks to detect whether a prompt, model, retriever, or tool change broke known cases. In FutureAGI, golden datasets map to the fi.datasets.Dataset surface for repeatable scoring. In 2026 production stacks, the golden dataset is the single most important reliability artifact. public benchmarks shortlist, golden datasets decide.

Why golden datasets matter in production LLM and agent systems

Without a golden dataset, every release argues from anecdotes. A retriever update can cause silent hallucinations because the model still sounds confident while citing the wrong chunk. A classifier prompt can create label drift where refund, billing, and cancellation intents blur together. A tool-calling agent can choose the wrong function for edge cases that passed last week. None of these failures necessarily show up in latency, token count, or uptime. they show up in customer-impact metrics weeks later, after the regression has been baked into a fine-tuned model or memorized as “expected behavior.”

The pain lands on different teams. Developers lose a stable regression signal and debug from scattered traces. Product managers cannot tell whether a new prompt improved the core workflow or only the demo path. SREs see spikes in escalation rate or eval-fail-rate-by-cohort but lack row-level evidence. Compliance teams cannot prove that reviewed safety and policy cases were rechecked before deploy. In regulated verticals. healthcare, finance, legal. the golden dataset is the audit trail; without it, you cannot answer the basic question “how do you know your assistant still refuses cases X, Y, Z?”

Agentic systems make the need sharper. A single 2026 request may include planning, retrieval, tool selection, schema validation, and final answer generation. One missing row in the golden set means the release gate can miss a multi-step failure that compounds across the trajectory. The dataset is the contract: these cases must keep working, with the same references, the same rubric, and the same threshold history.

Golden dataset vs benchmark vs production traces

These three keep getting conflated. They are different artifacts with different jobs.

AssetBuilt byUpdated whenJob
Public benchmark (e.g., GAIA. 466 questions across 3 levels, SWE-Bench Verified. 500 real GitHub issues, τ-bench. multi-turn customer support from Anthropic)Research communityRarely, often frozenTier-filter models across the field
Golden datasetYour team + domain reviewersContinuously, from new failuresProtect product behavior, gate releases
Sampled production tracesAuto-sampled from live trafficEvery hourDetect drift, surface new failure modes
Synthetic eval setLLM-generated, optionally with simulate-sdkOn demandCover gaps the other three miss

A complete 2026 evaluation stack uses all four. The golden dataset is the middle layer: smaller than the trace stream, larger than the public benchmark, and the one that actually gates production releases.

How FutureAGI handles golden datasets

FutureAGI’s approach is to treat the golden dataset as a versioned reliability artifact, not a spreadsheet beside the eval code. The specific FAGI anchor is sdk:Dataset, exposed as fi.datasets.Dataset. Engineers create or import rows, add columns such as input, expected_response, context, rubric, cohort, source_trace_id, reviewer_status, and reviewer_agreement, then attach evaluators through Dataset.add_evaluation. The resulting scores stay tied to the dataset version that produced them.

A real workflow: a support agent team keeps a 2,400-row golden dataset with human-reviewed examples from refunds, account deletion, charge disputes, and policy refusal cases. For canonical labels, they run GroundTruthMatch. For RAG answers, they run Groundedness against the stored context. For agent outcomes, they run TaskCompletion and slice by cohort and dataset_version. For agent paths, they run TrajectoryScore. A prompt change that lifts overall pass rate from 0.91 to 0.93 but drops the account-deletion cohort to 0.82 is blocked, because the row-level report shows exactly which policy cases failed. and which evaluator fired.

Production traces feed the loop. A traceAI-langchain (or traceAI-openai-agents, or traceAI-strands, or traceAI-google-adk) integration preserves the user input, model output, retrieved context, agent.trajectory.step, and gen_ai.request.model. Failed traces are promoted only after human review through annotation queues exposed as fi.queues.AnnotationQueue. Two reviewers minimum, with reviewer_agreement stored on the row. Unlike Ragas-style reference-free checks, a golden dataset gives the team an explicit row-level contract for product behavior. Unlike LangSmith datasets which focus on framework-level inputs and outputs, FutureAGI’s Dataset is multi-modal-aware, carries cohort metadata as first-class, and integrates directly with production trace promotion.

In our 2026 evals, the strongest signal comes from mixing three sources: curated edge cases (the seed set), production failures (the growth set), and synthetic scenarios generated by simulate-sdk ScenarioGenerator that target gaps found in the dashboard. We’ve found that teams who skip the synthetic third source consistently miss “obvious” failure modes. the ones users haven’t hit yet but will the moment a new vertical onboards.

Wiring golden datasets into release gates

Inside FutureAGI, a regression eval run becomes a release gate via three components: a baseline (the last shipped model’s scores on the same rows), per-evaluator delta thresholds (e.g., Groundedness may not drop more than 2 points; TaskCompletion may not drop at all on safety-critical cohorts), and a cohort filter (refund, billing, legal, healthcare). The CI job runs the eval suite against the candidate, posts evaluator scores back to the Dataset, and either passes the build or blocks the deploy with a diff link.

Lifecycle of a 2026 golden dataset

A healthy golden dataset has a lifecycle, not a static state. The pattern that works:

  1. Seed (Week 1). 100-300 rows hand-written by domain reviewers from product requirements, refusal cases, and known edge cases. Tag every row with a cohort. This is the day-one minimum to gate releases.
  2. Growth from production (Weeks 2-8). Sample failed traces (eval_score < threshold or user_feedback = negative), route to fi.queues.AnnotationQueue, get two-reviewer agreement, promote into the dataset. Expect 30-80 new rows per week on an active product.
  3. Synthetic fill (Month 2+). Use simulate-sdk ScenarioGenerator to fill cohort gaps the dashboard surfaces. multilingual edge cases, rare intents, adversarial inputs. Mark these rows source=synthetic and audit a sample.
  4. Pruning (Quarterly). Remove rows that pass on every model in your gateway routing pool. they no longer discriminate and inflate run cost. Keep them in a deprecated archive for historical comparison.
  5. Sync with policy changes. When refund policy changes, when a tool is renamed, when a refusal rule updates, the affected rows must be re-rubric’d. Stale references silently mis-grade every model after the policy date.

Cohort design is the work

The single biggest predictor of whether a golden dataset is useful is cohort design. A 5,000-row dataset with one cohort is worse than a 600-row dataset with twenty cohorts. The cohorts we see work in 2026:

  • Intent cohorts. refund, billing, account, technical-support, escalation
  • Difficulty cohorts. happy-path, edge-case, adversarial
  • Tool cohorts. single-tool, multi-tool, no-tool, MCP-only
  • Trust cohorts. high-stakes (money, health), medium-stakes, low-stakes
  • Locale cohorts. by language and region
  • Source cohorts. seed, production-promoted, synthetic
  • Modality cohorts. text-only, multimodal, voice-transcript

Every row carries every cohort tag. Eval-fail-rate-by-cohort then becomes the dashboard view that actually predicts incidents.

Sizing: how many rows is enough?

The 2026 question we get most often. The answer depends on cohort design more than total size, but a useful rule of thumb:

  • Minimum viable. 200-500 rows total, 8-15 cohorts, 15-30 rows per cohort. Enough to gate releases on a narrow product surface.
  • Production-grade. 1,500-3,500 rows, 20-40 cohorts, 50-100 rows per cohort. Used by most teams shipping to thousands of users in 2026.
  • High-stakes (healthcare, fintech, legal). 5,000-15,000 rows, 40-80 cohorts, 100-200 rows per cohort. Used when individual incidents have regulatory or safety consequences.
  • Beyond 15,000 rows. diminishing returns; the next investment is better cohort coverage, not more rows. Often cheaper to use simulate-sdk synthesis to fill specific cohort gaps than to keep manually authoring rows.

The ratio matters more than the total. A 3,000-row dataset with 40 well-balanced cohorts catches more regressions than a 30,000-row dataset where 80% of rows are happy-path support queries.

Cost of running the gold set

Every release-gate run costs tokens. In 2026, a typical 2,500-row golden dataset run across Groundedness, AnswerRelevancy, TaskCompletion, ToolSelectionAccuracy, Faithfulness, and a CustomEvaluation costs $40-$200 per run depending on judge model. That is meaningfully expensive at the rate of 5-15 release candidates per week.

Patterns that reduce cost without sacrificing signal:

  • Tiered judges. Use a cheap model for first-pass evaluator scoring, escalate disagreements to a stronger judge. Cuts cost 60-70% with minimal accuracy loss.
  • Cohort sampling for fast lane. PR-time evals run a stratified sample (200-300 rows), full set runs nightly. PR feedback in minutes, full coverage daily.
  • Cache evaluator decisions by (input hash, output hash, prompt version). Re-running an identical eval is a cache hit, not a new LLM call.
  • Skip stable rows. Rows that have passed on every model and prompt in the last 50 runs are candidates for the deprecated archive. they no longer discriminate.

How to measure or detect golden dataset quality

Measure the dataset, not just the model running against it. A bad gold set silently mis-grades every model that runs through it:

  • GroundTruthMatch pass rate. checks row-level agreement against trusted answers or labels; split by cohort and dataset version.
  • Coverage by failure mode. percentage of known production failure modes represented in at least one reviewed row. Maintain a failure-mode taxonomy and audit coverage every release.
  • Reviewer agreement. share of rows where two reviewers select the same expected label or rubric score; low agreement (<0.85 Cohen’s κ) means noisy gold data. Quarantine and re-rubric.
  • Staleness. days since the last promoted production failure; long gaps (>14 days for high-traffic surfaces) usually mean the dataset no longer reflects traffic.
  • Eval-fail-rate-by-cohort. dashboard signal showing which slice regressed after a model, prompt, retriever, or tool change.
  • Distribution match. compare cohort distribution in the gold set against production traffic distribution; >25% drift on a major cohort means the gold set lies.
  • Per-evaluator stability. score variance across reruns on the same model + prompt; high variance indicates a flaky evaluator, not a flaky model.
  • fi.evals.HallucinationScore. surfaces hallucination clusters that may indicate weak references in the gold set.
  • User-feedback proxy. thumbs-down rate, corrected-label rate, and escalation rate for cohorts missing from the golden set.
from fi.datasets import Dataset
from fi.evals import GroundTruthMatch, Groundedness, TaskCompletion, TrajectoryScore

golden = Dataset.get("support-golden", version="v12")
golden.add_evaluation(GroundTruthMatch())
golden.add_evaluation(Groundedness())
golden.add_evaluation(TaskCompletion())
golden.add_evaluation(TrajectoryScore())

Reruns must be reproducible. pin model name, prompt version, and judge model. If the score moves on a rerun with no inputs changed, fix the eval before trusting any release decision.

For a cohort-filtered regression run that promotes production failures into the gold set through an AnnotationQueue:

from fi.datasets import Dataset
from fi.queues import AnnotationQueue
from fi.evals import Groundedness, TaskCompletion, AggregatedMetric

# Promote failed traces (eval score below threshold) into review
queue = AnnotationQueue(name="golden-promotion", min_reviewers=2)
queue.enqueue_from_traces(
    filter={"eval.Groundedness": {"$lt": 0.7}, "cohort": "refund"},
    last="7d",
)

# After human review, approved rows land in the golden Dataset
ds = Dataset.get("support-golden", version="v13")
ds.append_from_queue(queue, only_approved=True)

# Run cohort-filtered regression eval before release
agg = AggregatedMetric(
    metrics=[Groundedness(), TaskCompletion()],
    weights=[0.5, 0.5],
)
gate = agg.run_dataset(ds.filter(cohort="refund"), model="gpt-5.1")
assert gate.score >= 0.91, f"Refund cohort regressed to {gate.score}"

Multi-modal and voice gold sets

The 2026 surface for voice agents and multimodal assistants expanded what a golden dataset has to carry. A voice-agent gold row in our customer stacks now includes:

  • The user’s transcribed text (with timing and confidence)
  • The audio reference (for ASRAccuracy and TTSAccuracy evaluation)
  • The expected response text and tone
  • A persona definition from simulate-sdk Persona for replay
  • Interruption-handling expectations
  • Latency budgets (first-token, end-of-turn detection)

For multimodal gold sets. image input plus text response. the row carries the image reference, the expected response, the bounding-box or attention-region rubric, and an ImageInstructionAdherence check. The same fi.datasets.Dataset surface holds all three: text, voice, and multimodal. no separate tooling.

The contamination wall between training and eval

The single non-negotiable in a 2026 golden dataset is the wall between training data and gold rows. If a model has seen the gold answers during fine-tuning or DPO, the pass rate is a measurement of memorization, not generalization. The patterns that hold the wall:

  • Hash-tag every gold row with a stable identifier and check training corpora for overlap before any tuning run.
  • Never use production traces as training data without first checking whether they overlap with gold-promoted traces. The check is a join on source_trace_id.
  • Hold out a fresh 10-15% slice of the gold dataset from any fine-tuning pipeline as a contamination control. Score it separately.
  • Run an entity-rename probe on a sample of gold rows after every fine-tune; large accuracy gaps between original and renamed indicate the model memorized the gold answer.

When teams skip this discipline, their golden dataset slowly becomes a vanity dashboard.

Common mistakes

  • Mixing eval data with training data. If the model has seen the gold answers during fine-tuning, the pass rate overstates production reliability. Keep a hard wall between training corpora and gold rows.
  • Editing rows in place. Changing expected answers without a dataset version destroys release-to-release comparison and makes old regression results unreadable. Versioning is non-negotiable.
  • Only collecting happy paths. Golden sets need rare intents, refusals, locale issues, stale-context cases, multilingual edge cases, and tool failures, not only successful demo prompts. Failure rows are more valuable than success rows.
  • Skipping reviewer agreement. A row with disputed labels teaches the evaluator annotation noise. Quarantine it until the rubric or reference is clarified by a third reviewer.
  • Letting the set age silently. A dataset that ignores new production failures becomes a museum of last quarter’s risks. Promote failed traces weekly.
  • One global gold set across products. Refund, onboarding, compliance, and coding agents have different failure surfaces. Maintain a gold set per route and pool only the metadata, not the rows.
  • Sampling production traces without filters. Random sampling overrepresents happy paths. Sample stratified on eval_score < threshold and user_feedback = negative.
  • No judge-model contamination check. If the judge model and the production model are from the same family, scores inflate by several points on subjective rubrics. Pin the judge to a different family.

Frequently Asked Questions

What is a golden dataset?

A golden dataset is a reviewed, versioned eval set with representative inputs and trusted references. Teams use it to score LLM or agent outputs repeatably across prompt, model, retriever, and tool changes.

How is a golden dataset different from a benchmark?

A benchmark is usually public and built to compare models broadly. A golden dataset is private to your product, updated from real failures, and used to protect the behavior your users expect.

How do you measure a golden dataset?

FutureAGI stores it through fi.datasets.Dataset and scores rows with evaluators such as GroundTruthMatch, Groundedness, and TaskCompletion. Track coverage, reviewer agreement, eval-fail-rate-by-cohort, and regression stability.