What Is a Dataset (LLM Eval)?
A versioned collection of inputs, references, labels, context, metadata, and trace links used to evaluate LLM or agent behavior.
What Is a Dataset (LLM Eval)?
A dataset for LLM evaluation is a versioned collection of test inputs, expected outputs, labels, context, metadata, and trace references used to score model or agent behavior repeatably. It is a data-layer reliability asset that appears in eval pipelines, regression suites, release gates, and production-trace promotion workflows. FutureAGI maps this concept to fi.datasets.Dataset, where engineers manage rows and columns, attach evaluators, run prompts, and track eval stats across model, prompt, retriever, or tool changes.
In 2026 the shape of a useful eval dataset has changed completely from the tabular classifier era. A row is no longer a (features, label) tuple. it is a structured object with input, retrieved context, expected behavior, agent trajectory, tool calls, refusal status, cohort tags, and a link back to the production trace that justified its inclusion. The dataset is the spine of every LLM evaluation workflow: it pins what “correct” means, it carries the rows that block bad releases, and it is the artifact a regulator can inspect to answer “what did you test before shipping?”
Why datasets matter in production LLM and agent systems
Without a dependable eval dataset, release quality becomes a moving target. A RAG update may improve average answer relevance while causing silent hallucinations for policy questions whose reference context never made it into the test set. A support agent may pass happy-path demos while failing chargeback, cancellation, or privacy requests because those rows are missing. A tool-calling workflow may regress only when the row requires two tool calls and a refusal branch. which exactly zero of your demo rows exercise.
The pain spreads across teams. Developers lose a stable reproduction case and debug from scattered traces. Product managers cannot distinguish real quality gains from a friendlier test mix. SREs see eval-fail-rate-by-cohort rise after deploy but lack row-level evidence. Compliance teams cannot prove that reviewed PII, refusal, or policy rows were rerun before release. End users feel the result as wrong answers, unnecessary escalations, or inconsistent agent behavior. In our 2026 evals, the strongest correlate of post-launch reliability is not which model the team picked or how big their context window is. it is whether their dataset row count, cohort coverage, and provenance metadata were tracked from day one.
Datasets matter more in 2026-era multi-step systems because one user request can pass through retrieval, a planner, multiple tool calls, a model fallback, and a final answer. A single-row prompt/response CSV cannot represent that shape. The agent-era benchmarks that frontier labs report. τ-bench, SWE-Bench Verified, GAIA, OSWorld. all store rows that include database state, tool definitions, expected trajectories, and pass/fail criteria. Production datasets need to match that fidelity. A useful 2026 eval dataset row carries input, reference context, expected outcome, trajectory notes, metadata such as locale or account type, the MCP tool surface the agent had available, and a link back to the production trace or synthetic scenario that justified the row.
Dataset row shape in 2026
The columns that consistently appear in production-grade datasets:
| Column | Purpose | Required for |
|---|---|---|
input | The user query or prompt | Every row |
expected_response | Canonical or reference answer | GroundTruthMatch |
context | Retrieved chunks or system context | Groundedness, ContextRelevance, Faithfulness |
expected_tool | Tool the agent should call | ToolSelectionAccuracy |
expected_trajectory | Multi-step plan or trajectory | TrajectoryScore, TaskCompletion |
cohort | Locale, plan, tenant, intent | Cohort segmentation, fairness |
source_trace_id | Trace this row was promoted from | Provenance, drift investigation |
reviewer_status | Approved, pending, rejected | Golden dataset promotion |
dataset_version | Semver-style version | Reproducibility |
refusal_expected | Whether refusal is the right answer | Refusal-rate measurement |
policy_tag | Policy or regulation the row tests | EU AI Act, PII compliance |
pii_present | Whether the row contains PII | Privacy handling |
Three of these columns. source_trace_id, cohort, and reviewer_status. are the ones engineering teams skip and then regret. Without source_trace_id you cannot answer “why is this row in the dataset?” Without cohort you cannot detect data drift. Without reviewer_status you cannot tell a promoted golden row from a candidate row, and your release gate is grading itself.
How FutureAGI handles datasets
FutureAGI’s approach is to treat the dataset as the system of record for eval evidence. The specific anchor is fi.datasets.Dataset, exposed in the SDK and the evaluate UI. Engineers create or import datasets, add columns and rows, import files or Hugging Face data, run prompts over rows, attach evaluations, inspect eval stats, and use the results for optimization or regression gating.
A realistic dataset for a billing support agent might include columns named input, expected_response, context, cohort, source_trace_id, tool_path, reviewer_status, and dataset_version. Rows come from three sources: production traces captured through traceAI (real failure modes, sampled and reviewed), synthetic scenarios generated via the simulate surface (rare cohorts, adversarial inputs, multilingual variants), and human annotation (canonical answers, policy edge cases). Imports from Hugging Face Datasets are also a common starting point for public-benchmark coverage. Each source ties back to a ground truth reference where applicable and a reviewer status field that gates promotion to the gating subset. The team attaches GroundTruthMatch for canonical answers, Groundedness for responses that must cite provided context, Faithfulness for RAG fidelity, AnswerRelevancy for intent fit, and TaskCompletion plus TrajectoryScore for agent outcomes. Trace fields such as agent.trajectory.step, gen_ai.tool.name, and llm.token_count.prompt help explain whether a failure came from planning, retrieval, cost pressure, or the final generation step.
What happens next is operational, not archival. If a new prompt raises overall pass rate but drops the “enterprise cancellation” cohort below a 0.90 threshold, the release is blocked. If the dataset shows a retrieval-only regression, the engineer reruns the RAG eval before changing the system prompt. If a high-cost row repeatedly triggers a model fallback in Agent Command Center, it is flagged for routing review. Unlike Ragas-style single-turn RAG datasets, this dataset carries agent trajectory, tool, and policy context in the same row. Unlike LangSmith dataset collections, which center on input/output pairs, FutureAGI’s row model treats the trajectory and trace link as first-class fields so drift investigations are one query, not a manual join. In our 2026 evals, the best datasets explain why each row exists before they explain what score it produced.
The dataset lifecycle
A production dataset in 2026 moves through five states. Candidate rows come from sampled production traces or synthetic generation but have no expected output yet. Annotated rows have a reference answer and reviewer comments. Approved rows are promoted to the golden dataset and used in release gates. Deprecated rows have been superseded by a newer version (a pricing change, policy update, or product launch) and are kept for historical context but excluded from gating. Archived rows are retired entirely. FutureAGI’s Dataset API exposes status transitions and version diffs so the team can answer “which rows are blocking this release?” with a single query rather than a manual review.
Public benchmarks vs internal datasets
A 2026 senior engineer should treat public benchmark datasets and internal eval datasets as different tools that share a name. Public benchmarks. HLE, GPQA Diamond, SWE-Bench Verified, τ-bench, BFCL v3, MMMU-Pro, RULER, MLE-Bench. are built for comparison across the field and almost universally have a contamination risk and a “what we tested in the lab” framing. Internal datasets are built for one product, one set of policies, one customer base, and stay alive across releases. Public scores shortlist; internal datasets decide; production traces confirm. The teams that ship reliably in 2026 keep both alive: they run public LLM benchmarks for tier selection and trend tracking, and they maintain an internal dataset that gates every release. Treating one as a substitute for the other is the single most common eval-program mistake we see.
Source mix: production, synthetic, and human
Three sources, three failure modes if you use them wrong. Production-sampled rows are the most realistic but the most biased toward the cohorts you already have traffic for. they cannot cover a launch you have not made yet. Synthetic rows generated through Persona and Scenario definitions in the simulate surface cover hypothetical cohorts and adversarial cases but can drift from real user phrasing if the generation prompt is not refreshed. Human-annotated rows are the most accurate but the slowest and most expensive. The 2026 sweet spot we see in mature teams is roughly 60% production-sampled, 25% synthetic, 15% human-annotated for canonical edge cases, with the mix re-balanced quarterly as the product evolves. A common anti-pattern: a dataset that is 95% synthetic because production sampling is hard to wire up. That dataset measures whether the model can pass the generator’s idea of a test, not whether it serves real users.
Dataset versioning and reproducibility
Reproducibility is the part most teams discover the hard way. A 2026 release report should be able to answer four questions: which dataset_version ran, which gen_ai.request.model snapshot ran, which evaluator versions ran, and which prompt revision ran. Skip any of these and the regression eval is a moving target. FutureAGI’s Dataset.get(name, version=...) API pins the dataset version; the evaluator imports pin themselves through their class name plus their internal version field; the model snapshot is pinned in trace attributes. The combination is what makes “compare last week’s release to this week’s” a one-query operation rather than a forensic exercise.
How to measure dataset quality
Measure the dataset as an eval instrument, not only as storage. A useful 2026 quality stack covers six signals:
- Coverage by cohort: percent of known intents, locales, account types, policies, and failure modes represented by at least one reviewed row. Sub-2% coverage on a cohort means you cannot guard it.
- Reference completeness: share of rows with
expected_response, ground truth label,context, rubric, orexpected_toolpopulated when required by the attached evaluators. GroundTruthMatchpass rate: checks generated output against a trusted reference; split bydataset_versionand cohort to spot reference drift.Groundednessfailure rate: flags answers that are not supported by the stored context, which often reveals missing or stale context rows in the dataset itself.TaskCompletionandTrajectoryScore: for agent datasets, whether the row drove the agent to a successful end state and whether the trajectory matched expectations.- Provenance health: percent of rows with
source_trace_id, reviewer, synthetic scenario, or import source recorded. - Reviewer agreement: Cohen’s kappa or simple agreement rate across human reviewers on the same row; below 0.7 means the rubric or the expected answer needs work.
- Dashboard signal: eval-fail-rate-by-cohort, reviewer-disagreement rate, score variance across versions, and user-feedback proxies such as escalation rate.
from fi.datasets import Dataset
from fi.evals import GroundTruthMatch, Groundedness, TaskCompletion
dataset = Dataset.get("support-eval", version="2026-05-15")
dataset.add_evaluation(GroundTruthMatch())
dataset.add_evaluation(Groundedness())
dataset.add_evaluation(TaskCompletion())
results = dataset.run()
print(results.mean_by_cohort("Groundedness"))
To promote production traces into the dataset as Candidate rows and immediately attach an annotation queue for human review:
from fi.datasets import Dataset
from fi.queues import AnnotationQueue
from fi.evals import HallucinationScore
ds = Dataset.get("support-eval", version="2026-05-15")
queue = AnnotationQueue.get_or_create("support-eval-promotion")
# Pull failing traces from the last 24h on the enterprise-cancellation cohort
for trace in traces.filter(cohort="enterprise-cancellation", since="-1d"):
h = HallucinationScore().evaluate(
response=trace.attributes["llm.output"],
context=trace.attributes["retrieval.documents"],
)
if h.score > 0.4: # likely unsupported
row = ds.add_row(
input=trace.attributes["input.value"],
context=trace.attributes["retrieval.documents"],
source_trace_id=trace.id,
cohort="enterprise-cancellation",
reviewer_status="candidate",
)
queue.add_item(row_id=row.id, label_schema=["correct", "unsupported_claim", "wrong_tool"])
Healthy dataset signals: stable cohort coverage above 80% of known intents, reviewer_status populated on 100% of approved rows, no row older than 12 months in the active version, and a source_trace_id on every promoted production row. A dataset that fails any of these is technical debt that compounds with every release.
Promoting production traces into the dataset
The single highest-leverage dataset workflow in 2026 is the trace-to-dataset promotion loop. Start by sampling 2-5% of production traces per cohort, weighting toward failed evals and low-confidence outputs. Triage candidates with LLM-as-a-judge using a CustomEvaluation rubric that flags policy-relevant or novel rows. Send the triaged shortlist to human review. Approved rows enter the Approved state with source_trace_id populated. We’ve found this loop catches 70-80% of would-be drift incidents before they ship, because the rows that fail offline regression after promotion are exactly the rows that were quietly failing in production. Compared to a manual “annotate from scratch” workflow, the promotion loop produces 3-5× more useful rows per reviewer-hour because every row is grounded in a real user query rather than a hypothetical one.
Common mistakes
- Using one dataset for training and eval. Once eval rows enter fine-tuning, prompt examples, or RAG retrieval, pass rates stop estimating production behavior. Maintain hard partitions and audit them on every release.
- Leaving row provenance blank. Without
source_trace_id, reviewer, or synthetic-scenario tag, engineers cannot tell why a row exists. Two years later the row is undeletable because nobody knows what it tests. - Averaging away cohorts. One overall score can hide failures for locale, product plan, tool path, or protected-class cases. Always report per-cohort scores alongside the global mean.
- Changing expected answers without versioning. Silent edits break regression history and make 2026 release comparisons meaningless. Use
dataset_versionand never edit in place. - Keeping only model inputs and outputs. Agent datasets need context, tool calls, intermediate state, and final outcome labels. Single-turn rows cannot evaluate multi-step systems.
- Treating synthetic data as cheap. Synthetic rows inherit the bias of the generation model. Pair every synthetic batch with a sampled human review or you build a dataset that grades the generator’s worldview, not your users’.
- Not versioning the dataset version. Bumping the model snapshot or the retriever without bumping the dataset version means your regression evals run on a moving target. Every release should pin both
gen_ai.request.modelanddataset_version. - Skipping refusal rows. Without
refusal_expected: truerows, you cannot detect over-refusal regressions where the model starts refusing safe queries. a common failure mode after policy updates. - Letting the dataset grow without a retention policy. A 50,000-row dataset with no aging rule is mostly dead weight; the regression eval runs slowly and the gating threshold drifts. Cap row count per cohort, retire rows older than 12 months unless they cover a still-valid policy, and rebalance quarterly.
- Not pinning evaluator versions. Bumping
Groundednessto a new judge-model snapshot changes the score distribution. Pin the evaluator import or the regression diff is meaningless.
Frequently Asked Questions
What is a dataset in LLM evaluation?
A dataset in LLM evaluation is a versioned set of rows containing inputs, expected outputs, labels, context, metadata, and trace references. Teams run it through eval pipelines to score model, prompt, retriever, and agent changes repeatably.
How is a dataset different from a golden dataset?
A dataset is the general container for eval, training, analysis, or synthetic rows. A golden dataset is the reviewed, trusted subset used as a high-confidence regression and release gate.
How do you measure an LLM eval dataset?
FutureAGI uses fi.datasets.Dataset with evaluators such as GroundTruthMatch, Groundedness, and TaskCompletion. Track coverage, row provenance, reviewer agreement, eval-fail-rate-by-cohort, and score stability across dataset versions.