A test set is a held-back group of labeled examples used to measure whether an AI model, RAG pipeline, or agent works after development choices are made. It should stay separate from training, prompt tuning, and validation workflows.

How is a test set different from a validation set?

A validation set guides tuning, prompt selection, threshold setting, and model choice. A test set is reserved for final or recurring measurement after those choices are frozen, so it gives a cleaner estimate of production behavior.

How do you measure a test set with FutureAGI?

Use `sdk:Dataset` to version rows and attach evaluators such as `GroundTruthMatch`, `Groundedness`, and `TaskCompletion`. Track pass rate, eval-fail-rate-by-cohort, coverage, and score drift across dataset versions.

What Is a Test Set? FutureAGI Guide (2026)

What Is a Test Set?

A test set is a held-back group of labeled examples used to measure whether an AI model, RAG pipeline, or agent still works after training, prompt edits, retrieval changes, or tool updates. It is a data reliability asset, not production data at rest. In an eval pipeline, the test set supplies inputs, references, context, and metadata so results can be compared across versions. FutureAGI represents this workflow through sdk:Dataset and dataset-attached evaluations.

Why Test Sets Matter in Production LLM and Agent Systems

An unreliable test set creates false confidence. The most common failure mode is data leakage: examples used during prompt tuning, few-shot selection, fine-tuning, or threshold setting are later reported as independent evidence. Scores look strong because the system has already seen the cases. Another failure mode is coverage collapse. A chatbot test set may cover simple product questions while missing refund disputes, policy refusals, stale-context RAG cases, or tool failures.

The pain is practical. Developers lose a trusted release signal and debug from anecdotes. SREs see escalation rate or tool retry rate rise while the offline eval still passes. Product teams ship a new model because average score moved up, then learn that one regulated cohort regressed. Compliance reviewers cannot prove that sensitive rows were held out from tuning.

Agentic systems make the split more important because one row may exercise retrieval, planning, tool selection, structured output, and final response quality. Symptoms show up as a sudden all-green eval after prompt optimization, repeated failures in eval-fail-rate-by-cohort, or a model route that passes single-turn examples but fails multi-step traces. A useful test set gives each team the same question: did the changed system improve on cases it did not use to make the change?

How FutureAGI Handles Test Sets

FutureAGI’s approach is to treat the test set as a versioned reliability contract, not a spreadsheet copied into CI. The specific surface is sdk:Dataset, exposed as fi.datasets.Dataset. Engineers store rows with fields such as input, expected_response, reference_context, cohort, dataset_version, source_trace_id, expected_tool_path, and review_status, then attach evaluations through Dataset.add_evaluation.

A realistic workflow starts with a customer-support agent. The team promotes reviewed production traces and synthetic edge cases into a test dataset, while keeping prompt-tuning and validation rows in separate dataset versions. They attach GroundTruthMatch for canonical labels, Groundedness for answers that must use supplied context, and TaskCompletion for agent outcomes. Trace data from traceAI-langchain adds fields such as llm.token_count.prompt and agent.trajectory.step, which helps explain whether a failed row came from retrieval bloat, a bad tool decision, or the final response.

What the engineer does next depends on the slice. If the full pass rate rises but the refund_policy cohort drops below 0.92, the release is blocked and the failed rows become a regression eval. If failures concentrate in rows with long context, the team checks retrieval and token budget before changing the answer prompt. Unlike a Kaggle leaderboard split, a production LLM test set must preserve trace context, evaluator results, and row provenance so the same evidence can guide release gates, audits, and rollback decisions.

How to Measure or Detect Test Set Quality

Measure the test set as a measurement instrument:

Holdout integrity: no row appears in training examples, prompt-tuning sets, validation sets, or optimizer feedback.
Coverage by cohort: each critical intent, locale, policy, user type, tool path, and failure mode has reviewed rows.
GroundTruthMatch pass rate: returns evaluator results comparing outputs with trusted references; split by dataset_version and cohort.
Groundedness failure rate: catches answers not supported by reference_context, which often reveals stale or incomplete test rows.
Agent outcome score: TaskCompletion helps measure whether multi-step rows reached the intended goal, not only whether the final answer sounded right.
Dashboard signal: track eval-fail-rate-by-cohort, score drift across versions, reviewer disagreement, and user-feedback proxies such as escalation rate.

from fi.datasets import Dataset
from fi.evals import GroundTruthMatch, Groundedness

dataset = Dataset.get("support-test-set", version="2026-05-07")
dataset.add_evaluation(GroundTruthMatch())
dataset.add_evaluation(Groundedness())

Common Mistakes

Reusing validation rows in the test set. This makes tuning decisions look like generalization and inflates release confidence.
Sampling only successful traces. Clean paths miss refusals, escalations, retries, tool errors, and user corrections.
Editing references without versioning. Score movement becomes impossible to attribute to model, prompt, label, or policy changes.
Testing only final answers. Agent test sets also need expected tool paths, intermediate state, and outcome labels.
Reporting one average score. Cohort failures for high-risk intents can disappear behind a good aggregate pass rate.