How is a test set different from a validation set?

A validation set is used during training to tune hyperparameters and select model checkpoints. A test set is touched only once, after model selection is done, to give a final unbiased generalization estimate.

How do you build a test set for an LLM application?

FutureAGI stores LLM test sets as a Dataset of input/expected-output rows, attaches evaluators like TaskCompletion or Groundedness, and runs them as a regression eval on every release.

What Is a Test Set in ML? Definition & FutureAGI Guide (2026)

Q: What is a test set in machine learning?

A test set is held-out labelled data that the model never sees during training or tuning. It produces an unbiased estimate of generalization performance and is the third partition in the train/validation/test split.

What Is a Test Set in Machine Learning?

A test set is the held-out portion of labelled data that the model never sees during training, validation, or hyperparameter tuning, used only to estimate generalization performance. It is the third partition of the canonical train/validation/test split — typically 10–20% of the labelled data, frozen the moment the project starts. In LLM evaluation, the equivalent is a golden dataset of inputs and reference answers used to score the model under realistic conditions. A good test set is representative, large enough to give tight confidence intervals, and never leaked into training.

Why It Matters in Production LLM and Agent Systems

A model that scores well in development and badly in production almost always has a test-set problem. The most common cause is leakage: a fine-tuning corpus pulls from the same source as the test set, so the model has effectively seen the answers. The second is non-representativeness: the test set is the easy slice of traffic, while production carries multilingual prompts, long contexts, and tool-call edge cases that the test set never sampled.

ML engineers feel this when training metrics climb and customer escalations stay flat. Applied-AI leads feel it when a benchmark score does not predict real-world TaskCompletion. SREs feel it when a model that passed the regression suite produces bad answers under load — because the test set never tested batched inference behavior. Compliance teams feel it during an audit, when “how was this evaluated” returns “on a static test set from 18 months ago.”

For 2026 agent stacks the test-set problem is harder. The unit being evaluated is no longer a single input/output pair — it is a trajectory of LLM calls, tool invocations, and handoffs. A test-set row for a multi-step agent must include the goal, the tool registry, the expected trajectory, and the success criteria. Testing only the final answer hides every middle-step failure. Building a test set therefore means designing both the data and the evaluator stack that grades it.

How FutureAGI Handles Test Sets

FutureAGI stores test sets as a Dataset — a versioned, replayable collection of rows that are graded by attached evaluators on every run. The team calls Dataset.add_evaluation(TaskCompletion) or Dataset.add_evaluation(Groundedness), the dataset runs end-to-end across the model under test, and per-row scores are stored against the dataset version so historical runs are diffable. Test sets can be loaded from CSV or JSON, imported from Hugging Face, or built from sampled production traces ingested via traceAI — converting real failures into permanent regression rows.

A real workflow: a financial-summarization team ships v3 of a fine-tuned 8B model. They run Dataset.add_evaluation(FactualConsistency) and Dataset.add_evaluation(IsConcise) against a 1,200-row test set composed of canonical earnings reports plus 200 sampled production traces. The aggregate score holds at 0.86, but eval-fail-rate-by-cohort for “10-K” filings drops by 11 points. The team flags the cohort, runs a regression eval against the prior model, and reverts before the production push. Without the test-set partition, the cohort regression would have shipped silently.

Unlike a static benchmark like MMLU that scores fixed multiple-choice questions, FutureAGI test sets are domain-specific, evaluator-graded, and updated continuously. The test set is your model’s contract with production, not a leaderboard entry.

How to Measure or Detect It

A test set is itself the measurement instrument. Watch the meta-signals:

Test-set leakage rate: substring overlap between training corpus and test inputs; even 1% leakage inflates eval scores noticeably.
Cohort balance: distribution of test-set rows across user personas, languages, intents, and tool calls — match production traffic, not the easy slice.
TaskCompletion aggregate: 0–1 score across the trajectory for agent test sets; the canonical regression metric.
Groundedness for RAG test sets: returns whether responses are anchored in retrieved context.
Eval-fail-rate-by-cohort: percentage of test-set rows failing per cohort; reveals which slice degraded after a model or prompt change.

Minimal Python:

from fi.evals import TaskCompletion, Groundedness
from fi.datasets import Dataset

ds = Dataset.load("agent-test-set-v3")
ds.add_evaluation(TaskCompletion())
ds.add_evaluation(Groundedness())
results = ds.run()
print(results.cohort_breakdown())

Common Mistakes

Building the test set after the model. Test sets designed post-hoc tend to validate the model’s existing strengths; build the test set before you finish training.
Refreshing the test set every sprint. Constant refresh resets your historical baseline; freeze versions and add new versions alongside, do not overwrite.
Treating the validation set as the test set. Hyperparameter tuning on the validation set means you have already over-fit it; the test set must be untouched until final evaluation.
Ignoring data drift. A test set captured 6 months ago may no longer match production traffic; sample new traces continuously into a parallel cohort.
Aggregating only the mean. A 0.85 mean hides a 0.55 cohort; always break down by user, language, and route.