A test case is a single input-and-expected-outcome pair fed to an evaluator. For LLM evaluation, it usually contains a prompt, optional context, an optional reference answer, and metadata the evaluator uses to grade the model's output.

How is a test case different from a test set?

A test case is one row; a test set is the full collection. Hundreds of test cases form a test set, which is then split or held out from training for evaluation.

How do you organize test cases for an LLM application?

FutureAGI exposes Dataset.add_evaluation() to attach evaluators like TaskCompletion or Groundedness to each test case, version the dataset, and replay regressions on every release.

What Is a Test Case (LLM Evaluation)? FutureAGI Guide (2026)

What Is a Test Case?

A test case is a single input-and-expected-outcome pair that an evaluator scores. For LLM applications, a test case is a row containing the prompt, any retrieved context, an optional reference answer, and any metadata the evaluator needs — user persona, tool list, expected schema. Hundreds of test cases form a test set; the test set plus its scoring rubric is what an “eval” actually runs against. In FutureAGI, a test case is a row inside a Dataset object: versioned, replayable, and graded by every attached evaluator on every regression run.

Why It Matters in Production LLM and Agent Systems

Without explicit test cases, every release is a roll of the dice. A team ships a prompt edit, the model behaves differently on five edge cases nobody wrote down, and a customer files a ticket two days later. The fix is not “test more” in the abstract — it is “convert every reported failure into a test case.” A real regression suite grows by capturing each bug as a test case, attaching the right evaluator, and refusing to regress.

Engineers feel the absence in three ways. The on-call sees a Sentry error from a JSON-parse failure in a downstream service and has nowhere to add a JSONValidation test case. The product manager wants to A/B two prompts but has no held-out cohort to score them on. The applied-AI lead is asked, “are you sure this prompt edit didn’t break refusal behavior?” and has only vibes for an answer.

In 2026 agent stacks, test cases get richer. A single test case for a multi-step agent includes the user goal, a tool list, the expected trajectory, and the expected final state — not just an input string and a reference answer. Step-level evaluators like ToolSelectionAccuracy and GoalProgress need this shape. Treating a test case as a one-line prompt is the most common reason agent regressions get missed: the test case shape is wrong, so the evaluator never fires.

How FutureAGI Handles Test Cases

FutureAGI’s approach is to make a test case a first-class row inside a Dataset. Each row carries the input, the expected output, retrieved context, and any structured metadata the evaluator consumes. The team calls Dataset.add_evaluation(Groundedness) or Dataset.add_evaluation(TaskCompletion) and every test case is scored, results are versioned by run, and prior runs are diffable. Datasets can be loaded from CSV, JSON, Hugging Face, or imported from production traces sampled via traceAI — turning a real customer failure into a permanent test case in a single workflow.

A practical example: a contracting team running a LangGraph agent samples 500 production traces with eval-fail-rate-by-cohort above threshold, exports them as test cases into a Dataset, attaches TaskCompletion and ToolSelectionAccuracy evaluators, and runs the suite against four candidate prompts. The output is a per-test-case score plus reason, aggregated by evaluator, surfaced as a regression-eval dashboard. When prompt v3 drops TaskCompletion from 0.82 to 0.74 on a specific cohort, the engineer can drill into the failing test cases, read the evaluator’s reason field, and fix the prompt before deploying.

Unlike Pytest fixtures, FutureAGI test cases are graded by probabilistic evaluators (judge models, embedding similarity, NLI metrics) rather than only by exact-equality assertions — which is what open-ended LLM output requires.

How to Measure or Detect It

Test-case quality drives every downstream eval signal. Watch for:

Test-case coverage: percentage of production failure modes that exist as test cases in the dataset; aim for every Sentry error, escalation, or thumbs-down to become a test case.
Dataset.add_evaluation coverage: which evaluators are attached to which test cases; missing evaluators mean unscored failure modes.
Per-test-case score distribution: not just the mean — flag the worst-performing test cases and use them as regression anchors.
Eval-fail-rate-by-cohort: percentage of test cases that fail per cohort (route, model, persona); the canonical regression alarm.
Reason-field clustering: group evaluator reason strings to find recurring failure patterns across test cases.

Minimal Python:

from fi.evals import TaskCompletion
from fi.datasets import Dataset

ds = Dataset.load("agent-regression-v1")
ds.add_evaluation(TaskCompletion())
results = ds.run()
print(results.summary())

Common Mistakes

Writing test cases only for the happy path. The bugs live in edge cases — long inputs, multilingual prompts, malformed tool outputs, contradictory context.
Skipping the reference answer. Reference-free metrics work for some evaluators, but high-stakes test cases (factual recall, schema compliance) need a gold answer.
Letting the test set drift from production. A test set that does not get updated weekly turns into a stale benchmark within a quarter.
Mixing training-set and test-set rows. If your fine-tuning data leaks into the test cases, your eval scores are inflated and the regression suite is silently broken.
One mega test case per scenario. A test case should test one behavior; if a single case mixes refusal, formatting, and tool selection, the failure is unattributable.