How is selective sampling different from random sampling?

Random sampling treats every example as equally valuable. Selective sampling weights toward uncertain, novel, or under-represented examples so a fixed labeling budget produces more signal per dollar.

How do you do selective sampling for LLM evaluation?

FutureAGI lets you sample production traces by route, model variant, or evaluator score before pushing them into a Dataset for evaluation, so you spend judge-model compute on the rows most likely to fail.

What Is Selective Sampling? Definition & FutureAGI Guide (2026)

Q: What is selective sampling?

Selective sampling is the practice of choosing which examples to label, train on, or evaluate based on their expected information value rather than sampling uniformly at random.

What Is Selective Sampling?

Selective sampling is a machine-learning strategy where the model — or a loop around it — chooses which examples to label, train on, or evaluate, instead of treating every datapoint equally. Selection criteria include uncertainty, disagreement between models, novelty against the existing dataset, and cohort under-representation. The goal is to spend a finite labeling, training, or evaluation budget on the examples that move metrics the most. In LLM evaluation it appears as smart trace sampling: you cannot evaluate 100% of production traffic, so you sample the slices most likely to surface regressions.

Why It Matters in Production LLM and Agent Systems

Random sampling wastes budget. If 99% of your traffic is the same five common intents and 1% is the long tail where failures hide, a uniform 5% sample evaluates the easy cases and misses the bugs. The pain is concrete: a customer-service LLM team runs evals on 5,000 random production traces a day, ships a prompt change, sees no metric movement, and gets paged a week later when a low-volume but high-stakes refund flow has been broken since deploy.

The cost dimension is just as sharp. Judge-model evaluators are not free — running Groundedness on every trace at GPT-4-class quality can add five figures a month. Engineering leaders feel the squeeze: either cut sample size and miss regressions, or hold sample size and watch eval costs eat the budget.

In 2026-era agent stacks the problem amplifies. A single agent trace can produce ten LLM spans, each potentially eligible for evaluation. Uniform sampling at the trace level still produces too many spans to score. You need stratified sampling by route and model variant, oversampling of low-confidence or high-cost traces, and explicit cohorting for new releases. That is what selective sampling buys you — eval coverage that scales with risk, not with volume.

How FutureAGI Handles Selective Sampling

FutureAGI’s approach is to expose selective sampling as a first-class step between trace ingestion and evaluation. Production traces flow into traceAI; before they hit the eval queue, you apply a sampling policy: stratify by route or model name, weight by trace cost, oversample any trace where a fast pre-screen evaluator (ProtectFlash, IsJson, Contains) already returned a borderline score, and explicitly include traces from a freshly-deployed cohort. The selected slice lands in a Dataset row, where Dataset.add_evaluation() runs the heavier judges (Groundedness, TaskCompletion, HallucinationScore) only on the rows that earned their compute.

For training-time selective sampling — the active-learning case — FutureAGI’s annotation queue (fi.queues.AnnotationQueue) lets you push uncertain or model-disagreement examples to human annotators first, then loop the labeled examples back into a fine-tuning or regression eval workflow.

Concretely: a RAG team on traceAI-langchain samples 100% of traces that fail a fast ContextRelevance pre-check, 5% of traces from the dominant intent, and 25% from a freshly-released model variant. Across a million daily traces the eval cohort is around 30,000 — enough to catch regressions, cheap enough to run the slow judges, and cohort-balanced so the rare-but-critical failure modes still show up.

How to Measure or Detect It

Selective sampling is a workflow choice, but you measure its effect:

Eval coverage by cohort (dashboard signal): the percentage of each route/model-variant/intent cohort that lands in the eval set. Aim for floors per cohort, not just a global average.
Eval-fail-rate-by-cohort: the canonical regression alarm; selective sampling makes it accurate by guaranteeing each cohort has enough rows.
AnnotationQueue.progress: the fraction of an active-learning queue that has been labeled by humans, with agreement scores.
Information gain per sample: a custom metric — change in eval-fail-rate variance per added trace — to confirm your sampling policy is not just dumping low-information rows into the dataset.
Cost-per-failure-found: total judge-model spend divided by the number of distinct regressions caught; selective sampling should drive it down.

Minimal Python:

from fi.evals import Groundedness
from fi.datasets import Dataset

ds = Dataset.from_traces(
    traces=production_traces,
    sampler="stratified",
    strata=["route", "model"],
    oversample={"failed_prescreen": 1.0, "new_model_variant": 0.25},
)
ds.add_evaluation(Groundedness())

Common Mistakes

Sampling uniformly at random and calling it “evaluation”. Uniform sampling under-covers the long tail; bugs hide there.
Oversampling failures without rebalancing. Your dashboard now shows 60% fail-rate because you only kept failures — meaningless for trend lines.
Forgetting to fix the sampling seed. Non-reproducible eval cohorts make week-over-week comparisons noise.
Selective sampling without cohort floors. A new low-volume route gets zero traces sampled and rolls out unevaluated.
Treating selective sampling as a substitute for golden datasets. Production sampling catches drift; golden datasets catch regressions on canonical cases. You need both.