Models

What Is Variance?

The sensitivity of a model's predictions to its training sample; a primary axis in the bias-variance tradeoff.

What Is Variance?

Variance in machine learning is the sensitivity of a model’s predictions to the specific training sample it was fit on. A high-variance model gives very different answers when retrained on a slightly different dataset — the canonical signature of overfitting. It pairs with bias in the bias-variance tradeoff: low variance plus high bias is rigid; high variance plus low bias is unstable. For LLM systems, variance shows up as run-to-run answer divergence, cohort-specific eval-score swings, and unstable rankings between fine-tunes. FutureAGI surfaces this through regression evals across runs.

Why It Matters in Production LLM and Agent Systems

A high-variance model is a nightmare to operate. Two near-identical fine-tunes deliver wildly different scores; a tiny corpus refresh shifts behavior; sampling temperature creates noise that dominates the eval signal you actually wanted to measure. Teams confuse genuine quality differences with variance and ship the wrong checkpoint.

Roles see different symptoms. ML engineers struggle to attribute eval-score deltas to model changes versus run noise. SREs see latency and quality metrics fluctuate without an obvious cause. Product owners experience demos that work brilliantly one day and fail the next. Compliance leads cannot point to a single canonical model behavior because the model behaves like a distribution.

In 2026 LLM and agent systems, two distinct kinds of variance matter. Training variance (the classical kind): how much would scores change if I retrained on a resampled training set? Inference variance: how much would scores change if I ran the same prompt N times at temperature > 0? A planner with high inference variance can pick different tools on different runs, producing trajectories that succeed sometimes and fail sometimes — a near-impossible bug to triage without quantifying the variance first. The fix is to lift variance from anecdote to a tracked metric per release.

How FutureAGI Handles Variance

FutureAGI’s approach is to make variance a measurable signal rather than an unspoken nuisance. Run-to-run variance shows up directly in regression evals: when you re-run fi.evals.AnswerRelevancy against the same Dataset with the same model and prompt at temperature > 0, the spread across runs is the inference variance. Across model checkpoints, the same evaluator gives you cross-run training variance.

Concretely: a team fine-tunes three checkpoints from the same base with different random seeds and runs each against a 250-row golden Dataset. fi.evals.AnswerRelevancy returns per-row scores; FutureAGI aggregates mean ± stddev across the three checkpoints. A small spread on the validation cohort but a large spread on a production-sampled cohort reveals that the model is unstable on real-world distribution — a variance problem masked by a clean golden set. The mitigation: more diverse training data, lower temperature for inference, or self-consistency sampling at the response layer. Unlike a typical “single eval-score” workflow, FutureAGI treats N-run distributions as the unit of measurement, so variance is visible.

How to Measure or Detect It

Signals to track:

  • Run-to-run score stddev: re-run the same eval N times; high stddev = high inference variance.
  • Checkpoint-to-checkpoint score stddev: same eval, different fine-tune seeds; high stddev = high training variance.
  • fi.evals.AnswerRelevancy distribution: per-row score histogram, not just the mean.
  • Self-consistency spread: sample the same prompt 5x and score with EmbeddingSimilarity to quantify response divergence.
  • Cohort-specific stddev: variance on the golden set vs production cohort — divergence indicates distribution-mismatch variance.
from fi.evals import AnswerRelevancy
import statistics

scores = []
for _ in range(5):
    out = model.generate(input=q, temperature=0.7)
    scores.append(AnswerRelevancy().evaluate(input=q, output=out).score)
print("variance signal:", statistics.stdev(scores))

Common Mistakes

  • Reporting a single mean. A mean of 0.80 across 5 runs hides whether scores were all 0.80 or three 1.0s and two 0.5s; report stddev.
  • Confusing temperature variance with quality variance. Crank temperature to 0 to see deterministic behavior, then attribute remaining variance to training.
  • Ignoring cohort-specific variance. Stable on golden, unstable on production = real-world variance the team has not measured.
  • Adding variance reduction without bias awareness. Heavy regularization can lower variance but raise bias; use the tradeoff plot, not a single knob.

Frequently Asked Questions

What is variance in machine learning?

Variance is the sensitivity of a model's predictions to the specific training data sample. A high-variance model gives very different predictions when retrained on slightly different data, the canonical signature of overfitting.

How is variance different from bias?

Bias is the gap between average prediction and ground truth — wrong on average. Variance is the spread of predictions across retrainings — unstable. The bias-variance tradeoff says reducing one usually raises the other.

How do you reduce variance in an LLM workflow?

FutureAGI runs `fi.evals` regression evals across model checkpoints; high run-to-run score variance signals over-tuning. Mitigations include more training data, regularization, ensembling, or self-consistency sampling at inference time.