What is the binomial distribution?

The binomial distribution gives the probability of getting exactly k successes in N independent trials, each with success probability p. It is the foundation of pass-rate confidence intervals.

How is the binomial distribution used in LLM evaluation?

Every pass/fail eval result is a Bernoulli trial. Cohort fail rates follow a binomial, so confidence intervals and A/B significance for two prompts are computed against the binomial.

How do you size an LLM eval cohort?

Use the binomial-derived sample size for the difference you need to detect. For a 5% absolute difference at 95% confidence and 80% power, plan around 600 to 1,200 evaluations per arm.

Binomial Distribution: Definition & FutureAGI Guide (2026)

What Is Binomial Distribution?

The binomial distribution is a discrete probability model for counting successes across N independent trials when each trial has the same success probability p. In LLM evaluation, each pass/fail evaluator result is one Bernoulli trial, so the number of passing traces in a cohort follows a binomial pattern. FutureAGI uses that framing to attach confidence intervals, sample-size plans, and regression-test decisions to pass rates instead of treating a single eval percentage as certainty.

Why It Matters in Production LLM and Agent Systems

Most LLM eval results are binary at the surface — JSONValidation returns pass/fail, AnswerRefusal returns refused/answered, IsCompliant returns yes/no. When you average those over a cohort you get a pass rate, and what makes it interpretable is knowing the variance of that pass rate. Without a binomial framing, teams over-react to a 2% drop on a 100-row cohort and under-react to a 0.5% drop on a 10,000-row cohort — even though the second is statistically much stronger evidence of regression.

The pain shows up across roles. A platform engineer ships a “100% pass rate on 50 examples” prompt change and is surprised when production fail rate climbs to 6% — small samples plus high p have huge confidence intervals. A product lead approves an A/B test that compares two prompts on 200 traces each and finds “no significant difference” — at that sample size, only effects above 7 percentage points would have been detected. A compliance reviewer asks for an eval-fail-rate confidence interval for an audit and gets a single number.

In 2026 agent stacks the issue compounds. A trajectory has 10 step-level evaluators; each is a Bernoulli with its own p. Trajectory-success is the joint, which behaves binomially only if steps are independent — they rarely are. Modeling each step as binomial and reporting confidence intervals per step is what separates real regression detection from anecdotes.

How FutureAGI Handles the Binomial Distribution

FutureAGI does not surface “binomial distribution” as a UI element. It shows up wherever pass/fail counts and confidence intervals do. FutureAGI’s approach is to keep the statistical claim attached to the exact cohort, evaluator, trace slice, and release decision. Three places matter most. First, the eval dashboard reports eval_pass_rate per cohort with a Wilson 95% confidence interval — the de-facto small-sample binomial CI — so you see whether a 4% drop on a 200-row cohort is inside the noise band. Second, regression-eval gates in Dataset.add_evaluation use a binomial test against the previous run’s pass rate; small cohorts produce wide CIs and the gate refuses to block a release on insufficient evidence. Third, A/B comparison between two prompt versions can use traffic-mirroring to collect comparable production traces before a two-proportion z-test reports a p-value and CI rather than a point estimate.

A real workflow: a team is rolling out a new system prompt for a support agent. The pre-existing prompt has TaskCompletion pass rate 0.78, while JSONValidation tracks whether tool responses keep the expected schema. A 200-row pre-release run on the new prompt scores 0.83. Naive math says ship; the binomial-derived CI says the difference (95% CI: -0.01 to +0.11) overlaps zero. FutureAGI flags this as inconclusive, the team runs another 800 evaluations, the pass rate stabilizes at 0.81 (95% CI: 0.78 to 0.84), and the rollout decision is made on real evidence rather than on the first 200 rows.

Unlike spreadsheet eval reports or a naive normal-approximation interval with no cohort context, FutureAGI ties the pass-rate CI back to the dataset row, the prompt version, llm.token_count.prompt, and the model variant — so a regression alert has the cohort attached.

How to Measure or Detect It

Pick the right binomial calculation for the question you are asking:

Wilson CI on pass rate — the default for small cohorts; tighter than naive normal-approximation CIs.
Two-proportion z-test — for “is prompt B different from prompt A?”; assumes large enough N that normal-approx holds.
Exact binomial test — for very small N (under 30 per arm); avoids the normal-approximation breakdown.
Sample-size formula — n = (z_alpha + z_beta)^2 * (p1*(1-p1) + p2*(1-p2)) / (p1 - p2)^2; the canonical pre-release planner.
Eval-fail-rate-by-cohort with CI bands — dashboard-level signal with noise bands so 1–2% jitter is visible as noise, not regression.
Beta-binomial smoothing — when cohort sizes vary, apply a beta prior so 0/3 doesn’t show as 0% with a 0% CI.

Minimal Python:

from fi.evals import JSONValidation
from scipy.stats import binomtest

schema_eval = JSONValidation(schema={"type": "object"})
passes, total = 184, 220  # count pass/fail results from schema_eval
result = binomtest(k=passes, n=total, p=0.85, alternative="two-sided")
print(result.proportion_ci(confidence_level=0.95, method="wilson"))

Common Mistakes

Reporting a pass rate without a CI. “92% pass” on 50 examples and on 5,000 examples are very different claims; always report N alongside.
Comparing two arms without a significance test. Eyeball comparisons of 0.78 vs 0.83 ignore that the difference may be inside the binomial noise.
Assuming independence inside a trajectory. Step-level evaluators in an agent are correlated; using vanilla binomial CIs on trajectory success understates uncertainty.
Powering A/Bs to detect “any” effect. Without an effect-size target, you will run forever or stop early on noise.
Using the normal approximation for tiny p or small N. Use Wilson or exact binomial when p is near 0 or 1, or when N is under a few hundred.