What is the difference between bagging, boosting, and stacking?

Bagging trains models independently on bootstrapped samples and averages predictions. Boosting trains models sequentially, each one focusing on the errors of the prior. Stacking trains a meta-learner that combines base-model outputs as features.

How does ensemble learning apply to LLMs?

LLM applications use the same idea in jury-of-models evaluation, mixture-of-experts routing, and self-consistency sampling. FutureAGI scores those systems with AggregatedMetric, Dataset.add_evaluation, and disagreement checks before release.

Ensemble Learning: Definition & FutureAGI Guide (2026)

Q: What is ensemble learning?

Ensemble learning is a machine-learning paradigm that trains multiple base models and combines their predictions — via bagging (Random Forest), boosting (XGBoost), or stacking — for an output that is more accurate or lower-variance than any one base model.

What Is Ensemble Learning?

Ensemble learning is a model technique that combines multiple base models or model calls into one prediction so the output is more accurate, lower-variance, or better-calibrated than a single member. In production LLM systems, it appears in bagging, boosting, stacking, mixture-of-experts routing, self-consistency sampling, and jury-of-models evaluation. In FutureAGI workflows, teams evaluate ensemble outputs, member disagreement, and calibration drift before release.

Why Ensemble Learning Matters in Production LLM and Agent Systems

Single models — classical or LLM — have variance. The same input can give a different answer on two runs of the same LLM, and a single tabular model can underfit a feature interaction another model captures cleanly. Ensemble learning is the canonical variance-reduction and bias-reduction technique. It buys reliability where reliability matters: fraud scoring that lifts F1 from 0.91 to 0.94, an LLM judge that drops standard deviation from 0.18 to 0.06, a multi-retriever RAG that picks up rare-domain queries a single embedding model misses.

The pain shows up unevenly. An ML engineer at a credit-risk shop watches a single GBM stagnate; switching to a stacked ensemble of GBM, logistic regression, and a small MLP buys the regulator’s required 3-point lift in PR-AUC. An LLM evaluation engineer running nightly regressions sees the same answer score 0.6 on Monday and 0.85 on Wednesday because of judge non-determinism — a 3-judge jury collapses the noise. A retrieval team finds a multi-retriever ensemble (BM25 + dense + colbert) recovers 8 points of recall on rare domains a single retriever misses.

In 2026 stacks, ensemble learning runs at multiple layers: ensemble retrievers, mixture-of-experts decoders, multi-judge evaluators, multi-agent systems with planner-critic-verifier roles. Each ensemble layer is a calibration problem; without per-member visibility, regressions hide inside the average.

How FutureAGI Handles Ensemble Learning

FutureAGI does not train classical ensembles itself; it evaluates ensemble outputs and exposes the traces needed to debug them. AggregatedMetric combines multiple metric evaluators into one release score, while Dataset.add_evaluation records per-release deltas for each model member. When the ensemble sits behind an LLM gateway, controls such as traffic mirroring, fallback, and exact/semantic cache let teams compare a candidate route against the current route before moving production traffic.

FutureAGI’s approach is to treat ensemble reliability as a per-member observation problem, not a single leaderboard number. A fraud-scoring team might train an ensemble of LightGBM, logistic regression, and a small MLP via stacking. They register predictions against a versioned fi.datasets.Dataset in FutureAGI and run Dataset.add_evaluation with AggregatedMetric on every release; the dashboard shows per-member contribution, ensemble PR-AUC, calibration error, and eval-fail-rate-by-cohort. When a new feature pipeline silently degrades the LightGBM member, FutureAGI flags the regression at the member level, not just the ensemble.

For LLM-side ensembles, the same team’s RAG application can send langchain traceAI spans from each candidate answer and run judge-style evaluation on the final choice. Compared with Weights & Biases or MLflow logging, which mainly records training metrics and artifacts, FutureAGI’s regression layer records output-side metrics on the production dataset, so calibration drift on a deployed ensemble surfaces before users notice.

How to measure ensemble learning

Score ensembles on lift, calibration, and member health:

AggregatedMetric — combines several evaluator scores into one ensemble-health score while keeping component scores visible.
Dataset.add_evaluation — attaches release evaluations to a versioned Dataset so ensemble deltas are comparable over time.
Lift over best base — accuracy delta between ensemble and strongest single member; the only honest reason to ensemble.
Calibration error (ECE) — does predicted probability match observed frequency? Ensembles can sharpen but miscalibrate.
Per-member health — track each base model’s score independently; an ensemble that hides a regressing member is unsafe.
Gateway route checks — use traffic mirroring before a new ensemble route and fallback when disagreement crosses a threshold.

from fi.datasets import Dataset
from fi.evals import Groundedness, TaskCompletion, AggregatedMetric

golden = Dataset.get("ensemble-release", version="v12")
result = golden.add_evaluation(
    AggregatedMetric([Groundedness(), TaskCompletion()], weights=[0.5, 0.5]),
    threshold=0.86,
)
print(result.score, result.passed)

Common mistakes

Ensembling correlated members. Three GPT-class judges or three GBMs on the same features share biases; ensembling them lifts variance, not accuracy.
Reporting only the ensemble metric. Member-level dashboards are how you catch silent regressions hidden inside the average.
Skipping calibration. Bagging often improves calibration; boosting often makes it worse. Always compute ECE post-ensemble.
Adding members for cost-blind lift. A fourth judge that adds 0.5 points of accuracy at 30% extra latency may not be worth it.
Treating LLM jury as classical ensemble. Self-consistency and jury-of-models behave differently from bagging — they sample the same model rather than train independent ones.