What is the juries of models metric?

It is an LLM-as-a-judge evaluation that aggregates scores from three to five different judge models into one consensus score per row, reducing the bias and self-evaluation inflation any single judge introduces.

How is juries-of-models different from a single LLM-as-a-judge?

A single LLM-as-a-judge inherits the judge model's biases — verbosity preference, length bias, family-level drift. Juries aggregate across models, so no single bias dominates and consensus failures are more meaningful signals.

How do you measure juries-of-models in production?

Wrap each judge as a `CustomEvaluation` returning a 0–1 score, run all judges over the same `Dataset`, then aggregate via mean, median, or majority vote. FutureAGI versions both the judge prompts and the aggregation rule per release.

Juries of Models Metric: Definition & FutureAGI Guide (2026)

What Is Juries of Models Metric?

Juries-of-models metric is an LLM-evaluation metric that aggregates scores from three or more judge models, usually from different model families, into one consensus score per evaluated row. Aggregation can use mean, median, majority vote, or trimmed mean. The metric reduces single-judge bias, length preference, and self-evaluation inflation in eval pipelines. FutureAGI teams use it for refusal calibration, factual-accuracy checks, and agent-trajectory scoring where one judge’s quirks could mis-grade a release.

Why Juries of Models Metric Matters in Production LLM and Agent Systems

A single LLM-as-a-judge is correlated with the model that generated the candidate output. If the generator is GPT-4o and the judge is also GPT-4o, the judge consistently overrates verbose, hedged answers and underrates terse, accurate ones — the well-documented self-evaluation inflation problem. The fix is not “switch the judge to a different model”; that just trades one set of biases for another. The fix is a jury.

ML engineers feel this when they ship a prompt change that improves the single-judge score by 4 points but produces user complaints. SREs see eval-score-vs-user-feedback divergence on a dashboard. Compliance teams refuse to accept a single-judge eval as evidence of safety performance because no auditor signs off on a one-source measurement. Product managers see A/B tests that “win” on judge metrics but lose on retention.

In 2026 agent stacks, the cost of single-judge bias is amplified across trajectory length. Each step produces a candidate that gets judged; bias accumulates. A jury of three judges drawn from different families — Claude, GPT, Gemini — produces a consensus score whose disagreement itself is signal. When all three agree the trajectory failed, you have high-confidence evidence. When they disagree by more than 0.3, you have a row that needs human annotation, not a release decision.

How FutureAGI Handles Juries of Models Metric

FutureAGI’s approach is to treat juries as composed CustomEvaluation instances, not a black-box “jury” knob. A team defines three or five judge prompts, each pointed at a different model — claude-opus-4-7, gpt-4o, gemini-2.5-pro, for example — and wraps each as a CustomEvaluation that returns a 0–1 score and a one-line reason. Each judge becomes a row-level evaluator attached to the Dataset via Dataset.add_evaluation. For RAG rows, teams often compare the jury consensus with Groundedness; for agent rows, they compare it with TaskCompletion to separate bad judging from incomplete work. An aggregation evaluator (mean, median, or majority vote) is added on top to produce the consensus column. The platform versions both the judge prompts and the aggregation rule, so when a regression eval reruns at release time, the judges are pinned to the same prompts and models — reproducibility instead of drift.

A real workflow: an agent team runs a regression eval over 1,200 trajectories with three jurors. The mean-jury score gates the release at 0.78; rows where the three judges disagree by more than 0.25 are routed to a human-annotation queue. Concretely, when one judge model rotated to a new minor version last quarter, the jury’s mean held steady while the single-judge baseline shifted by 3 points — exactly the resilience a jury is supposed to provide. FutureAGI exposes the per-judge breakdown so engineers can spot when one model has drifted relative to the others, and rotate that judge before it poisons the consensus.

How to Measure Juries of Models Metric

A jury is a pipeline, not one evaluator. Wire the row-level signals first, then the aggregation:

Per-judge score: each CustomEvaluation returns a 0–1 score with reason; track all of them, not just the consensus.
Inter-judge agreement (dashboard signal): Krippendorff’s alpha or pairwise Pearson on judge scores — if alpha falls under 0.5, the jury is incoherent and the rubric needs work.
Disagreement-flagged rate: the share of rows where max judge minus min judge exceeds a threshold; route to human annotation.
Consensus drift: track jury mean per release; sudden shifts without a known model/prompt change signal a juror has drifted.
User-feedback correlation: thumbs-down rate vs. consensus score by cohort — the proxy that validates the jury isn’t gaming itself.

Minimal Python:

from fi.evals import CustomEvaluation
import statistics

jurors = [("claude_judge", "claude-opus-4-7"), ("gpt_judge", "gpt-4o"), ("gemini_judge", "gemini-2.5-pro")]
judges = [CustomEvaluation(name=n, model=m, rubric=RUBRIC) for n, m in jurors]
scores = [judge.evaluate(input=q, output=a).score for judge in judges]
consensus = statistics.median(scores)
needs_review = max(scores) - min(scores) > 0.25

Common mistakes

Picking three judges from one model family. Three GPT models share most biases; pick across Claude, GPT, Gemini, Llama for real diversity.
Aggregating with mean when the rubric is bimodal. Mean hides bimodality; use median or majority vote when scores cluster at 0 and 1.
Letting the candidate model also be a juror. Self-grading inflates scores; the candidate model must not appear in the jury.
No rubric pinning. If the rubric prompt changes between releases, jury drift is invisible. Version the rubric in the dataset.
Treating disagreement as noise. High judge disagreement is the most useful signal a jury produces — it means the row deserves human review.