What is a jury of models in evals?

A jury of models uses multiple judge models to score the same LLM or agent output, then combines their votes into one eval verdict. It reduces dependence on a single judge and exposes disagreement.

How is a jury of models different from LLM-as-a-judge?

LLM-as-a-judge can use one model as the grader. A jury of models uses several judges, compares their reasons, and aggregates the scores so disagreement becomes a measurable signal.

How do you measure a jury of models?

Use FutureAGI `fi.evals.CustomEvaluation` judges with `AggregatedMetric`, then track inter-judge agreement, vote entropy, and eval-fail-rate-by-cohort. Trace fields such as `agent.trajectory.step` help isolate where the judges disagree.

Jury of Models: Definition & FutureAGI Guide (2026)

What Is a Jury of Models?

A jury of models is an eval pattern where multiple judge models independently score the same LLM or agent output and an aggregation rule turns their votes into one verdict. It appears in offline eval pipelines, regression gates, and production trace review when one judge model may be biased or brittle. FutureAGI teams use juries for open-ended quality, safety, and task-completion checks where disagreement is a useful signal, not just noise.

Why a Jury of Models Matters in Production LLM and Agent Systems

A single judge model can make the release gate look cleaner than the product really is. The common failure is not a dramatic crash; it is quiet over-acceptance. One judge rewards confident wording, another catches unsupported claims, and a third flags unsafe tool use. If you only run the first judge, silent hallucinations and bad agent actions pass with high scores.

The pain lands across the team. Developers see flaky eval results after prompt edits. SREs get alert noise because the eval-fail-rate jumps and then vanishes on the next sample. Compliance reviewers cannot explain why two similar customer answers received different labels. Product teams see user complaints before the offline benchmark shows a regression. In logs, the symptoms are score volatility, low agreement between human review and judge scores, repeated “borderline pass” reasons, and cohorts where one judge consistently disagrees.

This matters more for 2026-era agent systems than for single-turn chat. Multi-step agents create many judgment points: the plan, each tool call, the final answer, and the recovery path after failure. Unlike Ragas faithfulness, which focuses on whether an answer is supported by retrieved context, a model jury can combine groundedness, task completion, policy fit, and trajectory quality into one auditable decision. That makes disagreement a debugging surface instead of an argument in a spreadsheet.

How FutureAGI Handles a Jury of Models

The anchor surface is eval:CustomEvaluation, exposed as fi.evals.CustomEvaluation. In FutureAGI, each judge in the jury can be represented as a named CustomEvaluation with its own rubric, judge model, score range, and reason field. An AggregatedMetric then combines the votes using the rule the team chose: majority vote, mean score, weighted score, or fail-fast on a safety judge.

FutureAGI’s approach is to make the jury auditable: every vote, model, rubric version, score, reason, and final aggregation rule is stored with the eval run or trace. A support-agent team might create three judges for refund-policy answers: one Groundedness style judge for cited policy support, one TaskCompletion style judge for whether the agent solved the request, and one custom compliance judge for prohibited promises. The team runs the jury offline against a golden dataset, then attaches the same jury to production traces.

With the traceAI-langchain integration, the engineer can inspect the exact span where disagreement began. If the jury fails only when llm.token_count.prompt is high, the likely issue is context overload. If disagreement clusters around agent.trajectory.step values for tool calls, the next action is a regression eval on tool selection, not another prompt rewrite. In production, the team sets a release threshold: block deploys when jury agreement drops below 0.75 or when the safety judge fails more than 1% of traces.

How to Measure Jury-of-Models Quality

Measure the jury and the system it grades. Useful signals include:

CustomEvaluation vote output: store each judge’s score, label, and reason before aggregation so disagreement is visible.
AggregatedMetric verdict: combine judge votes into the release-gate score; track mean, majority, and fail-fast outcomes separately.
Inter-judge agreement: monitor pairwise agreement, Cohen’s kappa, or simple same-label rate on a stable cohort.
Dashboard drift: alert on eval-fail-rate-by-cohort, vote entropy, p95 judge latency, and token-cost-per-trace.
Trace and feedback proxies: compare disagreement with agent.trajectory.step, thumbs-down rate, escalation rate, and human annotation overrides.

Minimal sketch:

from fi.evals import CustomEvaluation, AggregatedMetric

policy = CustomEvaluation(name="policy_fit", rubric="Score 0-1 for policy support.")
safety = CustomEvaluation(name="safe_action", rubric="Score 0-1 for unsafe tool use.")
jury = AggregatedMetric(metrics=[policy, safety], aggregation="mean")
result = jury.evaluate(input=user_query, output=agent_answer)
print(result.score, result.reason)

Common Mistakes

Treating the jury as truth. A jury reduces single-judge bias; it does not replace calibration against human annotations.
Using near-identical judges. Three prompts on the same model often share the same blind spot. Vary model family, rubric, or evaluation method.
Hiding disagreement inside the average. A 0.8 mean can mask one hard safety failure. Keep vote-level records.
Changing rubrics without versioning. Old and new jury scores become incomparable when rubric text changes silently.
Running juries on every trace by default. Sample high-risk cohorts first; otherwise judge cost and latency can obscure the signal.