What Is a Judge Model?
The large language model that performs grading inside an LLM-as-a-judge evaluator, returning a score and rationale for a candidate output.
What Is a Judge Model?
A judge model is the LLM that grades outputs inside an LLM-as-a-judge evaluation pipeline. It reads the user prompt, candidate response, optional reference answer, and rubric, then returns a structured score and reason. The judge is separate from the generator model being tested; that separation reduces self-preference bias and makes release gates easier to trust. In FutureAGI, teams pin and version the judge beside the rubric so score changes can be traced to model, prompt, or data drift.
Why Judge Models Matter in Production
The judge is the most under-engineered piece of most teams’ eval stacks. People obsess over rubric wording and ignore which model is reading it. The downstream effects are large. A weak judge produces noisy scores — flip the same trace twice and get different verdicts. A judge from the same family as the generator produces inflated scores — research has shown self-evaluation rewards style and length, not correctness. A judge with a small context window truncates retrieved context and hallucinates faithfulness verdicts.
The pain shows up as eval-suite that doesn’t predict user feedback. ML engineers see the eval pass green; thumbs-down rate ticks up; nobody trusts the metric anymore; the eval stack quietly stops gating releases. That is the worst place an eval program lands — present, but unreliable.
For 2026-era systems, judge model choice is also a cost decision. Running gpt-4o as a judge on every production trace at 100K traces/day is real money. Many teams under-provision the judge to save cost and end up with a noisy signal. FutureAGI’s approach: use a strong judge for offline calibration runs and the canonical golden dataset; use a smaller, distilled judge for live high-volume scoring, with periodic re-calibration against the strong judge. Comparable open-source frameworks like DeepEval ship default judges but rarely give you a built-in calibration loop.
How FutureAGI Handles Judge Models
FutureAGI’s approach is to make the judge model a first-class, swappable parameter on every judge-based evaluator. CustomEvaluation accepts judge_model="gpt-4o", judge_model="gemini-2.5-pro", or any model registered in the SDK’s model database. Built-in cloud-template evaluators (Groundedness, AnswerRelevancy, Faithfulness) ship with managed judge models tuned and calibrated by FutureAGI; you can override the judge for any of them.
Per-judge calibration is handled through the annotation queue: pull a sample of evaluator results into fi.queues.AnnotationQueue, have humans label them, and FutureAGI computes Cohen’s kappa between the judge and humans. A judge below 0.7 kappa gets flagged before it gates a release.
Real example: a healthcare team needs ClinicallyInappropriateTone running on every patient-facing message. They run two judge models — gemini-2.5-pro (strong, expensive) and gpt-4o-mini (cheap, fast) — against 500 human-labeled traces. Strong judge agreement is 0.81 kappa; cheap judge is 0.58. They use the strong judge for nightly regression evals against the golden dataset, and the cheap judge live with a 5%-sampled audit by the strong judge. The Agent Command Center’s routing-policies route the eval calls themselves through cost-optimized to keep judge cost predictable.
How to Measure or Detect Judge Model Quality
Judge models need their own metrics. Track:
- Human-agreement (Cohen’s kappa): ≥0.7 is the floor for production trust.
- Position bias: in pairwise judge runs, the % of times the judge picks the first response regardless of content. >55% means rebalance with order randomization.
- Self-preference bias: when the judge is from the same family as the generator, the inflation factor — measured against a different-family judge.
- Latency: judge call time per trace; a
gpt-4ojudge adds 800ms–2s. - Cost per evaluated trace: tokens × price. Plot weekly.
Minimal Python:
from fi.evals import CustomEvaluation
judge = CustomEvaluation(
name="medical_tone",
rubric="Score 1-5 for clinical appropriateness. 1=inappropriate, 5=ideal.",
judge_model="gemini-2.5-pro",
)
result = judge.evaluate(input=q, output=a)
Common Mistakes
- Same model as generator. Inflates scores; introduces stylistic bias. Use a different family.
- Skipping calibration. A judge that hasn’t been validated against human labels is decorative.
- Ignoring temperature. Judges run at temperature 0 are more consistent; many teams leave it at the default 0.7 and get noisy scores.
- Letting the judge see the gold answer when grading reference-free traits. The judge will reward paraphrase, not correctness.
- Locking in one judge forever. Models improve and degrade; re-bench the judge quarterly against the same human-labeled set.
Frequently Asked Questions
What is a judge model?
A judge model is the LLM that performs the grading inside an LLM-as-a-judge evaluator. It reads the input, the candidate output, and the rubric, then returns a score and a written justification.
How is a judge model different from a generator model?
The generator model produces the response being graded; the judge model evaluates it. They should generally be different — using the same model for both inflates scores by 5–15% and adds family-specific style bias.
How do you pick a judge model?
Pick a model that is at least as strong as the generator, from a different family if possible. Calibrate against human annotations on 50–200 traces. FutureAGI's CustomEvaluation accepts a judge_model argument so you can pin and version the judge.