Best LLM Judge Models in 2026: 8 Models Ranked on Calibration, Cost, and Self-Preference
Eight LLM judge models compared on human correlation, cost per score, latency, and self-preference bias. Pick by your rubric, not by SummEval.
Table of Contents
You ship a customer-support agent. The judge is GPT-4o and the helpfulness rubric reads 0.91 every Monday. In March the judge bumps to a 4o minor version. In April the agent quotes a refund off by an order of magnitude. The rubric still reads 0.91. The signal stopped meaning what you thought it meant the day the judge changed.
Most posts on this topic rank judge models by a single SummEval Spearman or an MT-Bench winrate. Those numbers do a lot of unspoken work. A judge that hits 0.514 Spearman on summarization is not automatically the judge that holds for two years on your customer-support rubric. The benchmark measures one task on one dataset with one rubric the paper authors wrote.
The thesis this post defends: pick a judge model by three axes. Human correlation against your rubric. Cost per score at your traffic volume. Self-preference bias against your candidate models. The best judge for your eval wins on your rubric, not on SummEval. We compare eight models worth running on a calibration set in May 2026.
Methodology note: scoring axes below are calibration ceiling (kappa against human labels on subjective rubrics), cost per million output tokens, p95 latency on a 2K-token transcript, self-preference bias documented across published work, and license/deployment shape. Pricing verified May 2026 against vendor pricing pages. Calibration is task-dependent. Treat these as starting points, not procurement decisions.
TL;DR: which judge wins on which axis
| Axis | Pick |
|---|---|
| Best calibration ceiling on subjective rubrics | Claude Sonnet 4.5, Claude Opus 4.x |
| Best structured-output reliability and cost-quality balance | GPT-5 + GPT-5-mini cascade |
| Best long-context judging (over 200K) and multimodal rubrics | Gemini 2.5 Pro |
| Cheapest frontier-tier judge at scale | Gemini 2.5 Flash, GPT-5-mini |
| Best fine-tuned eval-specific judge (closed) | Galileo Luna-2 |
| Best fine-tuned eval-specific judge (with full stack integration) | Future AGI turing_large / turing_flash |
| Best open-weight judge on cost | DeepSeek-V3 |
| Best self-hosted regulated judge | Llama 3.3 70B |
| Best open-weight evaluator-specific fine-tune | Prometheus 2 (8x7B) |
| Worst idea | Same model as judge and candidate |
If you read one row: there is no single winner. Run two or three candidates on a 100-to-300 example human-labeled set, measure kappa, divide by cost per score, then make the call.
The three axes that actually matter
The reason published leaderboards mislead is that they collapse the choice to one number. In production, three axes bind separately and the binding axis changes by team.
Human correlation on your rubric. Cohen’s kappa or Spearman against a human-labeled hold-out, computed on your data, not on SummEval. A marketing-copy rubric tolerates kappa around 0.6. A medical-advice rubric needs 0.85 or higher. A judge that hits 0.85 on a public benchmark and 0.55 on your dataset is a judge that does not know your domain. The fix is unglamorous: hand-label 100 to 300 examples covering the failure modes that matter, run every candidate judge against the same set, and only then read the leaderboard.
Cost per score at your traffic volume. A frontier judge call on a 30-second agent trace costs $0.01 to $0.05 depending on judge and tokens. At a million traces a day that is $30K to $1.5M monthly. The judge that wins on kappa and loses on dollar-per-score loses the procurement. Three patterns rescue the bill: fine-tuned judges that score at one to ten percent of frontier per call, classifier cascades that escalate only close cases to the frontier model, and sample-don’t-score on routine traffic.
Self-preference bias against your candidates. A judge prefers outputs from its own family at 10 to 25 percent margin per Zheng et al. 2024. The cardinal mistake is the same model as judge and candidate. The second mistake is judging GPT-5 candidates with a GPT-5 judge and Sonnet 4.5 candidates with a Sonnet judge in the same eval suite, which builds family bias into the comparison. The mitigation is a three-judge ensemble across families on launch decisions and single-family judges only for trend tracking.
Add one operational axis underneath the three: judge version stability. Sonnet 4.5 to Sonnet 4.6 is a minor bump that ships with a new training mix. The mean rubric score shifts 3 to 8 points; the distribution narrows. If you do not pin the judge model id inside the eval contract, the dashboard moves but the agent did not. See why LLM-as-a-judge and G-Eval definitive guide for the contract pattern.
The eight judge models worth running in 2026
1. Claude Sonnet 4.5 / Opus 4.x: the calibration ceiling
Closed-weight frontier. Anthropic, Bedrock, Vertex.
Where it earns its bill. Subjective rubrics that need open-ended reasoning over long context. Multi-document RAG faithfulness, multi-turn conversation adherence, agent-trajectory rubrics. Sonnet 4.5 sits at the top of LMSYS Chatbot Arena’s pairwise preference table and produces reasoning chains dense enough to survive an audit log. The 200K window covers most production transcripts in one pass.
Cost. $3 input / $15 output per 1M tokens for Sonnet 4.5; Opus 4.x runs higher. A 2K-input, 200-output call lands near $0.009 per score. Viable as the second-stage judge in a cascade. Expensive as the first pass.
Latency. 1.5 to 3 seconds p95. Not an inline-guardrail judge.
Self-preference. Claude prefers Claude-family outputs in published evaluations. Do not use Sonnet 4.5 to judge a Sonnet 4.5 candidate. Pair with GPT or Gemini in an ensemble for launch decisions.
Best for. High-stakes judging where calibration matters more than cost. Pre-launch validation. The frontier slot in a two-stage cascade.
2. GPT-5 / GPT-5-mini: the structured-output workhorse
Closed-weight frontier. OpenAI, Azure OpenAI.
Where it earns its bill. Structured-output evaluation. JSON-mode reliability on GPT-5 is the lowest-parsing-failure tier in this list, which matters when 100K judgments at a 5 percent parse-failure rate is 5,000 retries. The two-stage GPT-5-mini-screens / GPT-5-rescore cascade lands under 30 percent of GPT-5-only cost at the same calibration target.
Cost. Verify on the pricing page. GPT-5-mini is roughly an order of magnitude cheaper per million tokens than GPT-5.
Latency. GPT-5-mini sub-second; GPT-5 at 1.5 to 3 seconds p95.
Self-preference. GPT-4 was the original self-preference data point in Zheng et al. 2024 at 10 to 25 percent margin. Treat the bias as inherited by GPT-5 until measured.
Best for. General-purpose judge in OpenAI stacks. Strict structured-output rubrics where parse failures kill throughput.
3. Gemini 2.5 Pro / Flash: long context, multimodal, cost-balanced
Closed-weight frontier. Vertex AI, AI Studio.
Where it earns its bill. Context windows past 200K. Multimodal judging on image, audio, and video inputs. Gemini 2.5 Pro carries a 1M-plus context window with experimental 2M tiers. The only judge here that scores a whole-document multi-doc RAG response in one pass. Flash brings frontier-tier reasoning to a price point where million-span-a-day scoring is financially viable; Vertex Batch Prediction discounts offline workloads further.
Latency. Flash sub-second on short inputs; Pro at 2 to 4 seconds on long-context judging. Region-dependent.
Self-preference. Less publicly documented than GPT or Claude. Assume present, measure on your set, do not judge Gemini candidates with a Gemini judge.
Worth flagging. The 1M-plus window degrades subtly past a few hundred K tokens. Calibrate empirically on long inputs.
Best for. Long-context judging where 200K is not enough. Multimodal rubrics. Cost-sensitive high-volume scoring with Flash.
4. Galileo Luna-2: the purpose-built evaluator
Closed-weight fine-tune. Galileo.
Where it earns its bill. Eval-specific fine-tunes win on cost-per-score for the rubrics they were trained on. Luna-2 is a 2B-parameter model fine-tuned for hallucination, context adherence, and tool-call correctness. Galileo published agreement numbers against GPT-4o on internal benchmarks and prices it at a fraction of frontier judge calls. The reference category for “small purpose-built judge beats the big general-purpose one on its home turf.”
Latency. Sub-second by design.
Self-preference. Not applicable in the same shape. Luna-2 was not trained as a general assistant, so the bias surface is rubric framing rather than family preference. Kappa on rubrics outside its training distribution can degrade sharply.
Worth flagging. Strong pick when rubrics overlap training. Wrong pick for rubrics that require open-ended reasoning the fine-tune has not seen. The Future AGI Platform runs the same cascade at lower per-eval cost than Galileo Luna-2 with tighter trace + optimizer + gateway integration.
Best for. Cost-optimized hallucination, context adherence, and tool-call scoring. The fine-tune slot in a cascade.
5. Future AGI Turing series (turing_large, turing_flash): fine-tune meets full stack
Closed-weight fine-tune. Routed through the ai-evaluation SDK.
Where it earns its bill. FAGI ships three Turing variants (turing_flash, turing_small, turing_large; see fi/evals/core/registry.py) as the default judge backend for 50+ pre-built EvalTemplate rubrics. Call evaluate("faithfulness", output=..., context=..., model="turing_flash") and the SDK routes to a fine-tuned classifier. turing_large handles heavier reasoning rubrics; turing_flash carries the 50-70 ms p95 inline guardrail path through the Agent Command Center gateway.
Cost. The Future AGI Platform runs Turing-backed scoring at lower per-eval cost than Galileo Luna-2. Volume discounts make daily full-traffic judging financially viable.
Latency. turing_flash at 50-70 ms p95; turing_large at 1-2 seconds for full templates.
Self-preference. Like Luna-2, the bias surface is rubric framing rather than family preference. Calibrate on a human-labeled set per rubric.
Worth flagging. The Turing series composes natively with traceAI spans, Error Feed clustering, and the agent-opt optimizers. Honest tradeoff: Galileo has more public head-to-head agreement data; FAGI has tighter integration and a CustomLLMJudge that runs the same rubric against any LiteLLM-backed model when you want to BYOK GPT, Claude, Gemini, DeepSeek, or open-weight judges through the same SDK.
Best for. Production teams running traces, evals, optimization, and gateway routing in one stack. The inline-guardrail slot with turing_flash. BYOK plus FAGI-routed scoring in the same workflow.
6. DeepSeek-V3: open-weight cost leader
Open weight. DeepSeek Model License; verify commercial terms.
Where it earns its bill. Open-weight cost economics at frontier-tier reasoning. DeepSeek-V3 is a 671B MoE with 37B active parameters, so inference cost is closer to a 37B dense model than to a 671B dense one. Cost-per-token sits in a different league from closed frontier judges. Reasoning-rubric calibration is competitive within a small kappa gap.
Cost. Verify on DeepSeek API pricing or hosted providers like Together. A fraction of GPT-5 or Sonnet 4.5 at frontier-tier quality.
Latency. 1-2 seconds p95 on hosted endpoints; self-hosting requires multi-GPU because MoE is memory-heavy.
Worth flagging. Geopolitical and data-handling considerations apply for some procurement contexts. The cost win compounds at volume; so does the calibration gap if you skip the work.
Best for. High-volume routine scoring where cost-per-judgment binds. The first-pass judge in a two-stage cascade.
7. Llama 3.3 70B: the regulated self-host pick
Open weight. Meta’s Llama 3.3 license.
Where it earns its bill. Regulated workloads where data cannot leave the boundary. Llama 3.3 70B is the standard self-hosted judge in 2026 production stacks. Served via vLLM, TGI, Bedrock, Together, or Fireworks, it holds within a small kappa gap to frontier on subjective rubrics. Dense 70B is operationally cheaper to serve than DeepSeek’s 671B MoE: fewer GPU memory headaches, simpler scaling.
Cost. Free weights. Roughly $1-3 per million tokens on managed providers like Together, Fireworks, and Bedrock. Self-hosting typically runs 4xH100 for BF16 or 2xH100 for FP8.
Latency. 1-2 seconds p95 on managed endpoints.
Self-preference. Documented in Zheng et al. 2024 on Llama 2; the family effect likely persists in 3.3.
Worth flagging. Not a frontier model on hard reasoning rubrics. Function-calling reliability varies by serving framework. Structured-output reliability is good on hosted endpoints but not as strict as OpenAI JSON mode.
Best for. Regulated workloads (healthcare, banking, defense), on-premise deployments, and cost-optimized scoring where hosted closed models are a non-starter.
8. Prometheus 2 (8x7B): the open-weight evaluator fine-tune
Open weight. Apache 2.0. arXiv:2405.01535.
Where it earns its bill. Prometheus 2 is a Mistral-family fine-tune trained on a 100K-instance feedback dataset specifically for absolute and pairwise scoring. The paper showed it outperformed its base Mistral by a wide margin on judge benchmarks and approached GPT-4 on absolute scoring. Two variants ship (7B dense and 8x7B MoE) covering the cost-quality spread.
Cost. Free weights. Sub-dollar per million tokens on hosted providers; cheaper self-hosted.
Latency. Sub-second to 1 second p95.
Self-preference. Trained on diverse teacher annotations rather than its own family’s outputs, which softens (does not eliminate) the bias.
Worth flagging. Right pick when you specifically need an evaluator-trained open-weight model rather than a general-purpose one serving as a judge. Calibration degrades on rubrics outside its training distribution: the universal fine-tune caveat.
Best for. Open-source eval stacks. Research workflows where reproducibility matters. The open-weight slot in a cascade where data residency or budget forbids hosted frontier judges.
How the eight stack up
| Judge | Weight access | Calibration ceiling | Cost shape | p95 latency (2K input) | Self-preference risk |
|---|---|---|---|---|---|
| Claude Sonnet 4.5 / Opus 4.x | Closed | Frontier | $3 in / $15 out per 1M | 1.5-3s | High vs Claude |
| GPT-5 / GPT-5-mini | Closed | Frontier | Frontier; mini ~10x cheaper | 0.5-3s | High vs GPT |
| Gemini 2.5 Pro / Flash | Closed | Frontier; multimodal | Flash cost-balanced | 0.5-4s | Measure |
| Galileo Luna-2 | Closed fine-tune | Strong on trained rubrics | Far below frontier | Sub-second | Rubric-framing |
FAGI turing_large / flash | Closed fine-tune | Strong on trained rubrics | Below Luna-2 per eval | 50-70 ms (flash) | Rubric-framing |
| DeepSeek-V3 | Open (MoE) | Small gap to frontier | Open-weight low | 1-2s | Measure |
| Llama 3.3 70B | Open (dense) | Small gap to frontier | $1-3 per 1M hosted | 1-2s | High vs Llama |
| Prometheus 2 (8x7B) | Open (Apache 2.0) | Eval-tuned absolute/pairwise | Open-weight low | 0.5-1s | Softened |
Two patterns fall out. Fine-tuned evaluators (Luna-2, Turing, Prometheus 2) sit in a different cost-latency tier than frontier general-purpose judges. The cascade pattern (fine-tune first, frontier on disagreements) is how most production stacks reconcile the two. The open-weight class is the only path for regulated workloads, and the kappa gap to frontier is small enough on most subjective rubrics that the self-host pick is rarely a quality compromise.
The cascade pattern beats picking one judge
Teams that have run judges for more than a quarter stop trying to pick “the” judge. Four layers replace the question.
Layer 1: deterministic floor. JSON schema, refusal regex, parser. Sub-millisecond, free, never drifts.
Layer 2: fine-tuned classifier judge. turing_flash, Luna-2, or a Gemma 3n adapter. Runs on every span at sub-second latency and one to ten percent of frontier cost. Catches confident pass and confident fail; flags close calls.
Layer 3: frontier general-purpose judge. Sonnet 4.5, GPT-5, Gemini 2.5 Pro. Runs only on the close calls layer 2 flagged. Bills for reasoning only when reasoning is the binding constraint.
Layer 4: human review queue. Cases the frontier judge still scores ambiguously go to a labeling queue. Labels feed back into rubric calibration so layer 2 catches more next quarter.
A million-span-a-day frontier-only workload is a $30K-to-$1.5M monthly bill. The same workload as a cascade with 90 percent caught at layer 2 drops the frontier bill by 90 percent without losing detection on hard cases. The ai-evaluation SDK ships the cascade as one call (augment=True runs a local classifier first and hands its reasoning to the frontier judge as in-context evidence). See why LLM-as-a-judge for the full pattern.
How to pick: a 90-minute calibration sprint
You do not pick a judge model from a blog post. You pick by running the candidates on your data. This is the cheapest version of that.
-
Hand-label 100 to 300 examples that cover the failure modes your rubric needs to catch. Use two annotators where possible and discard low-agreement items. The set is your ground truth for the next quarter.
-
Run three or four candidate judges from this list against the same set. A frontier closed (Sonnet 4.5 or GPT-5), a frontier-mini (Gemini 2.5 Flash or GPT-5-mini), a fine-tune (
turing_largeor Luna-2), and an open-weight (DeepSeek-V3 or Llama 3.3 70B). Same rubric prompt, structured output, position randomized on pairwise. -
Compute Cohen’s kappa or accuracy against the human labels for each judge. Plot kappa on the x-axis and cost-per-score on the y-axis. The Pareto frontier is your shortlist.
-
Test self-preference explicitly. Take 50 pairs where one side is from the candidate model family and one side is from a different family with human labels for which is actually better. Score in both orderings. The judge whose winrate diverges from the ground truth by more than 10 percent has self-preference you cannot ignore.
-
Project cost at three months. Token cost times tokens-per-judgment times sample rate times retries. The cheapest judge that hits your kappa threshold wins, not the highest kappa.
The sprint usually surfaces a two-judge cascade (one fine-tune for layer 2, one frontier for layer 3) and one open-weight fallback for the regulated path. That is the procurement output, not “judge model X is best.”
Common mistakes when picking an LLM judge model
- Defaulting to last year’s model. Judge model versions shift behavior. A judge calibrated on GPT-4 is not calibrated on GPT-5 without re-validation. Pin the judge model id inside the eval contract; bump deliberately.
- Same model as judge and candidate. Self-preference adds 10 to 25 percent score to the judge’s own family per Zheng et al. 2024. The cardinal mistake.
- Pricing only the per-token cost. Real cost equals tokens-per-judgment times retries times sample rate. A judge with 5 percent parsing failures costs more than the per-token rate suggests.
- Trusting a vendor leaderboard number. Vendors benchmark on their training distribution. Your data is not their training distribution. Calibrate.
- Treating judging as a model choice. Production judging is a workflow: cascade, calibrate, audit, re-calibrate. The model is one input.
Recent shifts worth tracking
| Date | Event | Why it matters |
|---|---|---|
| 2024 | Llama 3.3 70B, DeepSeek-V3, Prometheus 2 | Open-weight judging closed the gap to frontier on most subjective rubrics. |
| 2025 | GPT-5, Claude Sonnet 4.5, Gemini 2.5 Pro 1M+ | Frontier judge baseline shifted upward; recalibration required across stacks. |
| 2025 | Galileo Luna-2 released | Purpose-built 2B evaluator entered production stacks as the reference fine-tune. |
| 2026 | Future AGI Turing series across 50+ EvalTemplate rubrics | Fine-tuned classifier-backed scoring at lower per-eval cost than Galileo Luna-2. |
Where Future AGI fits in this stack
The judge model is one component. The stack around it is what compounds across two years of production.
The ai-evaluation SDK (Apache 2.0) ships CustomLLMJudge, a Jinja2-templated G-Eval primitive that runs against any LiteLLM-backed judge, including every model in this list. The same class powers 50+ pre-built EvalTemplate rubrics with Turing models as the default backend. Drop model="turing_flash" for the cheap first pass, model="gpt-5" or model="claude-sonnet-4-5" for the frontier rescore, or any open-weight model for the self-hosted path. The cascade is one API.
traceAI carries the same rubric as a span-attached EvalTag across Python, TypeScript, Java, and C#. The collector runs the eval server-side and writes results back as gen_ai.evaluation.* attributes. Zero added inline latency. The same rubric runs in pytest as a CI gate, in batch on offline traces, and on live spans. That diff closes most eval-versus-production drift covered in the trace-eval gap post.
The Agent Command Center gateway handles judge routing across 100+ providers with exact and semantic caching, shadow / mirror / race routing, virtual keys with per-key budgets, and 18+ built-in guardrail scanners plus 15 third-party adapters. SOC 2 Type II, HIPAA, GDPR, CCPA certified; ISO/IEC 27001 in active audit. Canary judge swaps become A/B tests rather than deploy events. The Future AGI Platform layers self-improving evaluators that retune from thumbs feedback, lower per-eval cost than Galileo Luna-2, and Error Feed clustering that groups failing-judge traces over ClickHouse embeddings with a Sonnet 4.5 Judge writing the immediate_fix.
Ready to pick a judge against your own workload? Start with the ai-evaluation SDK quickstart, drop three candidates from this list into the same evaluate() call on your labeled set, and pick by kappa per dollar. That is the procurement output that holds for two years.
Related reading
- Why LLM-as-a-Judge (2026): The Case For, Against, and the Hybrid That Wins
- G-Eval (2026): The Definitive Guide for Production LLM Teams
- Evaluating LLM Judge Bias Mitigation (2026)
- LLM Arena as a Judge: Pairwise Comparison Evals (2026)
- LLM Judge Prompt Engineering Guide (2026)
- Best LLM-as-Judge Platforms (2026)
- Your Agent Passes Evals and Fails in Production (2026)
- The 2026 LLM Evaluation Playbook
Frequently asked questions
Which LLM is the best judge model in 2026?
How do I measure self-preference bias before picking a judge?
Are open-weight judges good enough for production?
What is Galileo Luna-2 and how does it compare to Future AGI Turing models?
Should I use a fine-tuned judge or a frontier general-purpose model?
How does Future AGI's Turing series integrate with this evaluation stack?
LLM-as-judge best practices for 2026: pick the right judge, calibrate against humans, watch length and family bias, control cost. Discipline that scales.
Future AGI, DeepEval, Galileo Luna-2, Braintrust, Phoenix, Ragas: calibrated judges, classifier cascade, deterministic floor, audit. Honest tradeoffs.
Best LLMs May 2026: compare GPT-5.5, Claude Opus 4.7, Gemini 3.1 Pro, and DeepSeek V4 across coding, agents, multimodal, cost, and open weights.