Research

Best LLM Judge Models in 2026: 8 Models Ranked on Calibration, Cost, and Self-Preference

Eight LLM judge models compared on human correlation, cost per score, latency, and self-preference bias. Pick by your rubric, not by SummEval.

April 18, 2025

Updated May 20, 2026

17 min read

llm-as-judge llm-judge-models judge-calibration self-preference galileo-luna-2 turing-models prometheus-2 2026

Table of Contents

You ship a customer-support agent. The judge is GPT-4o and the helpfulness rubric reads 0.91 every Monday. In March the judge bumps to a 4o minor version. In April the agent quotes a refund off by an order of magnitude. The rubric still reads 0.91. The signal stopped meaning what you thought it meant the day the judge changed.

Most posts on this topic rank judge models by a single SummEval Spearman or an MT-Bench winrate. Those numbers do a lot of unspoken work. A judge that hits 0.514 Spearman on summarization is not automatically the judge that holds for two years on your customer-support rubric. The benchmark measures one task on one dataset with one rubric the paper authors wrote.

The thesis this post defends: pick a judge model by three axes. Human correlation against your rubric. Cost per score at your traffic volume. Self-preference bias against your candidate models. The best judge for your eval wins on your rubric, not on SummEval. We compare eight models worth running on a calibration set in May 2026.

Methodology note: scoring axes below are calibration ceiling (kappa against human labels on subjective rubrics), cost per million output tokens, p95 latency on a 2K-token transcript, self-preference bias documented across published work, and license/deployment shape. Pricing verified May 2026 against vendor pricing pages. Calibration is task-dependent. Treat these as starting points, not procurement decisions.

TL;DR: which judge wins on which axis

Axis	Pick
Best calibration ceiling on subjective rubrics	Claude Sonnet 4.5, Claude Opus 4.x
Best structured-output reliability and cost-quality balance	GPT-5 + GPT-5-mini cascade
Best long-context judging (over 200K) and multimodal rubrics	Gemini 2.5 Pro
Cheapest frontier-tier judge at scale	Gemini 2.5 Flash, GPT-5-mini
Best fine-tuned eval-specific judge (closed)	Galileo Luna-2
Best fine-tuned eval-specific judge (with full stack integration)	Future AGI `turing_large` / `turing_flash`
Best open-weight judge on cost	DeepSeek-V3
Best self-hosted regulated judge	Llama 3.3 70B
Best open-weight evaluator-specific fine-tune	Prometheus 2 (8x7B)
Worst idea	Same model as judge and candidate

If you read one row: there is no single winner. Run two or three candidates on a 100-to-300 example human-labeled set, measure kappa, divide by cost per score, then make the call.

The three axes that actually matter

The reason published leaderboards mislead is that they collapse the choice to one number. In production, three axes bind separately and the binding axis changes by team.

Human correlation on your rubric. Cohen’s kappa or Spearman against a human-labeled hold-out, computed on your data, not on SummEval. A marketing-copy rubric tolerates kappa around 0.6. A medical-advice rubric needs 0.85 or higher. A judge that hits 0.85 on a public benchmark and 0.55 on your dataset is a judge that does not know your domain. The fix is unglamorous: hand-label 100 to 300 examples covering the failure modes that matter, run every candidate judge against the same set, and only then read the leaderboard.

Cost per score at your traffic volume. A frontier judge call on a 30-second agent trace costs $0.01 to $0.05 depending on judge and tokens. At a million traces a day that is $30K to $1.5M monthly. The judge that wins on kappa and loses on dollar-per-score loses the procurement. Three patterns rescue the bill: fine-tuned judges that score at one to ten percent of frontier per call, classifier cascades that escalate only close cases to the frontier model, and sample-don’t-score on routine traffic.

Self-preference bias against your candidates. A judge prefers outputs from its own family at 10 to 25 percent margin per Zheng et al. 2024. The cardinal mistake is the same model as judge and candidate. The second mistake is judging GPT-5 candidates with a GPT-5 judge and Sonnet 4.5 candidates with a Sonnet judge in the same eval suite, which builds family bias into the comparison. The mitigation is a three-judge ensemble across families on launch decisions and single-family judges only for trend tracking.

Add one operational axis underneath the three: judge version stability. Sonnet 4.5 to Sonnet 4.6 is a minor bump that ships with a new training mix. The mean rubric score shifts 3 to 8 points; the distribution narrows. If you do not pin the judge model id inside the eval contract, the dashboard moves but the agent did not. See why LLM-as-a-judge and G-Eval definitive guide for the contract pattern.

The eight judge models worth running in 2026

1. Claude Sonnet 4.5 / Opus 4.x: the calibration ceiling

Closed-weight frontier. Anthropic, Bedrock, Vertex.

Where it earns its bill. Subjective rubrics that need open-ended reasoning over long context. Multi-document RAG faithfulness, multi-turn conversation adherence, agent-trajectory rubrics. Sonnet 4.5 sits at the top of LMSYS Chatbot Arena’s pairwise preference table and produces reasoning chains dense enough to survive an audit log. The 200K window covers most production transcripts in one pass.

Cost. $3 input / $15 output per 1M tokens for Sonnet 4.5; Opus 4.x runs higher. A 2K-input, 200-output call lands near $0.009 per score. Viable as the second-stage judge in a cascade. Expensive as the first pass.

Latency. 1.5 to 3 seconds p95. Not an inline-guardrail judge.

Self-preference. Claude prefers Claude-family outputs in published evaluations. Do not use Sonnet 4.5 to judge a Sonnet 4.5 candidate. Pair with GPT or Gemini in an ensemble for launch decisions.

Best for. High-stakes judging where calibration matters more than cost. Pre-launch validation. The frontier slot in a two-stage cascade.

2. GPT-5 / GPT-5-mini: the structured-output workhorse

Closed-weight frontier. OpenAI, Azure OpenAI.

Where it earns its bill. Structured-output evaluation. JSON-mode reliability on GPT-5 is the lowest-parsing-failure tier in this list, which matters when 100K judgments at a 5 percent parse-failure rate is 5,000 retries. The two-stage GPT-5-mini-screens / GPT-5-rescore cascade lands under 30 percent of GPT-5-only cost at the same calibration target.

Cost. Verify on the pricing page. GPT-5-mini is roughly an order of magnitude cheaper per million tokens than GPT-5.

Latency. GPT-5-mini sub-second; GPT-5 at 1.5 to 3 seconds p95.

Self-preference. GPT-4 was the original self-preference data point in Zheng et al. 2024 at 10 to 25 percent margin. Treat the bias as inherited by GPT-5 until measured.

Best for. General-purpose judge in OpenAI stacks. Strict structured-output rubrics where parse failures kill throughput.

3. Gemini 2.5 Pro / Flash: long context, multimodal, cost-balanced

Closed-weight frontier. Vertex AI, AI Studio.

Where it earns its bill. Context windows past 200K. Multimodal judging on image, audio, and video inputs. Gemini 2.5 Pro carries a 1M-plus context window with experimental 2M tiers. The only judge here that scores a whole-document multi-doc RAG response in one pass. Flash brings frontier-tier reasoning to a price point where million-span-a-day scoring is financially viable; Vertex Batch Prediction discounts offline workloads further.

Latency. Flash sub-second on short inputs; Pro at 2 to 4 seconds on long-context judging. Region-dependent.

Self-preference. Less publicly documented than GPT or Claude. Assume present, measure on your set, do not judge Gemini candidates with a Gemini judge.

Worth flagging. The 1M-plus window degrades subtly past a few hundred K tokens. Calibrate empirically on long inputs.

Best for. Long-context judging where 200K is not enough. Multimodal rubrics. Cost-sensitive high-volume scoring with Flash.

4. Galileo Luna-2: the purpose-built evaluator

Closed-weight fine-tune. Galileo.

Where it earns its bill. Eval-specific fine-tunes win on cost-per-score for the rubrics they were trained on. Luna-2 is a 2B-parameter model fine-tuned for hallucination, context adherence, and tool-call correctness. Galileo published agreement numbers against GPT-4o on internal benchmarks and prices it at a fraction of frontier judge calls. The reference category for “small purpose-built judge beats the big general-purpose one on its home turf.”

Latency. Sub-second by design.

Self-preference. Not applicable in the same shape. Luna-2 was not trained as a general assistant, so the bias surface is rubric framing rather than family preference. Kappa on rubrics outside its training distribution can degrade sharply.

Worth flagging. Strong pick when rubrics overlap training. Wrong pick for rubrics that require open-ended reasoning the fine-tune has not seen. The Future AGI Platform runs the same cascade at lower per-eval cost than Galileo Luna-2 with tighter trace + optimizer + gateway integration.

Best for. Cost-optimized hallucination, context adherence, and tool-call scoring. The fine-tune slot in a cascade.

5. Future AGI Turing series (`turing_large`, `turing_flash`): fine-tune meets full stack

Closed-weight fine-tune. Routed through the ai-evaluation SDK.

Where it earns its bill. FAGI ships three Turing variants (turing_flash, turing_small, turing_large; see fi/evals/core/registry.py) as the default judge backend for 50+ pre-built EvalTemplate rubrics. Call evaluate("faithfulness", output=..., context=..., model="turing_flash") and the SDK routes to a fine-tuned classifier. turing_large handles heavier reasoning rubrics; turing_flash carries the 50-70 ms p95 inline guardrail path through the Agent Command Center gateway.

Cost. The Future AGI Platform runs Turing-backed scoring at lower per-eval cost than Galileo Luna-2. Volume discounts make daily full-traffic judging financially viable.

Latency. turing_flash at 50-70 ms p95; turing_large at 1-2 seconds for full templates.

Self-preference. Like Luna-2, the bias surface is rubric framing rather than family preference. Calibrate on a human-labeled set per rubric.

Worth flagging. The Turing series composes natively with traceAI spans, Error Feed clustering, and the agent-opt optimizers. Honest tradeoff: Galileo has more public head-to-head agreement data; FAGI has tighter integration and a CustomLLMJudge that runs the same rubric against any LiteLLM-backed model when you want to BYOK GPT, Claude, Gemini, DeepSeek, or open-weight judges through the same SDK.

Best for. Production teams running traces, evals, optimization, and gateway routing in one stack. The inline-guardrail slot with turing_flash. BYOK plus FAGI-routed scoring in the same workflow.

6. DeepSeek-V3: open-weight cost leader

Open weight. DeepSeek Model License; verify commercial terms.

Where it earns its bill. Open-weight cost economics at frontier-tier reasoning. DeepSeek-V3 is a 671B MoE with 37B active parameters, so inference cost is closer to a 37B dense model than to a 671B dense one. Cost-per-token sits in a different league from closed frontier judges. Reasoning-rubric calibration is competitive within a small kappa gap.

Cost. Verify on DeepSeek API pricing or hosted providers like Together. A fraction of GPT-5 or Sonnet 4.5 at frontier-tier quality.

Latency. 1-2 seconds p95 on hosted endpoints; self-hosting requires multi-GPU because MoE is memory-heavy.

Worth flagging. Geopolitical and data-handling considerations apply for some procurement contexts. The cost win compounds at volume; so does the calibration gap if you skip the work.

Best for. High-volume routine scoring where cost-per-judgment binds. The first-pass judge in a two-stage cascade.

7. Llama 3.3 70B: the regulated self-host pick

Open weight. Meta’s Llama 3.3 license.

Where it earns its bill. Regulated workloads where data cannot leave the boundary. Llama 3.3 70B is the standard self-hosted judge in 2026 production stacks. Served via vLLM, TGI, Bedrock, Together, or Fireworks, it holds within a small kappa gap to frontier on subjective rubrics. Dense 70B is operationally cheaper to serve than DeepSeek’s 671B MoE: fewer GPU memory headaches, simpler scaling.

Cost. Free weights. Roughly $1-3 per million tokens on managed providers like Together, Fireworks, and Bedrock. Self-hosting typically runs 4xH100 for BF16 or 2xH100 for FP8.

Latency. 1-2 seconds p95 on managed endpoints.

Self-preference. Documented in Zheng et al. 2024 on Llama 2; the family effect likely persists in 3.3.

Worth flagging. Not a frontier model on hard reasoning rubrics. Function-calling reliability varies by serving framework. Structured-output reliability is good on hosted endpoints but not as strict as OpenAI JSON mode.

Best for. Regulated workloads (healthcare, banking, defense), on-premise deployments, and cost-optimized scoring where hosted closed models are a non-starter.

8. Prometheus 2 (8x7B): the open-weight evaluator fine-tune

Open weight. Apache 2.0. arXiv:2405.01535.

Where it earns its bill. Prometheus 2 is a Mistral-family fine-tune trained on a 100K-instance feedback dataset specifically for absolute and pairwise scoring. The paper showed it outperformed its base Mistral by a wide margin on judge benchmarks and approached GPT-4 on absolute scoring. Two variants ship (7B dense and 8x7B MoE) covering the cost-quality spread.

Cost. Free weights. Sub-dollar per million tokens on hosted providers; cheaper self-hosted.

Latency. Sub-second to 1 second p95.

Self-preference. Trained on diverse teacher annotations rather than its own family’s outputs, which softens (does not eliminate) the bias.

Worth flagging. Right pick when you specifically need an evaluator-trained open-weight model rather than a general-purpose one serving as a judge. Calibration degrades on rubrics outside its training distribution: the universal fine-tune caveat.

Best for. Open-source eval stacks. Research workflows where reproducibility matters. The open-weight slot in a cascade where data residency or budget forbids hosted frontier judges.

How the eight stack up

Judge	Weight access	Calibration ceiling	Cost shape	p95 latency (2K input)	Self-preference risk
Claude Sonnet 4.5 / Opus 4.x	Closed	Frontier	$3 in / $15 out per 1M	1.5-3s	High vs Claude
GPT-5 / GPT-5-mini	Closed	Frontier	Frontier; mini ~10x cheaper	0.5-3s	High vs GPT
Gemini 2.5 Pro / Flash	Closed	Frontier; multimodal	Flash cost-balanced	0.5-4s	Measure
Galileo Luna-2	Closed fine-tune	Strong on trained rubrics	Far below frontier	Sub-second	Rubric-framing
FAGI `turing_large` / `flash`	Closed fine-tune	Strong on trained rubrics	Below Luna-2 per eval	50-70 ms (flash)	Rubric-framing
DeepSeek-V3	Open (MoE)	Small gap to frontier	Open-weight low	1-2s	Measure
Llama 3.3 70B	Open (dense)	Small gap to frontier	$1-3 per 1M hosted	1-2s	High vs Llama
Prometheus 2 (8x7B)	Open (Apache 2.0)	Eval-tuned absolute/pairwise	Open-weight low	0.5-1s	Softened

Two patterns fall out. Fine-tuned evaluators (Luna-2, Turing, Prometheus 2) sit in a different cost-latency tier than frontier general-purpose judges. The cascade pattern (fine-tune first, frontier on disagreements) is how most production stacks reconcile the two. The open-weight class is the only path for regulated workloads, and the kappa gap to frontier is small enough on most subjective rubrics that the self-host pick is rarely a quality compromise.

The cascade pattern beats picking one judge

Teams that have run judges for more than a quarter stop trying to pick “the” judge. Four layers replace the question.

Layer 1: deterministic floor. JSON schema, refusal regex, parser. Sub-millisecond, free, never drifts.

Layer 2: fine-tuned classifier judge. turing_flash, Luna-2, or a Gemma 3n adapter. Runs on every span at sub-second latency and one to ten percent of frontier cost. Catches confident pass and confident fail; flags close calls.

Layer 3: frontier general-purpose judge. Sonnet 4.5, GPT-5, Gemini 2.5 Pro. Runs only on the close calls layer 2 flagged. Bills for reasoning only when reasoning is the binding constraint.

Layer 4: human review queue. Cases the frontier judge still scores ambiguously go to a labeling queue. Labels feed back into rubric calibration so layer 2 catches more next quarter.

A million-span-a-day frontier-only workload is a $30K-to-$1.5M monthly bill. The same workload as a cascade with 90 percent caught at layer 2 drops the frontier bill by 90 percent without losing detection on hard cases. The ai-evaluation SDK ships the cascade as one call (augment=True runs a local classifier first and hands its reasoning to the frontier judge as in-context evidence). See why LLM-as-a-judge for the full pattern.

How to pick: a 90-minute calibration sprint

You do not pick a judge model from a blog post. You pick by running the candidates on your data. This is the cheapest version of that.

Hand-label 100 to 300 examples that cover the failure modes your rubric needs to catch. Use two annotators where possible and discard low-agreement items. The set is your ground truth for the next quarter.
Run three or four candidate judges from this list against the same set. A frontier closed (Sonnet 4.5 or GPT-5), a frontier-mini (Gemini 2.5 Flash or GPT-5-mini), a fine-tune (turing_large or Luna-2), and an open-weight (DeepSeek-V3 or Llama 3.3 70B). Same rubric prompt, structured output, position randomized on pairwise.
Compute Cohen’s kappa or accuracy against the human labels for each judge. Plot kappa on the x-axis and cost-per-score on the y-axis. The Pareto frontier is your shortlist.
Test self-preference explicitly. Take 50 pairs where one side is from the candidate model family and one side is from a different family with human labels for which is actually better. Score in both orderings. The judge whose winrate diverges from the ground truth by more than 10 percent has self-preference you cannot ignore.
Project cost at three months. Token cost times tokens-per-judgment times sample rate times retries. The cheapest judge that hits your kappa threshold wins, not the highest kappa.

The sprint usually surfaces a two-judge cascade (one fine-tune for layer 2, one frontier for layer 3) and one open-weight fallback for the regulated path. That is the procurement output, not “judge model X is best.”

Common mistakes when picking an LLM judge model

Defaulting to last year’s model. Judge model versions shift behavior. A judge calibrated on GPT-4 is not calibrated on GPT-5 without re-validation. Pin the judge model id inside the eval contract; bump deliberately.
Same model as judge and candidate. Self-preference adds 10 to 25 percent score to the judge’s own family per Zheng et al. 2024. The cardinal mistake.
Pricing only the per-token cost. Real cost equals tokens-per-judgment times retries times sample rate. A judge with 5 percent parsing failures costs more than the per-token rate suggests.
Trusting a vendor leaderboard number. Vendors benchmark on their training distribution. Your data is not their training distribution. Calibrate.
Treating judging as a model choice. Production judging is a workflow: cascade, calibrate, audit, re-calibrate. The model is one input.

Recent shifts worth tracking

Date	Event	Why it matters
2024	Llama 3.3 70B, DeepSeek-V3, Prometheus 2	Open-weight judging closed the gap to frontier on most subjective rubrics.
2025	GPT-5, Claude Sonnet 4.5, Gemini 2.5 Pro 1M+	Frontier judge baseline shifted upward; recalibration required across stacks.
2025	Galileo Luna-2 released	Purpose-built 2B evaluator entered production stacks as the reference fine-tune.
2026	Future AGI Turing series across 50+ EvalTemplate rubrics	Fine-tuned classifier-backed scoring at lower per-eval cost than Galileo Luna-2.

Where Future AGI fits in this stack

The judge model is one component. The stack around it is what compounds across two years of production.

The ai-evaluation SDK (Apache 2.0) ships CustomLLMJudge, a Jinja2-templated G-Eval primitive that runs against any LiteLLM-backed judge, including every model in this list. The same class powers 50+ pre-built EvalTemplate rubrics with Turing models as the default backend. Drop model="turing_flash" for the cheap first pass, model="gpt-5" or model="claude-sonnet-4-5" for the frontier rescore, or any open-weight model for the self-hosted path. The cascade is one API.

traceAI carries the same rubric as a span-attached EvalTag across Python, TypeScript, Java, and C#. The collector runs the eval server-side and writes results back as gen_ai.evaluation.* attributes. Zero added inline latency. The same rubric runs in pytest as a CI gate, in batch on offline traces, and on live spans. That diff closes most eval-versus-production drift covered in the trace-eval gap post.

The Agent Command Center gateway handles judge routing across 100+ providers with exact and semantic caching, shadow / mirror / race routing, virtual keys with per-key budgets, and 18+ built-in guardrail scanners plus 15 third-party adapters. SOC 2 Type II, HIPAA, GDPR, CCPA certified; ISO/IEC 27001 in active audit. Canary judge swaps become A/B tests rather than deploy events. The Future AGI Platform layers self-improving evaluators that retune from thumbs feedback, lower per-eval cost than Galileo Luna-2, and Error Feed clustering that groups failing-judge traces over ClickHouse embeddings with a Sonnet 4.5 Judge writing the immediate_fix.

Ready to pick a judge against your own workload? Start with the ai-evaluation SDK quickstart, drop three candidates from this list into the same evaluate() call on your labeled set, and pick by kappa per dollar. That is the procurement output that holds for two years.

Frequently asked questions

Which LLM is the best judge model in 2026?

There is no single best judge. The right one wins on three axes against your rubric: agreement with your human labels (Cohen's kappa), cost per score at your traffic volume, and self-preference bias against your candidate models. Claude Sonnet 4.5 and GPT-5 sit at the top for subjective rubrics with rich reasoning. Gemini 2.5 Flash and GPT-5-mini are the cost-balanced workhorses. Galileo Luna-2 and the Future AGI Turing models (`turing_large`, `turing_flash`) are fine-tuned eval-specific judges that beat general-purpose frontier models on price-per-score while holding kappa. Llama 3.3 70B and DeepSeek-V3 are the self-hosted picks. Prometheus 2 is the open-weight option specifically trained on absolute and pairwise scoring. Pick by the axis that binds you, then calibrate on a 100-300 example human-labeled set.

How do I measure self-preference bias before picking a judge?

Build a 200-pair set where each pair has one output from the candidate judge's family and one from a different family, with human labels that say which is actually better. Run the judge in both pairwise orders (swap A and B). A judge with no self-preference produces the same winrate as the human-label ground truth in both orders. A self-preferring judge inflates its own family's winrate by 10 to 25 percent. Zheng et al. 2024 (arXiv:2306.05685) reported that GPT-4 prefers GPT-4 outputs and Claude prefers Claude outputs at similar margins. Mitigation: never use the candidate model as its own judge, and on launch decisions run a three-judge ensemble across families (Sonnet, GPT, Gemini) so the biases cancel.

Are open-weight judges good enough for production?

For most rubrics, yes. Llama 3.3 70B and DeepSeek-V3 hold within a small kappa gap to frontier closed judges on subjective rubrics like helpfulness, faithfulness, and instruction adherence. Prometheus 2 was fine-tuned specifically for absolute and pairwise scoring and outperforms its base Mistral model by a wide margin on judge benchmarks. The exact gap to frontier varies by rubric and is the thing you should measure on your own labeled set rather than trust from a public leaderboard. The cost win compounds: a hosted Llama 3.3 70B call runs roughly $1-3 per million tokens versus $3 input / $15 output for Sonnet 4.5. For high-volume scoring the open-weight pick saves an order of magnitude per month.

What is Galileo Luna-2 and how does it compare to Future AGI Turing models?

Galileo Luna-2 is a 2B-parameter purpose-built evaluator model fine-tuned for hallucination, context adherence, and tool-call correctness scoring. Galileo published agreement numbers against GPT-4o on internal benchmarks and prices it at a fraction of frontier judge calls. The Future AGI Turing series (`turing_flash`, `turing_small`, `turing_large` — see `fi/evals/core/registry.py`) is the equivalent FAGI surface: fine-tuned classifier-backed evaluators that route automatically when you call `evaluate(metric, model='turing_flash')`. The Future AGI Platform runs the same cascade at lower per-eval cost than Galileo Luna-2, and `turing_flash` powers the 50-70 ms inline guardrail path that Luna-2 does not target. The honest tradeoff: Galileo has more public agreement data; FAGI has tighter integration with the trace, optimizer, and gateway surfaces in one stack.

Should I use a fine-tuned judge or a frontier general-purpose model?

Run both on your calibration set and pick by kappa per dollar. Fine-tuned judges (Galileo Luna-2, Future AGI Turing, Prometheus 2) win on cost and latency for the rubrics they were trained on. Frontier general-purpose models (Sonnet 4.5, GPT-5) win on rubrics that require open-ended reasoning the fine-tune has not seen. The production pattern most teams settle on: fine-tuned judge as the cheap first pass on every span; frontier judge as the second pass on the cases the fine-tune flagged as close calls; deterministic checks underneath so the judge never runs on cases a parser already failed. The cascade drops the bill 80 to 90 percent without losing detection rate on hard cases.

How does Future AGI's Turing series integrate with this evaluation stack?

The `ai-evaluation` SDK (Apache 2.0) exposes Turing models as the default judge backend on 50+ built-in evaluators. Call `evaluate('faithfulness', output=..., context=..., model='turing_flash')` and the SDK routes through the FAGI cloud to a fine-tuned classifier; `turing_large` routes to the larger reasoning judge for harder rubrics; `CustomLLMJudge` accepts any LiteLLM model when you want to bring your own. `turing_flash` runs at roughly 50-70 ms p95 for inline guardrails through the Agent Command Center gateway; offline eval templates run in 1-2 seconds. The honest framing: Turing models are fine-tuned for the FAGI evaluator surface, so they compose with traceAI spans, Error Feed clustering, and the agent-opt optimizers without rewiring. For BYOK across GPT, Claude, Gemini, DeepSeek, or open-weight models, the same SDK runs the same rubric against any LiteLLM endpoint.

View all

Research

LLM-as-Judge Best Practices in 2026: Calibration, Bias, and Cost

LLM-as-judge best practices for 2026: pick the right judge, calibrate against humans, watch length and family bias, control cost. Discipline that scales.

Vrinda Damani · Dec 8, 2025

10 min

Research

Best LLM-as-Judge Platforms in 2026: 6 Compared

Future AGI, DeepEval, Galileo Luna-2, Braintrust, Phoenix, Ragas: calibrated judges, classifier cascade, deterministic floor, audit. Honest tradeoffs.

Vrinda Damani · May 25, 2025

19 min

Research

Best LLMs of May 2026: Top Closed-Source, Open-Weight, Multimodal, and Coding Picks

Best LLMs May 2026: compare GPT-5.5, Claude Opus 4.7, Gemini 3.1 Pro, and DeepSeek V4 across coding, agents, multimodal, cost, and open weights.

Vrinda Damani · May 6, 2026

29 min

TL;DR: which judge wins on which axis

The three axes that actually matter

The eight judge models worth running in 2026

1. Claude Sonnet 4.5 / Opus 4.x: the calibration ceiling

2. GPT-5 / GPT-5-mini: the structured-output workhorse

3. Gemini 2.5 Pro / Flash: long context, multimodal, cost-balanced

4. Galileo Luna-2: the purpose-built evaluator

5. Future AGI Turing series (turing_large, turing_flash): fine-tune meets full stack

6. DeepSeek-V3: open-weight cost leader

7. Llama 3.3 70B: the regulated self-host pick

8. Prometheus 2 (8x7B): the open-weight evaluator fine-tune

How the eight stack up

The cascade pattern beats picking one judge

How to pick: a 90-minute calibration sprint

Common mistakes when picking an LLM judge model

Recent shifts worth tracking

Where Future AGI fits in this stack

Related reading

Frequently asked questions

5. Future AGI Turing series (`turing_large`, `turing_flash`): fine-tune meets full stack