Guides

LLM-Judge Prompt Engineering: The 2026 Engineering Guide

Judge prompts are eval-time programs. Five elements decide signal vs noise: criterion, calibrated examples, randomization, schema, self-consistency.

April 20, 2026

Updated May 20, 2026

12 min read

llm-judge prompt-engineering llm-evaluation rubric-design judge-calibration judge-bias agent-evaluation 2026

Table of Contents

Most LLM-judge prompts fail silently. They return numbers, the dashboard turns green, and the team ships. Two weeks later production breaks in ways the judge never flagged. The judge model wasn’t broken. The prompt was.

The opinion this guide earns: judge prompts are eval-time programs. They need a tight criterion, calibrated examples, position-randomized inputs, structured output, and self-consistency checks. Skip any one of these five and your judge scores noise dressed up as signal. Everything else (chain-of-thought, role lines, scale anchors, anti-bias preambles) is implementation detail in service of those five. Get the five right and a 0.3 kappa rubric climbs past 0.7 in an afternoon of calibration.

The five elements, with patterns and anti-patterns

1. Tight criterion

The criterion is the operational definition of what the judge measures. “Helpful” is aspirational. “Addresses every part of the user’s question with a concrete next step” is operational, because a judge can check it against the input.

Anti-pattern. “Rate the helpfulness on a scale of 1 to 5.” Helpful to whom, under what definition, judged how. The judge invents its own definition per call and variance balloons. Two reruns of the same input return different scores because the rubric itself drifts.

Pattern. One task per prompt, three to seven named dimensions, each with a one-line operational definition the judge can verify against the input.

Groundedness: every factual claim in the response is supported by the provided context.
Completeness: the response addresses every part of the user's question.
Citation validity: every cited span exists verbatim in the provided context.

The moment you bundle “score this for groundedness and completeness and tone” into one judge call, you lose the ability to decompose failures, and any weak dimension drags the others toward the mean. If you need three rubrics, run three judge calls and aggregate downstream. Cost is linear; signal quality compounds.

2. Calibrated few-shot examples

Two to five examples that span the scale, mixing strong, borderline, and weak. The borderline ones matter most: they teach the judge where the boundary lives.

Anti-pattern. Three examples of the obviously-good case and one example of the obviously-bad case. The judge already knows what “obviously good” looks like; it doesn’t know where you draw your specific line. U-shaped score distributions in the resulting eval are the symptom.

Pattern. Prioritize the boundary cases. If you can only fit one example per scale point, pick the one that sits closest to the next boundary up or down.

Examples:
- Context: "X founded in 2019." Response: "X was founded in 2019." -> 1.0
- Context: "X founded in 2019." Response: "X, founded in 2019, has 500 employees." -> 0.4 (employee count ungrounded)
- Context: "X founded in 2019." Response: "X was founded in 2021 [src: doc1]." -> 0.0 (contradicts + fabricated citation)

Borderline examples are also where calibration lives. When you sweep prompt variants, the score on borderline examples is what moves; the obvious ones rarely change. If the borderline set is shallow, calibration cannot find the right rubric.

3. Position randomization

The judge inherits systematic biases from the underlying model, and the strongest ones are about input ordering and context contamination.

Anti-pattern. Pairwise comparison (“which response is better, A or B?”) with a fixed slot order. Frontier judges show 10 to 15 points of winrate swing depending on which response sits in slot A, per Zheng et al. 2024 (arXiv:2306.05685). A fixed-order pairwise judge is measuring position as much as quality.

Anti-pattern, second variant. The ground-truth reference answer sits in the judge’s context, and the judge parrots it back as the verdict. You are measuring the reference, not the model.

Pattern. Randomize order per call on pairwise. For high-stakes comparisons, run each twice with positions swapped and treat order-dependent verdicts as ties. Strip the reference from the judge context for blind scoring. If you need the reference for the rubric, make the comparison explicit and ask for a delta, not a score conditioned on the reference being correct.

The same logic extends to self-preference: a model judging its own family scores it 10 to 25 percent higher. Use a different family than the candidate. If you run a multi-family fleet, rotate judge families and average.

4. Structured output

The output schema is what makes the judge programmatically usable. Anything you want to filter, threshold, or aggregate goes in the schema.

Anti-pattern. “Score 1 to 5 and explain your reasoning.” Free-text output, no schema, downstream parsing is a regex with edge cases. The judge that returns “I’ll give this a 4 out of 5” looks fine until production traffic surfaces “between a 3 and a 4” and the parser drops a row.

Pattern. Strict JSON, typed keys, dedicated reasoning field, schema validation as a gate. A non-parsing judge response is a failed eval, not a low score.

{
  "score": 0.0,
  "scale": "0-1",
  "reasoning": "Step-by-step rubric application",
  "rubric_breakdown": {
    "groundedness": 1,
    "completeness": 0,
    "citation_validity": 1
  },
  "confidence": "high|medium|low"
}

Add an explicit chain-of-thought directive (“Reason step by step through the rubric before producing the final score”) and a dedicated reasoning field. Ablations across frontier judges consistently show 10 to 25 percent reduction in score variance when chain-of-thought is enabled. The reasoning traces also become your audit trail when a score looks wrong.

5. Self-consistency check

A judge that returns a different score on the same input across runs is noise. A judge that returns the same score reliably might still be wrong, but at least it is measurably wrong.

Anti-pattern. Single-shot scoring with no consistency check, temperature 0.7, “the judge said 4 so it’s a 4.” You have no idea whether the score is reproducible or whether you got lucky.

Pattern. Run the same input through the judge twice (different seeds, swapped positions where applicable) and flag disagreements. On the cases where the two runs diverge, route to a human reviewer or to a second judge family. Temperature 0 for production judging unless you specifically need probability-weighted scoring.

For high-volume rubrics, sample 1 to 5 percent of calls for the self-consistency check rather than every call. The sampled disagreement rate is its own first-class metric: when it drifts up, the rubric is overdue for recalibration.

The calibration loop

The five elements get you to a prompt. Calibration is what turns the prompt into a measuring instrument. The discipline has four steps and you run them every time the rubric changes.

Write. Draft the first version of the prompt with all five elements explicit. Pin a judge model and a temperature. Treat the version tuple (prompt_template_hash, judge_model_id, few_shot_pool_version) as the eval contract.

Label. Build a 50 to 100 example golden set with human labels, sampled from real production traffic. Label each by hand using the same rubric the judge will use. If two raters disagree, that’s a rubric problem; sharpen the operational definition until inter-rater agreement clears 0.6 Cohen’s kappa before you blame the judge. The cases humans disagree on are exactly the borderline examples you need in the few-shot pool.

Measure. Run the judge against the golden set. Compute Cohen’s kappa between the judge and the human majority. A marketing-copy rubric tolerates kappa around 0.6. A medical-advice rubric needs 0.85 or higher. Score length-controlled subsets (pairs within plus or minus 20 percent token count) alongside the raw rubric; if the two winrates diverge, verbosity bias is doing the work. Rotate rubric phrasings on the calibration set; if scores move with phrasing alone, the rubric is leaking criteria language into the verdict.

Tweak. Four moves compound. Sharpen the operational definition. Swap borderline few-shot examples (the obvious ones rarely change kappa; the borderline ones do). Switch scale shape (binary vs 3-point vs 0-1 continuous; 5-point Likert is usually the wrong default). Try a different judge family. Re-run after every change. If kappa drops, revert.

Most judges launch at kappa 0.3 to 0.4 and reach 0.6 to 0.75 after three or four sweep iterations. Hours of work, not days. For deeper coverage, see LLM-as-Judge Best Practices in 2026 and Why LLM-as-a-Judge (2026).

Production rollout: versioning and A/B between judge prompts

The judge prompt is production code. It needs the same discipline.

Version the contract, not the file. The eval is the tuple (prompt_template_hash, judge_model_id, few_shot_pool_version). Bump any field deliberately. Cache judge verdicts keyed on the tuple plus the input hash; invalidate on contract change, not on every PR. A judge model minor version bump is a contract change, even when the prompt file is byte-identical.

A/B between judge prompts the way you A/B between models. Run prompt v1 and v2 in parallel on a shadow stream. Compare kappa against the golden set, compare score distributions, route disagreement cases to human review. Ship the winner; archive the loser with its kappa number so you don’t re-test the same losing variant six months later.

Pin the judge model version inside the contract. A rubric calibrated against gpt-4o-2024-08-06 produces different distributions on gpt-4o-2024-11-20. The mean shifts 3 to 8 points; the distribution narrows. If the judge is your only quality metric and the model rotates quarterly, you are measuring the judge change, not the model change. Treat judge rotation as a deliberate eval-suite migration with its own calibration sweep, not a config swap.

Run judges async on a sampled stream, not inline. A frontier judge call adds 500 ms to 2 s of latency. Sampling 5 to 20 percent of traffic and writing verdicts back to the trace gives signal without blowing the latency budget. For dimensions where you need every span scored, use a classifier in front of the judge; reserve the judge for the cases the classifier cannot decide.

CustomLLMJudge: the production-grade judge primitive

You almost never want to write a judge prompt from scratch. The ai-evaluation SDK (Apache 2.0) ships 70+ EvalTemplate classes with judge prompts already engineered for the five elements above. Use one as the base, override what you need.

from fi.evals import Evaluator
from fi.evals.templates import Groundedness, ContextAdherence, Completeness
from fi.testcases import TestCase

evaluator = Evaluator(fi_api_key="...", fi_secret_key="...")

result = evaluator.evaluate(
    eval_templates=[Groundedness(), ContextAdherence(), Completeness()],
    inputs=[
        TestCase(
            input="Who founded the Bauhaus?",
            output="Walter Gropius founded the Bauhaus in Weimar in 1919.",
            context="The Bauhaus was founded by Walter Gropius in Weimar in 1919.",
        )
    ],
)

When you need a custom rubric, CustomLLMJudge is the surface. It accepts a Jinja2 grading_criteria template (the criterion + few-shot block), enforces structured DefaultJudgeOutput parsing (the schema), and runs against any LiteLLM-supported model with multi-modal input.

from fi.evals.metrics.llm_as_judges.custom_judge.metric import CustomLLMJudge
from fi.evals.metrics.llm_as_judges.types import CustomInput
from fi.evals.llm.providers.litellm import LiteLLMProvider

judge = CustomLLMJudge(
    provider=LiteLLMProvider(),
    config={
        "name": "groundedness_with_citation",
        "model": "anthropic/claude-sonnet-4-5",
        "grading_criteria": """
You are scoring one response for groundedness against the provided context.

Rubric (continuous 0-1):
- Groundedness: every factual claim is supported by the provided context.
- Citation validity: every cited span exists verbatim in the context.

Scale anchors: 1.0 fully grounded; 0.4 a load-bearing claim ungrounded; 0.0 contradicts the context.

Examples:
- Ctx "X founded 2019." Resp "X was founded in 2019." -> 1.0
- Ctx "X founded 2019." Resp "X, founded 2019, has 500 employees." -> 0.4
- Ctx "X founded 2019." Resp "X was founded in 2021." -> 0.0

Length and tone are not quality signals. Reason step by step before scoring.

Context: {{ context }}
Response: {{ output }}
""",
    },
)

result = judge.compute_one(CustomInput(context="...", output="..."))
# result["output"] -> float in [0.0, 1.0]

Every one of the five elements lands on the same surface: criterion in grading_criteria, calibrated examples in few_shot_examples, position randomization handled at the input layer for pairwise variants, structured output enforced by DefaultJudgeOutput, and self-consistency baked into the multi-run eval API. Three properties matter for prompt engineering:

70+ EvalTemplate starting points. Groundedness, ContextAdherence, FactualAccuracy, Toxicity, PromptInjection, TaskCompletion, LLMFunctionCalling, SummaryQuality, EvaluateFunctionCalling, and 60+ others ship as calibrated prompts. Override the rubric, keep the scaffolding.
The augment=True cascade. A local classifier runs first; only ambiguous cases hit the LLM judge. Your prompt only needs to nail the fuzzy middle. 90 percent cost saved with no measurable drop in detection rate on most rubrics.
Multi-modal scoring. CustomLLMJudge accepts image_url and audio_url keys inline; LiteLLM forwards to vision and audio-capable models. The same prompt anatomy works for vision and voice rubrics.

For server-side scoring at zero inline latency, attach the same rubric to a production span via traceAI’s EvalTag. The collector runs the eval server-side and writes results back as gen_ai.evaluation.* attributes. Same rubric in pytest as a CI gate and on live spans in production; that diff closes most of the trace-eval drift covered in the trace-eval gap post.

When LLM-as-judge is the wrong tool

A judge call costs cents and adds 500 ms to 2 s of latency. If the rubric can be scored without one, score it without one.

Deterministic checks suffice. Regex, JSON schema validation, exact match, tool-call argument shape, citation parsing, token budget conformance. Free, instant, never drift.
Classifier-backed evals are 10x cheaper and 5x faster. For high-volume rubrics that fit a classifier (toxicity, prompt injection, language detection, safety categories), the SDK ships 8 sub-10 ms Scanners and 13 guardrail backends including open-weight LLAMAGUARD_3_8B/1B, QWEN3GUARD_8B/4B/0.6B, GRANITE_GUARDIAN_8B/5B, WILDGUARD_7B, SHIELDGEMMA_2B.
Structural validation is possible. Tool-use traces, function-calling arguments, structured outputs. Score these against the schema, not a judge.

The judge is for the fuzzy middle: groundedness, helpfulness, role adherence, refusal calibration, brand voice. Deterministic first, classifier second, LLM judge only on the cases the cheaper layers cannot decide. The augment=True cascade in Evaluator automates exactly this routing.

Optimizing the judge prompt itself

The judge prompt is a prompt, and prompts can be optimized. The agent-opt package (Apache 2.0) ships six optimizers that treat the judge prompt as the optimization target: RandomSearchOptimizer, BayesianSearchOptimizer (Optuna-backed, teacher-inferred few-shot, resumable studies), MetaPromptOptimizer, ProTeGi, GEPAOptimizer, and PromptWizardOptimizer. A shared EarlyStoppingConfig caps budget across all six.

What ships today: eval-driven optimization. You give the optimizer a labeled golden set and a starting judge prompt; it sweeps variants and returns the one that maximizes kappa. The active roadmap item is the trace-stream ingestion connector (traceAI to dataset) that auto-promotes production failures into the optimizer’s dataset. For deeper coverage, see Automated Prompt Improvement (2026).

How FAGI ships judge prompts as a package

A judge call by itself is a number. A judge integrated into an eval stack that calibrates, cascades, clusters, and refines is what compounds. Start with the SDK for code-defined judges. Graduate to the Platform when you need self-improving rubrics, in-product authoring, and classifier-backed cost economics at scale.

The ai-evaluation SDK is the code-first surface: CustomLLMJudge for the five-element judge prompt, 70+ EvalTemplate rubrics as calibrated starting points, 13 guardrail backends as the classifier triage layer, 8 sub-10 ms Scanners as the deterministic floor, and four distributed runners (Celery, Ray, Temporal, Kubernetes). traceAI carries the same rubric as a span-attached EvalTag across 50+ AI surfaces in Python, TypeScript, Java, and C#. The Agent Command Center handles judge routing across 20+ providers (SOC 2 Type II, HIPAA, GDPR, CCPA certified, ISO/IEC 27001 in active audit) with shadow, mirror, and race modes so canary judge swaps are A/B tests, not deploy events.

The Future AGI Platform layers what the SDK alone cannot do. Self-improving evaluators retune from thumbs up / down feedback so the rubric ages with the product. An in-product authoring agent writes judge prompts from natural-language descriptions. Classifier-backed scoring runs at lower per-eval cost than Galileo Luna-2, which makes daily full-traffic judging financially viable instead of a quarterly batch. Error Feed closes the loop: HDBSCAN soft-clusters failing-judge traces, a Claude Sonnet 4.5 Judge agent writes an immediate_fix, and the fixes feed the self-improving evaluators. agent-opt consumes the same scores so prompt search runs against the rubric the CI gate uses.

Ready to wire a production-grade judge against your own workload? Start with the ai-evaluation SDK quickstart, drop a CustomLLMJudge with the five elements explicit against your dataset in pytest this afternoon, then attach the same rubric as an EvalTag on live spans via traceAI.

Three takeaways for 2026

Judge prompts are eval-time programs. Tight criterion, calibrated examples, position randomization, structured output, self-consistency. Skip any one and you are scoring noise.
Calibration is the discipline that decides if your judge is signal. Write, label, measure kappa, tweak. Most prompts double agreement in three or four sweep iterations.
The stack is the moat, not the prompt. A judge call is a number. A judge integrated with calibration, cascading, clustering, and self-improving rubrics is what compounds.

Frequently asked questions

What are the five elements every production judge prompt needs?

A tight criterion stated operationally (not 'rate helpfulness' but 'every factual claim is supported by the provided context'). Calibrated few-shot examples spanning the scale with the borderline cases prioritized. Position-randomized inputs so the judge cannot anchor on slot order or reference leakage. A structured output schema with typed JSON keys and a chain-of-thought reasoning field. A self-consistency check that runs the same input through the judge twice (different seeds, swapped positions) and flags disagreements. Skip any one and the judge scores noise.

What does the calibration loop actually look like?

Write the prompt, label 50 to 100 production examples by hand using the same rubric, measure Cohen's kappa between judge and humans, then tweak. Most judges launch at kappa around 0.3 to 0.4 and reach 0.6 to 0.75 after three or four sweep iterations. The four moves that compound: sharpen the operational definition, swap borderline few-shot examples, switch scale shape (binary vs 3-point vs 0-1 continuous), and pin a different judge family. Re-test after every prompt edit. Without this loop, every change is a coin flip about whether agreement went up or down.

How do you roll judge prompts into production safely?

Treat the judge prompt as code. Version the rubric (prompt_template_hash + judge_model_id + few_shot_pool_version) as a single contract. A/B between judge prompts the way you A/B between models: run prompt v1 and v2 in parallel on a shadow stream, compare kappa against the golden set, ship the winner. Pin the judge model version inside the contract so a vendor minor bump doesn't silently move scores. Cache verdicts by contract hash and invalidate on contract change, not on every PR.

How does FAGI's CustomLLMJudge implement these five elements?

CustomLLMJudge ships the Jinja2 grading_criteria template (criterion + few-shot block), structured DefaultJudgeOutput parsing (the output schema), and multi-modal input via LiteLLM. The same class powers 70+ EvalTemplate rubrics (Groundedness, ContextAdherence, FactualAccuracy, TaskCompletion, SummaryQuality, EvaluateFunctionCalling) so you start from a calibrated prompt and override fields, not from a blank file. The augment=True cascade routes only ambiguous cases to the LLM judge so the prompt only has to nail the hard middle, and traceAI's EvalTag attaches the same rubric to production spans for server-side scoring at zero inline latency.

Which biases hit LLM judges hardest?

Position bias (the first option in pairwise wins by 10 to 15 points), verbosity bias (longer answers win even when length adds nothing), self-preference bias (a model rewards its own family by 10 to 25 percent per Zheng et al. 2024), calibration drift across judge model versions (rubric scores shift on a minor bump), and prior leakage (the ground-truth reference sits in the judge's context and the judge parrots it). Counters: position randomization across two runs, explicit no-length-no-tone rubric language, different family for judge and candidate, pinned judge model version, and stripping the reference for blind scoring.

Can you optimize a judge prompt automatically?

Yes. The agent-opt package (Apache 2.0) ships six optimizers (RandomSearchOptimizer, BayesianSearchOptimizer with Optuna-backed teacher-inferred few-shot, MetaPromptOptimizer, ProTeGi, GEPAOptimizer, PromptWizardOptimizer) that treat the judge prompt as the optimization target. You give them a labeled golden set and a starting prompt; they sweep variants and return the one that maximizes Cohen's kappa against humans. Eval-driven optimization ships today. The trace-stream ingestion connector that auto-promotes production failures into the optimizer's dataset is the active roadmap item.

View all

Guides

LLM-Judge Bias Mitigation (2026): Detect, Measure, Fix

Five named LLM-judge biases, each with a measurement and a mitigation that holds in production: position, verbosity, self-preference, format.

NVJK Kartik · Mar 24, 2026

13 min

Guides

Evaluating Pydantic AI Agents That Use MCP Tools (2026)

Evaluate Pydantic AI agents that call MCP tools in 2026: per-typed-output rubrics, tool-call argument fidelity, MCP security checks, dependency invariants.

Vrinda Damani · May 21, 2026

11 min

Guides

Evaluating AWS Bedrock Agents in 2026

Bedrock's built-in eval is dev-loop only. Score action-group correctness, KB retrieval quality, and guardrail precision/recall on every release.

Rishav Hada · May 19, 2026

11 min

The five elements, with patterns and anti-patterns

1. Tight criterion

2. Calibrated few-shot examples

3. Position randomization

4. Structured output

5. Self-consistency check

The calibration loop

Production rollout: versioning and A/B between judge prompts

CustomLLMJudge: the production-grade judge primitive

When LLM-as-judge is the wrong tool

Optimizing the judge prompt itself

How FAGI ships judge prompts as a package

Three takeaways for 2026

Related reading

Frequently asked questions