Research

Custom LLM Eval Metrics (2026): The Three-Part Contract That Works

Custom LLM eval metrics in 2026: a tight criterion, a calibrated corpus, a stability check. Patterns, code, and pitfalls, all in one guide.

·
Updated
·
11 min read
custom-metrics llm-evaluation domain-specific-evals g-eval rubric-evaluation 2026
Editorial cover image on a pure black starfield background with faint white grid. Bold all-caps white headline CUSTOM LLM EVAL METRICS fills the left half. The right half shows a wireframe metric ruler bent to fit a custom domain shape, with a soft white halo glow on the bent custom segment, drawn in pure white outlines.
Table of Contents

A medical-question agent passes the team’s faithfulness and refusal gates all quarter. Within a week of launch, a regulatory review flags a batch of outputs with unverified drug-interaction claims. The faithfulness rubric scored whether the output stayed within retrieved context; the retrieved context contained the claims. The metric was right. The metric was the wrong question.

This is the case for custom metrics in 2026. Stock evaluators (faithfulness, BLEU, ROUGE, toxicity, schema validation) cover the common surface. Production workloads break on the domain underneath: claim modality matching a cited row, brand voice on a regulated tier, citation linking across multi-step reasoning. No stock metric catches those.

The opinion this post earns: a custom eval metric is a contract between the rubric and the judge, and the contract has three parts. A tight criterion (one rubric measures one observable behavior). A calibration corpus (50 to 200 human-labeled examples with inter-annotator agreement above 0.6). A stability check (cross-family judge rotation that bounds variance). Without all three, you are scoring noise dressed as a number. This guide walks the three parts, the deployment pattern, and the four mistakes that kill custom metrics before they ship.

TL;DR: the three-part contract

PartWhat it producesWhat you skip if you skip it
Tight criterionOne observable behavior per rubric, anchored scale, structured outputThe judge invents the criterion on every call; kappa never breaks 0.5
Calibration corpus50-200 human-labeled examples, IAA > 0.6, 20% hold-out, weighted Cohen’s kappaEvidence the score agrees with humans; you ship a vibes detector
Stability checkCross-family judge rotation, version-pinned contract, quarterly recalibrationThe score moves when the judge bumps a minor version; you measure the judge, not the agent

Skip a part and the metric is not a metric. It is a number that drifts faster than the workload it scores. The three parts compound: a good criterion lifts kappa, a calibrated corpus catches drift, a stability check survives the judge swap that happens every quarter.

When stock metrics aren’t enough

Stock evaluators from FAGI, DeepEval, Ragas, and Phoenix cover the common axes: faithfulness, answer relevance, context precision/recall, BLEU, ROUGE, BERTScore, toxicity, schema validation, tool-call accuracy. Wide surface. Three classes they miss:

Domain-specific correctness. Medical claim verification, legal citation accuracy, financial-rule compliance. The stock library does not encode your domain.

Multi-step structural consistency. Whether the reasoning chain reuses numeric values correctly, whether citations resolve across steps, whether tool arguments match the plan.

Business-defined quality. Brand voice, persona consistency, regulatory tone, customer-tier-appropriate response. Stock metrics do not encode your brand.

The trigger to build a custom metric is the gap between a green dashboard and a broken workload. If stock scores are high and engineering still owns the on-call page, the metric is not the failure. The wrong metric is.

Part 1: write a tight criterion

One rubric measures one observable behavior. This is the single biggest lift in custom-metric design.

A vague rubric forces the judge to interpolate criteria from training data on every call. The score comes back, but the score means something slightly different each time. A specific rubric collapses the ambiguity into defined behaviors and the judge starts pattern-matching anchors.

Bad criterion:

Score the response on helpfulness from 1 to 5.

Good criterion:

Score the response on whether it directly answers the user's question
in the first sentence, without restating the question, without hedging
words ("might", "perhaps"), and without asking a clarifying question
when intent is unambiguous.

Three observable behaviors collapsed into one judgment. The “1 to 5 helpfulness” rubric is the most common cause of low judge-versus-human kappa in calibration sweeps. Specific, behavior-anchored phrasing is the cheapest single change that lifts agreement.

Anchor every scale point. Free-form “rate 1 to 5” leaks structure. Anchored scales lock each level to a concrete behavior:

5 = answers correctly in the first sentence, no digression, no hedging
4 = answers correctly in the first paragraph, one minor digression
3 = answers correctly but buries the answer after restatement or hedging
2 = addresses the question but contains a factual error
1 = does not address the question

Pick the smallest scale that captures the decision. Binary for compliance (PII present, citation missing, schema invalid); 5-point for semantic quality; continuous only when the judge has a structural rule to pick the value (fraction of claims with valid citations). Seven-point scales typically over-resolve LLM judges; the judge cannot reliably tell 4 from 5 or 5 from 6, and kappa drops.

Output a structured form, not free text. Free-text grades produce higher variance and harder parsing. The G-Eval form-filling shape (Liu et al. 2023) is the production default:

Output JSON only:
{
  "criterion": "<short restatement of what is being scored>",
  "reasoning": "<2-3 sentences quoting the response>",
  "score": <integer 1-5>
}

The reasoning field doubles as audit data. When the metric drifts, the reasoning tells you whether the rubric or the judge model is the cause.

Bake the bias guards into the rubric. Length-neutral phrasing kills verbosity bias at zero runtime cost:

Do not reward longer responses. A correct one-sentence answer
scores the same as a correct multi-paragraph answer.

A line like that does most of the work on verbosity-induced score inflation. The deeper failure modes (position, self-preference, calibration drift) need the calibration corpus and the stability check to catch.

The criterion is the prompt. Versioned, hashed, source-controlled. Treat rubric edits as schema migrations.

Part 2: build the calibration corpus

A rubric without calibration is a subjective scorer with a JSON schema. The calibration corpus is what turns a prompt into a metric.

Size: 50 to 200 examples per rubric. Below 50 and the kappa estimate has too much variance to trust. Above 200 and the labels stop moving the kappa needle. Tune up for high-stakes rubrics (compliance, medical, legal) where the false-pass cost is large.

Stratify across failure modes and cohorts. A random sample is dominated by the same traffic your dashboards already cover. Bucket by failure mode, then balance cohorts inside each bucket:

import random
from collections import defaultdict

def stratified_corpus(traces, n_per_stratum=40):
    """Build a balanced calibration corpus across failure-mode strata."""
    by_mode = defaultdict(list)
    for t in traces:
        by_mode[t["failure_mode"]].append(t)

    selected = []
    for mode, items in by_mode.items():
        by_cohort = defaultdict(list)
        for it in items:
            by_cohort[it["cohort"]].append(it)
        per_cohort = max(1, n_per_stratum // max(1, len(by_cohort)))
        for cohort_items in by_cohort.values():
            random.shuffle(cohort_items)
            selected.extend(cohort_items[:per_cohort])
    return selected

Label with at least two humans. Compute inter-annotator agreement (IAA) first. If IAA sits below 0.6, the rubric is ambiguous, not the labelers. Rewrite and re-label before going near the judge. The judge cannot exceed the kappa ceiling the humans set; chasing a higher number means overfitting to one labeler’s bias.

Resolve disagreements. Drop items where humans disagree by more than one scale point. For one-point disagreements, resolve via a third labeler or majority vote.

Compute weighted Cohen’s kappa. Once the human labels are clean, score the judge on the same set:

from sklearn.metrics import cohen_kappa_score
from fi.evals.metrics.llm_as_judges.custom_judge.metric import CustomLLMJudge
from fi.evals.metrics.llm_as_judges.types import CustomInput
from fi.evals.llm.providers.litellm import LiteLLMProvider

judge = CustomLLMJudge(
    provider=LiteLLMProvider(),
    config={
        "name": "brand_voice",
        "model": "gpt-5",
        "grading_criteria": (
            "Score 5 if the response uses second-person ('you') and "
            "contains no hedging words. Score 1 at the opposite end."
        ),
    },
)

judge_scores = [
    judge.compute_one(CustomInput(**item["inputs"]))["output"]
    for item in corpus
]
human_scores = [item["human_score"] for item in corpus]
kappa = cohen_kappa_score(human_scores, judge_scores, weights="quadratic")

Use weighted (quadratic) kappa for ordinal scales; a 5-vs-1 disagreement counts more than 5-vs-4. Use unweighted kappa for binary.

The Landis & Koch (1977) bands map kappa to operational policy. As a starting team rule: kappa above 0.6 for CI gates, above 0.8 for unattended automation. Below 0.6 the false-fail rate makes the gate noise.

Hold out 20 percent. Never include calibration examples in the rubric’s few-shot block. Never test against the same set you tuned on. The hold-out catches the overfit case where the rubric and the corpus co-evolved until the rubric only fires on the labeled set.

Part 3: run the stability check

Calibration is a number at a point in time. Stability is what happens to it the day the judge model bumps a minor version.

The judge is a prompt, not a function. A rubric calibrated against gpt-4o-2024-08-06 produces different distributions on gpt-4o-2024-11-20. Mean shifts 3 to 8 points; distribution narrows. Swap GPT-5 for Sonnet 4.5 without recalibrating and the dashboard moves but the agent didn’t. The judge changed, not the system under test.

Three checks bound the variance.

Cross-family judge rotation on the calibration corpus. Run the same rubric through GPT, Claude, and Gemini judges on the same calibration set. Compare distributions. If kappa-against-humans is 0.8 on GPT and 0.55 on Sonnet, the rubric is leaking criteria phrasing into one judge’s prior. Tighten the criterion or expand the calibration corpus until the rubric travels across families. A rubric that only calibrates on one judge is judge-model-locked, which means you cannot ever rotate without rebuilding the metric.

Pin the contract. The eval is the tuple (judge_model_id, rubric_version, prompt_template_hash). Bump any field deliberately, never as a side effect of a vendor swap. Cache verdicts keyed on the tuple; invalidate on contract change, not on every PR. Store the rubric in source control; block merges to the rubric file unless the PR includes a calibration run with kappa above baseline.

Three-judge ensemble for launch decisions. Sonnet 4.5, GPT-5, Gemini 2.5 Pro is a defensible default as of May 2026. Family-specific biases cancel. Ensemble costs roughly 3x a single judge; reserve it for the gate, not the weekly trend. Single judge for the dashboard; ensemble for the ship decision.

Recalibrate on triggers. A one-week SLA on every one of:

  1. Judge model update (provider rolls a snapshot, releases a minor version).
  2. Rubric edit (any wording change to a criterion or anchor).
  3. Distribution shift (new feature, persona, cohort, prompt revision in the production model).
  4. Rolling-window kappa drops meaningfully below baseline for two consecutive windows.

The cost of recalibration is small. The cost of a metric that silently drifted to fair-agreement kappa for a quarter is a release that shipped on green dashboards and broke production.

The deployment pattern

The rubric runs in two places: pytest as a CI gate, and as a span-attached evaluator on live traffic. The same CustomLLMJudge instance powers both.

from fi.evals.metrics.llm_as_judges.custom_judge.metric import CustomLLMJudge
from fi.evals.metrics.llm_as_judges.types import CustomInput
from fi.evals.llm.providers.litellm import LiteLLMProvider

medical_claim_judge = CustomLLMJudge(
    provider=LiteLLMProvider(),
    config={
        "name": "medical_claim_modality",
        "model": "gpt-5",
        "grading_criteria": (
            "For each numeric drug-interaction claim, extract the stated "
            "modality (always, sometimes, contraindicated) and the cited "
            "row's modality. Score 1 only if every modality matches; "
            "score 0 otherwise."
        ),
        "few_shot_examples": [
            {"inputs": {"response": "...", "context": "..."},
             "output": '{"score": 1.0, "reason": "all modalities match"}'},
        ],
    },
)

result = medical_claim_judge.compute_one(CustomInput(
    response="...",
    context="...",
))
# result["output"] -> float in [0.0, 1.0]
# result["reason"] -> JSON-stringified judge output

DefaultJudgeOutput enforces the form-filling schema (score: float ∈ [0, 1], reason: str). The Jinja template carries the rubric and few-shot calibration block. Multi-modal: pass image_url or audio_url and LiteLLM forwards the media to vision and audio-capable judges.

For zero inline latency on live traffic, attach the same rubric as a span-level EvalTag via traceAI. The tag serializes into the OTel resource; the collector runs the eval server-side and writes results back as gen_ai.evaluation.* attributes. Same rubric in CI and on the trace; that diff closes most of the offline-vs-production drift that wrecks custom-metric programs.

Cascade deterministic in front of the judge. A regex catches a missing citation for free. A JSON schema catches a malformed response in microseconds. Pay the judge only on cases the cheap checks couldn’t decide:

def medical_composite(inputs):
    if not all_claims_have_citations(inputs["response"]):
        return {"score": 0.0, "reason": "missing citation"}
    if not all_citations_resolve(inputs["response"]):
        return {"score": 0.0, "reason": "broken citation"}
    return medical_claim_judge.compute_one(CustomInput(**inputs))

Deterministic checks are 10,000x cheaper than a frontier judge and never drift. The judge bill drops 80 to 90 percent on most workloads.

Common mistakes

Five recurring failures kill custom metrics before they ship.

Vague criterion. One rubric carrying three observable behaviors silently. The judge guesses which one to score on every call; kappa never breaks 0.5. Fix: rewrite the rubric until each scale point is a concrete behavior. If you cannot anchor it, you are measuring more than one thing.

No calibration corpus. The rubric scores 4.2 on average and the team trusts the number. There is no evidence the score agrees with humans on this domain. Fix: 50 to 200 hand-labeled examples, IAA above 0.6, weighted kappa against the judge. Below kappa 0.6, the gate is noise.

Single-judge lock-in. The rubric calibrated on GPT-5 and nobody checked Sonnet. The provider rolls a snapshot, the score moves, the agent didn’t. Fix: cross-family judge rotation on the calibration corpus; pin the contract; recalibrate on every judge swap.

Reaching for a judge when a regex works. A $0.04-per-call frontier judge running on a binary “does this contain a citation” decision a regex returns in microseconds. Fix: deterministic floor in front of every judge. The skill is reaching for the cheapest tool that gives the right answer.

One score across multi-dim quality. A response can be on-brand and factually wrong; a patch can pass tests and import a deprecated library. The scalar averages over the failure. Fix: per-dimension scoring. If two dimensions can plausibly move in opposite directions across a release, score them separately. Gate on the minimum across dimensions when every dim is a hard requirement; gate on a weighted average only when the business defends the weights.

How Future AGI ships custom metrics as a package

A custom rubric is a contract. A custom rubric integrated into an eval stack that calibrates, cascades, clusters failing traces, and refines is what compounds. Start with the SDK for code-defined rubrics. Graduate to the Platform when you need self-improving evaluators, in-product authoring, and classifier-backed cost economics at scale.

The ai-evaluation SDK (Apache 2.0) is the code-first surface. CustomLLMJudge exposes the G-Eval primitive: Jinja2 template, structured DefaultJudgeOutput, few-shot calibration, multi-modal input. The same class powers 70+ EvalTemplate rubrics across faithfulness, agent quality, function calling, summarization, and multi-modal output. 13 guardrail backends (9 open-weight: LLAMAGUARD_3_8B/1B, QWEN3GUARD_8B/4B/0.6B, GRANITE_GUARDIAN_8B/5B, WILDGUARD_7B, SHIELDGEMMA_2B) plus 8 sub-10ms local Scanners (jailbreak, code injection, secrets, malicious URL, invisible chars, language, topic restriction, regex) supply the deterministic floor for the cost cascade. Four distributed runners (Celery, Ray, Temporal, Kubernetes) carry rubric execution into whatever orchestrator the team already runs.

traceAI (Apache 2.0) carries the same rubric as a span-attached EvalTag on live traffic across 50+ AI surfaces in Python, TypeScript, Java, and C#. Pluggable semantic conventions at register() time. Server-side scoring at zero inline latency.

The Future AGI Platform layers what the SDK alone cannot do. Self-improving rubrics retune from thumbs up/down feedback so the rubric ages with the product instead of against it. An in-product authoring agent writes custom rubrics from natural-language descriptions and proposes calibration corpora to label. Classifier-backed scoring runs at lower per-eval cost than Galileo Luna-2, which is what makes daily full-traffic scoring financially viable instead of a quarterly batch. The Agent Command Center handles judge routing across 100+ providers (SOC 2 Type II, HIPAA, GDPR, and CCPA certified, ISO/IEC 27001 in active audit) so cross-family judge rotation is a config change, not a deploy. Error Feed sits inside the eval stack: HDBSCAN soft-clusters failing-rubric traces, a Sonnet 4.5 Judge writes the RCA with an immediate_fix, fixes feed the self-improving evaluators. agent-opt consumes the custom rubric across six optimizers so prompt search runs against the same metric the CI gate uses.

Ready to wire a production-grade custom metric against your own workload? Start with the ai-evaluation SDK quickstart, drop a CustomLLMJudge against your dataset in pytest this afternoon, then attach the same rubric as an EvalTag on live spans via traceAI. The same rubric in both places is the diff that turns a custom rubric from a notebook experiment into a metric that holds for two years.

Three takeaways for 2026

  1. A custom metric is a contract. Tight criterion, calibration corpus, stability check. Without all three you are scoring noise.
  2. One rubric, one behavior. The cheapest lift in custom-metric design is naming exactly one observable thing per rubric and anchoring each scale point.
  3. The stack is the moat, not the prompt. A rubric by itself is a JSON blob. A rubric integrated with calibration, cascading, clustering, and self-improving evaluators is what survives the judge swap that happens every quarter.

Frequently asked questions

What is a custom LLM eval metric and when do I need one?
A custom eval metric is a rubric you write yourself because no stock evaluator measures the failure mode that actually breaks your workload. You need one when stock metrics (faithfulness, BLEU, ROUGE, toxicity) return green and the agent still ships the wrong refund, the wrong dosage, the wrong tone. Stock evaluators cover the common surface; custom metrics cover the domain. The three triggers: domain-specific correctness (medical-claim verification, legal-citation accuracy), structural consistency stock parsers miss (citations resolving across steps), and business rules (brand voice, persona consistency, regulatory tone). If your engineers can't name a stock metric that catches the failure, you need a custom one.
What is the three-part contract for a custom metric?
A custom metric is a contract between the rubric and the judge with three parts: a tight criterion (one rubric measures one observable behavior), a calibration corpus (50 to 200 human-labeled examples with inter-annotator agreement above 0.6), and a stability check (cross-family judge rotation that bounds the variance). Without all three, you are scoring noise. Skip the criterion and the judge interpolates from training data. Skip the calibration and you have no evidence the score means what you think. Skip the stability check and you are measuring the judge model, not the agent under test. The three parts are not optional. They are the unit of work.
How do I write a criterion that calibrates?
One rubric measures one observable behavior. Bad criterion: 'Score the response on helpfulness from 1 to 5.' The judge invents the criterion on every call. Good criterion: 'Score the response on whether it directly answers the question in the first sentence, without restating the question, without hedging words, and without asking a clarifying question when intent is unambiguous.' Three observable behaviors collapsed into one judgment. Anchor each scale point to a concrete behavior. Use the smallest scale that captures the decision: binary for compliance, 5-point for semantic quality, continuous only when the judge has a structural rule (fraction of claims with valid citations). Name it. Version it. Treat it like code.
How large should my calibration corpus be?
Fifty to two hundred human-labeled examples per rubric is the working range. Below 50 and the kappa estimate has too much variance to trust. Above 200 and you are paying for labels that do not move the kappa needle. Stratify across failure modes and cohorts so kappa per stratum is estimable. Label every item with at least two humans; compute inter-annotator agreement (IAA) first. If IAA sits below 0.6, the rubric is ambiguous, not the labelers. Rewrite the rubric and re-label before going near the judge. Hold out 20 percent of the corpus during calibration to catch overfit.
What stability check should I run before trusting the metric?
Run the same rubric through a cross-family judge rotation. If GPT-5 and Sonnet 4.5 produce divergent score distributions on the same calibration corpus, the rubric is leaking criteria language into the verdict and the metric is judge-model-locked. Run a three-judge ensemble across families (Sonnet, GPT, Gemini) on launch decisions; family-specific biases cancel. Pin the judge model and rubric version as a single contract: the eval is the tuple (judge_model_id, rubric_version, prompt_template_hash). Recalibrate every quarter and on every judge swap. Treat judge rotation as a deliberate eval-suite migration, not a config change.
What are the common mistakes building custom metrics?
Five recurring failures. Vague criterion (one rubric carrying three observable behaviors silently). No calibration corpus (the score has no evidence it agrees with humans). Single-judge lock-in (the rubric scores 0.91 on GPT-5 and 0.74 on Sonnet because nobody checked). Reaching for a judge when a regex works (paying 100x for what a parser catches deterministically). Scoring everything with one number (a response can be on-brand and factually wrong; the scalar averages the failure). Each one is preventable. None of them are caught by a dashboard that only watches the score.
What does Future AGI ship for custom metrics?
The ai-evaluation SDK (Apache 2.0) ships CustomLLMJudge, a Jinja2-templated G-Eval primitive against any LiteLLM-supported model with structured DefaultJudgeOutput parsing, few-shot calibration, and multi-modal input. The same class powers 70+ EvalTemplate rubrics. 13 guardrail backends (9 open-weight) and 8 sub-10ms local Scanners supply the deterministic floor. traceAI carries the same rubric as a span-attached EvalTag on live spans across 50+ AI surfaces in Python, TypeScript, Java, and C#. The Platform layers self-improving rubrics tuned by thumbs feedback, an in-product authoring agent, and classifier-backed scoring at lower per-eval cost than Galileo Luna-2. Error Feed clusters failing-rubric traces into named issues with an immediate_fix.
Related Articles
View all