Custom LLM Eval Metrics (2026): The Three-Part Contract That Works
Custom LLM eval metrics in 2026: a tight criterion, a calibrated corpus, a stability check. Patterns, code, and pitfalls, all in one guide.
Table of Contents
A medical-question agent passes the team’s faithfulness and refusal gates all quarter. Within a week of launch, a regulatory review flags a batch of outputs with unverified drug-interaction claims. The faithfulness rubric scored whether the output stayed within retrieved context; the retrieved context contained the claims. The metric was right. The metric was the wrong question.
This is the case for custom metrics in 2026. Stock evaluators (faithfulness, BLEU, ROUGE, toxicity, schema validation) cover the common surface. Production workloads break on the domain underneath: claim modality matching a cited row, brand voice on a regulated tier, citation linking across multi-step reasoning. No stock metric catches those.
The opinion this post earns: a custom eval metric is a contract between the rubric and the judge, and the contract has three parts. A tight criterion (one rubric measures one observable behavior). A calibration corpus (50 to 200 human-labeled examples with inter-annotator agreement above 0.6). A stability check (cross-family judge rotation that bounds variance). Without all three, you are scoring noise dressed as a number. This guide walks the three parts, the deployment pattern, and the four mistakes that kill custom metrics before they ship.
TL;DR: the three-part contract
| Part | What it produces | What you skip if you skip it |
|---|---|---|
| Tight criterion | One observable behavior per rubric, anchored scale, structured output | The judge invents the criterion on every call; kappa never breaks 0.5 |
| Calibration corpus | 50-200 human-labeled examples, IAA > 0.6, 20% hold-out, weighted Cohen’s kappa | Evidence the score agrees with humans; you ship a vibes detector |
| Stability check | Cross-family judge rotation, version-pinned contract, quarterly recalibration | The score moves when the judge bumps a minor version; you measure the judge, not the agent |
Skip a part and the metric is not a metric. It is a number that drifts faster than the workload it scores. The three parts compound: a good criterion lifts kappa, a calibrated corpus catches drift, a stability check survives the judge swap that happens every quarter.
When stock metrics aren’t enough
Stock evaluators from FAGI, DeepEval, Ragas, and Phoenix cover the common axes: faithfulness, answer relevance, context precision/recall, BLEU, ROUGE, BERTScore, toxicity, schema validation, tool-call accuracy. Wide surface. Three classes they miss:
Domain-specific correctness. Medical claim verification, legal citation accuracy, financial-rule compliance. The stock library does not encode your domain.
Multi-step structural consistency. Whether the reasoning chain reuses numeric values correctly, whether citations resolve across steps, whether tool arguments match the plan.
Business-defined quality. Brand voice, persona consistency, regulatory tone, customer-tier-appropriate response. Stock metrics do not encode your brand.
The trigger to build a custom metric is the gap between a green dashboard and a broken workload. If stock scores are high and engineering still owns the on-call page, the metric is not the failure. The wrong metric is.
Part 1: write a tight criterion
One rubric measures one observable behavior. This is the single biggest lift in custom-metric design.
A vague rubric forces the judge to interpolate criteria from training data on every call. The score comes back, but the score means something slightly different each time. A specific rubric collapses the ambiguity into defined behaviors and the judge starts pattern-matching anchors.
Bad criterion:
Score the response on helpfulness from 1 to 5.
Good criterion:
Score the response on whether it directly answers the user's question
in the first sentence, without restating the question, without hedging
words ("might", "perhaps"), and without asking a clarifying question
when intent is unambiguous.
Three observable behaviors collapsed into one judgment. The “1 to 5 helpfulness” rubric is the most common cause of low judge-versus-human kappa in calibration sweeps. Specific, behavior-anchored phrasing is the cheapest single change that lifts agreement.
Anchor every scale point. Free-form “rate 1 to 5” leaks structure. Anchored scales lock each level to a concrete behavior:
5 = answers correctly in the first sentence, no digression, no hedging
4 = answers correctly in the first paragraph, one minor digression
3 = answers correctly but buries the answer after restatement or hedging
2 = addresses the question but contains a factual error
1 = does not address the question
Pick the smallest scale that captures the decision. Binary for compliance (PII present, citation missing, schema invalid); 5-point for semantic quality; continuous only when the judge has a structural rule to pick the value (fraction of claims with valid citations). Seven-point scales typically over-resolve LLM judges; the judge cannot reliably tell 4 from 5 or 5 from 6, and kappa drops.
Output a structured form, not free text. Free-text grades produce higher variance and harder parsing. The G-Eval form-filling shape (Liu et al. 2023) is the production default:
Output JSON only:
{
"criterion": "<short restatement of what is being scored>",
"reasoning": "<2-3 sentences quoting the response>",
"score": <integer 1-5>
}
The reasoning field doubles as audit data. When the metric drifts, the reasoning tells you whether the rubric or the judge model is the cause.
Bake the bias guards into the rubric. Length-neutral phrasing kills verbosity bias at zero runtime cost:
Do not reward longer responses. A correct one-sentence answer
scores the same as a correct multi-paragraph answer.
A line like that does most of the work on verbosity-induced score inflation. The deeper failure modes (position, self-preference, calibration drift) need the calibration corpus and the stability check to catch.
The criterion is the prompt. Versioned, hashed, source-controlled. Treat rubric edits as schema migrations.
Part 2: build the calibration corpus
A rubric without calibration is a subjective scorer with a JSON schema. The calibration corpus is what turns a prompt into a metric.
Size: 50 to 200 examples per rubric. Below 50 and the kappa estimate has too much variance to trust. Above 200 and the labels stop moving the kappa needle. Tune up for high-stakes rubrics (compliance, medical, legal) where the false-pass cost is large.
Stratify across failure modes and cohorts. A random sample is dominated by the same traffic your dashboards already cover. Bucket by failure mode, then balance cohorts inside each bucket:
import random
from collections import defaultdict
def stratified_corpus(traces, n_per_stratum=40):
"""Build a balanced calibration corpus across failure-mode strata."""
by_mode = defaultdict(list)
for t in traces:
by_mode[t["failure_mode"]].append(t)
selected = []
for mode, items in by_mode.items():
by_cohort = defaultdict(list)
for it in items:
by_cohort[it["cohort"]].append(it)
per_cohort = max(1, n_per_stratum // max(1, len(by_cohort)))
for cohort_items in by_cohort.values():
random.shuffle(cohort_items)
selected.extend(cohort_items[:per_cohort])
return selected
Label with at least two humans. Compute inter-annotator agreement (IAA) first. If IAA sits below 0.6, the rubric is ambiguous, not the labelers. Rewrite and re-label before going near the judge. The judge cannot exceed the kappa ceiling the humans set; chasing a higher number means overfitting to one labeler’s bias.
Resolve disagreements. Drop items where humans disagree by more than one scale point. For one-point disagreements, resolve via a third labeler or majority vote.
Compute weighted Cohen’s kappa. Once the human labels are clean, score the judge on the same set:
from sklearn.metrics import cohen_kappa_score
from fi.evals.metrics.llm_as_judges.custom_judge.metric import CustomLLMJudge
from fi.evals.metrics.llm_as_judges.types import CustomInput
from fi.evals.llm.providers.litellm import LiteLLMProvider
judge = CustomLLMJudge(
provider=LiteLLMProvider(),
config={
"name": "brand_voice",
"model": "gpt-5",
"grading_criteria": (
"Score 5 if the response uses second-person ('you') and "
"contains no hedging words. Score 1 at the opposite end."
),
},
)
judge_scores = [
judge.compute_one(CustomInput(**item["inputs"]))["output"]
for item in corpus
]
human_scores = [item["human_score"] for item in corpus]
kappa = cohen_kappa_score(human_scores, judge_scores, weights="quadratic")
Use weighted (quadratic) kappa for ordinal scales; a 5-vs-1 disagreement counts more than 5-vs-4. Use unweighted kappa for binary.
The Landis & Koch (1977) bands map kappa to operational policy. As a starting team rule: kappa above 0.6 for CI gates, above 0.8 for unattended automation. Below 0.6 the false-fail rate makes the gate noise.
Hold out 20 percent. Never include calibration examples in the rubric’s few-shot block. Never test against the same set you tuned on. The hold-out catches the overfit case where the rubric and the corpus co-evolved until the rubric only fires on the labeled set.
Part 3: run the stability check
Calibration is a number at a point in time. Stability is what happens to it the day the judge model bumps a minor version.
The judge is a prompt, not a function. A rubric calibrated against gpt-4o-2024-08-06 produces different distributions on gpt-4o-2024-11-20. Mean shifts 3 to 8 points; distribution narrows. Swap GPT-5 for Sonnet 4.5 without recalibrating and the dashboard moves but the agent didn’t. The judge changed, not the system under test.
Three checks bound the variance.
Cross-family judge rotation on the calibration corpus. Run the same rubric through GPT, Claude, and Gemini judges on the same calibration set. Compare distributions. If kappa-against-humans is 0.8 on GPT and 0.55 on Sonnet, the rubric is leaking criteria phrasing into one judge’s prior. Tighten the criterion or expand the calibration corpus until the rubric travels across families. A rubric that only calibrates on one judge is judge-model-locked, which means you cannot ever rotate without rebuilding the metric.
Pin the contract. The eval is the tuple (judge_model_id, rubric_version, prompt_template_hash). Bump any field deliberately, never as a side effect of a vendor swap. Cache verdicts keyed on the tuple; invalidate on contract change, not on every PR. Store the rubric in source control; block merges to the rubric file unless the PR includes a calibration run with kappa above baseline.
Three-judge ensemble for launch decisions. Sonnet 4.5, GPT-5, Gemini 2.5 Pro is a defensible default as of May 2026. Family-specific biases cancel. Ensemble costs roughly 3x a single judge; reserve it for the gate, not the weekly trend. Single judge for the dashboard; ensemble for the ship decision.
Recalibrate on triggers. A one-week SLA on every one of:
- Judge model update (provider rolls a snapshot, releases a minor version).
- Rubric edit (any wording change to a criterion or anchor).
- Distribution shift (new feature, persona, cohort, prompt revision in the production model).
- Rolling-window kappa drops meaningfully below baseline for two consecutive windows.
The cost of recalibration is small. The cost of a metric that silently drifted to fair-agreement kappa for a quarter is a release that shipped on green dashboards and broke production.
The deployment pattern
The rubric runs in two places: pytest as a CI gate, and as a span-attached evaluator on live traffic. The same CustomLLMJudge instance powers both.
from fi.evals.metrics.llm_as_judges.custom_judge.metric import CustomLLMJudge
from fi.evals.metrics.llm_as_judges.types import CustomInput
from fi.evals.llm.providers.litellm import LiteLLMProvider
medical_claim_judge = CustomLLMJudge(
provider=LiteLLMProvider(),
config={
"name": "medical_claim_modality",
"model": "gpt-5",
"grading_criteria": (
"For each numeric drug-interaction claim, extract the stated "
"modality (always, sometimes, contraindicated) and the cited "
"row's modality. Score 1 only if every modality matches; "
"score 0 otherwise."
),
"few_shot_examples": [
{"inputs": {"response": "...", "context": "..."},
"output": '{"score": 1.0, "reason": "all modalities match"}'},
],
},
)
result = medical_claim_judge.compute_one(CustomInput(
response="...",
context="...",
))
# result["output"] -> float in [0.0, 1.0]
# result["reason"] -> JSON-stringified judge output
DefaultJudgeOutput enforces the form-filling schema (score: float ∈ [0, 1], reason: str). The Jinja template carries the rubric and few-shot calibration block. Multi-modal: pass image_url or audio_url and LiteLLM forwards the media to vision and audio-capable judges.
For zero inline latency on live traffic, attach the same rubric as a span-level EvalTag via traceAI. The tag serializes into the OTel resource; the collector runs the eval server-side and writes results back as gen_ai.evaluation.* attributes. Same rubric in CI and on the trace; that diff closes most of the offline-vs-production drift that wrecks custom-metric programs.
Cascade deterministic in front of the judge. A regex catches a missing citation for free. A JSON schema catches a malformed response in microseconds. Pay the judge only on cases the cheap checks couldn’t decide:
def medical_composite(inputs):
if not all_claims_have_citations(inputs["response"]):
return {"score": 0.0, "reason": "missing citation"}
if not all_citations_resolve(inputs["response"]):
return {"score": 0.0, "reason": "broken citation"}
return medical_claim_judge.compute_one(CustomInput(**inputs))
Deterministic checks are 10,000x cheaper than a frontier judge and never drift. The judge bill drops 80 to 90 percent on most workloads.
Common mistakes
Five recurring failures kill custom metrics before they ship.
Vague criterion. One rubric carrying three observable behaviors silently. The judge guesses which one to score on every call; kappa never breaks 0.5. Fix: rewrite the rubric until each scale point is a concrete behavior. If you cannot anchor it, you are measuring more than one thing.
No calibration corpus. The rubric scores 4.2 on average and the team trusts the number. There is no evidence the score agrees with humans on this domain. Fix: 50 to 200 hand-labeled examples, IAA above 0.6, weighted kappa against the judge. Below kappa 0.6, the gate is noise.
Single-judge lock-in. The rubric calibrated on GPT-5 and nobody checked Sonnet. The provider rolls a snapshot, the score moves, the agent didn’t. Fix: cross-family judge rotation on the calibration corpus; pin the contract; recalibrate on every judge swap.
Reaching for a judge when a regex works. A $0.04-per-call frontier judge running on a binary “does this contain a citation” decision a regex returns in microseconds. Fix: deterministic floor in front of every judge. The skill is reaching for the cheapest tool that gives the right answer.
One score across multi-dim quality. A response can be on-brand and factually wrong; a patch can pass tests and import a deprecated library. The scalar averages over the failure. Fix: per-dimension scoring. If two dimensions can plausibly move in opposite directions across a release, score them separately. Gate on the minimum across dimensions when every dim is a hard requirement; gate on a weighted average only when the business defends the weights.
How Future AGI ships custom metrics as a package
A custom rubric is a contract. A custom rubric integrated into an eval stack that calibrates, cascades, clusters failing traces, and refines is what compounds. Start with the SDK for code-defined rubrics. Graduate to the Platform when you need self-improving evaluators, in-product authoring, and classifier-backed cost economics at scale.
The ai-evaluation SDK (Apache 2.0) is the code-first surface. CustomLLMJudge exposes the G-Eval primitive: Jinja2 template, structured DefaultJudgeOutput, few-shot calibration, multi-modal input. The same class powers 70+ EvalTemplate rubrics across faithfulness, agent quality, function calling, summarization, and multi-modal output. 13 guardrail backends (9 open-weight: LLAMAGUARD_3_8B/1B, QWEN3GUARD_8B/4B/0.6B, GRANITE_GUARDIAN_8B/5B, WILDGUARD_7B, SHIELDGEMMA_2B) plus 8 sub-10ms local Scanners (jailbreak, code injection, secrets, malicious URL, invisible chars, language, topic restriction, regex) supply the deterministic floor for the cost cascade. Four distributed runners (Celery, Ray, Temporal, Kubernetes) carry rubric execution into whatever orchestrator the team already runs.
traceAI (Apache 2.0) carries the same rubric as a span-attached EvalTag on live traffic across 50+ AI surfaces in Python, TypeScript, Java, and C#. Pluggable semantic conventions at register() time. Server-side scoring at zero inline latency.
The Future AGI Platform layers what the SDK alone cannot do. Self-improving rubrics retune from thumbs up/down feedback so the rubric ages with the product instead of against it. An in-product authoring agent writes custom rubrics from natural-language descriptions and proposes calibration corpora to label. Classifier-backed scoring runs at lower per-eval cost than Galileo Luna-2, which is what makes daily full-traffic scoring financially viable instead of a quarterly batch. The Agent Command Center handles judge routing across 100+ providers (SOC 2 Type II, HIPAA, GDPR, and CCPA certified, ISO/IEC 27001 in active audit) so cross-family judge rotation is a config change, not a deploy. Error Feed sits inside the eval stack: HDBSCAN soft-clusters failing-rubric traces, a Sonnet 4.5 Judge writes the RCA with an immediate_fix, fixes feed the self-improving evaluators. agent-opt consumes the custom rubric across six optimizers so prompt search runs against the same metric the CI gate uses.
Ready to wire a production-grade custom metric against your own workload? Start with the ai-evaluation SDK quickstart, drop a CustomLLMJudge against your dataset in pytest this afternoon, then attach the same rubric as an EvalTag on live spans via traceAI. The same rubric in both places is the diff that turns a custom rubric from a notebook experiment into a metric that holds for two years.
Three takeaways for 2026
- A custom metric is a contract. Tight criterion, calibration corpus, stability check. Without all three you are scoring noise.
- One rubric, one behavior. The cheapest lift in custom-metric design is naming exactly one observable thing per rubric and anchoring each scale point.
- The stack is the moat, not the prompt. A rubric by itself is a JSON blob. A rubric integrated with calibration, cascading, clustering, and self-improving evaluators is what survives the judge swap that happens every quarter.
Related reading
Frequently asked questions
What is a custom LLM eval metric and when do I need one?
What is the three-part contract for a custom metric?
How do I write a criterion that calibrates?
How large should my calibration corpus be?
What stability check should I run before trusting the metric?
What are the common mistakes building custom metrics?
What does Future AGI ship for custom metrics?
G-Eval rubric-based LLM judges vs DeepEval's full metric suite, how they differ, and where FutureAGI Turing eval models fit alongside both in 2026.
Best LLMs May 2026: compare GPT-5.5, Claude Opus 4.7, Gemini 3.1 Pro, and DeepSeek V4 across coding, agents, multimodal, cost, and open weights.
Best Voice AI May 2026: compare Deepgram, Cartesia, ElevenLabs, Retell, and Vapi for STT, TTS, latency budgets, and production voice agents.