Evaluating LLM Translation Quality in 2026
BLEU is dead for LLM translation. The 2026 stack: COMET + LLM-as-judge fluency/adequacy rubrics + per-language-pair calibration. With code and thresholds.
Table of Contents
BLEU is dead for LLM translation. ROUGE didn’t help either. Vendor benchmark scores transferred from WMT or FLORES tell you almost nothing about whether the German enterprise customer is going to renew. The eval that matches modern translation is COMET (neural reference-based or reference-free) plus LLM-as-judge fluency and adequacy rubrics, calibrated against native-speaker labels on your own language pairs. This guide is the working stack: why the classic metrics fail, how COMET and COMET-Kiwi actually work, the rubric set that catches what COMET can’t see, the per-pair calibration step, and the production loop that uses post-edit distance as the strongest feedback signal you have.
The thesis
Translation is a multi-dimensional problem with domain-specific failure modes. The 2026 eval stack runs three layers on the same outputs:
- A learned neural metric (COMET) as the single-number regression signal. COMET-22 with references, COMET-Kiwi without. Correlates with human direct assessment far better than BLEU or chrF.
- An LLM-as-judge rubric on five quality dimensions COMET can’t separate: adequacy, fluency, idiom transfer, cultural register, domain term consistency. Scored independently. Calibrated against native-speaker labels per pair.
- A production loop with post-edit distance as the truth signal. The diff between the LLM draft and the published target is the score that actually predicts customer satisfaction.
BLEU and chrF still ship, as the lowest tile on the dashboard for backward compatibility. Never as the gate.
Step 1: kill BLEU as the gate
BLEU was designed by Papineni et al. in 2002 for statistical MT, where the model rendered the source closely and a single reference was authoritative. Those assumptions don’t hold for LLM translation. Three failure modes recur, and BLEU misses all three.
Paraphrase penalty. Given “Could you check on my order?”, an English-to-Spanish reference might be “¿Podrías revisar mi pedido?” An equally correct translation is “¿Puedes mirar mi orden?” BLEU punishes the second. A native Spanish reviewer doesn’t.
Idiom inversion. “Spill the beans” should become a Spanish idiom for revealing a secret, not “derramar los frijoles.” The literal target has higher n-gram overlap with a poorly written reference than the natural one. BLEU rewards the literal mistake.
Dropped clauses with high overlap. An LLM that translates a 40-token German legal sentence into a 30-token Spanish target by silently dropping a subordinate clause still matches most of the reference n-grams. BLEU never sees what’s missing.
ROUGE was built for summarization recall, not translation. Some suites still ship a ROUGE column. Catches lexical drift. Nothing else. The WMT shared tasks moved off BLEU as the primary system ranking in 2017 and have used learned neural metrics since WMT22. The research community settled this. Production teams haven’t, because BLEU is in every template.
Step 2: COMET and COMET-Kiwi
COMET is a learned neural metric trained on human direct-assessment scores from the WMT shared tasks, fine-tuned over XLM-RoBERTa-large. Source, candidate, optionally reference go in; a score that correlates with human judgment far better than any n-gram metric comes out. Two variants matter in production.
COMET-22 (wmt22-comet-da). Reference-based. Highest correlation with human DA when a gold reference exists. Offline scorer against your golden set.
COMET-Kiwi (wmt22-cometkiwi-da). Reference-free. Scores source-to-target directly. The production scorer. You almost never have a gold reference on a live translation, only the source and the model output.
Both ship in unbabel-comet.
from comet import download_model, load_from_checkpoint
# Reference-based: golden-set offline scoring
ref_model = load_from_checkpoint(download_model("Unbabel/wmt22-comet-da"))
ref_data = [{"src": s, "mt": t, "ref": r} for s, t, r in golden_pairs]
ref_scores = ref_model.predict(ref_data, batch_size=8, gpus=0)
# Reference-free: production-stream scoring
kiwi_model = load_from_checkpoint(download_model("Unbabel/wmt22-cometkiwi-da"))
kiwi_data = [{"src": s, "mt": t} for s, t in production_pairs]
kiwi_scores = kiwi_model.predict(kiwi_data, batch_size=8, gpus=0)
COMET scores sit on a roughly 0-1 scale. The absolute number is not interpretable across pairs: a 0.85 on English-to-Spanish news is fine, a 0.85 on English-to-Arabic legal might be the floor. Calibrate per pair against your post-edit signal, not a universal cutoff. COMET-Kiwi is the streaming scorer (milliseconds per pair on GPU). The LLM-judge rubric is the next layer, where you need to know why a score moved.
Step 3: the LLM-as-judge rubric COMET can’t replace
COMET gives you a quality scalar. It can’t tell you whether the regression came from a register slip, a dropped clause, a literal idiom, or a medical-term mistranslation. The LLM-as-judge layer is the diagnostic the scalar lacks. Score five rubrics independently, 1-5 with one-sentence reasoning per score.
Adequacy. Does the target convey the source meaning faithfully? No dropped clauses, no hallucinated content, no inverted negation. The closest analog to a strict reference comparison, but it tolerates paraphrase.
Fluency. Does the target read naturally to a native speaker of the target locale? The dangerous failure mode is a fluent translation that’s inadequate. Reviewers don’t notice. Customers do.
Idiom transfer. Are figurative expressions rendered as natural target-language equivalents rather than literally? Run this rubric on a deliberately idiom-heavy subset of the golden set (10-20% of the pairs).
Cultural register. Is the formality level correct for the target locale, the relationship in the source, and the channel? Spanish formal usted versus informal tú, Japanese honorific levels, German Sie versus du, Korean speech levels: all of these have to stay consistent within a translation and match the source’s social signal.
Domain term consistency. Are legal, medical, financial, or product-specific terms mapped to their canonical target equivalents? Attach the glossary to the request, score adherence per term. A medical translation that renders “myocardial infarction” as a literal target-language phrase rather than the standard local clinical term has translated the words and lost the document.
Two more rubrics ride along as guardrails. Refusal preservation scores whether a refusal in the source language stays a refusal in the target — translated agents that route through translation, hand off to a downstream agent, and translate the response back can quietly bypass safety boundaries the source-language rubric never fires on. Hallucinated facts (FactualAccuracy) catches numbers, dates, or named entities introduced during translation.
Holistic single-score templates exist (TranslationAccuracy is one). Useful as dashboard tiles. Never as the gate, because they hide which dimension regressed.
Step 4: per-language-pair calibration
An uncalibrated LLM judge is a confident random number generator. Translation makes this worse because judge models have asymmetric strength across pairs. GPT-class judges are strong on Western European pairs and weaker on Korean and Arabic. Claude-class judges have a different asymmetry. Pick a judge and verify it per pair.
Run the calibration loop once when you set the rubric up and quarterly after:
- Sample 50 source-target pairs per pair from production. Skew toward edge cases (long sources, mixed-language content, idiom-heavy text, contested register).
- Two native speakers score each pair on all five dimensions. 1-5 Likert with one-sentence reasoning. Resolve disagreements by discussion or a third annotator.
- Run the judge prompt against the same set. Same prompt, same judge model, same temperature you’ll use in production.
- Compute Cohen’s kappa per dimension per pair between judge scores and human consensus. The G-Eval paper hit Spearman 0.514 with human raters on SummEval using GPT-4. A 2026 translation judge calibrated on a native-speaker set should hit Cohen’s kappa of 0.6 or better on the dimensions your product cares about most. Below that, the judge model is wrong for the pair, or the rubric prompt needs sharpening, or both.
- Pin the judge model and rubric prompt as one versioned contract. Swapping the judge is an eval migration, not a config change. Track judge-human kappa over quarters; when it slides, the source distribution or the model has shifted.
Thirty minutes of native-speaker time per pair, quarterly, prevents months of arguing about whether a release regressed.
Step 5: domain terminology in three layers
Most production translation pipelines carry a glossary: brand names, legal terms of art, medical canonical translations, do-not-translate lists. The standard failure is that the LLM uses the canonical target in one chunk and translates the same term word-for-word in the next.
- Deterministic lookup. For every glossary entry, scan the source for the source term and the target for the canonical equivalent. Source-present + canonical-absent is a violation. String match, microseconds per translation, catches the easiest 80%.
ContextAdherencetemplate. Attach the glossary as thecontextfield on the request.ContextAdherence(eval_id=5) scores whether the translation honored the attached glossary or style guide. Catches uses of the canonical term in the wrong context.CustomLLMJudgediagnostic. “Score 1-5 whether every glossary term in the source is rendered using the canonical target equivalent attached in the context. Score 1 if any term is translated word-for-word.” The reason field names the term that was missed.
Run in cascade. Deterministic first. Judge calls only on rows the deterministic layer passes.
Step 6: build the cascade
Translation matrices grow combinatorially. Language pairs by domains by source styles compounds fast. Running a frontier judge on every translation burns the eval budget before it scores anything useful. The cascade:
- Deterministic first. Length spec, structured-field presence, glossary string match, named-entity preservation. Microseconds, zero API cost, every trace.
- COMET-Kiwi second. Reference-free neural metric, milliseconds per pair on GPU. The streaming quality scalar; drift alarm fires on rolling mean per pair.
- Multilingual classifier third. Toxicity (
eval_id=15), PII, prompt injection, refusal preservation. Run in the target language, not just English. Use the nine open-weight backends inai-evaluation(QWEN3GUARD_8B, LLAMAGUARD_3_8B, SHIELDGEMMA_2B and siblings). Sub-100ms. - LLM-as-judge fourth. The five quality rubrics. Sampled (5-10% of production traffic, 100% of CI). Pin the judge model, cache verdicts keyed on
(judge_model_id, rubric_version, source_hash, target_hash, glossary_hash). - Frontier adjudication last. When the standard judge and a classifier disagree, or COMET-Kiwi flags but the judge passes, escalate to a stronger judge or a human label. The localization reviewer queue lives here.
Step 7: post-edit distance is the truth signal
Localization pipelines almost always have humans in the loop. A post-editor takes the LLM draft, fixes the issues, ships the edited target. The diff between LLM draft and human-edited target is the strongest quality signal you have. It’s free. It’s real. It correlates with customer-facing quality better than any automated metric.
Track translation edit rate (TER) — Levenshtein-style edit distance normalized by target token length. SacreBLEU ships it. Compute per language pair per domain.
import sacrebleu
ter = sacrebleu.metrics.TER()
draft = "El cliente solicita la cancelación del pedido."
edited = "El cliente solicitó cancelar el pedido."
ter_score = ter.sentence_score(draft, [edited]).score # lower is better; 0 = no edits
Three patterns fall out immediately.
TER per pair points at model selection. English-to-Japanese TER of 35 against English-to-Spanish at 12 on the same domain means the Japanese model or prompt needs work. COMET might not have separated them. The reviewer time spent did.
Post-edit diffs seed the golden set. Categorize edits (register fixes, glossary fixes, dropped-clause restorations, idiom naturalization) with a small classifier. Every fix is a free training pair: LLM draft = failing case, post-edited target = gold, diff = labeled failure mode.
Reviewer acceptance rate per pair is a model-selection signal. A translator accepting 95% on English-to-Spanish and 40% on English-to-Japanese is telling you the Japanese pair is the problem, not the reviewer. Surface acceptance per pair alongside COMET and rubric scores.
Stream the diff and TER next to the eval scores on the OTel span. The trace then carries quality scores, the gateway cost and latency headers, and the post-edit truth signal in one record. Cost-per-acceptable-translation becomes a one-query number.
Production patterns by use case
The five-rubric framework holds across translation workloads. Weights and operational shape vary.
Content localization (bulk, queued). Marketing copy, product strings, documentation translated through a queue with human post-editors. Fluency, idiom transfer, and domain term consistency dominate. Cultural register matters most for marketing where brand voice has to survive. Cost per source token is the dominant operations axis because volume compounds. Run COMET-22 plus the LLM-judge rubric on every batch; order the reviewer queue by lowest-scoring axis. Post-edit acceptance rate per reviewer per pair is the model-selection truth.
Customer support multilingual. Agent reads a ticket in the customer’s language, responds in the same, often with an English-speaking reviewer before send. Adequacy and cultural register dominate. Refusal preservation matters because the agent has tool access. Latency budgets are loose; the eval can run heavier judges. Watch the gateway’s x-prism-guardrail-triggered header per target language; a spike usually means the multilingual classifier baseline shifted under a model swap.
Real-time and multilingual RAG. Live chat or voice translation runs against a 200-500ms budget per chunk. Adequacy and fluency dominate; idiom handling is best-effort. Use COMET-Kiwi sampled (every 20th chunk) for streaming; the full rubric runs offline on captured chunks. Voice-side patterns are in multilingual voice AI testing. Multilingual RAG retrieves in one language and generates in another with a glossary attached. Groundedness, ContextAdherence, and the glossary rubric carry the weight. The common failure: generator paraphrases the retrieved chunk into the target language and silently drops a numeric clause. The chunking angle is in advanced chunking techniques for RAG.
Wire it into CI
Translation eval isn’t useful unless a regression blocks a release. The shape is the same as any LLM CI gate, with one twist: per-language-pair thresholds, not a single number.
from fi.evals import Evaluator
from fi.evals.templates import (
TranslationAccuracy, BleuScore,
Groundedness, Completeness, FactualAccuracy,
ContextAdherence, AnswerRefusal,
)
from fi.evals.metrics.llm_as_judges.custom_judge import CustomLLMJudge
from fi.evals.providers import LiteLLMProvider
ev = Evaluator(fi_api_key="...", fi_secret_key="...")
def judge(name, criteria):
return CustomLLMJudge(
provider=LiteLLMProvider(),
config={"name": name, "grading_criteria": criteria},
)
fluency = judge("fluency_score",
"Score 1-5 whether the target reads naturally to a native speaker. "
"5 = indistinguishable from human writing. 1 = obviously awkward.")
idiom = judge("idiom_transfer",
"Score 1-5 whether source idioms render as natural target-language "
"equivalents. Mark 1 if any source idiom is translated word-for-word.")
register_judge = judge("cultural_register",
"Score 1-5 whether formality, honorifics, and locale conventions match "
"the source's social signal and stay consistent. Mark 1 if register "
"slips mid-paragraph.")
result = ev.evaluate(
eval_templates=[
TranslationAccuracy(), # holistic dashboard tile
Groundedness(), Completeness(), FactualAccuracy(),
ContextAdherence(), # glossary adherence
AnswerRefusal(),
BleuScore(), # backward-compat tile only
fluency, idiom, register_judge,
],
inputs=[
{
"input": ex["source"],
"output": target,
"context": ex.get("glossary", ""),
"expected_text": ex.get("reference", ""),
}
for ex, target in zip(golden_set, model_targets)
],
)
The output is a BatchRunResult with per-input scores and reasons per rubric. Run COMET-22 in the same loop and merge by row. A reasonable starting threshold for English-to-Spanish on a support workload: TranslationAccuracy >= 4.0, fluency >= 4.3, idiom transfer >= 3.8, cultural register >= 4.0, glossary adherence >= 4.5, AnswerRefusal >= 0.95, COMET-22 >= 0.82. English-to-Japanese usually wants a lower idiom floor and a higher register floor. Tune from production, not from a universal table.
Block the PR if any rubric drops more than 2 absolute points from the trailing 7-day baseline, or falls below the floor. Surface the failure with the rubric, the pair, the cluster name from Error Feed, and the immediate_fix text the judge agent wrote. Not the aggregate. The wider release-gate pattern is in A/B testing LLM prompts.
Anti-patterns to avoid
Five recur. Each maps to a dimension the team isn’t measuring.
Single-metric evaluation. BLEU-only gates are the most common and the hardest to argue against because the number looks objective. BLEU misses fluency, idiom transfer, dropped clauses, and over-rewards literal renders. If you ship one number to the dashboard, it’s COMET-22 (offline) or COMET-Kiwi (production), not BLEU.
Single-language-pair calibration. A judge calibrated on English-to-Spanish doesn’t transfer to English-to-Japanese. Run the kappa loop per pair, or you’re flying blind on the long tail.
No cultural register test. Without a register rubric, usted silently becomes tú mid-paragraph and a polite Japanese support reply lands in plain form. Reviewers catch this in spot checks. CI gates don’t, unless you measure it.
No refusal preservation test. Translated agents that route through translation, hand off to a downstream agent, and translate back can bypass safety boundaries. AnswerRefusal scored on the target output with a multilingual guardrail backend catches it. The wider prompt-injection angle is in prompt injection defense at the gateway.
No glossary attachment. Without a glossary in context and a glossary-adherence rubric, medical, legal, and financial jargon translates word-for-word. The translation reads fluent and means something subtly wrong.
Where Future AGI fits
A package, three layers.
- ai-evaluation SDK (Apache 2.0).
from fi.evals import Evaluator. 60+EvalTemplateclasses including translation-specificTranslationAccuracy(eval_id=67),CulturalSensitivity(68),BleuScore(101), plus cross-cuttingGroundedness(47),Completeness(10),FactualAccuracy(66),ContextAdherence(5),AnswerRefusal(88), andCustomLLMJudgefor everything not in the catalog. Nine open-weight multilingual classifier backends (QWEN3GUARD, LLAMAGUARD_3, SHIELDGEMMA, and siblings), four distributed runners, multi-modalCustomLLMJudgevia LiteLLM for OCR’d PDFs and product UI screenshots. - Future AGI Platform. Self-improving evaluators that retune from human post-edit feedback. An in-product authoring agent that turns a natural-language brief into a deployable rubric. Classifier-backed evals at lower per-eval cost than Galileo Luna-2: the economics that make weekly full-golden-set reruns the default.
- Error Feed (inside the eval stack). HDBSCAN soft-clustering over ClickHouse embeddings groups failing translations into named issues. Typical clusters in customer data: “Spanish formal register slips to informal
túmid-paragraph,” “Japanese idiom rendered literally not naturalized,” “Medical jargon translated word-for-word loses canonical meaning,” “Refusal in English source becomes compliant answer in Mandarin target,” “Numeric clause silently dropped in long German legal source.” A Claude Sonnet 4.5 Judge agent writes the root cause and animmediate_fixper cluster. Linear ticketing today; Slack, GitHub, Jira, PagerDuty roadmap.
traceAI (Apache 2.0) is the tracing layer: 50+ AI surfaces across Python, TypeScript, Java, and C#. The hosted Agent Command Center sits in front of the translation provider and emits x-prism-cost, x-prism-latency-ms, x-prism-model-used, x-prism-fallback-used, and x-prism-guardrail-triggered headers per call. SOC 2 Type II, HIPAA, GDPR, CCPA certified. Eval-driven prompt optimization ships through the six optimizers in agent-opt; details in automated optimization for agents.
Ready to replace your BLEU gate? Install ai-evaluation and unbabel-comet, pull your last 200 production translations per pair, score them on TranslationAccuracy + fluency + idiom + register + glossary adherence + COMET-Kiwi this afternoon, then wire the same rubrics as EvalTag on live spans via traceAI tomorrow. The same rubric in CI and in production is what turns a translation eval into a regression suite that holds for the language pairs your customers actually use.
Related reading
Frequently asked questions
Why is BLEU the wrong primary metric for LLM translation?
What is COMET and why does it correlate better than BLEU?
Why isn't COMET enough on its own?
How big should a per-language-pair golden set be?
How do you calibrate an LLM judge for translation quality?
How does post-edit distance close the loop in production?
What does Future AGI ship for translation evaluation?
Deterministic vs LLM-judge isn't a pick. It's a cascade. Where each wins, where each breaks, and the layering that drops eval cost 95% in production.
G-Eval in 2026: what the paper actually shipped, where the method breaks in production, the four biases that wreck a rubric judge, and how to harden it for real traffic.
Academic LLM benchmarks answer 'which model is generally smartest.' Production eval answers 'does my system work on my traffic today.' Different questions, different methodologies, and the bridge pattern that connects them in 2026.