Guides

Evaluating LLM Translation Quality in 2026

BLEU is dead for LLM translation. The 2026 stack: COMET + LLM-as-judge fluency/adequacy rubrics + per-language-pair calibration. With code and thresholds.

·
Updated
·
13 min read
llm-evaluation machine-translation comet localization llm-as-judge 2026
Editorial cover image for Evaluating LLM Translation Quality in 2026
Table of Contents

BLEU is dead for LLM translation. ROUGE didn’t help either. Vendor benchmark scores transferred from WMT or FLORES tell you almost nothing about whether the German enterprise customer is going to renew. The eval that matches modern translation is COMET (neural reference-based or reference-free) plus LLM-as-judge fluency and adequacy rubrics, calibrated against native-speaker labels on your own language pairs. This guide is the working stack: why the classic metrics fail, how COMET and COMET-Kiwi actually work, the rubric set that catches what COMET can’t see, the per-pair calibration step, and the production loop that uses post-edit distance as the strongest feedback signal you have.

The thesis

Translation is a multi-dimensional problem with domain-specific failure modes. The 2026 eval stack runs three layers on the same outputs:

  1. A learned neural metric (COMET) as the single-number regression signal. COMET-22 with references, COMET-Kiwi without. Correlates with human direct assessment far better than BLEU or chrF.
  2. An LLM-as-judge rubric on five quality dimensions COMET can’t separate: adequacy, fluency, idiom transfer, cultural register, domain term consistency. Scored independently. Calibrated against native-speaker labels per pair.
  3. A production loop with post-edit distance as the truth signal. The diff between the LLM draft and the published target is the score that actually predicts customer satisfaction.

BLEU and chrF still ship, as the lowest tile on the dashboard for backward compatibility. Never as the gate.

Step 1: kill BLEU as the gate

BLEU was designed by Papineni et al. in 2002 for statistical MT, where the model rendered the source closely and a single reference was authoritative. Those assumptions don’t hold for LLM translation. Three failure modes recur, and BLEU misses all three.

Paraphrase penalty. Given “Could you check on my order?”, an English-to-Spanish reference might be “¿Podrías revisar mi pedido?” An equally correct translation is “¿Puedes mirar mi orden?” BLEU punishes the second. A native Spanish reviewer doesn’t.

Idiom inversion. “Spill the beans” should become a Spanish idiom for revealing a secret, not “derramar los frijoles.” The literal target has higher n-gram overlap with a poorly written reference than the natural one. BLEU rewards the literal mistake.

Dropped clauses with high overlap. An LLM that translates a 40-token German legal sentence into a 30-token Spanish target by silently dropping a subordinate clause still matches most of the reference n-grams. BLEU never sees what’s missing.

ROUGE was built for summarization recall, not translation. Some suites still ship a ROUGE column. Catches lexical drift. Nothing else. The WMT shared tasks moved off BLEU as the primary system ranking in 2017 and have used learned neural metrics since WMT22. The research community settled this. Production teams haven’t, because BLEU is in every template.

Step 2: COMET and COMET-Kiwi

COMET is a learned neural metric trained on human direct-assessment scores from the WMT shared tasks, fine-tuned over XLM-RoBERTa-large. Source, candidate, optionally reference go in; a score that correlates with human judgment far better than any n-gram metric comes out. Two variants matter in production.

COMET-22 (wmt22-comet-da). Reference-based. Highest correlation with human DA when a gold reference exists. Offline scorer against your golden set.

COMET-Kiwi (wmt22-cometkiwi-da). Reference-free. Scores source-to-target directly. The production scorer. You almost never have a gold reference on a live translation, only the source and the model output.

Both ship in unbabel-comet.

from comet import download_model, load_from_checkpoint

# Reference-based: golden-set offline scoring
ref_model = load_from_checkpoint(download_model("Unbabel/wmt22-comet-da"))
ref_data = [{"src": s, "mt": t, "ref": r} for s, t, r in golden_pairs]
ref_scores = ref_model.predict(ref_data, batch_size=8, gpus=0)

# Reference-free: production-stream scoring
kiwi_model = load_from_checkpoint(download_model("Unbabel/wmt22-cometkiwi-da"))
kiwi_data = [{"src": s, "mt": t} for s, t in production_pairs]
kiwi_scores = kiwi_model.predict(kiwi_data, batch_size=8, gpus=0)

COMET scores sit on a roughly 0-1 scale. The absolute number is not interpretable across pairs: a 0.85 on English-to-Spanish news is fine, a 0.85 on English-to-Arabic legal might be the floor. Calibrate per pair against your post-edit signal, not a universal cutoff. COMET-Kiwi is the streaming scorer (milliseconds per pair on GPU). The LLM-judge rubric is the next layer, where you need to know why a score moved.

Step 3: the LLM-as-judge rubric COMET can’t replace

COMET gives you a quality scalar. It can’t tell you whether the regression came from a register slip, a dropped clause, a literal idiom, or a medical-term mistranslation. The LLM-as-judge layer is the diagnostic the scalar lacks. Score five rubrics independently, 1-5 with one-sentence reasoning per score.

Adequacy. Does the target convey the source meaning faithfully? No dropped clauses, no hallucinated content, no inverted negation. The closest analog to a strict reference comparison, but it tolerates paraphrase.

Fluency. Does the target read naturally to a native speaker of the target locale? The dangerous failure mode is a fluent translation that’s inadequate. Reviewers don’t notice. Customers do.

Idiom transfer. Are figurative expressions rendered as natural target-language equivalents rather than literally? Run this rubric on a deliberately idiom-heavy subset of the golden set (10-20% of the pairs).

Cultural register. Is the formality level correct for the target locale, the relationship in the source, and the channel? Spanish formal usted versus informal , Japanese honorific levels, German Sie versus du, Korean speech levels: all of these have to stay consistent within a translation and match the source’s social signal.

Domain term consistency. Are legal, medical, financial, or product-specific terms mapped to their canonical target equivalents? Attach the glossary to the request, score adherence per term. A medical translation that renders “myocardial infarction” as a literal target-language phrase rather than the standard local clinical term has translated the words and lost the document.

Two more rubrics ride along as guardrails. Refusal preservation scores whether a refusal in the source language stays a refusal in the target — translated agents that route through translation, hand off to a downstream agent, and translate the response back can quietly bypass safety boundaries the source-language rubric never fires on. Hallucinated facts (FactualAccuracy) catches numbers, dates, or named entities introduced during translation.

Holistic single-score templates exist (TranslationAccuracy is one). Useful as dashboard tiles. Never as the gate, because they hide which dimension regressed.

Step 4: per-language-pair calibration

An uncalibrated LLM judge is a confident random number generator. Translation makes this worse because judge models have asymmetric strength across pairs. GPT-class judges are strong on Western European pairs and weaker on Korean and Arabic. Claude-class judges have a different asymmetry. Pick a judge and verify it per pair.

Run the calibration loop once when you set the rubric up and quarterly after:

  1. Sample 50 source-target pairs per pair from production. Skew toward edge cases (long sources, mixed-language content, idiom-heavy text, contested register).
  2. Two native speakers score each pair on all five dimensions. 1-5 Likert with one-sentence reasoning. Resolve disagreements by discussion or a third annotator.
  3. Run the judge prompt against the same set. Same prompt, same judge model, same temperature you’ll use in production.
  4. Compute Cohen’s kappa per dimension per pair between judge scores and human consensus. The G-Eval paper hit Spearman 0.514 with human raters on SummEval using GPT-4. A 2026 translation judge calibrated on a native-speaker set should hit Cohen’s kappa of 0.6 or better on the dimensions your product cares about most. Below that, the judge model is wrong for the pair, or the rubric prompt needs sharpening, or both.
  5. Pin the judge model and rubric prompt as one versioned contract. Swapping the judge is an eval migration, not a config change. Track judge-human kappa over quarters; when it slides, the source distribution or the model has shifted.

Thirty minutes of native-speaker time per pair, quarterly, prevents months of arguing about whether a release regressed.

Step 5: domain terminology in three layers

Most production translation pipelines carry a glossary: brand names, legal terms of art, medical canonical translations, do-not-translate lists. The standard failure is that the LLM uses the canonical target in one chunk and translates the same term word-for-word in the next.

  • Deterministic lookup. For every glossary entry, scan the source for the source term and the target for the canonical equivalent. Source-present + canonical-absent is a violation. String match, microseconds per translation, catches the easiest 80%.
  • ContextAdherence template. Attach the glossary as the context field on the request. ContextAdherence (eval_id=5) scores whether the translation honored the attached glossary or style guide. Catches uses of the canonical term in the wrong context.
  • CustomLLMJudge diagnostic. “Score 1-5 whether every glossary term in the source is rendered using the canonical target equivalent attached in the context. Score 1 if any term is translated word-for-word.” The reason field names the term that was missed.

Run in cascade. Deterministic first. Judge calls only on rows the deterministic layer passes.

Step 6: build the cascade

Translation matrices grow combinatorially. Language pairs by domains by source styles compounds fast. Running a frontier judge on every translation burns the eval budget before it scores anything useful. The cascade:

  • Deterministic first. Length spec, structured-field presence, glossary string match, named-entity preservation. Microseconds, zero API cost, every trace.
  • COMET-Kiwi second. Reference-free neural metric, milliseconds per pair on GPU. The streaming quality scalar; drift alarm fires on rolling mean per pair.
  • Multilingual classifier third. Toxicity (eval_id=15), PII, prompt injection, refusal preservation. Run in the target language, not just English. Use the nine open-weight backends in ai-evaluation (QWEN3GUARD_8B, LLAMAGUARD_3_8B, SHIELDGEMMA_2B and siblings). Sub-100ms.
  • LLM-as-judge fourth. The five quality rubrics. Sampled (5-10% of production traffic, 100% of CI). Pin the judge model, cache verdicts keyed on (judge_model_id, rubric_version, source_hash, target_hash, glossary_hash).
  • Frontier adjudication last. When the standard judge and a classifier disagree, or COMET-Kiwi flags but the judge passes, escalate to a stronger judge or a human label. The localization reviewer queue lives here.

Step 7: post-edit distance is the truth signal

Localization pipelines almost always have humans in the loop. A post-editor takes the LLM draft, fixes the issues, ships the edited target. The diff between LLM draft and human-edited target is the strongest quality signal you have. It’s free. It’s real. It correlates with customer-facing quality better than any automated metric.

Track translation edit rate (TER) — Levenshtein-style edit distance normalized by target token length. SacreBLEU ships it. Compute per language pair per domain.

import sacrebleu

ter = sacrebleu.metrics.TER()
draft = "El cliente solicita la cancelación del pedido."
edited = "El cliente solicitó cancelar el pedido."
ter_score = ter.sentence_score(draft, [edited]).score  # lower is better; 0 = no edits

Three patterns fall out immediately.

TER per pair points at model selection. English-to-Japanese TER of 35 against English-to-Spanish at 12 on the same domain means the Japanese model or prompt needs work. COMET might not have separated them. The reviewer time spent did.

Post-edit diffs seed the golden set. Categorize edits (register fixes, glossary fixes, dropped-clause restorations, idiom naturalization) with a small classifier. Every fix is a free training pair: LLM draft = failing case, post-edited target = gold, diff = labeled failure mode.

Reviewer acceptance rate per pair is a model-selection signal. A translator accepting 95% on English-to-Spanish and 40% on English-to-Japanese is telling you the Japanese pair is the problem, not the reviewer. Surface acceptance per pair alongside COMET and rubric scores.

Stream the diff and TER next to the eval scores on the OTel span. The trace then carries quality scores, the gateway cost and latency headers, and the post-edit truth signal in one record. Cost-per-acceptable-translation becomes a one-query number.

Production patterns by use case

The five-rubric framework holds across translation workloads. Weights and operational shape vary.

Content localization (bulk, queued). Marketing copy, product strings, documentation translated through a queue with human post-editors. Fluency, idiom transfer, and domain term consistency dominate. Cultural register matters most for marketing where brand voice has to survive. Cost per source token is the dominant operations axis because volume compounds. Run COMET-22 plus the LLM-judge rubric on every batch; order the reviewer queue by lowest-scoring axis. Post-edit acceptance rate per reviewer per pair is the model-selection truth.

Customer support multilingual. Agent reads a ticket in the customer’s language, responds in the same, often with an English-speaking reviewer before send. Adequacy and cultural register dominate. Refusal preservation matters because the agent has tool access. Latency budgets are loose; the eval can run heavier judges. Watch the gateway’s x-prism-guardrail-triggered header per target language; a spike usually means the multilingual classifier baseline shifted under a model swap.

Real-time and multilingual RAG. Live chat or voice translation runs against a 200-500ms budget per chunk. Adequacy and fluency dominate; idiom handling is best-effort. Use COMET-Kiwi sampled (every 20th chunk) for streaming; the full rubric runs offline on captured chunks. Voice-side patterns are in multilingual voice AI testing. Multilingual RAG retrieves in one language and generates in another with a glossary attached. Groundedness, ContextAdherence, and the glossary rubric carry the weight. The common failure: generator paraphrases the retrieved chunk into the target language and silently drops a numeric clause. The chunking angle is in advanced chunking techniques for RAG.

Wire it into CI

Translation eval isn’t useful unless a regression blocks a release. The shape is the same as any LLM CI gate, with one twist: per-language-pair thresholds, not a single number.

from fi.evals import Evaluator
from fi.evals.templates import (
    TranslationAccuracy, BleuScore,
    Groundedness, Completeness, FactualAccuracy,
    ContextAdherence, AnswerRefusal,
)
from fi.evals.metrics.llm_as_judges.custom_judge import CustomLLMJudge
from fi.evals.providers import LiteLLMProvider

ev = Evaluator(fi_api_key="...", fi_secret_key="...")

def judge(name, criteria):
    return CustomLLMJudge(
        provider=LiteLLMProvider(),
        config={"name": name, "grading_criteria": criteria},
    )

fluency = judge("fluency_score",
    "Score 1-5 whether the target reads naturally to a native speaker. "
    "5 = indistinguishable from human writing. 1 = obviously awkward.")
idiom = judge("idiom_transfer",
    "Score 1-5 whether source idioms render as natural target-language "
    "equivalents. Mark 1 if any source idiom is translated word-for-word.")
register_judge = judge("cultural_register",
    "Score 1-5 whether formality, honorifics, and locale conventions match "
    "the source's social signal and stay consistent. Mark 1 if register "
    "slips mid-paragraph.")

result = ev.evaluate(
    eval_templates=[
        TranslationAccuracy(),  # holistic dashboard tile
        Groundedness(), Completeness(), FactualAccuracy(),
        ContextAdherence(),     # glossary adherence
        AnswerRefusal(),
        BleuScore(),            # backward-compat tile only
        fluency, idiom, register_judge,
    ],
    inputs=[
        {
            "input": ex["source"],
            "output": target,
            "context": ex.get("glossary", ""),
            "expected_text": ex.get("reference", ""),
        }
        for ex, target in zip(golden_set, model_targets)
    ],
)

The output is a BatchRunResult with per-input scores and reasons per rubric. Run COMET-22 in the same loop and merge by row. A reasonable starting threshold for English-to-Spanish on a support workload: TranslationAccuracy >= 4.0, fluency >= 4.3, idiom transfer >= 3.8, cultural register >= 4.0, glossary adherence >= 4.5, AnswerRefusal >= 0.95, COMET-22 >= 0.82. English-to-Japanese usually wants a lower idiom floor and a higher register floor. Tune from production, not from a universal table.

Block the PR if any rubric drops more than 2 absolute points from the trailing 7-day baseline, or falls below the floor. Surface the failure with the rubric, the pair, the cluster name from Error Feed, and the immediate_fix text the judge agent wrote. Not the aggregate. The wider release-gate pattern is in A/B testing LLM prompts.

Anti-patterns to avoid

Five recur. Each maps to a dimension the team isn’t measuring.

Single-metric evaluation. BLEU-only gates are the most common and the hardest to argue against because the number looks objective. BLEU misses fluency, idiom transfer, dropped clauses, and over-rewards literal renders. If you ship one number to the dashboard, it’s COMET-22 (offline) or COMET-Kiwi (production), not BLEU.

Single-language-pair calibration. A judge calibrated on English-to-Spanish doesn’t transfer to English-to-Japanese. Run the kappa loop per pair, or you’re flying blind on the long tail.

No cultural register test. Without a register rubric, usted silently becomes mid-paragraph and a polite Japanese support reply lands in plain form. Reviewers catch this in spot checks. CI gates don’t, unless you measure it.

No refusal preservation test. Translated agents that route through translation, hand off to a downstream agent, and translate back can bypass safety boundaries. AnswerRefusal scored on the target output with a multilingual guardrail backend catches it. The wider prompt-injection angle is in prompt injection defense at the gateway.

No glossary attachment. Without a glossary in context and a glossary-adherence rubric, medical, legal, and financial jargon translates word-for-word. The translation reads fluent and means something subtly wrong.

Where Future AGI fits

A package, three layers.

  • ai-evaluation SDK (Apache 2.0). from fi.evals import Evaluator. 60+ EvalTemplate classes including translation-specific TranslationAccuracy (eval_id=67), CulturalSensitivity (68), BleuScore (101), plus cross-cutting Groundedness (47), Completeness (10), FactualAccuracy (66), ContextAdherence (5), AnswerRefusal (88), and CustomLLMJudge for everything not in the catalog. Nine open-weight multilingual classifier backends (QWEN3GUARD, LLAMAGUARD_3, SHIELDGEMMA, and siblings), four distributed runners, multi-modal CustomLLMJudge via LiteLLM for OCR’d PDFs and product UI screenshots.
  • Future AGI Platform. Self-improving evaluators that retune from human post-edit feedback. An in-product authoring agent that turns a natural-language brief into a deployable rubric. Classifier-backed evals at lower per-eval cost than Galileo Luna-2: the economics that make weekly full-golden-set reruns the default.
  • Error Feed (inside the eval stack). HDBSCAN soft-clustering over ClickHouse embeddings groups failing translations into named issues. Typical clusters in customer data: “Spanish formal register slips to informal mid-paragraph,” “Japanese idiom rendered literally not naturalized,” “Medical jargon translated word-for-word loses canonical meaning,” “Refusal in English source becomes compliant answer in Mandarin target,” “Numeric clause silently dropped in long German legal source.” A Claude Sonnet 4.5 Judge agent writes the root cause and an immediate_fix per cluster. Linear ticketing today; Slack, GitHub, Jira, PagerDuty roadmap.

traceAI (Apache 2.0) is the tracing layer: 50+ AI surfaces across Python, TypeScript, Java, and C#. The hosted Agent Command Center sits in front of the translation provider and emits x-prism-cost, x-prism-latency-ms, x-prism-model-used, x-prism-fallback-used, and x-prism-guardrail-triggered headers per call. SOC 2 Type II, HIPAA, GDPR, CCPA certified. Eval-driven prompt optimization ships through the six optimizers in agent-opt; details in automated optimization for agents.

Ready to replace your BLEU gate? Install ai-evaluation and unbabel-comet, pull your last 200 production translations per pair, score them on TranslationAccuracy + fluency + idiom + register + glossary adherence + COMET-Kiwi this afternoon, then wire the same rubrics as EvalTag on live spans via traceAI tomorrow. The same rubric in CI and in production is what turns a translation eval into a regression suite that holds for the language pairs your customers actually use.

Frequently asked questions

Why is BLEU the wrong primary metric for LLM translation?
BLEU scores n-gram overlap against a single reference translation. That made sense for the statistical and neural MT systems it was built for, where models rendered the source closely and references were authoritative. LLM translation breaks both assumptions. The model paraphrases, restructures, and naturalizes idioms, so a faithful translation can share almost no n-grams with the reference and still be perfect. A literal mistranslation that happens to overlap n-grams can score higher. The WMT shared tasks moved off BLEU as the primary system ranking years ago for exactly this reason. Use BLEU as a cheap regression tile on a dashboard. Never as the gate that decides whether a translation ships.
What is COMET and why does it correlate better than BLEU?
COMET is a learned neural metric trained on human direct-assessment scores from the WMT shared tasks. It takes the source, the candidate translation, and optionally a reference, encodes them with a multilingual transformer (XLM-RoBERTa underneath), and predicts a quality score that correlates with human ratings far better than n-gram metrics. The reference-based variant (COMET-22, wmt22-comet-da) is the standard offline scorer. COMET-Kiwi (wmt22-cometkiwi-da) is the reference-free variant that scores source-to-target directly, which matters in production because you usually don't have a gold reference. Both ship in the unbabel-comet package.
Why isn't COMET enough on its own?
COMET is a single scalar trained on aggregated direct-assessment scores. It predicts overall human quality reasonably well, but it can't tell you why a translation failed. A summary that dropped a numeric clause, got a register slip, mistranslated a medical term, and used a literal idiom can all produce the same middling COMET score. Production teams need the diagnostic dimensions LLM-as-judge gives you (adequacy, fluency, idiom transfer, register, domain terminology) running alongside COMET. COMET catches the overall regression. The rubric tells you which dimension caused it.
How big should a per-language-pair golden set be?
Start at 100 to 200 examples per direction (English-to-Japanese is one pair, Japanese-to-English is another). Stratify by source length, register, and domain because failure modes cluster on all three. Native speakers label one reference target and an acceptable-target set. Pair each example with a COMET-22 reference score so you have a learned-metric anchor next to the LLM-judge rubric. Beyond 300 to 500 per direction, judge cost dominates and sampling beats more data. Grow the set weekly by promoting failing production traces.
How do you calibrate an LLM judge for translation quality?
Same protocol as G-Eval for summarization. Sample 50 source-target pairs per language pair from production, have two native speakers score each on adequacy, fluency, idiom transfer, cultural register, and domain term consistency using a 1-5 Likert. Resolve disagreements. Run the same judge prompt against the set. Compute Cohen's kappa between judge and human consensus per dimension. Anything below 0.6 means the rubric prompt needs sharpening, the judge model is wrong for the pair, or both. Pin judge model plus rubric prompt as one versioned contract. Re-calibrate quarterly and any time you swap models.
How does post-edit distance close the loop in production?
Localization pipelines almost always have human post-editors fixing LLM drafts before publish. The diff between the LLM draft and the human-edited final is the strongest signal you have. Track translation edit rate (TER, the Levenshtein-style edit distance normalized by target length) per language pair and per domain. Stream the post-edit diff next to the LLM-judge scores on the trace. When TER spikes for a pair, you know reviewers are doing more work even if COMET and the judge scores didn't move. The post-edit signal is also free labels: every edited segment becomes a training pair for the next golden set update.
What does Future AGI ship for translation evaluation?
The ai-evaluation SDK (Apache 2.0) carries the dedicated templates (TranslationAccuracy as eval_id 67, CulturalSensitivity as 68, BleuScore as 101) plus the cross-cutting ones (Groundedness 47, Completeness 10, FactualAccuracy 66, ContextAdherence 5, AnswerRefusal 88). CustomLLMJudge slots in for the rubrics that aren't in the catalog (fluency, idiom transfer, domain terminology). The Future AGI Platform adds self-improving evaluators that retune from human post-edit feedback at lower per-eval cost than Galileo Luna-2. Error Feed clusters failing translations into named issues (register slips, dropped clauses, refusal bypass) with an immediate_fix per cluster. Agent Command Center sits in front of the translation provider and emits x-prism-cost, x-prism-latency-ms, x-prism-model-used, and x-prism-fallback-used headers per call.
Related Articles
View all
Academic vs Production LLM Evaluation: The 2026 Bridge
Guides

Academic LLM benchmarks answer 'which model is generally smartest.' Production eval answers 'does my system work on my traffic today.' Different questions, different methodologies, and the bridge pattern that connects them in 2026.

NVJK Kartik
NVJK Kartik ·
13 min