Guides

Multilingual LLM Evaluation: A 2026 Playbook for Non-English

Ship LLM eval that holds up outside English. The 7 multilingual challenges, the 5-step rollout, classifier ensembles per language, and how Future AGI grounds the loop.

·
14 min read
llm-evaluation multilingual non-english guardrails 2026 globalization safety
Editorial cover image for Multilingual LLM Evaluation: A 2026 Playbook for Non-English
Table of Contents

Most LLM eval frameworks were built on English data, judged by English-tuned LLMs, and benchmarked against English-language safety sets. When teams ship that same stack to a Hindi, Japanese, or Arabic surface, they ship a bilingual-quality product: an English half that scores well and a non-English half that scores well only because the rubric cannot see what is broken. Multilingual evaluation is the under-served axis of the whole LLM eval discipline in 2026. This playbook covers the seven structural challenges, the five-step rollout, and how the Future AGI stack is wired so the non-English half of your product is not a degraded clone of the English one.

TL;DR: the multilingual eval loop

  1. Stratify the golden set by language and script. Do not lump all CJK together. Do not assume Spanish and Portuguese behave the same. Hire native annotators per language and enforce kappa above 0.7 per language.
  2. Pick a per-language ensemble of classifier backends. Qwen3Guard, ShieldGemma, WildGuard, Granite Guardian, and LlamaGuard variants each have a different language coverage profile. Combine them with a weighted aggregation.
  3. Author rubrics in the target language. Use CustomLLMJudge with grading_criteria written natively, not translated from English.
  4. Tag every trace with its locale. Custom attributes on traceAI spans (llm.input.language, llm.output.language, session.language) make per-language slicing trivial downstream.
  5. Cluster failures per language via Error Feed. HDBSCAN plus a Claude Sonnet 4.5 cluster labeler surfaces patterns like “Japanese formal register slips mid-paragraph” rather than dumping raw failed traces.

The rest of this guide is the working pattern behind each of those steps.

Why English-only eval ships a degraded non-English product

The English bias is not one bug. It is seven structural gaps that compound.

Safety classifier coverage is uneven. LlamaGuard 3 8B and 1B are strong on English unsafe-content categories and noticeably weaker on Hindi, Tamil, and Arabic. Qwen3Guard variants are strong across Mandarin, Japanese, Korean, Hindi, and Arabic, weaker on some European languages. ShieldGemma 2B is broad but not deep. No single classifier covers every language well. A deployment that picks one backend ships a per-language safety floor that varies by 20 to 40 points in measured recall, depending on the language.

Judge models carry an English idiom prior. An LLM-judge trained predominantly on English prefers responses that read fluently in English. The same judge applied to Japanese will under-rate concise polite outputs because they look short by English standards. It will over-rate verbose Arabic outputs because they look thorough. The bias is real and quantifiable per language, and the only fix is to either swap to a judge with strong target-language coverage or rewrite the rubric and few-shots in the target language.

Rubric translation does not preserve intent. Translating an English rubric into Japanese preserves the words and loses the register distinction the rubric was meant to enforce. A rubric that says “polite tone” in English maps to several distinct Japanese politeness levels (teineigo, sonkeigo, kenjougo). The translated rubric collapses them. A judge scoring against the translated rubric cannot distinguish a response that mis-mixes keigo levels from one that gets it right.

Annotator agreement is per-language. A Cohen’s kappa of 0.8 on English labels says nothing about the Mandarin labels in the same set. If the Mandarin labels were translated from English guidance by non-native annotators, the per-language kappa is often below 0.5 even when the headline number looks fine. Below 0.7 the labels are noise, and every downstream calibration runs against noise.

Tokenization costs vary by language. Under most modern tokenizers, Mandarin sits at roughly 2x English at equivalent semantic content. Hindi runs 3 to 4x. Arabic, Tamil, and Korean are also high. The per-eval cost on those languages is correspondingly higher. A Hindi golden set the same case-count as the English one costs three times as much to score. Teams that do not budget per-language hit overruns at scale.

Idiom, formality, and register are language-specific. Spanish has the formal-versus-informal tu and usted split, which production agents in Mexico, Spain, and Argentina handle differently. Japanese keigo has the three levels above plus context-dependent switches. Arabic deployments cross dialect boundaries (Modern Standard Arabic, Egyptian, Gulf, Maghrebi) where the same lexical form carries different connotations. An eval rubric that does not name the register dimension explicitly leaves the register failure invisible.

Refusal calibration drifts under translation. A query that the agent refuses in English is sometimes complied with when translated and presented in the target language, because the model’s safety boundaries were calibrated mostly on English prompts. The reverse also happens: a benign Japanese query mis-classified as unsafe because the safety classifier flags a phrase the translator chose. Refusal calibration tests have to run per language, not as an English suite plus translation.

Each gap is small in isolation. Stacked, they produce a non-English experience that scores well on the eval dashboard and badly on the user surface. The goal of a multilingual eval program is to surface each gap explicitly and instrument it.

The seven challenges in one table

ChallengeWhat breaksHow to instrument
Safety classifier coveragePer-language recall varies 20-40 points across backendsPer-language ensemble: Qwen3Guard plus ShieldGemma plus LlamaGuard with weighted aggregation
Judge English-idiom biasConcise non-English outputs under-rated, verbose under-rated upCustomLLMJudge with target-language grading_criteria, native few-shots, judge model with strong target-language coverage
Rubric translationRegister distinctions collapse, intent is lostAuthor rubrics natively, do not translate from English
Annotator agreementPer-language kappa hides under English-aggregate kappaTrack kappa per language per rubric, hire native annotators, refresh guidance per language
Tokenization costMandarin 2x, Hindi 3-4x English token countPer-language gateway budgets, per-language eval cost dashboards
Idiom, formality, registerRegister failures invisible to translated rubricsCustom rubrics like FormalityCorrectness, CulturalRegisterAdherence, IdiomTransferQuality
Refusal calibrationRefusals leak across languages in both directionsPer-language refusal test suite, RefusalPreservation rubric, paired English plus target-language prompts

Per-language classifier ensembles

The single biggest lever on multilingual safety is moving from one classifier backend to a per-language ensemble. The Future AGI Guardrails layer ships nine open-weight backends configurable per route:

  • QWEN3GUARD_8B, QWEN3GUARD_4B, QWEN3GUARD_0_6B. Strongest multilingual line in the menu. Mandarin, Japanese, Korean, Hindi, and Arabic coverage. The 0.6B sits at edge-inference latency for low-stakes routes.
  • LLAMAGUARD_3_8B, LLAMAGUARD_3_1B. Meta’s line. Broad but with a noticeable English bias. Strong on canonical unsafe-content categories.
  • SHIELDGEMMA_2B. Google’s broad-multilingual classifier. Useful as a third vote in an ensemble.
  • WILDGUARD_7B. Allen Institute. Multilingual coverage with a different label taxonomy from the others, which is what makes it useful as a complementary vote.
  • GRANITE_GUARDIAN_8B, GRANITE_GUARDIAN_5B. IBM’s multilingual line. Good European-language coverage.

The pattern that works is a weighted ensemble. For a Mandarin route:

from fi.evals import Guardrails, RailType, AggregationStrategy
from fi.evals.types import LocalEvalBackend as Backend

mandarin_output_rail = Guardrails(
    rail_type=RailType.OUTPUT,
    aggregation=AggregationStrategy.WEIGHTED,
    backends=[
        Backend.QWEN3GUARD_8B,
        Backend.SHIELDGEMMA_2B,
        Backend.WILDGUARD_7B,
    ],
    weights={"QWEN3GUARD_8B": 0.5, "SHIELDGEMMA_2B": 0.3, "WILDGUARD_7B": 0.2},
)

A Hindi route uses the same shape with a different weighting. A French route weights Granite Guardian and LlamaGuard more heavily. The aggregation surface accepts WEIGHTED, MAJORITY, ANY_FAIL, and ALL_FAIL strategies, so the route owner picks the calibration. The Future AGI guardrail metrics guide covers how to measure the ensemble’s per-language recall against your own labeled set.

The deployment guide is straightforward: pick the strongest backend per language as the anchor vote, add one or two complementary backends so a label-taxonomy gap is covered, and weight the anchor at 0.5 to 0.6. Re-run the calibration whenever a backend gets a new release.

Authoring rubrics natively rather than translating

Translated rubrics are the second-largest source of multilingual eval drift. The Future AGI CustomLLMJudge surface accepts grading_criteria in any language and pairs it with a judge model that handles that language. For Japanese formal register:

from fi.evals import CustomLLMJudge

japanese_register_judge = CustomLLMJudge(
    name="JapaneseFormalRegister",
    grading_criteria="""
    応答が一貫した敬語レベルを保っているかを評価してください。
    丁寧語、尊敬語、謙譲語のレベルが段落の途中で混在している場合は不合格とします。
    カジュアルな表現がフォーマルな文脈に紛れ込んでいる場合も不合格です。
    """,
    rating_scale="pass_fail",
    model="claude-sonnet-4-5",
)

The same pattern works for Spanish formal-versus-informal register, Arabic dialect consistency, Mandarin classical-versus-vernacular register, and so on. The rubric is authored once, reviewed by a native annotator on the rubric text itself (not the cases), and then frozen as a versioned artifact alongside the English rubrics.

A small set of native rubrics worth shipping by default on any multilingual deployment:

  • IdiomTransferQuality. Did the translation preserve the idiomatic intent rather than render literally?
  • CulturalRegisterAdherence. Did the response match the cultural register expected for the user’s locale?
  • RefusalPreservation. Did a refusal in the source language remain a refusal in the target language?
  • FormalityCorrectness. Did the response hit the expected formality level for the locale (tu vs usted, keigo level, dialect choice)?

Each ships as a versioned CustomLLMJudge artifact. The LLM judge prompt engineering guide covers the few-shot pattern that lifts judge agreement on each of these.

Tagging every trace with its locale

The traceAI SDK supports custom attributes on spans, so every LLM call carries the locale metadata that makes per-language slicing trivial downstream:

from fi.integrations.otel import register, FITracer
from opentelemetry import trace

tracer = FITracer(register(project_name="multilingual-prod"))

def handle_turn(user_input, detected_language):
    with tracer.start_as_current_span("llm_call") as span:
        span.set_attribute("llm.input.language", detected_language)
        span.set_attribute("session.language", detected_language)
        response = call_llm(user_input)
        span.set_attribute("llm.output.language", detect_language(response))
        return response

Once every span carries its locale, the rest of the Future AGI Platform inherits it. Dashboards filter by language. Error Feed clusters per language. Self-improving evaluators retune thresholds per language. The Platform’s session-level views show the language distribution across a multi-turn conversation, which is what catches code-switching failures that single-turn instrumentation misses. The traceAI instrumentation guide covers the full custom-attribute surface.

Parallelizing the per-language matrix

A multilingual eval matrix grows fast. Five top languages times five routes times eight rubrics times 500 cases is 100,000 evaluator calls. Running them sequentially on a single Python process is a day-long batch. The Future AGI evaluation engine ships four distributed runners that parallelize the matrix:

  • ConcurrentEvaluator. Thread-pool for IO-bound evaluators (API calls to hosted judges).
  • BatchEvaluator. Batched submission to the platform’s eval queue.
  • AsyncEvaluator. Async runner for streaming workloads.
  • PipelineEvaluator. Ordered pipelines where a downstream rubric reads an upstream rubric’s score.

A typical multilingual run uses BatchEvaluator for the bulk of the matrix and ConcurrentEvaluator for the streaming subset. The whole 100,000-call run completes in tens of minutes, not a day. The open-source ai-evaluation library writeup covers the runner selection logic.

Error Feed: clustering per-language failures

Once the matrix runs, the question is which failures matter. Error Feed clusters failed traces with HDBSCAN and labels each cluster with a Claude Sonnet 4.5 judge that writes a short cluster summary plus an immediate_fix field. On multilingual deployments, the cluster shapes that surface are language-specific:

  • “Japanese formal register slips to informal mid-paragraph.” The judge writes an immediate fix that says to add an explicit keigo-level system instruction and a JapaneseFormalRegister rubric to the CI gate.
  • “Arabic right-to-left rendering breaks structured outputs.” The fix points at the structured-output formatter ignoring RTL Unicode markers.
  • “Mandarin idiomatic phrases mis-translated literally.” The fix points at the translation pipeline and proposes an IdiomTransferQuality threshold tightening.
  • “Spanish (Mexico) responses use usted when the user opened with tu.” The fix points at the FormalityCorrectness rubric and proposes a few-shot update.

Each cluster is a unit of work an engineer can act on. The self-improving agent pipeline writeup covers how clusters feed back into the next training-and-eval cycle.

Per-language gateway budgets

Tokenization variance turns into real cost variance on production. The Agent Command Center gateway carries five-level hierarchical budgets (org, team, project, route, key). For a multilingual deployment, the route level is where the per-language budget lives:

route_budget = {
    "route": "support-agent-jp",
    "language": "ja",
    "monthly_token_budget": 12_000_000,
    "alert_threshold": 0.8,
}

A Hindi route with the same case volume as the English one carries roughly 3-4x the budget. A Mandarin route carries 2x. The gateway enforces the budget and the dashboards surface per-language overruns before the line item drifts. The AI gateway cost optimization guide covers the hierarchical surface in detail.

The 5-step multilingual rollout

The full rollout sequence we recommend for a team going from English-only to multilingual production:

Step 1: stratify the golden set by language and script. Sample production traces (or seed scenarios) per top language. Do not lump CJK together. Do not collapse RTL languages. The stratification keys are language, script direction, register expectation, and dialect where relevant. Aim for 200 to 500 cases per language per route as a floor, weighted toward the hardest 10 percent of failures observed so far.

Step 2: hire native-speaker annotators per top language. Translators reading English guidance are not the same as native annotators reading native guidance. Enforce Cohen’s kappa above 0.7 per language per rubric. When a language drops below 0.7, fix the rubric guidance in that language before adding more cases.

Step 3: pick a per-language classifier ensemble. Use the menu above. Anchor on Qwen3Guard for CJK and South Asian languages, anchor on Granite Guardian for European languages, anchor on LlamaGuard for English. Add one or two complementary backends. Weighted aggregation, anchor at 0.5 to 0.6.

Step 4: author rubrics in the target language. Native rubric authoring on the four-rubric multilingual default set (IdiomTransferQuality, CulturalRegisterAdherence, RefusalPreservation, FormalityCorrectness) plus your domain-specific custom judges. Review the rubric text with a native annotator before scoring any cases.

Step 5: monitor per-language drift via Error Feed and retune. Cluster production failures per language. Triage each cluster through the immediate_fix the judge writes. Promote validated fixes into the next golden-set refresh. Let the Platform self-improving evaluators retune per-language thresholds from thumbs-up and thumbs-down feedback. The golden set design guide covers the refresh cadence pattern that fits cleanly with this loop.

Anti-patterns to avoid

The teams that get multilingual wrong tend to repeat the same mistakes.

English golden set with translated test cases. The labels were written against the English version, the translations were generated, and the evaluator scores translations against English-derived labels. This evaluates translation quality, not native behavior, and it hides every register failure the translator smoothed over.

One classifier backend for all languages. Whichever backend wins, it is always weakest somewhere. The fix is the ensemble, not a longer search for a magic single model.

No per-language annotator agreement check. Headline kappa of 0.8 with three of the five languages at kappa 0.4 is common in teams that did not measure per-language. Every downstream Mandarin or Hindi metric reads against noise.

Ignoring per-language token cost. A Hindi route at the same case volume as English costs three times as much. Teams that did not budget per-language hit overruns at the worst possible moment, which is right after a successful product expansion.

No per-language refusal calibration. Refusals leak across languages in both directions. An agent that refuses correctly in English will sometimes comply in the translated context, and the reverse also happens. The fix is paired English plus target-language prompts in the refusal test suite and a RefusalPreservation rubric on every multilingual route.

The LLM judge bias mitigation guide covers more of the per-language judge-side anti-patterns that show up once the rubric layer is in place.

Honest framing on what ships today versus what is roadmap

A few caveats so this guide stays trustworthy.

The trace-stream connector that feeds production traces directly into the agent-optimization loop is on the near-term roadmap rather than shipping today. The pattern in this guide uses the eval-driven optimization loop that ships now via six optimizers (DSPy MIPROv2, BootstrapFewShot, COPRO, OPRO, PromptWizard, BootstrapFewShotWithRandomSearch), which is more than enough to retune per-language rubric prompts and judge few-shots without the connector. The automated agent optimization writeup covers the six optimizers.

Error Feed ships with a Linear integration today. JIRA, Slack, and PagerDuty connectors are on the integrations roadmap. Teams that need a different ticket router run the Sonnet 4.5 cluster summaries through a webhook adapter in the interim.

The FAGI Protect classifier weights are closed-source. The Agent Command Center gateway self-hosts deterministic regex-and-lexicon PII fallbacks across 18 entity types with multilingual coverage, which is what powers most production redaction. The full ML Protect surface is hosted only.

These caveats are the honest perimeter of what the playbook covers. Inside the perimeter the loop is solid: per-language ensembles, native rubric authoring, distributed evaluation, per-language Error Feed clusters, and per-language self-improving evaluator retuning.

How Future AGI grounds the loop

Five surfaces carry the multilingual loop end to end.

Guardrails with the nine-backend open-weight menu. Qwen3Guard 8B, 4B, 0.6B; LlamaGuard 3 8B, 1B; ShieldGemma 2B; WildGuard 7B; Granite Guardian 8B, 5B. Per-language ensembles via weighted aggregation. The guardrails platform comparison covers the broader landscape.

CustomLLMJudge for native rubric authoring. Grading criteria, rating scales, and few-shots in any language. Versioned artifacts alongside the English rubrics.

traceAI with custom language attributes per span. llm.input.language, llm.output.language, session.language, plus any custom tags. Per-language slicing inherits across every dashboard.

Four distributed runners. ConcurrentEvaluator, BatchEvaluator, AsyncEvaluator, PipelineEvaluator parallelize the per-language matrix so 100,000-call runs complete in tens of minutes.

Error Feed plus Platform self-improving evaluators. HDBSCAN clustering, Claude Sonnet 4.5 cluster labeling with immediate_fix, thumbs feedback that retunes per-language thresholds automatically. The open-source eval frameworks roundup covers how this differs from the libraries that ship the runners without the feedback loop.

The combination is the closed loop. Per-language failures surface in Error Feed. The immediate_fix gets applied. The next batch run re-evaluates. Per-language thresholds retune from production thumbs. The non-English half of the product stops being a degraded clone of the English one.

The closing loop

Multilingual evaluation is not a bolt-on feature on top of an English eval stack. It is a redesign of the labeled set, the classifier layer, the judge layer, the trace layer, and the cluster-triage layer so each one carries its locale through the pipeline. The teams that did this in 2025 are the ones whose Japanese, Hindi, Spanish, and Arabic surfaces hold up under the same scrutiny as their English surface. The teams that did not are the ones shipping the bilingual-quality product they did not realize they were shipping. The loop above is the working pattern. Each surface is documented. The honest perimeter is named. The rest is the per-language work that only a native annotator and a careful rubric author can do.

Frequently asked questions

Why do English-only LLM eval frameworks fail outside English?
The rubrics, classifiers, and judge prompts were built and calibrated on English corpora, so they inherit an English prior. LlamaGuard is strong on English but weaker on Hindi, Tamil, and Arabic. Judges trained on English prefer English idiomatic phrasing, which undervalues concise non-English outputs. Tokenization treats non-Latin scripts unevenly. Each gap is small in isolation. Stacked, they ship a bilingual product where the non-English side is a degraded clone of the English one.
Which classifier backends are strongest for non-English safety?
Qwen3Guard variants (8B, 4B, 0.6B) cover Mandarin, Japanese, Korean, Hindi, and Arabic well. ShieldGemma 2B and WildGuard 7B add complementary coverage. Granite Guardian 8B and 5B are IBM's multilingual line. LlamaGuard 3 8B and 1B remain strong on English but weaker on non-Latin scripts. The right answer per language is rarely a single backend. Ensemble two or three with a weighted aggregation so a per-language gap in one classifier is covered by another.
How do I author rubrics in a target language without translating English ones?
Use the Future AGI CustomLLMJudge surface with grading_criteria written in the target language and an evaluator model that handles that language natively. Translated rubrics often preserve words but lose register, and a Japanese rubric that says "polite" without distinguishing keigo levels misses the actual user complaint. Authoring in the target language with a native annotator pass on the rubric text closes that gap before any case is scored.
How does FAGI's Error Feed help on multilingual deployments?
Error Feed clusters failures via HDBSCAN and labels each cluster with a Claude Sonnet 4.5 judge. On multilingual deployments the surfaced clusters look like "Japanese formal register slips to informal mid-paragraph", "Arabic right-to-left rendering breaks structured outputs", or "Mandarin idiomatic phrases mis-translated literally". Each cluster carries an immediate_fix the judge writes. Engineers triage clusters per language rather than scrolling raw failed traces, which is what makes per-language drift visible.
Does non-English really cost more in tokens?
Yes. Mandarin runs roughly 2x English at equivalent semantic content under most modern tokenizers. Hindi sits at 3-4x. Arabic and Tamil are also high. The per-eval cost on those languages is correspondingly higher, so the budget for a 5,000-case golden set in Hindi is not the same as the budget for the same set in English. Track this in the gateway with per-language hierarchical budgets so a translation expansion does not overrun the eval line item silently.
What is the kappa target for per-language annotation?
Cohen's kappa above 0.7 per language per rubric. Below that the labels are noise, and a Mandarin golden set with kappa 0.4 will mis-calibrate every downstream Mandarin judge. Hire native-speaker annotators for each top language rather than relying on translators reading English guidance. Track kappa per language; when it drops, fix the rubric guidance in that language before adding more cases.
Which Future AGI surfaces matter most for multilingual eval?
Five surfaces. Guardrails with per-language backend ensembles via the open-weight classifier menu. CustomLLMJudge for native-language rubric authoring. traceAI with custom language attributes per span so every trace carries its locale. The four-runner distributed evaluation engine that parallelizes the per-language matrix. Error Feed with HDBSCAN clustering and Sonnet 4.5 cluster labeling for triage. The Platform self-improving evaluators retune per-language thresholds from production thumbs feedback.
Related Articles
View all