Evaluation

What Is Language Classification?

Language classification identifies the language of text and checks whether it matches the expected locale or label.

What Is Language Classification?

Language classification is the task of identifying the language of an input, retrieved document, transcript, or model response. In AI reliability, it is an LLM-evaluation check for multilingual systems: the eval pipeline verifies that the detected language matches the expected locale or user language. In a FutureAGI workflow, language classification shows up on datasets and production traces to catch wrong-language answers, mixed-language drift, and locale routing mistakes before release or live escalation.

Why language classification matters in production LLM and agent systems

Wrong language is a high-signal reliability failure. A Spanish user asks a billing question, the assistant retrieves English policy text, and the final answer returns in English with a confident refund instruction. A multilingual support agent may classify the request as French, route it to the wrong knowledge base, and then fail an intent classifier trained on another locale. The surface error looks small, but the downstream action can be wrong.

The pain is split across teams. Developers see flaky multilingual evals because the same prompt passes in English and fails in Japanese, Arabic, or code-switched text. SREs see higher latency and retry counts after requests are sent to a gateway fallback route that was not needed. Product teams see abandonment in non-English markets. Compliance teams care when language mismatch causes policy, consent, or regulated disclosures to be delivered in a language the user did not ask for.

Agentic systems make this sharper. In a 2026-era multi-step pipeline, language classification may affect the retriever, tool route, response template, moderation policy, and escalation queue. One bad language label can poison every later step. Symptoms usually appear as language_mismatch_rate spikes, locale-specific thumbs-down clusters, unexpected model routes, or traces where the input language, retrieved context language, and output language disagree.

How FutureAGI handles language classification

FutureAGI handles language classification as an eval contract, not as a generic NLP label. The anchor surface is eval:CustomEvaluation: an engineer defines the expected label, the language source to check, and the score shape for each dataset row or live trace. The evaluator can wrap a deterministic language-ID library, a domain-specific rule, or a judge prompt when code-switching and transliteration need human-like judgment. For customer-agent flows, CustomerAgentLanguageHandling can sit beside the custom check when the product needs conversation-level language consistency rather than a single-label detector.

A real workflow starts with a multilingual support dataset. Each row stores input_text, output_text, expected_language, customer_locale, model, prompt_version, and trace_id. The team registers a CustomEvaluation named language_classification that emits score, detected_language, and reason. A release gate requires language_mismatch_rate < 1% overall and 0 mismatches on regulated disclosure flows. When a run fails, the engineer checks whether the mismatch came from the retriever, the model response, or an earlier routing decision.

FutureAGI’s approach is to treat the language label as a traceable production contract. Unlike a standalone langdetect check that runs after the response and loses context, FutureAGI can attach the verdict to a dataset row or a trace captured through the langchain traceAI integration. That lets the dashboard correlate failures with llm.token_count.prompt, prompt version, and agent step metadata such as agent.trajectory.step. The next action is concrete: alert on the cohort, add examples to the regression eval, route unsupported languages to a human queue, or pin a locale-specific prompt.

How to measure or detect language classification

Measure language classification with labeled examples first, then use production telemetry to find the slices that need more labels.

  • CustomEvaluation can standardize the score for a language-ID rule or judge rubric over expected_language and detected_language.
  • Track dashboard signals such as language_mismatch_rate, eval-fail-rate-by-locale, unsupported-language-rate, and mismatch-by-prompt-version.
  • Use langchain traceAI traces to compare input language, retrieved-context language, output language, model route, and agent.trajectory.step.
  • Use user-feedback proxies such as thumbs-down rate, escalation-rate, manual translation requests, and “wrong language” support tags.
  • Inspect a confusion matrix for pairs such as Portuguese vs. Spanish, Serbian Latin vs. Croatian, or simplified vs. traditional Chinese.
from fi.evals import CustomEvaluation

language_eval = CustomEvaluation(
    name="language_classification",
    rubric="Return score=1 if output language matches expected_language; else 0. Include detected_language.",
)
result = language_eval.evaluate(
    input={"expected_language": "es"},
    output="Su reembolso fue aprobado."
)
print(result.score, result.reason)

Set thresholds by risk. A marketing chatbot may tolerate occasional mixed-language copy. A consent, healthcare, or financial workflow should treat wrong-language regulated text as a release blocker.

Common mistakes

  • Reporting one global language-classification score while hiding failures in low-volume locales, right-to-left scripts, or mixed-language conversations.
  • Confusing language classification with translation quality; a response can be in Spanish and still mistranslate the source.
  • Running language detection only on final output, missing retrieval steps that injected the wrong-language policy text.
  • Treating code-switched text as a failure without defining whether the product supports mixed-language users.
  • Ignoring short strings, names, SKUs, and addresses; language detectors often mislabel them, so route those cases to explicit fallbacks.

Frequently Asked Questions

What is language classification?

Language classification identifies the language used in an input, retrieved passage, or model output. In LLM evaluation, it verifies that multilingual systems answer in the expected language and stay consistent across locale-specific workflows.

How is language classification different from translation accuracy?

Language classification asks which language the text is in or whether it matches an expected label. Translation accuracy checks whether meaning was preserved across source and target languages.

How do you measure language classification in FutureAGI?

Use `CustomEvaluation` to score detected_language against expected_language on a labeled dataset, then monitor language_mismatch_rate by locale, model, prompt version, and trace cohort.