Models

What Is Text Normalization?

The preprocessing step that converts raw text into a standardized form — lowercasing, Unicode normalization, number expansion, punctuation handling — for downstream models.

What Is Text Normalization?

Text normalization is the preprocessing step that rewrites raw text into a standardized form before it is fed to a downstream model. Typical operations include Unicode normalization (NFC versus NFD), lowercasing, expanding contractions, converting digits and dates to canonical strings, stripping accents, collapsing whitespace, and applying provider-specific rules for currency, time, and units. For TTS it expands 123 into “one hundred twenty-three”; for ASR post-processing it inserts numerals and punctuation. In LLM pipelines it lives inside the tokenizer or as upstream cleaning for retrieval.

Why It Matters in Production LLM and Agent Systems

Normalization decisions silently change which tokens a model sees, which embeddings get computed, and which strings match in retrieval. A retrieval index built with NFC-normalized text will miss queries written in NFD — the same character looks different in bytes. A TTS model that does not expand numerals reads “twenty-twenty-three” as “two-zero-two-three” and the agent sounds robotic. An ASR system that emits raw text without normalization gives downstream classifiers january lowercased while the entity recognizer was trained on title-cased months.

ML engineers feel this when retrieval quality drops after a corpus refresh that introduced different Unicode source files. Voice teams feel it when a TTS provider upgrades and number pronunciation changes overnight. Retrieval engineers feel it when accents in user queries fail to match accent-stripped index entries. None of these are obvious bugs — they are quiet quality regressions on a specific cohort.

For 2026 voice and multilingual agent stacks, normalization is the most-frequently-broken layer because every locale has its own rules. French preserves accents; Spanish and Portuguese sometimes do; German lowercases nouns differently from Dutch. A single shared “lowercase + strip accents” rule that worked for English breaks four other languages. FutureAGI’s role is to expose the consequence in evaluation scores rather than to enforce normalization rules.

How FutureAGI Handles Text Normalization

FutureAGI does not run a normalization library — that lives in the tokenizer, the TTS provider, or the retrieval-side cleaning function (often Hugging Face tokenizers, unicodedata, or NeMo’s text-normalization toolkit). FutureAGI’s role is to evaluate whether the normalization choices upstream produce trustworthy outputs downstream. The relevant fi.evals surfaces are ASRAccuracy (transcript correctness, where punctuation and numerals are normalization-sensitive), TTSAccuracy (spoken output where number expansion matters), ContextRelevance (retrieval where Unicode and accent handling decide match rates), and EmbeddingSimilarity (semantic comparisons that drift if normalization changes between corpus and query).

A real workflow: a logistics voice-agent team adds a French locale. The first day, ASRAccuracy holds at 0.91 but TaskCompletion drops to 0.62 on French calls. They sample failed traces — the agent is repeatedly mispronouncing tracking numbers because the TTS layer stopped expanding them after a provider upgrade changed normalization defaults. Re-enabling number expansion restores TTSAccuracy and TaskCompletion together. Without per-cohort eval scores tied to traceAI-livekit spans, the bug would have stayed silent.

For retrieval-side normalization, the team scores ContextRelevance and ContextPrecision on a query cohort with accents (café, résumé); a regression points directly at the normalization layer.

How to Measure or Detect It

Normalization is not directly scored — measure it through the layer it affects:

  • ASRAccuracy: transcript-correctness score; drops when punctuation, casing, or numeral handling changes.
  • TTSAccuracy: spoken-output score against a reference; drops when number, date, or unit expansion fails.
  • ContextRelevance + ContextPrecision: retrieval-side scores; drift when corpus and query normalization diverge.
  • Per-cohort eval-fail-rate: split by language, locale, and Unicode source; the canonical alarm for normalization regressions.
  • Token-count anomalies: llm.token_count.prompt shifts after an upstream normalization change suggest the tokenizer is seeing different inputs.

Minimal Python:

from fi.evals import ASRAccuracy, TTSAccuracy

asr = ASRAccuracy()
tts = TTSAccuracy()

asr_result = asr.evaluate(prediction=raw_transcript, reference=reference_text)
tts_result = tts.evaluate(prediction=tts_output, reference=expected_speech)
print(asr_result.score, tts_result.score)

Common Mistakes

  • Applying English-only rules across all languages. Lowercasing and accent-stripping that work for English break Spanish, French, and Vietnamese.
  • Normalizing the corpus differently from the query. If the index lowercases but the query does not, retrieval silently misses cases.
  • Assuming the TTS provider normalizes for you. Some providers expand numerals automatically; others read digits literally — always validate with TTSAccuracy.
  • Ignoring NFC versus NFD. Two visually identical strings can have different byte representations; normalize Unicode form before any string match.
  • Skipping normalization tests in regression evals. A library upgrade can change defaults overnight; pin the normalization function or test it explicitly.

Frequently Asked Questions

What is text normalization?

Text normalization standardizes raw text before it is consumed by a model — lowercasing, Unicode normalization, expanding numbers, formatting dates, stripping accents. It happens upstream of tokenization, TTS, and retrieval.

How is text normalization different from tokenization?

Normalization rewrites the surface text into a standard form. Tokenization then splits the normalized text into the integer IDs a model consumes. Normalization is preprocessing; tokenization is the model's actual input pipeline.

How do you measure text-normalization quality?

FutureAGI evaluates the downstream signals — ASRAccuracy for transcripts where normalization affects punctuation and numerals, TTSAccuracy for spoken-form expansion, and retrieval evaluators for normalization in RAG pipelines.