Models

What Are N-Grams?

A contiguous sequence of n tokens — words, characters, or subwords — extracted from a piece of text.

What Are N-Grams?

An n-gram is a contiguous sequence of n tokens — words, characters, or subwords — extracted from a piece of text. Unigrams (n=1), bigrams (n=2), and trigrams (n=3) are the common sizes. N-grams powered classical language models, autocomplete, and feature extraction long before transformers, and they still sit inside the modern eval stack: every BLEU, ROUGE-N, and word-overlap metric is built on n-gram matching. They are the cheapest first pass for detecting duplicate text, leaked prompts, or canned-template outputs.

Why It Matters in Production LLM and Agent Systems

N-grams are useful precisely because they are cheap and surface-level. When you need to detect that a model regurgitated training data, a user replayed a prompt verbatim, or two responses are near-duplicates of each other, n-gram overlap is fast, deterministic, and easy to reason about. They are the right tool for canonical-reference tasks like translation evaluation (BLEU), structured-output checks, and prompt-leakage detection where you are looking for exact textual reuse.

The misuse case is what bites teams in production. BLEU and ROUGE compute n-gram overlap; both fail badly on open-ended generation where many surface forms are equally good answers. A model paraphrasing the gold answer correctly will score near zero on BLEU. An ML engineer who treats BLEU as “quality” will ship a regression evaluating a model that is actually better but worded differently. The pain falls on whoever is reading the dashboard: a green BLEU number on chat is meaningless, and so is the alert that fires when it drops.

In 2026 agent stacks, n-grams have a renewed niche: detecting tool-output reuse, catching template-leak in prompts, and flagging when an agent’s response copies retrieved context verbatim instead of synthesizing. They are a complement to embedding-based and judge-model evaluators, not a replacement.

How FutureAGI Handles N-Grams in Evaluation

FutureAGI’s approach is to expose n-gram-based evaluators alongside embedding-based and judge-model ones, with documentation that says when to use which. BLEUScore and ROUGEScore ship as local-metric evaluators in fi.evals and are the right choice for translation, summarization with canonical references, and structured-text generation. FuzzyMatch provides a more forgiving n-gram-aware string match for cases where minor formatting differences should not cause failure.

Concretely: a localization team running an LLM-based translation pipeline through traceAI-openai evaluates each output against a reference translation with BLEUScore and a per-language judge model. BLEU surfaces obvious surface degradations — token-order regressions, dropped clauses — while the judge handles fluency and idiom. When BLEU drops 4 points overnight, the trace view confirms the model started under-translating long sentences. For a chatbot team running an open-ended support agent, FutureAGI’s docs steer them away from BLEU and toward AnswerRelevancy plus ConversationCoherence, because n-gram overlap on a chat answer is noise. Pick the metric that matches the task.

How to Measure or Detect It

Use n-gram metrics where surface form is canonical, not where it is open:

  • BLEUScore: returns 0–1 weighted n-gram precision against a reference; standard for translation.
  • ROUGEScore: returns recall-oriented n-gram overlap (ROUGE-1, ROUGE-2, ROUGE-L); standard for summarization.
  • FuzzyMatch: returns 0–1 forgiving similarity using n-gram-aware techniques; good for near-canonical text.
  • N-gram repetition signal: count repeated trigrams in an output to detect model loops or template-fill failures.
  • N-gram overlap with retrieved context: high overlap = copy-paste behavior, useful as a hallucination-detector input.

Minimal Python:

from fi.evals import BLEUScore, ROUGEScore

bleu = BLEUScore()
rouge = ROUGEScore()

result = bleu.evaluate(
    output="The cat sat on the mat.",
    reference="A cat is sitting on the mat."
)
print(result.score)

Common Mistakes

  • Using BLEU on open-ended chat. BLEU only works when the gold answer is canonical; on chat it measures noise.
  • Picking n=4 by default. ROUGE-2 and BLEU-2 catch most signal; higher n is sparse and easily fooled.
  • Ignoring case and punctuation normalization. Mismatches there can drop BLEU 10 points without any real quality change.
  • Treating BLEU drop as quality drop. Confirm with a judge model or EmbeddingSimilarity before alerting.
  • Skipping reference-free metrics on open tasks. Use AnswerRelevancy or judge rubrics where references are not canonical.

Frequently Asked Questions

What is an n-gram?

An n-gram is a contiguous sequence of n tokens — words, characters, or subwords — extracted from text. Unigrams, bigrams, and trigrams are the most common sizes.

How are n-grams different from embeddings?

N-grams are surface-level token sequences; embeddings are dense vector representations of meaning. N-grams cannot capture synonymy or paraphrase, which is why they fail on open-ended generation evaluation where embeddings or judge models do better.

How are n-grams used in LLM evaluation?

FutureAGI's `BLEUScore` and `ROUGEScore` evaluators use n-gram overlap to measure surface similarity between model output and a reference, useful when the gold answer is canonical (e.g., translation, structured generation).