Research

What is Tokenization in LLMs? BPE, SentencePiece, tiktoken in 2026

Tokenization explained for 2026 LLMs: BPE, SentencePiece, WordPiece, tiktoken, why tokenizers shape cost, latency, eval scores, and multilingual quality.

·
Updated
·
14 min read
tokenization llm-internals bpe sentencepiece tiktoken token-counting cost-optimization 2026
Editorial cover image on a pure black starfield background with faint white grid. Bold all-caps white headline WHAT IS TOKENIZATION fills the left half. The right half shows a sentence broken into colored token bricks with a soft white halo glow on the subword boundary, drawn in pure white outlines.
Table of Contents

A multilingual support agent ships and the bill triples in two weeks. The prompt template did not change. The user mix did. The Brazilian and Indonesian customer cohorts grew 8x, and their prompts tokenize at 1.6x the per-character cost of English in the older cl100k_base vocabulary the team is using. A switch to a model on o200k_base cuts the per-character token cost on those languages by 35%. The team did not touch the prompt, the model size, or the agent logic. The fix was a tokenizer.

This is what tokenization is for in 2026. The tokenizer is the silent boundary between text and the model, and its choice shapes cost, latency, context length, eval scores, and multilingual quality. Most teams treat tokenizers as a black box; production teams that scale do not. This guide is the entry-point explainer covering the major algorithms (BPE, SentencePiece, WordPiece), the canonical libraries (tiktoken, SentencePiece, transformers), and how to count tokens for OpenAI, Anthropic, and Llama models in 2026.

TL;DR: What tokenization is

Tokenization is the step that turns text into a sequence of integers an LLM can process. The text is split into tokens (often subwords, sometimes whole words or single bytes), each token maps to an id in a fixed vocabulary, and the model operates on the integer ids during prefill and decode. Detokenization is the reverse. Every LLM has a tokenizer, and the choice shapes context length, cost, latency, and quality on multilingual and code workloads. The dominant algorithm in 2026 is BPE and its variants; SentencePiece remains common across open and multilingual model families for BPE and unigram training; OpenAI’s tiktoken is the canonical token counter for GPT models.

Why tokenization matters in 2026

Three reasons it stopped being implementation detail.

First, cost. LLM pricing is per-token. A prompt that tokenizes to 1,200 tokens in one vocabulary might tokenize to 800 tokens in another. On a workload processing 10M tokens a day, a 33% tokenization gap is real money. Multilingual workloads make this more acute: older English-heavy vocabularies fragment non-English text 2-3x more than newer multilingual vocabularies.

Second, context. Context windows in 2026 range from 32K on older models to 1M on long-context tiers of Gemini 2.5 and Claude Sonnet 4/4.5; GPT-5 sits at 400K. The window is in tokens, not characters. A 200K-character document fits comfortably in a 128K window in English with cl100k_base; the same document in Hindi with the same tokenizer can spill past the cap. Tokenization decides whether your RAG chunk fits.

Third, eval. BLEU and ROUGE are computed on tokens. Schema-validation evals depend on the tokenizer not splitting JSON keys awkwardly. Refusal-rate and safety classifiers were trained with one tokenizer and may behave differently when the input was tokenized with another. Tokenizer drift is a quiet source of eval regression in 2026 stacks that swap models without re-tokenizing the eval set.

Tokenization is no longer a black box detail. It is a parameter that shapes the dollars, the seconds, and the score.

Same sentence, three tokenizers: BPE vs SentencePiece vs WordPiece

Track which tokenizer your eval set was tokenized with; swapping models without re-tokenizing introduces silent eval regression.

TokenizerUsed byStrategy
BPE (Byte-Pair Encoding)GPT-2/3/4, Claude familyGreedy merge of frequent byte pairs
SentencePieceLlama, T5, Mistral, GeminiUnigram or BPE over raw bytes, no whitespace pre-split
WordPieceBERT, original Transformer derivativesGreedy split by maximum likelihood

The major tokenization algorithms

Byte Pair Encoding (BPE)

BPE starts from a vocabulary of single bytes or characters, then iteratively merges the most frequent adjacent pair until the vocabulary hits the target size. The resulting vocabulary contains common whole words (“the”, “and”), useful subword fragments (“tion”, “ed”, “ing”), and rare ids that map to single bytes for handling unseen characters.

GPT-2 popularised byte-level BPE, where the alphabet is the 256 bytes rather than Unicode characters. This avoids the unknown-token problem entirely: any byte sequence can be tokenized. GPT-3, GPT-4, GPT-5, and the Llama 3+ family use byte-level BPE variants.

The original BPE-for-NMT paper is Sennrich, Haddow, and Birch’s 2016 Neural Machine Translation of Rare Words with Subword Units. The byte-level extension is documented in Radford et al.’s GPT-2 paper.

SentencePiece

SentencePiece is a tokenizer library and training framework released by Google in 2018. Two design choices distinguish it.

First, it trains directly on raw text and treats whitespace as a normal symbol, escaped as U+2581 (printed as ). This matters for languages without whitespace (Chinese, Japanese, Thai) where a whitespace assumption breaks tokenization. SentencePiece works the same across all languages without separate per-language preprocessing.

Second, it supports both BPE and unigram language model training. The unigram variant is a probabilistic model: it starts with a large vocabulary, computes the likelihood of each token under a unigram LM, and prunes the lowest-likelihood tokens until the vocabulary hits the target size. The unigram variant is what Gemma uses; Llama 2 uses the BPE variant.

The original paper is Kudo and Richardson’s 2018 SentencePiece: A simple and language independent subword tokenizer. The unigram variant is in Kudo’s Subword Regularization.

WordPiece

WordPiece is the algorithm BERT uses. It is a greedy longest-match subword tokenizer. The vocabulary is trained similarly to BPE but with a likelihood-based merge criterion. Subword pieces are prefixed with ## to mark continuation. WordPiece dominated several encoder-only models (BERT, DistilBERT, ELECTRA); RoBERTa uses byte-level BPE rather than WordPiece. WordPiece has been largely supplanted by BPE for generative LLMs.

Character and byte tokenization

Single-character or byte tokenizers exist (CharFormer, ByT5, byte-level fallback inside BPE) but rarely as the primary tokenizer for production LLMs. The token sequence is much longer at the same character count, so cost and latency are higher; the upside is universal language coverage with no vocabulary mismatch. Used in some specialised models for code, biology, or low-resource language.

Tokenizer-by-model: who uses what in 2026

Model familyTokenizerVocab sizeNotes
GPT-3 (base completion models)tiktoken p50k_base / r50k_base (BPE)~50KOlder GPT-2-derived encoding
GPT-3.5, GPT-4, GPT-4-turbotiktoken cl100k_base (BPE)~100KEnglish-heavy; fragments non-English text
GPT-4o, GPT-5tiktoken o200k_base (BPE)~200KMultilingual-friendly; fewer tokens on Korean, Chinese, Hindi than cl100k_base
Claude 3, Claude 3.5, Claude Sonnet 4proprietary (BPE-style)undisclosedNo public local tokenizer; use count_tokens API
Llama 2SentencePiece BPE32KOlder, smaller vocab
Llama 3, Llama 3.1, Llama 4tiktoken-style (BPE)128KLarger vocab; better multilingual coverage
GemmaSentencePiece (unigram)256KMultilingual-first
Gemini 2.5proprietaryundisclosedUse Vertex AI SDK count
Mistral, MixtralSentencePiece BPE32K-128KVaries by version
Qwen 2, Qwen 3BPE~150KChinese-friendly
BERT, DistilBERT, ELECTRAWordPiece30KEncoder-only
RoBERTabyte-level BPE50KEncoder-only; uses GPT-2-style BPE

For canonical sources see the tiktoken model -> encoding table, the Llama tokenizer page, and Anthropic’s token counting docs.

How to count tokens correctly in 2026

Three rules.

Rule 1. Use the official tokenizer for the target model. Approximations across vocabularies are wrong by 10-40% on non-English text and 20-60% on code.

Rule 2. Pin the tokenizer version. tiktoken adds new encodings; if you upgrade the library and the model’s encoding name changed, your historical cost dashboards now compare two different vocabularies.

Rule 3. Count both prompt and completion. Pricing is asymmetric (output is typically 3-5x the input price), so track them separately. Most gateways (FutureAGI Agent Command Center, Helicone, Portkey, LiteLLM) attribute both natively.

Code patterns:

# OpenAI (tiktoken)
import tiktoken
enc = tiktoken.encoding_for_model("gpt-4o")
n_tokens = len(enc.encode("Hello, world!"))

# Anthropic (count_tokens API)
import anthropic
client = anthropic.Anthropic()
n_tokens = client.messages.count_tokens(
    model="claude-sonnet-4-20250514",
    messages=[{"role": "user", "content": "Hello, world!"}],
).input_tokens

# Llama 3 (transformers)
from transformers import AutoTokenizer
tok = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3-8B")
n_tokens = len(tok.encode("Hello, world!"))

For multi-message chat prompts (system + user + assistant turns), the model adds special tokens around each message. The official tokenizer or the count endpoint includes those; ad-hoc word-count approximations do not. Use the official path.

Common mistakes when working with tokenization

  • Approximating tokens as ~4 characters or ~0.75 words. Wrong by 30%+ on non-English, code, JSON, or any structured text. Use the actual tokenizer.
  • Using one tokenizer to estimate cost across providers. OpenAI, Anthropic, and Llama vocabularies are different; cross-vocabulary estimates are wrong.
  • Not counting system messages and tool definitions. A 4K-token system prompt is 4K tokens on every call. Many cost dashboards quietly omit it.
  • Letting the tokenizer drift between eval and production. If the eval set was tokenized with cl100k_base and the production model is on o200k_base, BLEU/ROUGE numbers do not transfer.
  • Treating multilingual workloads as English. Token cost per character can be 2-3x higher in non-English; budget accordingly.
  • Forgetting BOM, zero-width, and control characters. They tokenize, often as their own tokens, often unexpectedly. Strip or normalise input.
  • Using SentencePiece BPE settings from a Llama 2 codepath on Llama 3. Llama 3 changed tokenizers. Old code paths silently undercount.
  • Counting tokens client-side and trusting the result for billing reconciliation. Use the provider’s reported usage.input_tokens / usage.output_tokens in the response. Client-side counts are estimates.

Recent tokenization updates

DateEventWhy it matters
May 2024OpenAI shipped o200k_base for GPT-4o familyBetter multilingual + code; shrank token counts on non-English ~30%
Apr 2024Llama 3 moved from 32K SentencePiece to 128K tiktoken-styleMultilingual coverage on the OSS side caught up
Jun 2024T-FREE proposed tokenizer-free generative LLMs with sparse hash embeddings (arXiv 2406.19223)First credible challenge to the “vocab is mandatory” assumption since CANINE
Dec 2024Meta released the Byte Latent Transformer (BLT) (arXiv 2412.09871)Dynamic byte-patching at training time; 8B-param byte-level model matches a Llama-3 8B BPE baseline at comparable FLOPs
2024Petrov et al. published “Language Model Tokenizers Introduce Unfairness Between Languages” (arXiv 2305.15425)Quantified that low-resource-language users pay 2 to 15x more per character; sparked a wave of fairness work
May 2024Minixhofer et al. published Zero-Shot Tokenizer Transfer (ZeTT) (arXiv 2405.07883)Transfer a trained LLM to a new tokenizer with a hypernetwork; cuts the cost of retokenizing OSS models for multilingual workloads
2025Anthropic published the count_tokens Messages API (docs)Production-grade token counting without proxying through the model
2025Gemma 2/3 shipped with 256K SentencePiece unigramOSS multilingual tokenizers crossed 200K
2025OpenAI shipped the harmony chat format and accompanying encoding for o1, o3, and o4 reasoning models (openai-harmony)Explicit <think>, <answer>, and channel tokens turned reasoning models into a structured-output target
2026Llama 4 and DeepSeek V4 standardized on 128K-200K tiktoken-style vocabs with stronger multilingual coverageThe closed/open tokenizer-efficiency gap effectively closed for the major frontier families
2026Most LLM gateways and observability backends report both prompt and completion tokens via OTel gen_ai.usage.* attributesCost attribution per-tenant, per-route, per-feature became standard

Recent research on tokenization (2024 to 2026)

The active research questions in 2026 are not “which BPE merge order” but “should the model see tokens at all” and “how do you make existing models stop punishing low-resource languages.”

Byte-level and tokenizer-free models

The strongest 2024-2025 result is Meta’s Byte Latent Transformer (BLT): rather than tokenizing input into discrete vocabulary IDs, BLT dynamically groups bytes into variable-length patches at training time, with patch boundaries determined by an entropy model. An 8B-parameter BLT trained at comparable FLOPs matches a Llama-3 8B BPE baseline on standard benchmarks while handling typos, code, and low-resource languages more gracefully (arXiv 2412.09871).

T-FREE (Deiseroth et al., 2024) takes a different angle: a tokenizer-free generative LLM that uses sparse hashed character-trigram embeddings instead of a learned vocabulary. The model handles any UTF-8 input without a vocabulary, and the embedding table compresses by ~85% versus a comparable BPE vocab (arXiv 2406.19223).

MambaByte (Wang et al., 2024) shows that state-space models scale to byte-level inputs without the quadratic-attention cost that has historically forced tokenization onto transformers (arXiv 2401.13660).

Practical read for 2026: byte-level approaches are not yet production-default but are no longer fringe. Watch for a frontier release on the byte path.

Tokenizer transfer between models

Zero-Shot Tokenizer Transfer (ZeTT) (Minixhofer et al., NeurIPS 2024) trains a hypernetwork that predicts new embedding matrices given a new tokenizer, letting you swap the tokenizer on a trained LLM without full retraining (arXiv 2405.07883). This is the cleanest answer yet to the “we trained on English BPE and now need Hindi coverage” problem.

Empirical tokenizer choice

“Getting the Most out of Your Tokenizer for Pre-training and Domain Adaptation” (Dagan et al., 2024) is the canonical empirical study on vocabulary size, training corpus, and how tokenizer choice cascades into downstream task performance. The headline finding: tokenizer choice is more sensitive to domain mix than the literature previously claimed, and naive cross-domain transfer of vocabularies underperforms a domain-specific re-train (arXiv 2402.01035).

“Tokenization Is More Than Compression” (Schmidt et al., 2024) decouples tokenizer compression from downstream model quality; better compression on a corpus does not always mean better LM perplexity, which complicates the “smaller token count is strictly better” intuition (arXiv 2402.18376).

Multilingual fairness

Petrov et al., “Language Model Tokenizers Introduce Unfairness Between Languages” (NeurIPS 2023, refreshed 2024) measured the per-character token cost across 100+ languages on GPT, Llama, and BERT tokenizers. Speakers of Burmese, Tamil, and Malayalam paid 5 to 15x more tokens per character than English speakers in 2023-era tokenizers. The work pushed every frontier lab to widen multilingual coverage; o200k_base, Llama 3, and Gemma 2 all narrowed the gap measurably (arXiv 2305.15425).

MAGNET (Ahia et al., 2024) proposes adaptive gradient-based tokenization that learns per-language merge boundaries from gradients during training rather than committing to a fixed BPE vocab pre-training (arXiv 2407.08818). Promising for the next wave of multilingual base models.

Tokenizers for reasoning models

OpenAI’s harmony chat format (openai-harmony repo) ships explicit <|start|>, <|message|>, <|channel|>, and <|end|> tokens that separate the user-visible reply from the model’s internal reasoning. The associated encoding sits next to o200k_base and is the canonical target for o1, o3, and o4 inference. Production teams targeting reasoning-class models should parse on the channel boundaries; flat text concatenation defeats the structured-output payoff.

“Unspeakable tokens” and “glitch tokens” research from 2024 onward shows that vocabulary entries trained on near-zero examples can act as adversarial trigger strings (cause refusals, hallucinations, or jailbreaks). Most frontier vocabularies still contain residual glitch tokens. The 2026 mitigation pattern is to filter the vocabulary at train time against a held-out usage corpus and to maintain a per-tokenizer glitch-token deny-list in production preprocessing.

How to actually pick and operate a tokenizer in 2026

  1. Pick the model first. The tokenizer comes with the model; you do not pick a tokenizer in isolation.
  2. Audit per-language efficiency. Run 10K representative samples through the tokenizer; record tokens-per-character by language. Surprising regressions hide here.
  3. Pin the tokenizer version in CI. Treat it like a model version.
  4. Wire token counts into the gateway. Per-tenant, per-feature, per-route attribution. See Best LLM Gateways in 2026.
  5. Use provider-reported usage for billing. Client-side counts are estimates; provider usage.* values are ground truth.
  6. Track tokens-per-character drift. A new model release can shift the tokenizer; the cost dashboard will reflect it before the on-call notices.
  7. Rebuild eval sets when changing tokenizer. Token-based metrics are not portable across vocabularies. For depth, see LLM Cost Optimization and LLM Cost Tracking Best Practices in 2026.

How to use this with FAGI

FutureAGI is the production-grade gateway and observability stack for teams operating tokenizers in production. The Agent Command Center is itself a BYOK gateway across 20+ providers via six native adapters (OpenAI, Anthropic, Gemini, Bedrock, Cohere, Azure) plus OpenAI-compatible presets and self-hosted backends that attributes prompt and completion tokens separately at every span, with per-tenant, per-feature, per-route, per-prompt-version cuts. Provider usage.* values land in span attributes natively; tokens-per-character drift is a chart, not a CSV join. Cost gates run alongside quality gates in CI: a regression that ships 30% more tokens for the same task blocks the merge.

Span-attached evals via turing_flash (50 to 70 ms p95 for guardrail screening, about 1 to 2 seconds for full eval templates) score quality on every sampled trace, so the join “did the longer prompt actually move the rubric?” is one query rather than three vendor exports. The same plane carries 50+ eval metrics, persona-driven simulation, 18+ guardrails, and Apache 2.0 traceAI instrumentation on one self-hostable surface. Pricing starts free with a 50 GB tracing tier.

Sources

Foundational papers

Recent research (2024 to 2026)

Libraries and reference

Read next: LLM Cost Optimization, LLM Cost Tracking Best Practices in 2026, Best Token Cost Tracking Tools in 2026, Embeddings for LLMs

Frequently asked questions

What is tokenization in plain terms?
Tokenization is the step that turns text into a sequence of integers an LLM can process. The text is split into tokens (sometimes whole words, sometimes subwords, sometimes single characters or bytes), and each token is mapped to an id in a vocabulary. The model never sees raw characters during inference; it sees the integer ids. Detokenization is the reverse step that converts the model's output ids back into text. Every LLM has a tokenizer, and the choice of tokenizer shapes context length, cost, latency, and quality on multilingual or code workloads.
What is BPE and why is it the dominant subword algorithm?
BPE (Byte Pair Encoding) starts from a vocabulary of single bytes or characters, then iteratively merges the most frequent adjacent pair until the vocabulary hits the target size. The resulting vocabulary contains common whole words and useful subword fragments. BPE became dominant because it is simple, deterministic, handles unknown words by falling back to subwords, and produces compact token sequences on English text. GPT-2, GPT-3, GPT-4, GPT-5, and Llama all use BPE variants. The original paper is Sennrich, Haddow, and Birch's 2016 [Neural Machine Translation of Rare Words with Subword Units](https://arxiv.org/abs/1508.07909).
What is SentencePiece, and how does it differ from BPE?
SentencePiece is a tokenizer library and training algorithm released by Google in 2018. It trains directly on raw text and treats whitespace as a normal symbol (escaped as U+2581), so it works natively across languages without separate per-language preprocessing. SentencePiece supports both BPE and unigram language model training; the unigram variant is what Gemma uses. The headline difference: SentencePiece does not require pre-tokenization on whitespace, while BPE in its original form assumes whitespace tokenization. The original paper is Kudo and Richardson's 2018 [SentencePiece: A simple and language independent subword tokenizer](https://arxiv.org/abs/1808.06226).
What is tiktoken?
tiktoken is OpenAI's open-source BPE tokenizer library, released in 2022. It implements the exact same tokenization as GPT-3.5, GPT-4, GPT-5, and the embedding models. The library is fast (Rust core, Python bindings) and is the canonical way to count tokens for OpenAI models in production. The encodings are named: cl100k_base for GPT-3.5/4, o200k_base for GPT-4o and later, etc. Find the GitHub repo at [openai/tiktoken](https://github.com/openai/tiktoken). For Anthropic and Llama models, use their respective official tokenizers; do not approximate.
Why does tokenization affect cost and latency?
LLM pricing is per-token, not per-character. A prompt that tokenizes to 1,200 tokens in cl100k_base might tokenize to 800 tokens in a different vocabulary; same text, 33% cost difference. Latency follows similarly because both the prefill phase (input tokens) and the decode phase (output tokens) are token-bounded. On code, JSON, and multilingual text the difference can be 2x or more between tokenizers. For a multilingual support agent that processes 10M tokens a day, a 30% tokenization-efficiency gap is dollars and seconds, not a rounding error.
How does tokenization break on non-English text?
Older BPE vocabularies trained mostly on English text fragment non-English characters into many tokens. A single CJK character can take 2-4 tokens in cl100k_base; Hindi, Thai, Tamil, and Bengali are similar. Newer tokenizers (o200k_base, Llama 3, Gemma) added many multilingual tokens and bring the per-character cost closer to English. The eval implication: a model running on a non-English prompt is paying 2-3x the token cost of the equivalent English prompt and may hit context limits sooner. Track tokens-per-character per language as a tokenizer-quality metric.
How does tokenization affect eval scores?
Several ways. BLEU and ROUGE are computed on tokens, so the tokenizer used at evaluation time changes the score. Schema-validation evals on JSON output can fail if the tokenizer splits a JSON key in a way that the model has never seen. Refusal-rate and safety evals can shift because identical text under different tokenizers reaches different lengths and triggers different context handling. The pragmatic rule: use the same tokenizer for eval that the production model uses, and pin the tokenizer version. Tokenizer drift is a quiet source of regression. For depth, see the [What is LLM Evaluation](/blog/what-is-llm-evaluation-2026) explainer.
What is the right way to count tokens for OpenAI, Anthropic, and Llama models in 2026?
OpenAI: use [tiktoken](https://github.com/openai/tiktoken) with the encoding that matches the model (`cl100k_base` for GPT-4 turbo and earlier, `o200k_base` for GPT-4o family and GPT-5). Anthropic: use the [count_tokens endpoint](https://docs.anthropic.com/en/docs/build-with-claude/token-counting) or the official Anthropic SDK; Anthropic does not ship a public local tokenizer. Llama: use the SentencePiece tokenizer shipped with the model weights (Llama 2 used a SentencePiece BPE; Llama 3 and Llama 4 use a 128k-vocabulary tiktoken-style tokenizer). Do not approximate Anthropic counts from tiktoken; the vocabularies differ. Build the count into your gateway layer so cost dashboards reflect ground truth.
Related Articles
View all