Tokenization: Building the Vocabulary
A model performs great in English but completely fails on a customer's Korean text. The vendor says "our tokenizer handles 100 languages." You need to figure out whether the tokenizer is the bottleneck before you re-train. What should you actually inspect?
The previous lesson treated tokens as a unit of measurement. This one is about where that vocabulary comes from. The tokenizer is one of the few pieces of an LLM that is not learned during model training; it is built ahead of time, frozen, and shipped with the weights forever. That decision shapes what the model is good at, what languages it speaks well, what it costs you to serve, and what your platform looks like in production.
What it is
A tokenizer is a deterministic function that takes a string and returns a sequence of integer token IDs. Each token ID corresponds to a string (a "subword unit") in the model's fixed vocabulary, which is typically 32K to 256K entries. The vocabulary plus the algorithm together are the tokenizer.
The two algorithms you will encounter in production:
- Byte-Pair Encoding (BPE): starts with single bytes (or characters) as the vocabulary, then iteratively merges the most frequent adjacent pair into a new token until the vocabulary hits the target size. GPT-2 onward, GPT-4, and many open models use a BPE variant.
- SentencePiece: a library wrapping either BPE or a probabilistic model called Unigram. Critically, it treats the input as raw bytes and includes whitespace as part of tokens (using
▁to represent a space). Llama, Mistral, and Gemma use SentencePiece.
The practical difference between BPE and SentencePiece is mostly in whitespace handling and language-agnosticism. SentencePiece's byte-level approach makes it work on any Unicode input without preprocessing. BPE variants typically need a pre-tokenization step (like splitting on whitespace) and behave subtly differently across languages.
The vocabulary is fixed at training time. If your model was trained on a vocabulary that lacks subword units for, say, medical terminology or Korean morphemes, every Korean sentence and every drug name will fragment into many short tokens. Cost goes up, throughput drops, and the model has less effective context to work with. This is not a bug. It is a vocabulary mismatch, and it cannot be fixed without retraining.
How it works under the hood
Tokenization runs as a four-stage pipeline. Each stage is configurable per model.
The tokenization pipeline
Optional Unicode normalization (NFC, NFKC), lowercasing, accent stripping. Most modern tokenizers do minimal normalization to preserve the original input. Some legacy tokenizers were destructive here, which is a portability problem.
Split the string into chunks the algorithm can operate on. Usually splits on whitespace and punctuation. SentencePiece skips this step and operates on raw bytes.
Iteratively find the longest matching subword in the vocabulary. This is the core of the algorithm. Output is a sequence of vocabulary entries.
Look up each subword in the vocabulary table to get its integer ID. The output is the integer sequence the model actually consumes.
Hover to expand each layer
When you train a tokenizer, you collect a corpus, run BPE for N merges, and freeze the result. Llama-3's tokenizer was trained on a corpus that is heavily English-weighted, with substantial code and a long tail of other languages. It uses 128,256 vocabulary entries (Llama-2 was 32,000), specifically expanded to handle multilingual and code use cases better.
The reason this matters operationally: a single Korean sentence in Llama-2's tokenizer might fragment into 60 tokens. The same sentence in Llama-3's tokenizer is closer to 25 tokens. Same model family, same architecture, very different cost per Korean conversation, because the vocabulary was rebuilt with multilingual coverage in mind.
You can inspect any tokenizer's behavior in three lines:
from transformers import AutoTokenizer
tok = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3-70B-Instruct")
ids = tok.encode("안녕하세요, 어떻게 도와드릴까요?")
print(len(ids), [tok.decode([i]) for i in ids])
The decoded list shows what each token actually is. If the same string in English produces 8 tokens and the Korean version produces 35, you have your answer: the Korean is paying a 4x tokenization tax for being underrepresented in the vocabulary.
Operationalizing it
Three operational concerns recur every time tokenization touches your system:
1. Token count varies by language. This is the single most common surprise. The same prompt template, translated into different languages, can cost 1x to 5x more depending on the target language. If you operate a multilingual product, build per-language token budgets and per-language cost dashboards.
Same sentence, different token costs (Llama-3 tokenizer)
English
Native vocabulary
Korean
Underrepresented in pretraining
2. Chat templates add tokens you did not count. Every modern instruct model expects a specific chat template that wraps your messages with role markers. For Llama-3 they look like this:
<|begin_of_text|>
<|start_header_id|>user<|end_header_id|>
How do I deploy this?
<|eot_id|>
<|start_header_id|>assistant<|end_header_id|>
These templates add 5-20 tokens per message turn. Always apply the template before counting:
messages = [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "How do I deploy this?"},
]
templated = tok.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
ids = tok.encode(templated)
print(f"Tokens including template: {len(ids)}")
3. The tokenizer is a shipped artifact. If you are version-pinning your model, version-pin your tokenizer alongside it. A subtle update to the tokenizer (HuggingFace ships these occasionally) can change token IDs without changing the model weights, breaking caches and analytics that key on token IDs.
A team migrated from Llama-2 to Llama-3 and saw their per-conversation API cost go down by 40%. They credited the new model. The actual cause was 80% tokenizer (Llama-3's vocabulary is 4x larger and handles their multilingual support content much better) and 20% model. The model's quality improvement was real but had nothing to do with the cost win. Lesson: when costs change after a migration, check the tokenizer first.
Trade-offs and decision framework
When you are picking a model (or evaluating a vendor), the tokenizer should be one of your evaluation axes. Concretely:
- Run your real production prompts through each candidate's tokenizer. Compare token counts. If you operate in 5 languages, do all 5.
- Check vocabulary size. A 32K vocabulary will usually be more wasteful for non-English content than a 128K vocabulary. Not always; the training data matters too.
- Check special token handling. If you rely on tool calls, function calling, or structured output, look at how each tokenizer handles JSON braces, code fences, and any role markers. Some tokenize JSON significantly more efficiently than others.
- Check tokenizer speed. At very high QPS, the tokenizer itself can become a bottleneck on the client side.
tiktokenand the Rust-backed HuggingFacetokenizerslibrary are both fast. Pure-Python implementations are not.
If your domain is heavily code, scientific text, or non-Latin scripts, the model trained on a vocabulary that covers those well will outperform an "objectively better" model with a poorly-suited vocabulary, often by enough to flip the cost math.
Common mistakes
- Comparing models on price-per-token without comparing token-per-prompt. A model that is cheaper per token but tokenizes your input 2x worse is more expensive overall.
- Assuming all tokenizers are roughly the same. They are not. The same string can produce 1x, 2x, or 5x as many tokens across tokenizers. Always measure on your own input.
- Building a custom tokenizer for fine-tuning. Tempting and almost always wrong. The model's embedding table is keyed to the original vocabulary; replacing the tokenizer means retraining the embeddings. This is a real research problem, not a fine-tuning task.
- Ignoring the chat template in cost estimates. Multiply role-marker overhead by your average conversation length. For a 20-turn conversation, the template overhead alone can be 200-400 tokens.
- Logging raw token IDs forever. If you upgrade the model, those IDs become meaningless. Log the decoded text or a structured form instead.
Why does the same prompt cost different amounts in different languages? What infrastructure decision does this drive when you're sizing a multi-region serving deployment?