The first lossy compression every model performs.
Tokenization controls how efficiently your model sees text, code, math, and other modalities. A bad vocabulary wastes context window, fragments numbers, and hurts multilingual performance. The 2026 default is byte-level BPE with a 128k-200k vocab — but byte-level tokenizer-free models (BLT, MambaByte) are entering production-scale runs.
A frontier-grade implementation, in order.
BPE (production default), Unigram (SentencePiece), or byte-level patching (BLT). For new code: tiktoken or HuggingFace tokenizers.
Sample 1-10% of pretraining mix, weighted by language. Train vocab on the actual mixture you'll pretrain on.
100k-200k for monolingual+code, 256k-300k for highly multilingual. Larger vocab = better compression but bigger embedding/output matrices.
Split digits individually for arithmetic robustness. Add code-specific pre-tokenization (whitespace, indentation).
Reserve special tokens for image/audio/video patches, tool calls, system prompts, end-of-thought, etc.
In rough order of foundational → recent. Click any title to open the arXiv abstract.
Sennrich, Haddow, Birch · 2016
Kudo, Richardson · 2018
Wang, Cho, Gu · 2020
Pagnoni et al. (Meta) · 2024
Yu et al. · 2023
Wang et al. · 2024
Mentzer et al. · 2023
Défossez et al. · 2022
Kyutai · 2024
Meta · 2024