Build a Frontier Model
🔤
Step 2 of 11

Tokenization

The first lossy compression every model performs.

Tokenization controls how efficiently your model sees text, code, math, and other modalities. A bad vocabulary wastes context window, fragments numbers, and hurts multilingual performance. The 2026 default is byte-level BPE with a 128k-200k vocab — but byte-level tokenizer-free models (BLT, MambaByte) are entering production-scale runs.

Why it matters

  • Llama 3 expanded tokenizer vocab from 32k → 128k and gained ~15% throughput on the same hardware.
  • Tokenization choices silently bias arithmetic: GPT-3.5 tokenized '1234' as one token but '12345' as two — breaking digit-by-digit math.
  • Multimodal models need shared vocabularies across text + image + audio, or massively differing token costs per modality.
  • Audio tokenizers (Mimi, EnCodec) at ~12.5 Hz make real-time voice models like Moshi and GPT-4o possible.

State of the art

2025-2026
  • tiktoken-style byte-level BPE with cl100k (100k) → o200k (200k) is the production default.
  • FSQ (Finite Scalar Quantization) replaced VQ-VAE codebooks for image/video tokenizers in 2024.
  • BLT (Byte Latent Transformer, Meta Dec 2024) uses dynamic patching to match BPE-trained Llama at 8B scale — the most interesting tokenization paper of the year.
  • Chameleon-style unified text+image vocabularies enabled native multimodal training at GPT-4o, Gemini, and Llama 4.

The recipe

A frontier-grade implementation, in order.

1

Pick an algorithm

BPE (production default), Unigram (SentencePiece), or byte-level patching (BLT). For new code: tiktoken or HuggingFace tokenizers.

2

Train on representative data

Sample 1-10% of pretraining mix, weighted by language. Train vocab on the actual mixture you'll pretrain on.

3

Choose vocab size

100k-200k for monolingual+code, 256k-300k for highly multilingual. Larger vocab = better compression but bigger embedding/output matrices.

4

Pre-tokenize digits + code

Split digits individually for arithmetic robustness. Add code-specific pre-tokenization (whitespace, indentation).

5

Add modality tokens

Reserve special tokens for image/audio/video patches, tool calls, system prompts, end-of-thought, etc.

⚠️

Common pitfalls

Don't reuse another model's tokenizer for a new pretraining run — vocab mismatch costs throughput and quality.
Glitch tokens (SolidGoldMagikarp, etc.) emerge from rare merges — audit the long tail of your vocab.
Tokenizer changes are not backwards-compatible: switching invalidates all cached prompts and KV caches.
Byte-level fallback is critical — never crash on unseen bytes (emoji, rare scripts).