Build a Frontier Model

🔤

Step 2 of 11

Tokenization

The first lossy compression every model performs.

Tokenization controls how efficiently your model sees text, code, math, and other modalities. A bad vocabulary wastes context window, fragments numbers, and hurts multilingual performance. The 2026 default is byte-level BPE with a 128k-200k vocab — but byte-level tokenizer-free models (BLT, MambaByte) are entering production-scale runs.

Why it matters

Llama 3 expanded tokenizer vocab from 32k → 128k and gained ~15% throughput on the same hardware.
Tokenization choices silently bias arithmetic: GPT-3.5 tokenized '1234' as one token but '12345' as two — breaking digit-by-digit math.
Multimodal models need shared vocabularies across text + image + audio, or massively differing token costs per modality.
Audio tokenizers (Mimi, EnCodec) at ~12.5 Hz make real-time voice models like Moshi and GPT-4o possible.

State of the art

2025-2026

tiktoken-style byte-level BPE with cl100k (100k) → o200k (200k) is the production default.
FSQ (Finite Scalar Quantization) replaced VQ-VAE codebooks for image/video tokenizers in 2024.
BLT (Byte Latent Transformer, Meta Dec 2024) uses dynamic patching to match BPE-trained Llama at 8B scale — the most interesting tokenization paper of the year.
Chameleon-style unified text+image vocabularies enabled native multimodal training at GPT-4o, Gemini, and Llama 4.

The recipe

A frontier-grade implementation, in order.

1

Pick an algorithm

BPE (production default), Unigram (SentencePiece), or byte-level patching (BLT). For new code: tiktoken or HuggingFace tokenizers.

2

Train on representative data

Sample 1-10% of pretraining mix, weighted by language. Train vocab on the actual mixture you'll pretrain on.

3

Choose vocab size

100k-200k for monolingual+code, 256k-300k for highly multilingual. Larger vocab = better compression but bigger embedding/output matrices.

4

Pre-tokenize digits + code

Split digits individually for arithmetic robustness. Add code-specific pre-tokenization (whitespace, indentation).

5

Add modality tokens

Reserve special tokens for image/audio/video patches, tool calls, system prompts, end-of-thought, etc.

⚠️

Common pitfalls

Don't reuse another model's tokenizer for a new pretraining run — vocab mismatch costs throughput and quality.

Glitch tokens (SolidGoldMagikarp, etc.) emerge from rare merges — audit the long tail of your vocab.

Tokenizer changes are not backwards-compatible: switching invalidates all cached prompts and KV caches.

Byte-level fallback is critical — never crash on unseen bytes (emoji, rare scripts).

Papers to read

In rough order of foundational → recent. Click any title to open the arXiv abstract.

Neural Machine Translation of Rare Words with Subword Units (BPE)

Sennrich, Haddow, Birch · 2016

SentencePiece: A simple and language independent subword tokenizer

Kudo, Richardson · 2018

Neural Machine Translation with Byte-Level Subwords

Wang, Cho, Gu · 2020

Byte Latent Transformer: Patches Scale Better Than Tokens (BLT)

Pagnoni et al. (Meta) · 2024

MegaByte: Predicting Million-byte Sequences with Multiscale Transformers

Yu et al. · 2023

MambaByte: Token-free Selective State Space Model

Wang et al. · 2024

Finite Scalar Quantization: VQ-VAE Made Simple (FSQ)

Mentzer et al. · 2023

High Fidelity Neural Audio Compression (EnCodec)

Défossez et al. · 2022

Moshi: a speech-text foundation model for real-time dialogue

Kyutai · 2024

Chameleon: Mixed-Modal Early-Fusion Foundation Models

Meta · 2024

Previous · Step 1

← Pretraining Data

Architecture →