Garbage in, garbage out — at trillion-token scale.
Frontier 2026 models are trained on 15-40T tokens. Quality, not raw volume, is what separates a great model from an also-ran. The pipeline starts with raw web crawl, then passes through aggressive filtering, deduplication, decontamination, model-based quality scoring, and curated mixes of code, math, multilingual, and synthetic data.
A frontier-grade implementation, in order.
Common Crawl WARC dumps → trafilatura/Resiliparse for HTML→text. Filter by language ID (fastText), domain blocklists, NSFW classifiers.
Train a lightweight classifier (FineWeb-Edu uses Llama-3-70B labels on 500k pages → distill to a small DeBERTa). Drop the bottom 80%.
MinHash + LSH at document level (Jaccard threshold ~0.7). Then SemDeDup or paragraph-level near-dup. Expect 60-80% reduction.
13-gram overlap against every eval set you might publish on. Llama 3 paper documents the exact protocol.
Web 60-70%, code 15-25%, math 5-10%, multilingual 5-15%, books/papers 1-5%. Anneal on highest-quality data at the end.
Persona-Hub style diverse generation, textbook-style rewrites (Phi), reasoning trace distillation from R1/o3.
In rough order of foundational → recent. Click any title to open the arXiv abstract.
Penedo et al. (Hugging Face) · 2024
Li et al. · 2024
NVIDIA · 2024
Penedo et al. · 2023
Hoffmann et al. (DeepMind) · 2022
Sardana et al. · 2024
Muennighoff et al. · 2023
Maini et al. · 2024
Chan et al. · 2024
Abbas et al. · 2023
Microsoft · 2024