📚

Step 1 of 11

Pretraining Data

Garbage in, garbage out — at trillion-token scale.

Frontier 2026 models are trained on 15-40T tokens. Quality, not raw volume, is what separates a great model from an also-ran. The pipeline starts with raw web crawl, then passes through aggressive filtering, deduplication, decontamination, model-based quality scoring, and curated mixes of code, math, multilingual, and synthetic data.

Why it matters

DeepSeek and FineWeb-Edu showed that better data filtering can match a 2-3x increase in compute.
Token-to-parameter ratios have exploded: Llama 3 trained an 8B model on 15T tokens (~1,800 tokens/param), far past Chinchilla-optimal.
Test-set leakage from web crawls silently inflates benchmark scores — decontamination is now a release-blocker.
Synthetic data is load-bearing: Phi-4 and post-R1 distillation pipelines train almost entirely on model-generated text.

State of the art

2025-2026

FineWeb (15T tokens) and FineWeb-Edu replaced RefinedWeb as the open-data SOTA in 2024.
DCLM proved controlled data ablations beat intuition — its 7B/2T baseline matched Llama 3 8B.
Nemotron-CC (Dec 2024) used an LLM to rephrase low-quality web pages, recovering ~3T extra usable tokens.
Annealing on high-quality math/code/instruction data in the final 5-10% of pretraining is now standard (Llama 3, DeepSeek V3, Qwen 3).
RLVR-era pipelines distill reasoning traces from R1/o-series models as a primary post-training data source.

The recipe

A frontier-grade implementation, in order.

Crawl & extract

Common Crawl WARC dumps → trafilatura/Resiliparse for HTML→text. Filter by language ID (fastText), domain blocklists, NSFW classifiers.

Quality classification

Train a lightweight classifier (FineWeb-Edu uses Llama-3-70B labels on 500k pages → distill to a small DeBERTa). Drop the bottom 80%.

Deduplicate

MinHash + LSH at document level (Jaccard threshold ~0.7). Then SemDeDup or paragraph-level near-dup. Expect 60-80% reduction.

Decontaminate

13-gram overlap against every eval set you might publish on. Llama 3 paper documents the exact protocol.

Mix & curriculum

Web 60-70%, code 15-25%, math 5-10%, multilingual 5-15%, books/papers 1-5%. Anneal on highest-quality data at the end.

Synthetic uplift

Persona-Hub style diverse generation, textbook-style rewrites (Phi), reasoning trace distillation from R1/o3.

⚠️

Common pitfalls

Aggressive dedup hurts — keep medium-similarity duplicates (they regularize). DCLM ablations show the optimum.

Quality classifiers learn surface features (formatting, domain). Audit by hand on a held-out sample.

Synthetic data collapses diversity. Mix at most 30-50% synthetic and rotate generators.

Multilingual data quality drops off a cliff outside top-20 languages — don't trust web crawl alone.

Papers to read

In rough order of foundational → recent. Click any title to open the arXiv abstract.

FineWeb: decanting the web for the finest text data at scale

2406.17557

Penedo et al. (Hugging Face) · 2024

DataComp-LM: In Search of the Next Generation of Training Sets for Language Models

2406.11794

Li et al. · 2024

Nemotron-CC: Transforming Common Crawl into a Refined Long-Horizon Pretraining Dataset

2412.02595

NVIDIA · 2024

The RefinedWeb Dataset for Falcon LLM

2306.01116

Penedo et al. · 2023

Training Compute-Optimal Large Language Models (Chinchilla)

2203.15556

Hoffmann et al. (DeepMind) · 2022

Beyond Chinchilla-Optimal: Accounting for Inference in Language Model Scaling Laws

2401.00448

Sardana et al. · 2024

Scaling Data-Constrained Language Models

2305.16264

Muennighoff et al. · 2023

WRAP: Web Rephrase Augmented Pre-training

2401.16380

Maini et al. · 2024

Scaling Synthetic Data Creation with 1,000,000,000 Personas (Persona-Hub)

2406.20094

Chan et al. · 2024

SemDeDup: Data-efficient learning at web-scale through semantic deduplication

2303.09540

Abbas et al. · 2023

Phi-4 Technical Report

2412.08905

Microsoft · 2024

Next · Step 2