Build a Frontier Model
📚
Step 1 of 11

Pretraining Data

Garbage in, garbage out — at trillion-token scale.

Frontier 2026 models are trained on 15-40T tokens. Quality, not raw volume, is what separates a great model from an also-ran. The pipeline starts with raw web crawl, then passes through aggressive filtering, deduplication, decontamination, model-based quality scoring, and curated mixes of code, math, multilingual, and synthetic data.

Why it matters

  • DeepSeek and FineWeb-Edu showed that better data filtering can match a 2-3x increase in compute.
  • Token-to-parameter ratios have exploded: Llama 3 trained an 8B model on 15T tokens (~1,800 tokens/param), far past Chinchilla-optimal.
  • Test-set leakage from web crawls silently inflates benchmark scores — decontamination is now a release-blocker.
  • Synthetic data is load-bearing: Phi-4 and post-R1 distillation pipelines train almost entirely on model-generated text.

State of the art

2025-2026
  • FineWeb (15T tokens) and FineWeb-Edu replaced RefinedWeb as the open-data SOTA in 2024.
  • DCLM proved controlled data ablations beat intuition — its 7B/2T baseline matched Llama 3 8B.
  • Nemotron-CC (Dec 2024) used an LLM to rephrase low-quality web pages, recovering ~3T extra usable tokens.
  • Annealing on high-quality math/code/instruction data in the final 5-10% of pretraining is now standard (Llama 3, DeepSeek V3, Qwen 3).
  • RLVR-era pipelines distill reasoning traces from R1/o-series models as a primary post-training data source.

The recipe

A frontier-grade implementation, in order.

1

Crawl & extract

Common Crawl WARC dumps → trafilatura/Resiliparse for HTML→text. Filter by language ID (fastText), domain blocklists, NSFW classifiers.

2

Quality classification

Train a lightweight classifier (FineWeb-Edu uses Llama-3-70B labels on 500k pages → distill to a small DeBERTa). Drop the bottom 80%.

3

Deduplicate

MinHash + LSH at document level (Jaccard threshold ~0.7). Then SemDeDup or paragraph-level near-dup. Expect 60-80% reduction.

4

Decontaminate

13-gram overlap against every eval set you might publish on. Llama 3 paper documents the exact protocol.

5

Mix & curriculum

Web 60-70%, code 15-25%, math 5-10%, multilingual 5-15%, books/papers 1-5%. Anneal on highest-quality data at the end.

6

Synthetic uplift

Persona-Hub style diverse generation, textbook-style rewrites (Phi), reasoning trace distillation from R1/o3.

⚠️

Common pitfalls

Aggressive dedup hurts — keep medium-similarity duplicates (they regularize). DCLM ablations show the optimum.
Quality classifiers learn surface features (formatting, domain). Audit by hand on a held-out sample.
Synthetic data collapses diversity. Mix at most 30-50% synthetic and rotate generators.
Multilingual data quality drops off a cliff outside top-20 languages — don't trust web crawl alone.