The $100M-$1B run where everything has to go right.
A frontier pretraining run in 2026 burns 10-50k+ Blackwell GPUs for 2-6 months on 15-40T tokens. Scaling laws, hyperparameter transfer (μP), low-precision training (FP8/MXFP4), and a curriculum that anneals on the highest-quality data at the end are now table stakes. Multi-token prediction (MTP) is the new auxiliary objective.
A frontier-grade implementation, in order.
Train 8-12 small models (10M-1B params) at varying token counts. Fit scaling law constants on YOUR data. Don't trust Chinchilla's defaults.
Find optimal LR, init scale, batch size on a 40M-100M proxy. Transfer to target scale via μTransfer.
If you'll deploy at scale, 5-30x Chinchilla-optimal tokens (Sardana inference-aware). Otherwise stick near 20 tokens/param.
FP8 forward, BF16 master weights, FP32 optimizer states. Selective high-precision for embeddings, output, and gradient norms.
Stage 1 (0-90%): broad mix at short context. Stage 2 (90-95%): long-context anneal. Stage 3 (95-100%): high-quality math/code/instruction anneal.
Add 1-4 future-token prediction heads with weight ~0.1. Free quality boost + faster inference via self-speculation.
In rough order of foundational → recent. Click any title to open the arXiv abstract.
Hoffmann et al. (DeepMind) · 2022
Kaplan et al. (OpenAI) · 2020
Sardana et al. · 2024
Yang et al. · 2022
DeepSeek-AI · 2024
Meta AI · 2024
Gloeckle et al. (Meta) · 2024
Hu et al. · 2024
Ding et al. · 2024
Bavarian et al. (OpenAI) · 2022