⚡

Step 4 of 11

Pretraining at Scale

The $100M-$1B run where everything has to go right.

A frontier pretraining run in 2026 burns 10-50k+ Blackwell GPUs for 2-6 months on 15-40T tokens. Scaling laws, hyperparameter transfer (μP), low-precision training (FP8/MXFP4), and a curriculum that anneals on the highest-quality data at the end are now table stakes. Multi-token prediction (MTP) is the new auxiliary objective.

Why it matters

A single failed run can cost $50M-$500M. Most labs do 5-10 small-scale ablations for every full run.
μP (Maximal Update Parametrization) lets you tune hyperparameters on a 40M proxy and transfer them to 400B+.
FP8 training, validated at scale by DeepSeek-V3, halves memory bandwidth requirements vs BF16.
Inference-aware over-training (Sardana 2024) means most labs train past Chinchilla-optimal — Llama 3 8B saw 15.6T tokens (~1,800/param).

State of the art

2025-2026

DeepSeek-V3 was the first frontier model trained primarily in FP8, with selective high-precision residual paths.
Multi-Token Prediction (MTP) auxiliary objective gives 1-2% benchmark uplift and accelerates speculative decoding at inference.
Warmup-Stable-Decay (WSD) learning rate schedules (MiniCPM, 2024) replaced cosine schedules at frontier labs.
Cross-document attention masking (preventing attention bleed across packed documents) is a 2024 hygiene fix.
Mid-training 'uplift' stages — short, high-quality data injections between pretraining and post-training — are the new norm.

The recipe

A frontier-grade implementation, in order.

Scaling-law sweep

Train 8-12 small models (10M-1B params) at varying token counts. Fit scaling law constants on YOUR data. Don't trust Chinchilla's defaults.

μP hyperparameter transfer

Find optimal LR, init scale, batch size on a 40M-100M proxy. Transfer to target scale via μTransfer.

Token budget

If you'll deploy at scale, 5-30x Chinchilla-optimal tokens (Sardana inference-aware). Otherwise stick near 20 tokens/param.

Precision

FP8 forward, BF16 master weights, FP32 optimizer states. Selective high-precision for embeddings, output, and gradient norms.

Curriculum

Stage 1 (0-90%): broad mix at short context. Stage 2 (90-95%): long-context anneal. Stage 3 (95-100%): high-quality math/code/instruction anneal.

MTP auxiliary

Add 1-4 future-token prediction heads with weight ~0.1. Free quality boost + faster inference via self-speculation.

⚠️

Common pitfalls

Loss spikes at scale — keep gradient/activation norm telemetry at every layer. Restart from last clean checkpoint and skip the offending batch.

Don't change the data mix mid-run unless you've ablated it. The 'just one more high-quality dataset' temptation has killed many runs.

Optimizer state explodes at scale — ZeRO-3 / FSDP or expert-parallelism is mandatory above ~70B.

Tokenizer recycling: never inherit a tokenizer from a different pretraining mix. The frequency mismatch costs ~5% throughput.

Papers to read

In rough order of foundational → recent. Click any title to open the arXiv abstract.

Training Compute-Optimal Large Language Models (Chinchilla)

2203.15556

Hoffmann et al. (DeepMind) · 2022

Scaling Laws for Neural Language Models

2001.08361

Kaplan et al. (OpenAI) · 2020

Beyond Chinchilla-Optimal: Accounting for Inference

2401.00448

Sardana et al. · 2024

Tensor Programs V: Tuning Large Neural Networks via Zero-Shot Hyperparameter Transfer (μP)

2203.03466

Yang et al. · 2022

DeepSeek-V3 Technical Report (FP8 training at scale)

2412.19437

DeepSeek-AI · 2024

The Llama 3 Herd of Models

2407.21783

Meta AI · 2024

Better & Faster Large Language Models via Multi-Token Prediction

2404.19737

Gloeckle et al. (Meta) · 2024

MiniCPM: Unveiling the Potential of Small Language Models with Scalable Training Strategies (WSD)

2404.06395

Hu et al. · 2024

Fewer Truncations Improve Language Modeling (Best-fit Packing)

2404.10830

Ding et al. · 2024

Efficient Training of Language Models to Fill in the Middle

2207.14255

Bavarian et al. (OpenAI) · 2022

Previous · Step 3

← Architecture

Next · Step 5