Build a Frontier Model
Step 4 of 11

Pretraining at Scale

The $100M-$1B run where everything has to go right.

A frontier pretraining run in 2026 burns 10-50k+ Blackwell GPUs for 2-6 months on 15-40T tokens. Scaling laws, hyperparameter transfer (μP), low-precision training (FP8/MXFP4), and a curriculum that anneals on the highest-quality data at the end are now table stakes. Multi-token prediction (MTP) is the new auxiliary objective.

Why it matters

  • A single failed run can cost $50M-$500M. Most labs do 5-10 small-scale ablations for every full run.
  • μP (Maximal Update Parametrization) lets you tune hyperparameters on a 40M proxy and transfer them to 400B+.
  • FP8 training, validated at scale by DeepSeek-V3, halves memory bandwidth requirements vs BF16.
  • Inference-aware over-training (Sardana 2024) means most labs train past Chinchilla-optimal — Llama 3 8B saw 15.6T tokens (~1,800/param).

State of the art

2025-2026
  • DeepSeek-V3 was the first frontier model trained primarily in FP8, with selective high-precision residual paths.
  • Multi-Token Prediction (MTP) auxiliary objective gives 1-2% benchmark uplift and accelerates speculative decoding at inference.
  • Warmup-Stable-Decay (WSD) learning rate schedules (MiniCPM, 2024) replaced cosine schedules at frontier labs.
  • Cross-document attention masking (preventing attention bleed across packed documents) is a 2024 hygiene fix.
  • Mid-training 'uplift' stages — short, high-quality data injections between pretraining and post-training — are the new norm.

The recipe

A frontier-grade implementation, in order.

1

Scaling-law sweep

Train 8-12 small models (10M-1B params) at varying token counts. Fit scaling law constants on YOUR data. Don't trust Chinchilla's defaults.

2

μP hyperparameter transfer

Find optimal LR, init scale, batch size on a 40M-100M proxy. Transfer to target scale via μTransfer.

3

Token budget

If you'll deploy at scale, 5-30x Chinchilla-optimal tokens (Sardana inference-aware). Otherwise stick near 20 tokens/param.

4

Precision

FP8 forward, BF16 master weights, FP32 optimizer states. Selective high-precision for embeddings, output, and gradient norms.

5

Curriculum

Stage 1 (0-90%): broad mix at short context. Stage 2 (90-95%): long-context anneal. Stage 3 (95-100%): high-quality math/code/instruction anneal.

6

MTP auxiliary

Add 1-4 future-token prediction heads with weight ~0.1. Free quality boost + faster inference via self-speculation.

⚠️

Common pitfalls

Loss spikes at scale — keep gradient/activation norm telemetry at every layer. Restart from last clean checkpoint and skip the offending batch.
Don't change the data mix mid-run unless you've ablated it. The 'just one more high-quality dataset' temptation has killed many runs.
Optimizer state explodes at scale — ZeRO-3 / FSDP or expert-parallelism is mandatory above ~70B.
Tokenizer recycling: never inherit a tokenizer from a different pretraining mix. The frequency mismatch costs ~5% throughput.