No matter the price tag. The full pipeline a frontier lab actually uses in 2026 โ from cleaning a Common Crawl dump to deploying a reasoning-trained, multimodal, MCP-speaking agent on a 200,000-GPU cluster.
Each chapter pairs the practical recipe with the foundational and 2024-2026 papers actually being used at OpenAI, Anthropic, Google DeepMind, DeepSeek, and Meta.
Read them in order if you want the full mental model. Jump in anywhere if you already have a frontier-scale lab and just need the latest tricks.
Garbage in, garbage out โ at trillion-token scale.
Frontier 2026 models are trained on 15-40T tokens. Quality, not raw volume, is what separates a great model from an also-ran. The pipeline starts with raw web crawl, then passes through aggressive filtering, deduplication, decontamination, model-based quality scoring, and curated mixes of code, math, multilingual, and synthetic data.
The first lossy compression every model performs.
Tokenization controls how efficiently your model sees text, code, math, and other modalities. A bad vocabulary wastes context window, fragments numbers, and hurts multilingual performance. The 2026 default is byte-level BPE with a 128k-200k vocab โ but byte-level tokenizer-free models (BLT, MambaByte) are entering production-scale runs.
Transformer is the default. Everything else is a knob.
Decoder-only Transformer with RoPE positional encoding, RMSNorm, SwiGLU activations, and Grouped-Query Attention (GQA) is the 2026 default โ call it the 'Llama-style stack.' On top, frontier models layer Mixture-of-Experts (DeepSeek V3, Llama 4, Qwen 3), Multi-head Latent Attention (DeepSeek's MLA), state-space hybrids (Jamba, Samba, Hymba), and dynamic-compute routing (MoD, MoR).
The $100M-$1B run where everything has to go right.
A frontier pretraining run in 2026 burns 10-50k+ Blackwell GPUs for 2-6 months on 15-40T tokens. Scaling laws, hyperparameter transfer (ฮผP), low-precision training (FP8/MXFP4), and a curriculum that anneals on the highest-quality data at the end are now table stakes. Multi-token prediction (MTP) is the new auxiliary objective.
Where a base model becomes useful โ or harmful.
Post-training in 2026 is a multi-stage pipeline: SFT on high-quality instructions โ preference optimization (DPO and variants) โ RLVR (Reinforcement Learning from Verifiable Rewards) on math/code/agentic tasks โ final safety polish. Constitutional AI and deliberative alignment shape behavior; distillation from R1/o-class reasoners injects long-CoT capabilities into smaller models.
When you can't make the model bigger, make it think longer.
The 2025 breakthrough: train models with RL on verifiable rewards, give them an 'unbounded' scratchpad, and emergent long chain-of-thought appears. OpenAI's o1/o3, DeepSeek-R1, Gemini 2.5 Pro thinking, Claude Opus 4.x with extended thinking, and Kimi k1.5 all train this way. Inference now spends 1k-100k+ thinking tokens before producing a final answer.
From chatbot to autonomous worker.
Agents are LLMs in a loop: observe, plan, call a tool, observe the result, repeat. The 2025-2026 stack centers on function calling, the Model Context Protocol (MCP), browser/computer-use models (Anthropic Computer Use, OpenAI Operator, Google Project Mariner), and verifiable evaluation harnesses (SWE-Bench, OSWorld, Tau-Bench). Single strong agents with good tools beat elaborate multi-agent orchestration for most tasks.
Native any-to-any is the new default.
2025-2026 frontier models are natively multimodal: text, images, audio, and (increasingly) video share a unified token space and a single backbone. GPT-5.5 unifies all four modalities in one model. Gemini 2.5 Pro handles 1M-token multimodal contexts. Vision encoders evolved from CLIP โ SigLIP โ SigLIP 2. Real-time voice (Moshi, GPT-4o) and generative video (Sora, Veo 3) crossed into product reality.
Training is one-time. Inference is forever.
Serving a frontier model economically requires every trick: FlashAttention 3 kernels, paged KV-cache (vLLM), speculative decoding (EAGLE-3, self-speculation), low-precision quantization (FP8, INT4, MXFP4), prompt caching, and disaggregated prefill/decode. Inference cost has dropped 10-100x in 18 months while quality climbed.
100,000 GPUs is the new 10,000.
A 2026 frontier run runs on a cluster of 50k-200k+ Blackwell or TPU v7 chips, networked with InfiniBand or NVLink-fabric, sustained for months. xAI's Colossus, OpenAI's Stargate, Meta's Hyperion, and Anthropic's Project Rainier (with AWS Trainium2) define the new scale. 3D+ parallelism (data ร tensor ร pipeline ร expert ร context), FSDP/ZeRO, FP8 collectives, and resilient checkpointing are mandatory.
If you can't measure it, you don't have it.
2026 frontier evaluation goes far beyond MMLU. The benchmarks that actually distinguish models โ GPQA Diamond, FrontierMath, ARC-AGI 2, Humanity's Last Exam, SWE-Bench Verified, OSWorld โ are graduate-or-expert difficulty. On the safety side, Responsible Scaling Policies, mandatory dangerous-capability evals (CBRN, cyber, autonomy), constitutional classifiers, and mechanistic interpretability via Sparse Autoencoders are now release-blockers at frontier labs.
How the chapters connect. Skip a step and a frontier model still works โ it just won't be a frontier model.
A frontier pretraining run in 2026 costs $100M-$1B. A frontier inference fleet costs 5-20x that, ongoing. The Stargate program (OpenAI/SoftBank/Oracle/MGX, announced January 2025) committed $500B over four years โ and that's for one consortium.
This guide is written assuming you have access to that level of capital, or are merely curious how it's spent. The same techniques scale down โ DeepSeek-R1 reproduced o1-class reasoning at ~$5-6M training cost โ but the chapters below describe no-compromises choices.
Or jump to any chapter above. Every page is self-contained โ but the order is opinionated.
Begin: Pretraining Data โ