Updated May 2026 ยท 11 chapters ยท 100+ papers

Build the
Most Cutting-Edge AI Model
From Scratch

No matter the price tag. The full pipeline a frontier lab actually uses in 2026 โ€” from cleaning a Common Crawl dump to deploying a reasoning-trained, multimodal, MCP-speaking agent on a 200,000-GPU cluster.

Each chapter pairs the practical recipe with the foundational and 2024-2026 papers actually being used at OpenAI, Anthropic, Google DeepMind, DeepSeek, and Meta.

15-40T
Pretraining tokens
100k+
Blackwell GPUs in a frontier cluster
$1B+
Cost of a single frontier run
87.6%
SOTA on SWE-Bench Verified (Apr 2026)

The 11 chapters

Read them in order if you want the full mental model. Jump in anywhere if you already have a frontier-scale lab and just need the latest tricks.

๐Ÿ“š
01

Pretraining Data

Garbage in, garbage out โ€” at trillion-token scale.

Frontier 2026 models are trained on 15-40T tokens. Quality, not raw volume, is what separates a great model from an also-ran. The pipeline starts with raw web crawl, then passes through aggressive filtering, deduplication, decontamination, model-based quality scoring, and curated mixes of code, math, multilingual, and synthetic data.

11 papersRead chapter
๐Ÿ”ค
02

Tokenization

The first lossy compression every model performs.

Tokenization controls how efficiently your model sees text, code, math, and other modalities. A bad vocabulary wastes context window, fragments numbers, and hurts multilingual performance. The 2026 default is byte-level BPE with a 128k-200k vocab โ€” but byte-level tokenizer-free models (BLT, MambaByte) are entering production-scale runs.

10 papersRead chapter
๐Ÿ›๏ธ
03

Architecture

Transformer is the default. Everything else is a knob.

Decoder-only Transformer with RoPE positional encoding, RMSNorm, SwiGLU activations, and Grouped-Query Attention (GQA) is the 2026 default โ€” call it the 'Llama-style stack.' On top, frontier models layer Mixture-of-Experts (DeepSeek V3, Llama 4, Qwen 3), Multi-head Latent Attention (DeepSeek's MLA), state-space hybrids (Jamba, Samba, Hymba), and dynamic-compute routing (MoD, MoR).

17 papersRead chapter
โšก
04

Pretraining at Scale

The $100M-$1B run where everything has to go right.

A frontier pretraining run in 2026 burns 10-50k+ Blackwell GPUs for 2-6 months on 15-40T tokens. Scaling laws, hyperparameter transfer (ฮผP), low-precision training (FP8/MXFP4), and a curriculum that anneals on the highest-quality data at the end are now table stakes. Multi-token prediction (MTP) is the new auxiliary objective.

10 papersRead chapter
๐ŸŽฏ
05

Post-Training & Alignment

Where a base model becomes useful โ€” or harmful.

Post-training in 2026 is a multi-stage pipeline: SFT on high-quality instructions โ†’ preference optimization (DPO and variants) โ†’ RLVR (Reinforcement Learning from Verifiable Rewards) on math/code/agentic tasks โ†’ final safety polish. Constitutional AI and deliberative alignment shape behavior; distillation from R1/o-class reasoners injects long-CoT capabilities into smaller models.

13 papersRead chapter
๐Ÿง 
06

Reasoning & Test-Time Compute

When you can't make the model bigger, make it think longer.

The 2025 breakthrough: train models with RL on verifiable rewards, give them an 'unbounded' scratchpad, and emergent long chain-of-thought appears. OpenAI's o1/o3, DeepSeek-R1, Gemini 2.5 Pro thinking, Claude Opus 4.x with extended thinking, and Kimi k1.5 all train this way. Inference now spends 1k-100k+ thinking tokens before producing a final answer.

12 papersRead chapter
๐Ÿค–
07

Tool Use & Agents

From chatbot to autonomous worker.

Agents are LLMs in a loop: observe, plan, call a tool, observe the result, repeat. The 2025-2026 stack centers on function calling, the Model Context Protocol (MCP), browser/computer-use models (Anthropic Computer Use, OpenAI Operator, Google Project Mariner), and verifiable evaluation harnesses (SWE-Bench, OSWorld, Tau-Bench). Single strong agents with good tools beat elaborate multi-agent orchestration for most tasks.

9 papersRead chapter
๐ŸŽจ
08

Multimodal

Native any-to-any is the new default.

2025-2026 frontier models are natively multimodal: text, images, audio, and (increasingly) video share a unified token space and a single backbone. GPT-5.5 unifies all four modalities in one model. Gemini 2.5 Pro handles 1M-token multimodal contexts. Vision encoders evolved from CLIP โ†’ SigLIP โ†’ SigLIP 2. Real-time voice (Moshi, GPT-4o) and generative video (Sora, Veo 3) crossed into product reality.

12 papersRead chapter
๐Ÿš€
09

Inference & Serving

Training is one-time. Inference is forever.

Serving a frontier model economically requires every trick: FlashAttention 3 kernels, paged KV-cache (vLLM), speculative decoding (EAGLE-3, self-speculation), low-precision quantization (FP8, INT4, MXFP4), prompt caching, and disaggregated prefill/decode. Inference cost has dropped 10-100x in 18 months while quality climbed.

13 papersRead chapter
๐Ÿ—๏ธ
10

Training Infrastructure

100,000 GPUs is the new 10,000.

A 2026 frontier run runs on a cluster of 50k-200k+ Blackwell or TPU v7 chips, networked with InfiniBand or NVLink-fabric, sustained for months. xAI's Colossus, OpenAI's Stargate, Meta's Hyperion, and Anthropic's Project Rainier (with AWS Trainium2) define the new scale. 3D+ parallelism (data ร— tensor ร— pipeline ร— expert ร— context), FSDP/ZeRO, FP8 collectives, and resilient checkpointing are mandatory.

8 papersRead chapter
๐Ÿ›ก๏ธ
11

Evaluation & Safety

If you can't measure it, you don't have it.

2026 frontier evaluation goes far beyond MMLU. The benchmarks that actually distinguish models โ€” GPQA Diamond, FrontierMath, ARC-AGI 2, Humanity's Last Exam, SWE-Bench Verified, OSWorld โ€” are graduate-or-expert difficulty. On the safety side, Responsible Scaling Policies, mandatory dangerous-capability evals (CBRN, cyber, autonomy), constitutional classifiers, and mechanistic interpretability via Sparse Autoencoders are now release-blockers at frontier labs.

15 papersRead chapter

The frontier-model lifecycle

How the chapters connect. Skip a step and a frontier model still works โ€” it just won't be a frontier model.

PHASE 1 ยท INPUTS

What goes in

  • ยท Data โ€” 15-40T tokens, web + synthetic
  • ยท Tokenization โ€” 128k-200k BPE or byte-level
  • ยท Architecture โ€” Llama-stack + MoE/MLA/SSM
PHASE 2 ยท TRAINING

How it learns

  • ยท Pretraining โ€” FP8, ฮผP, MTP, curriculum
  • ยท Post-training โ€” SFT + DPO + RLVR
  • ยท Reasoning โ€” long-CoT RL
  • ยท Multimodal โ€” native fusion
  • ยท Agents โ€” tool use + MCP
PHASE 3 ยท OPS

How it runs

  • ยท Infrastructure โ€” 100k-GPU cluster
  • ยท Inference โ€” vLLM, FA3, MXFP4
  • ยท Eval & safety โ€” RSPs, SAEs, red-team

A note on price tags

A frontier pretraining run in 2026 costs $100M-$1B. A frontier inference fleet costs 5-20x that, ongoing. The Stargate program (OpenAI/SoftBank/Oracle/MGX, announced January 2025) committed $500B over four years โ€” and that's for one consortium.

This guide is written assuming you have access to that level of capital, or are merely curious how it's spent. The same techniques scale down โ€” DeepSeek-R1 reproduced o1-class reasoning at ~$5-6M training cost โ€” but the chapters below describe no-compromises choices.

Start with Chapter 1

Or jump to any chapter above. Every page is self-contained โ€” but the order is opinionated.

Begin: Pretraining Data โ†’