Updated May 2026 · 11 chapters · 100+ papers

Build the
Most Cutting-Edge AI Model
From Scratch

No matter the price tag. The full pipeline a frontier lab actually uses in 2026 — from cleaning a Common Crawl dump to deploying a reasoning-trained, multimodal, MCP-speaking agent on a 200,000-GPU cluster.

Each chapter pairs the practical recipe with the foundational and 2024-2026 papers actually being used at OpenAI, Anthropic, Google DeepMind, DeepSeek, and Meta.

15-40T

Pretraining tokens

100k+

Blackwell GPUs in a frontier cluster

$1B+

Cost of a single frontier run

87.6%

SOTA on SWE-Bench Verified (Apr 2026)

The 11 chapters

Read them in order if you want the full mental model. Jump in anywhere if you already have a frontier-scale lab and just need the latest tricks.

📚

Pretraining Data

Garbage in, garbage out — at trillion-token scale.

Frontier 2026 models are trained on 15-40T tokens. Quality, not raw volume, is what separates a great model from an also-ran. The pipeline starts with raw web crawl, then passes through aggressive filtering, deduplication, decontamination, model-based quality scoring, and curated mixes of code, math, multilingual, and synthetic data.

11 papersRead chapter

🔤

Tokenization

The first lossy compression every model performs.

Tokenization controls how efficiently your model sees text, code, math, and other modalities. A bad vocabulary wastes context window, fragments numbers, and hurts multilingual performance. The 2026 default is byte-level BPE with a 128k-200k vocab — but byte-level tokenizer-free models (BLT, MambaByte) are entering production-scale runs.

10 papersRead chapter

🏛️

Architecture

Transformer is the default. Everything else is a knob.

Decoder-only Transformer with RoPE positional encoding, RMSNorm, SwiGLU activations, and Grouped-Query Attention (GQA) is the 2026 default — call it the 'Llama-style stack.' On top, frontier models layer Mixture-of-Experts (DeepSeek V3, Llama 4, Qwen 3), Multi-head Latent Attention (DeepSeek's MLA), state-space hybrids (Jamba, Samba, Hymba), and dynamic-compute routing (MoD, MoR).

17 papersRead chapter

⚡

Pretraining at Scale

The $100M-$1B run where everything has to go right.

A frontier pretraining run in 2026 burns 10-50k+ Blackwell GPUs for 2-6 months on 15-40T tokens. Scaling laws, hyperparameter transfer (μP), low-precision training (FP8/MXFP4), and a curriculum that anneals on the highest-quality data at the end are now table stakes. Multi-token prediction (MTP) is the new auxiliary objective.

10 papersRead chapter

🎯

Post-Training & Alignment

Where a base model becomes useful — or harmful.

Post-training in 2026 is a multi-stage pipeline: SFT on high-quality instructions → preference optimization (DPO and variants) → RLVR (Reinforcement Learning from Verifiable Rewards) on math/code/agentic tasks → final safety polish. Constitutional AI and deliberative alignment shape behavior; distillation from R1/o-class reasoners injects long-CoT capabilities into smaller models.

13 papersRead chapter

🧠

Reasoning & Test-Time Compute

When you can't make the model bigger, make it think longer.

The 2025 breakthrough: train models with RL on verifiable rewards, give them an 'unbounded' scratchpad, and emergent long chain-of-thought appears. OpenAI's o1/o3, DeepSeek-R1, Gemini 2.5 Pro thinking, Claude Opus 4.x with extended thinking, and Kimi k1.5 all train this way. Inference now spends 1k-100k+ thinking tokens before producing a final answer.

12 papersRead chapter

🤖

Tool Use & Agents

From chatbot to autonomous worker.

Agents are LLMs in a loop: observe, plan, call a tool, observe the result, repeat. The 2025-2026 stack centers on function calling, the Model Context Protocol (MCP), browser/computer-use models (Anthropic Computer Use, OpenAI Operator, Google Project Mariner), and verifiable evaluation harnesses (SWE-Bench, OSWorld, Tau-Bench). Single strong agents with good tools beat elaborate multi-agent orchestration for most tasks.

9 papersRead chapter

🎨

Multimodal

Native any-to-any is the new default.

2025-2026 frontier models are natively multimodal: text, images, audio, and (increasingly) video share a unified token space and a single backbone. GPT-5.5 unifies all four modalities in one model. Gemini 2.5 Pro handles 1M-token multimodal contexts. Vision encoders evolved from CLIP → SigLIP → SigLIP 2. Real-time voice (Moshi, GPT-4o) and generative video (Sora, Veo 3) crossed into product reality.

12 papersRead chapter

🚀

Inference & Serving

Training is one-time. Inference is forever.

Serving a frontier model economically requires every trick: FlashAttention 3 kernels, paged KV-cache (vLLM), speculative decoding (EAGLE-3, self-speculation), low-precision quantization (FP8, INT4, MXFP4), prompt caching, and disaggregated prefill/decode. Inference cost has dropped 10-100x in 18 months while quality climbed.

13 papersRead chapter

🏗️

Training Infrastructure

100,000 GPUs is the new 10,000.

A 2026 frontier run runs on a cluster of 50k-200k+ Blackwell or TPU v7 chips, networked with InfiniBand or NVLink-fabric, sustained for months. xAI's Colossus, OpenAI's Stargate, Meta's Hyperion, and Anthropic's Project Rainier (with AWS Trainium2) define the new scale. 3D+ parallelism (data × tensor × pipeline × expert × context), FSDP/ZeRO, FP8 collectives, and resilient checkpointing are mandatory.

8 papersRead chapter

🛡️

Evaluation & Safety

If you can't measure it, you don't have it.

2026 frontier evaluation goes far beyond MMLU. The benchmarks that actually distinguish models — GPQA Diamond, FrontierMath, ARC-AGI 2, Humanity's Last Exam, SWE-Bench Verified, OSWorld — are graduate-or-expert difficulty. On the safety side, Responsible Scaling Policies, mandatory dangerous-capability evals (CBRN, cyber, autonomy), constitutional classifiers, and mechanistic interpretability via Sparse Autoencoders are now release-blockers at frontier labs.

15 papersRead chapter

The frontier-model lifecycle

How the chapters connect. Skip a step and a frontier model still works — it just won't be a frontier model.

PHASE 1 · INPUTS

What goes in

· Data — 15-40T tokens, web + synthetic
· Tokenization — 128k-200k BPE or byte-level
· Architecture — Llama-stack + MoE/MLA/SSM

PHASE 2 · TRAINING

How it learns

· Pretraining — FP8, μP, MTP, curriculum
· Post-training — SFT + DPO + RLVR
· Reasoning — long-CoT RL
· Multimodal — native fusion
· Agents — tool use + MCP

PHASE 3 · OPS

How it runs

· Infrastructure — 100k-GPU cluster
· Inference — vLLM, FA3, MXFP4
· Eval & safety — RSPs, SAEs, red-team

A note on price tags

A frontier pretraining run in 2026 costs $100M-$1B. A frontier inference fleet costs 5-20x that, ongoing. The Stargate program (OpenAI/SoftBank/Oracle/MGX, announced January 2025) committed $500B over four years — and that's for one consortium.

This guide is written assuming you have access to that level of capital, or are merely curious how it's spent. The same techniques scale down — DeepSeek-R1 reproduced o1-class reasoning at ~$5-6M training cost — but the chapters below describe no-compromises choices.

Start with Chapter 1

Or jump to any chapter above. Every page is self-contained — but the order is opinionated.

Begin: Pretraining Data →

Build theMost Cutting-Edge AI ModelFrom Scratch