Build a Frontier Model
🏗️
Step 10 of 11

Training Infrastructure

100,000 GPUs is the new 10,000.

A 2026 frontier run runs on a cluster of 50k-200k+ Blackwell or TPU v7 chips, networked with InfiniBand or NVLink-fabric, sustained for months. xAI's Colossus, OpenAI's Stargate, Meta's Hyperion, and Anthropic's Project Rainier (with AWS Trainium2) define the new scale. 3D+ parallelism (data × tensor × pipeline × expert × context), FSDP/ZeRO, FP8 collectives, and resilient checkpointing are mandatory.

Why it matters

  • Stargate ($500B announced Jan 2025, OpenAI/SoftBank/Oracle/MGX) — first site in Abilene TX is operational at ~1.2GW.
  • xAI Colossus reached 200k+ H100/H200 in 2025; targeting 1M GPUs.
  • Anthropic's Project Rainier (with AWS) deploys ~1M Trainium2 chips — frontier-scale alternative to NVIDIA.
  • Meta's Llama 3 paper documented 419 hardware interruptions in a 54-day run on 16k GPUs — fault tolerance is engineering, not luck.

State of the art

2025-2026
  • GB200 NVL72 racks — 72 Blackwell GPUs treated as a single unit, 1.4 EFLOPS per rack.
  • TPU v7 Ironwood (Apr 2025) — Google's inference-optimized TPU; pods up to 9,216 chips synchronously.
  • AMD MI355X with 288GB HBM (vs 180GB B200) — OpenAI signed a multi-GW MI deal Oct 2025.
  • FP8 training in production at frontier scale (DeepSeek-V3 first; now standard).
  • DualPipe and cross-node all-to-all overlap (DeepSeek-V3) for MoE expert parallelism.
  • Multi-datacenter training experiments (Stargate distributed, Gemini 1.5) — synchronous gradients across geographic distance.

The recipe

A frontier-grade implementation, in order.

1

Cluster spec

10k-200k accelerators (B200/GB200, TPU v6/v7, MI355X, Trainium2). 800Gb-3.2Tb InfiniBand or NVLink fabric. NVMe scratch + parallel filesystem.

2

Parallelism strategy

DP (data) × TP (tensor) × PP (pipeline) × EP (expert, MoE) × CP (context, long-seq). Megatron-LM is the reference codebase.

3

Memory partitioning

FSDP / ZeRO-3 for params + grads + optimizer states. CPU/NVMe offload for largest models. Activation checkpointing at 50-100% recompute.

4

Communication overlap

Overlap collectives with compute. SHARP in-network reduction. DualPipe-style cross-node overlap for MoE.

5

Fault tolerance

Async checkpointing every 15-30 min. Health checks for silent data corruption (SDC). Automatic node fencing + rerun-from-checkpoint.

6

Telemetry

Per-layer grad norm, activation norm, loss spikes. MFU (Model FLOPs Utilization) target ≥40%. Track every node's throughput vs cohort.

⚠️

Common pitfalls

Silent Data Corruption (SDC) on Hopper/Blackwell at scale is real — Meta's Llama 3 paper documents detection. Don't trust your training without per-step gradient sanity checks.
Pipeline parallelism creates bubbles; interleaved schedules (1F1B, ZB-H1) help. Don't use naive PP above ~16 stages.
MoE all-to-all collectives are the bottleneck — expert parallelism + DualPipe-style overlap is the difference between 30% and 50% MFU.
Storage I/O often becomes the bottleneck, not GPUs. Provision parallel filesystems (Lustre, WekaFS) generously.