100,000 GPUs is the new 10,000.
A 2026 frontier run runs on a cluster of 50k-200k+ Blackwell or TPU v7 chips, networked with InfiniBand or NVLink-fabric, sustained for months. xAI's Colossus, OpenAI's Stargate, Meta's Hyperion, and Anthropic's Project Rainier (with AWS Trainium2) define the new scale. 3D+ parallelism (data × tensor × pipeline × expert × context), FSDP/ZeRO, FP8 collectives, and resilient checkpointing are mandatory.
A frontier-grade implementation, in order.
10k-200k accelerators (B200/GB200, TPU v6/v7, MI355X, Trainium2). 800Gb-3.2Tb InfiniBand or NVLink fabric. NVMe scratch + parallel filesystem.
DP (data) × TP (tensor) × PP (pipeline) × EP (expert, MoE) × CP (context, long-seq). Megatron-LM is the reference codebase.
FSDP / ZeRO-3 for params + grads + optimizer states. CPU/NVMe offload for largest models. Activation checkpointing at 50-100% recompute.
Overlap collectives with compute. SHARP in-network reduction. DualPipe-style cross-node overlap for MoE.
Async checkpointing every 15-30 min. Health checks for silent data corruption (SDC). Automatic node fencing + rerun-from-checkpoint.
Per-layer grad norm, activation norm, loss spikes. MFU (Model FLOPs Utilization) target ≥40%. Track every node's throughput vs cohort.
In rough order of foundational → recent. Click any title to open the arXiv abstract.
Shoeybi et al. (NVIDIA) · 2019
Rajbhandari et al. (Microsoft) · 2019
Zhao et al. (Meta) · 2023
Liu, Zaharia, Abbeel · 2023
ByteDance · 2024
Meta AI · 2024
DeepSeek-AI · 2024
Korthikanti et al. (NVIDIA) · 2022