Training Infrastructure

100,000 GPUs is the new 10,000.

A 2026 frontier run runs on a cluster of 50k-200k+ Blackwell or TPU v7 chips, networked with InfiniBand or NVLink-fabric, sustained for months. xAI's Colossus, OpenAI's Stargate, Meta's Hyperion, and Anthropic's Project Rainier (with AWS Trainium2) define the new scale. 3D+ parallelism (data × tensor × pipeline × expert × context), FSDP/ZeRO, FP8 collectives, and resilient checkpointing are mandatory.

Why it matters

Stargate ($500B announced Jan 2025, OpenAI/SoftBank/Oracle/MGX) — first site in Abilene TX is operational at ~1.2GW.
xAI Colossus reached 200k+ H100/H200 in 2025; targeting 1M GPUs.
Anthropic's Project Rainier (with AWS) deploys ~1M Trainium2 chips — frontier-scale alternative to NVIDIA.
Meta's Llama 3 paper documented 419 hardware interruptions in a 54-day run on 16k GPUs — fault tolerance is engineering, not luck.

State of the art

2025-2026

GB200 NVL72 racks — 72 Blackwell GPUs treated as a single unit, 1.4 EFLOPS per rack.
TPU v7 Ironwood (Apr 2025) — Google's inference-optimized TPU; pods up to 9,216 chips synchronously.
AMD MI355X with 288GB HBM (vs 180GB B200) — OpenAI signed a multi-GW MI deal Oct 2025.
FP8 training in production at frontier scale (DeepSeek-V3 first; now standard).
DualPipe and cross-node all-to-all overlap (DeepSeek-V3) for MoE expert parallelism.
Multi-datacenter training experiments (Stargate distributed, Gemini 1.5) — synchronous gradients across geographic distance.

The recipe

A frontier-grade implementation, in order.

Cluster spec

10k-200k accelerators (B200/GB200, TPU v6/v7, MI355X, Trainium2). 800Gb-3.2Tb InfiniBand or NVLink fabric. NVMe scratch + parallel filesystem.

Parallelism strategy

DP (data) × TP (tensor) × PP (pipeline) × EP (expert, MoE) × CP (context, long-seq). Megatron-LM is the reference codebase.

Memory partitioning

FSDP / ZeRO-3 for params + grads + optimizer states. CPU/NVMe offload for largest models. Activation checkpointing at 50-100% recompute.

Communication overlap

Overlap collectives with compute. SHARP in-network reduction. DualPipe-style cross-node overlap for MoE.

Fault tolerance

Async checkpointing every 15-30 min. Health checks for silent data corruption (SDC). Automatic node fencing + rerun-from-checkpoint.

Telemetry

Per-layer grad norm, activation norm, loss spikes. MFU (Model FLOPs Utilization) target ≥40%. Track every node's throughput vs cohort.

⚠️

Common pitfalls

Silent Data Corruption (SDC) on Hopper/Blackwell at scale is real — Meta's Llama 3 paper documents detection. Don't trust your training without per-step gradient sanity checks.

Pipeline parallelism creates bubbles; interleaved schedules (1F1B, ZB-H1) help. Don't use naive PP above ~16 stages.

MoE all-to-all collectives are the bottleneck — expert parallelism + DualPipe-style overlap is the difference between 30% and 50% MFU.

Storage I/O often becomes the bottleneck, not GPUs. Provision parallel filesystems (Lustre, WekaFS) generously.