Build a Frontier Model
🏛️
Step 3 of 11

Architecture

Transformer is the default. Everything else is a knob.

Decoder-only Transformer with RoPE positional encoding, RMSNorm, SwiGLU activations, and Grouped-Query Attention (GQA) is the 2026 default — call it the 'Llama-style stack.' On top, frontier models layer Mixture-of-Experts (DeepSeek V3, Llama 4, Qwen 3), Multi-head Latent Attention (DeepSeek's MLA), state-space hybrids (Jamba, Samba, Hymba), and dynamic-compute routing (MoD, MoR).

Why it matters

  • Architecture choices compound: a 1.5x KV-cache reduction (MLA) is 1.5x cheaper inference forever.
  • MoE delivers 5-10x effective parameter count at the same active compute — DeepSeek V3 is 671B total / 37B active.
  • RoPE + YaRN extension enabled the long-context era (128k → 1M tokens) without retraining from scratch.
  • SSM/hybrid architectures (Mamba-2, Jamba, Samba) trade perfect recall for linear-time long context.

State of the art

2025-2026
  • MLA (Multi-head Latent Attention, DeepSeek V2/V3) compresses KV into a low-rank latent — drastically smaller cache than GQA.
  • Auxiliary-loss-free MoE load balancing (DeepSeek, 2024) replaced the noisy aux-loss trick.
  • Native sparse attention (NSA, Feb 2025) brought sparse attention into pretraining instead of bolt-on inference.
  • Diffusion language models (LLaDA, Mercury) crossed the threshold to competitive quality at 1B+ scale in 2025.
  • Mixture-of-Recursions (MoR) and Mixture-of-Depths (MoD) let tokens 'opt out' of layers for variable per-token compute.

The recipe

A frontier-grade implementation, in order.

1

Start from the Llama-3 reference

Decoder-only, RoPE, RMSNorm pre-norm, SwiGLU FFN (8/3 expansion), GQA with 8 KV heads. Don't deviate without reason.

2

Decide MoE or dense

MoE if you have >10B compute budget AND inference infra to host experts. Use fine-grained experts + shared experts (DeepSeek-MoE).

3

Pick KV strategy

GQA (8:1 ratio) is safe. MLA cuts KV by ~10x but is harder to train and requires custom kernels.

4

Long-context plan

Train on 4-8k, anneal on 32k-128k, then YaRN-extend to your target. Or natively train long with Ring Attention.

5

Multimodal fusion

Native (Chameleon-style) for new pretrains, late-fusion adapters (LLaVA-style) for retrofit. See Multimodal page.

⚠️

Common pitfalls

Don't 'innovate' on attention before scaling baselines — 95% of clever attention variants regress at scale.
MoE training is 2-3x more bug-prone than dense. Budget engineering time accordingly.
RoPE base frequency tuning matters: too low → context extension fails, too high → in-distribution loss bumps.
Mamba/SSM models lose on tasks requiring exact recall (book retrieval, long-context QA). Prefer hybrid.