Build a Frontier Model

🏛️

Step 3 of 11

Architecture

Transformer is the default. Everything else is a knob.

Decoder-only Transformer with RoPE positional encoding, RMSNorm, SwiGLU activations, and Grouped-Query Attention (GQA) is the 2026 default — call it the 'Llama-style stack.' On top, frontier models layer Mixture-of-Experts (DeepSeek V3, Llama 4, Qwen 3), Multi-head Latent Attention (DeepSeek's MLA), state-space hybrids (Jamba, Samba, Hymba), and dynamic-compute routing (MoD, MoR).

Why it matters

Architecture choices compound: a 1.5x KV-cache reduction (MLA) is 1.5x cheaper inference forever.
MoE delivers 5-10x effective parameter count at the same active compute — DeepSeek V3 is 671B total / 37B active.
RoPE + YaRN extension enabled the long-context era (128k → 1M tokens) without retraining from scratch.
SSM/hybrid architectures (Mamba-2, Jamba, Samba) trade perfect recall for linear-time long context.

State of the art

2025-2026

MLA (Multi-head Latent Attention, DeepSeek V2/V3) compresses KV into a low-rank latent — drastically smaller cache than GQA.
Auxiliary-loss-free MoE load balancing (DeepSeek, 2024) replaced the noisy aux-loss trick.
Native sparse attention (NSA, Feb 2025) brought sparse attention into pretraining instead of bolt-on inference.
Diffusion language models (LLaDA, Mercury) crossed the threshold to competitive quality at 1B+ scale in 2025.
Mixture-of-Recursions (MoR) and Mixture-of-Depths (MoD) let tokens 'opt out' of layers for variable per-token compute.

The recipe

A frontier-grade implementation, in order.

1

Start from the Llama-3 reference

Decoder-only, RoPE, RMSNorm pre-norm, SwiGLU FFN (8/3 expansion), GQA with 8 KV heads. Don't deviate without reason.

2

Decide MoE or dense

MoE if you have >10B compute budget AND inference infra to host experts. Use fine-grained experts + shared experts (DeepSeek-MoE).

3

Pick KV strategy

GQA (8:1 ratio) is safe. MLA cuts KV by ~10x but is harder to train and requires custom kernels.

4

Long-context plan

Train on 4-8k, anneal on 32k-128k, then YaRN-extend to your target. Or natively train long with Ring Attention.

5

Multimodal fusion

Native (Chameleon-style) for new pretrains, late-fusion adapters (LLaVA-style) for retrofit. See Multimodal page.

⚠️

Common pitfalls

Don't 'innovate' on attention before scaling baselines — 95% of clever attention variants regress at scale.

MoE training is 2-3x more bug-prone than dense. Budget engineering time accordingly.

RoPE base frequency tuning matters: too low → context extension fails, too high → in-distribution loss bumps.

Mamba/SSM models lose on tasks requiring exact recall (book retrieval, long-context QA). Prefer hybrid.

Papers to read

In rough order of foundational → recent. Click any title to open the arXiv abstract.

Attention Is All You Need

Vaswani et al. · 2017

RoFormer: Enhanced Transformer with Rotary Position Embedding (RoPE)

Su et al. · 2021

Root Mean Square Layer Normalization (RMSNorm)

Zhang & Sennrich · 2019

GLU Variants Improve Transformer (SwiGLU)

Shazeer · 2020

GQA: Training Generalized Multi-Query Transformer Models

Ainslie et al. · 2023

YaRN: Efficient Context Window Extension of Large Language Models

Peng et al. · 2023

DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model (MLA)

DeepSeek-AI · 2024

DeepSeek-V3 Technical Report

DeepSeek-AI · 2024

Mixture of Experts: Switch Transformer

Fedus, Zoph, Shazeer · 2021

Mixtral of Experts

Mistral AI · 2024

Mamba: Linear-Time Sequence Modeling with Selective State Spaces

Gu, Dao · 2023

Mamba-2: Transformers are SSMs

Dao, Gu · 2024

Jamba: A Hybrid Transformer-Mamba Language Model

AI21 · 2024

Samba: Simple Hybrid State Space Models

Microsoft · 2024

Mixture-of-Depths: Dynamically allocating compute in transformers

Raposo et al. (DeepMind) · 2024

Native Sparse Attention (NSA)

DeepSeek-AI · 2025

LLaDA: Large Language Diffusion Models

Nie et al. · 2025

Previous · Step 2

← Tokenization

Pretraining at Scale →