Transformer is the default. Everything else is a knob.
Decoder-only Transformer with RoPE positional encoding, RMSNorm, SwiGLU activations, and Grouped-Query Attention (GQA) is the 2026 default — call it the 'Llama-style stack.' On top, frontier models layer Mixture-of-Experts (DeepSeek V3, Llama 4, Qwen 3), Multi-head Latent Attention (DeepSeek's MLA), state-space hybrids (Jamba, Samba, Hymba), and dynamic-compute routing (MoD, MoR).
A frontier-grade implementation, in order.
Decoder-only, RoPE, RMSNorm pre-norm, SwiGLU FFN (8/3 expansion), GQA with 8 KV heads. Don't deviate without reason.
MoE if you have >10B compute budget AND inference infra to host experts. Use fine-grained experts + shared experts (DeepSeek-MoE).
GQA (8:1 ratio) is safe. MLA cuts KV by ~10x but is harder to train and requires custom kernels.
Train on 4-8k, anneal on 32k-128k, then YaRN-extend to your target. Or natively train long with Ring Attention.
Native (Chameleon-style) for new pretrains, late-fusion adapters (LLaVA-style) for retrofit. See Multimodal page.
In rough order of foundational → recent. Click any title to open the arXiv abstract.
Vaswani et al. · 2017
Su et al. · 2021
Zhang & Sennrich · 2019
Shazeer · 2020
Ainslie et al. · 2023
Peng et al. · 2023
DeepSeek-AI · 2024
DeepSeek-AI · 2024
Fedus, Zoph, Shazeer · 2021
Mistral AI · 2024
Gu, Dao · 2023
Dao, Gu · 2024
AI21 · 2024
Microsoft · 2024
Raposo et al. (DeepMind) · 2024
DeepSeek-AI · 2025
Nie et al. · 2025