The papers that shaped modern AI, from the 2017 Transformer to the 2025-2026 breakthroughs in reasoning, RLVR, and agents.
Building a frontier model? Read the full pipeline in Build a Frontier Model →
The landmark papers that established the field. Essential reading for understanding modern AI.
Vaswani et al. (Google) • 2017
120,000+
Introduced the Transformer architecture, eliminating recurrence and convolutions in favor of self-attention. The foundation of all modern LLMs including GPT, Claude, and Gemini.
Devlin et al. (Google) • 2018
90,000+
Pioneered bidirectional pre-training for language understanding. Introduced masked language modeling and established the pre-train/fine-tune paradigm.
Brown et al. (OpenAI) • 2020
35,000+
Demonstrated that scaling language models to 175B parameters enables few-shot learning without fine-tuning. Sparked the current LLM era.
Ouyang et al. (OpenAI) • 2022
8,000+
Introduced RLHF (Reinforcement Learning from Human Feedback) to align language models with human intent. Foundation of ChatGPT's helpfulness.
He et al. (Microsoft) • 2015
180,000+
Introduced residual connections enabling training of very deep networks. Won ImageNet 2015 and revolutionized deep learning architecture design.
Goodfellow et al. • 2014
65,000+
Introduced the adversarial training paradigm with generator and discriminator networks. Pioneered modern generative AI.
The cutting-edge research pushing AI capabilities forward. These papers represent the current frontier.
Tian et al. • 2024
New image generation approach predicting images from coarse to fine resolutions. Outperforms diffusion transformers with LLM-like scaling.
Meta AI • 2024
405B dense Transformer matching GPT-4 capabilities. Implements grouped query attention and 128K context. Open weights accelerated the field.
Gu & Dao • 2023
Proposed alternative to Transformers using selective state spaces. Achieves linear-time complexity while matching Transformer quality.
Dao • 2023
Optimized GPU implementation of attention achieving 2x speedup over FlashAttention. Enables longer sequences and faster training.
Bai et al. (Anthropic) • 2022
Training AI to follow principles without extensive human labeling. Self-critique based on defined constitution reduces harmful outputs.
Lewis et al. (Meta) • 2020
Combined retrieval with generation to ground LLM responses in external knowledge. Reduces hallucinations and enables access to current information.
Ho et al. (Google) • 2020
Established diffusion models as state-of-the-art for image generation. Foundation for DALL-E, Stable Diffusion, and Midjourney.
Darcet et al. (Meta) • 2024
Discovered issues with high-norm tokens in ViT feature maps. Adding register tokens significantly improves performance across vision tasks.
Munkhdalai et al. (Google) • 2024
Method to scale Transformers to infinitely long inputs with limited compute. Combines local and compressive memory.
DeepSeek-AI • 2025
Proved pure RL on verifiable rewards (math + code) elicits emergent long chain-of-thought without any SFT. Open-weight model rivaling OpenAI o1 at ~$5-6M training cost. Triggered the 'DeepSeek shock' market moment.
DeepSeek-AI • 2024
First frontier-scale model trained primarily in FP8. 671B total / 37B active MoE with Multi-head Latent Attention (MLA), DualPipe parallelism, and Multi-Token Prediction. Recipe widely adopted in 2025.
Pagnoni et al. (Meta) • 2024
Tokenizer-free architecture using dynamic byte patching that matches BPE-based Llama at 8B scale. The most interesting tokenization paper of 2024.
Shah, Bikshandi, Ye, Thakkar, Tri Dao et al. • 2024
Hopper-async-aware FP8 attention kernel. 1.5-2x speedup over FlashAttention-2 in BF16 and 2.6x in FP8. The universal attention kernel across the 2025-2026 ecosystem.
Microsoft Research • 2025
A 7B model + Process Reward Model + MCTS matches OpenAI o1-preview on AIME. Showed test-time compute can substitute for pretraining scale.
Muennighoff et al. (Stanford) • 2025
1,000 carefully selected reasoning traces + simple budget control via injected 'Wait' tokens reproduces o1-style performance. Showed reasoning is data-efficient if data is high-quality.
Dao, Gu • 2024
Unified theory of Transformers and State-Space Models via Structured State Space Duality. Faster training, hardware-aware kernels.
DeepSeek-AI • 2025
Hardware-aligned sparse attention designed into pretraining. Cuts long-context attention cost without quality regression — bringing sparse attention from inference hack to first-class training method.
Center for AI Safety + Scale AI • 2025
Multidisciplinary expert-level benchmark designed to remain unsaturated even as frontier models improve. Even GPT-5 / Claude Opus 4.x score under 30%.
The major directions of current AI research and their significance.
Train on math, code, and agentic tasks where correctness is mechanically checkable.
DeepSeek-R1 (2501.12948) • Kimi k1.5 (2501.12599) • rStar-Math (2501.04519) • Tülu 3 (2411.15124)
The defining post-training paradigm of 2025-2026
Variable inference budget — long chain-of-thought, search, and verification at decode time.
Snell et al. (2408.03314) • S1 (2501.19393) • rStar-Math • Process Reward Models (2305.20050)
Transforms inference from one-shot to deliberation
Models that plan, call tools, and act in real environments — code, browser, desktop.
ReAct (2210.03629) • SWE-Bench (2310.06770) • OSWorld (2404.07972) • Toolformer (2302.04761)
Crossing usability threshold; MCP became the integration standard
Cutting compute and memory at training and inference.
Mamba-2 (2405.21060) • MLA (DeepSeek-V2) • Mixture-of-Depths (2404.02258) • Native Sparse Attention (2502.11089)
MoE + MLA + sparse attention is the new default
Native any-to-any models and generative world simulators.
Chameleon (2405.09818) • SigLIP 2 (2502.14786) • Cosmos (2501.03575) • Movie Gen (2410.13720)
Unified text+image+audio+video; spatial intelligence emerging
Constitutional AI, deliberative alignment, and constitutional classifiers as deployed safety filters.
Constitutional Classifiers (2501.18837) • Deliberative Alignment (2412.16339) • Circuit Breakers (2406.04313)
RSPs and pre-deployment AISI evals are now release-blockers
Sparse Autoencoders to find interpretable features inside frontier models.
Scaling Monosemanticity (Anthropic, 2024) • Gemma Scope (2408.05147) • On the Biology of an LLM (Anthropic, Mar 2025)
First production interp tools; Anthropic targeting solved-by-2027
Quality filtering, decontamination, and synthetic data are now ML problems.
FineWeb (2406.17557) • DCLM (2406.11794) • Nemotron-CC (2412.02595) • Persona-Hub (2406.20094)
Data classifiers + LLM-rephrased web are standard
FullAI implements the latest research breakthroughs. Experience state-of-the-art AI in your applications.
Start Building for Free