Essential AI Research

The papers that shaped modern AI, from the 2017 Transformer to the 2025-2026 breakthroughs in reasoning, RLVR, and agents.

Building a frontier model? Read the full pipeline in Build a Frontier Model →

Foundational Papers

The landmark papers that established the field. Essential reading for understanding modern AI.

Vaswani et al. (Google)2017

Citations

120,000+

Introduced the Transformer architecture, eliminating recurrence and convolutions in favor of self-attention. The foundation of all modern LLMs including GPT, Claude, and Gemini.

Key Contributions

Self-attention mechanism for sequence modelingMulti-head attention for parallel processingPositional encoding for sequence orderEncoder-decoder architecture

Devlin et al. (Google)2018

Citations

90,000+

Pioneered bidirectional pre-training for language understanding. Introduced masked language modeling and established the pre-train/fine-tune paradigm.

Key Contributions

Bidirectional context modelingMasked language modeling (MLM)Next sentence predictionTransfer learning for NLP

Brown et al. (OpenAI)2020

Citations

35,000+

Demonstrated that scaling language models to 175B parameters enables few-shot learning without fine-tuning. Sparked the current LLM era.

Key Contributions

In-context learning without gradient updatesEmergent abilities from scaleZero/few-shot task performanceDemonstrated path to general-purpose AI

Ouyang et al. (OpenAI)2022

Citations

8,000+

Introduced RLHF (Reinforcement Learning from Human Feedback) to align language models with human intent. Foundation of ChatGPT's helpfulness.

Key Contributions

RLHF methodology for alignmentHuman preference data collectionReward modeling from comparisonsReduced harmful outputs significantly

He et al. (Microsoft)2015

Citations

180,000+

Introduced residual connections enabling training of very deep networks. Won ImageNet 2015 and revolutionized deep learning architecture design.

Key Contributions

Skip connections for gradient flowEnabled 100+ layer networksSolved vanishing gradient problemInfluenced Transformer design

Goodfellow et al.2014

Citations

65,000+

Introduced the adversarial training paradigm with generator and discriminator networks. Pioneered modern generative AI.

Key Contributions

Adversarial training frameworkGenerator/discriminator architectureImplicit density modelingFoundation for image generation

Recent Breakthroughs (2023-2026)

The cutting-edge research pushing AI capabilities forward. These papers represent the current frontier.

Tian et al.2024

New image generation approach predicting images from coarse to fine resolutions. Outperforms diffusion transformers with LLM-like scaling.

NeurIPS 2024 Best Paper
Multi-scale autoregressive generationSuperior to diffusion for visual tasksScaling properties similar to LLMs

Meta AI2024

405B dense Transformer matching GPT-4 capabilities. Implements grouped query attention and 128K context. Open weights accelerated the field.

Open-weight frontier modelGrouped query attention (GQA)128K token context supportCompetitive with closed models

Gu & Dao2023

Proposed alternative to Transformers using selective state spaces. Achieves linear-time complexity while matching Transformer quality.

Linear vs quadratic attention complexitySelective state space mechanismHardware-efficient implementationViable Transformer alternative

Dao2023

Optimized GPU implementation of attention achieving 2x speedup over FlashAttention. Enables longer sequences and faster training.

IO-aware attention algorithmBetter GPU memory utilizationEnables 16K+ context efficientlyWidely adopted in production

Bai et al. (Anthropic)2022

Training AI to follow principles without extensive human labeling. Self-critique based on defined constitution reduces harmful outputs.

Principle-based self-improvementReduced human annotation needsScalable alignment approachFoundation of Claude models

Lewis et al. (Meta)2020

Combined retrieval with generation to ground LLM responses in external knowledge. Reduces hallucinations and enables access to current information.

Retrieval + generation architectureGrounded factual responsesReduced hallucination ratesIndustry-standard technique

Ho et al. (Google)2020

Established diffusion models as state-of-the-art for image generation. Foundation for DALL-E, Stable Diffusion, and Midjourney.

Diffusion-based generationGradual denoising processHigh-quality image synthesisEnabled text-to-image models

Darcet et al. (Meta)2024

Discovered issues with high-norm tokens in ViT feature maps. Adding register tokens significantly improves performance across vision tasks.

ICLR 2024 Outstanding Paper
Identified artifact tokens problemSimple fix via register tokensImproved downstream performance

Munkhdalai et al. (Google)2024

Method to scale Transformers to infinitely long inputs with limited compute. Combines local and compressive memory.

Infinite context handlingCompressive memory mechanismBounded compute regardless of length

DeepSeek-AI2025

Proved pure RL on verifiable rewards (math + code) elicits emergent long chain-of-thought without any SFT. Open-weight model rivaling OpenAI o1 at ~$5-6M training cost. Triggered the 'DeepSeek shock' market moment.

Most-cited paper of Q1 2025
RL-only emergence of reasoning (R1-Zero)GRPO algorithm for stable on-policy RLDistillation traces unlock smaller dense modelsOpen weights for the frontier

DeepSeek-AI2024

First frontier-scale model trained primarily in FP8. 671B total / 37B active MoE with Multi-head Latent Attention (MLA), DualPipe parallelism, and Multi-Token Prediction. Recipe widely adopted in 2025.

FP8 training at frontier scaleMLA for ~10x KV-cache compressionDualPipe + EP for MoE efficiencyMTP auxiliary objective

Pagnoni et al. (Meta)2024

Tokenizer-free architecture using dynamic byte patching that matches BPE-based Llama at 8B scale. The most interesting tokenization paper of 2024.

Eliminates tokenizer entirelyDynamic patching by entropyBetter robustness on noisy/multilingual text

Shah, Bikshandi, Ye, Thakkar, Tri Dao et al.2024

Hopper-async-aware FP8 attention kernel. 1.5-2x speedup over FlashAttention-2 in BF16 and 2.6x in FP8. The universal attention kernel across the 2025-2026 ecosystem.

Hopper async tensor coresFP8 attention with safe scalingProduction-grade quality preservation

Microsoft Research2025

A 7B model + Process Reward Model + MCTS matches OpenAI o1-preview on AIME. Showed test-time compute can substitute for pretraining scale.

MCTS over reasoning stepsProcess Reward Models at scaleSelf-evolution loop

Muennighoff et al. (Stanford)2025

1,000 carefully selected reasoning traces + simple budget control via injected 'Wait' tokens reproduces o1-style performance. Showed reasoning is data-efficient if data is high-quality.

1k traces sufficient for SFT cold-startBudget forcing via 'Wait' tokensOpen recipe for reasoning models

Dao, Gu2024

Unified theory of Transformers and State-Space Models via Structured State Space Duality. Faster training, hardware-aware kernels.

SSD framework unifying SSM + attention2-8x training speedupFoundation for hybrid architectures

DeepSeek-AI2025

Hardware-aligned sparse attention designed into pretraining. Cuts long-context attention cost without quality regression — bringing sparse attention from inference hack to first-class training method.

Sparse attention at pretraining timeHardware-aligned block patternsFrontier-scale validation

Center for AI Safety + Scale AI2025

Multidisciplinary expert-level benchmark designed to remain unsaturated even as frontier models improve. Even GPT-5 / Claude Opus 4.x score under 30%.

Crowdsourced from world expertsDesigned against future saturationReplaces MMLU as the headline test

Active Research Areas

The major directions of current AI research and their significance.

RL with Verifiable Rewards (RLVR)

Train on math, code, and agentic tasks where correctness is mechanically checkable.

Key Papers

DeepSeek-R1 (2501.12948) • Kimi k1.5 (2501.12599) • rStar-Math (2501.04519) • Tülu 3 (2411.15124)

The defining post-training paradigm of 2025-2026

Test-Time Compute Scaling

Variable inference budget — long chain-of-thought, search, and verification at decode time.

Key Papers

Snell et al. (2408.03314) • S1 (2501.19393) • rStar-Math • Process Reward Models (2305.20050)

Transforms inference from one-shot to deliberation

Agents & Tool Use

Models that plan, call tools, and act in real environments — code, browser, desktop.

Key Papers

ReAct (2210.03629) • SWE-Bench (2310.06770) • OSWorld (2404.07972) • Toolformer (2302.04761)

Crossing usability threshold; MCP became the integration standard

Efficient Architectures

Cutting compute and memory at training and inference.

Key Papers

Mamba-2 (2405.21060) • MLA (DeepSeek-V2) • Mixture-of-Depths (2404.02258) • Native Sparse Attention (2502.11089)

MoE + MLA + sparse attention is the new default

Multimodal & World Models

Native any-to-any models and generative world simulators.

Key Papers

Chameleon (2405.09818) • SigLIP 2 (2502.14786) • Cosmos (2501.03575) • Movie Gen (2410.13720)

Unified text+image+audio+video; spatial intelligence emerging

Alignment & Safety

Constitutional AI, deliberative alignment, and constitutional classifiers as deployed safety filters.

Key Papers

Constitutional Classifiers (2501.18837) • Deliberative Alignment (2412.16339) • Circuit Breakers (2406.04313)

RSPs and pre-deployment AISI evals are now release-blockers

Mechanistic Interpretability

Sparse Autoencoders to find interpretable features inside frontier models.

Key Papers

Scaling Monosemanticity (Anthropic, 2024) • Gemma Scope (2408.05147) • On the Biology of an LLM (Anthropic, Mar 2025)

First production interp tools; Anthropic targeting solved-by-2027

Pretraining Data Science

Quality filtering, decontamination, and synthetic data are now ML problems.

Key Papers

FineWeb (2406.17557) • DCLM (2406.11794) • Nemotron-CC (2412.02595) • Persona-Hub (2406.20094)

Data classifiers + LLM-rephrased web are standard

Apply Research to Real Applications

FullAI implements the latest research breakthroughs. Experience state-of-the-art AI in your applications.

Start Building for Free