If you can't measure it, you don't have it.
2026 frontier evaluation goes far beyond MMLU. The benchmarks that actually distinguish models — GPQA Diamond, FrontierMath, ARC-AGI 2, Humanity's Last Exam, SWE-Bench Verified, OSWorld — are graduate-or-expert difficulty. On the safety side, Responsible Scaling Policies, mandatory dangerous-capability evals (CBRN, cyber, autonomy), constitutional classifiers, and mechanistic interpretability via Sparse Autoencoders are now release-blockers at frontier labs.
A frontier-grade implementation, in order.
MMLU-Pro, GPQA Diamond, AIME, FrontierMath, SWE-Bench Verified, LiveCodeBench, Humanity's Last Exam. Use lm-evaluation-harness for reproducibility.
MMMU for vision QA, OSWorld for desktop agents, Tau-Bench for customer service, MCP-Atlas for tool use, VBench for video.
13-gram overlap of training data against every published eval. Document the methodology in your model card.
CBRN uplift (WMDP), cyber (Cybench), autonomous replication (METR), persuasion (MakeMePay). Required by RSPs.
Recruit external red-teamers. Combine automated jailbreaks (PAIR, GCG) + human creativity. Convert findings into preference data.
Train Sparse Autoencoders on intermediate layers. Identify safety-relevant features. Use feature steering to reduce harmful outputs.
In rough order of foundational → recent. Click any title to open the arXiv abstract.
Hendrycks et al. · 2021
Wang et al. · 2024
Rein et al. · 2023
Jimenez et al. · 2024
Glazer et al. (Epoch AI) · 2024
Center for AI Safety + Scale AI · 2025
Liang et al. (Stanford) · 2022
Li et al. · 2024
Zhang et al. · 2024
Anthropic · 2024
DeepMind · 2024
Cunningham et al. · 2023
Zou et al. · 2023
Chao et al. · 2023
Zou et al. · 2024