Understanding how AI models are evaluated, compared, and ranked. From standardized tests to human preference ratings.
The classic benchmarks (MMLU, HumanEval, HellaSwag) saturated by 2024. The list below leads with the 2025-2026 frontier benchmarks — Humanity's Last Exam, FrontierMath, ARC-AGI 2, SWE-Bench Verified, GPQA Diamond — that still have headroom and actually distinguish today's top models.
Center for AI Safety + Scale AI
Multidisciplinary expert-level questions across math, physics, biology, history, and more. Designed to outlast benchmark saturation. Even GPT-5 / Claude Opus 4.x score under 30%.
GPT-5 Pro / Gemini 2.5 Pro · Jan 2025
Epoch AI Frontier Mathematics
Hundreds of original research-level math problems crafted by mathematicians. Closed test set; no contamination. Frontier models score in the low double digits.
GPT-5 / o3-class · 2025
Abstract Reasoning Corpus 2
Visual/abstract reasoning puzzles resistant to memorization. Designed by François Chollet to test fluid intelligence rather than pattern matching.
OpenAI o3 / GPT-5 · 2025
Software Engineering Benchmark (Verified subset)
Real GitHub issues from 12 popular Python projects, human-verified for solvability. Models must produce a working patch that passes all tests.
Claude Opus 4.7 · Apr 2026
Harder Software Engineering Benchmark
Harder professional-grade software-engineering tasks. Open-weight Chinese models (Kimi K2.6, GLM-5.1) currently top this leaderboard.
Kimi K2.6 · Apr 2026
Graduate-Level Google-Proof Q&A (Diamond)
Hardest subset of GPQA: 198 graduate-level physics, chemistry, and biology questions. Resistant to web search; even PhDs in adjacent fields score ~34%.
GPT-5 Pro w/ thinking · 2025
American Invitational Mathematics Examination 2025
Olympiad-level high school math. Numeric answers make it cleanly verifiable — the canonical RLVR target.
GPT-5 (no tools) · 2025
Massive Multitask Language Understanding (Pro)
Harder follow-on to MMLU with 10 answer choices and reasoning-focused questions. The successor MMLU as the original saturated above 90%.
GPT-5 / Claude Opus 4.x · 2025
OS World (computer-use agents)
Realistic computer-use tasks across Linux, macOS, and Windows applications. Agents must drive the desktop end-to-end.
GPT-5.5 · 2025
Tool-augmented Conversation Benchmark
Multi-turn customer-service workflows with tool use, policy adherence, and grounded reasoning over a database.
GPT-5.5 (Telecom) · 2025
Model Context Protocol — Atlas
Tool-use benchmark over heterogeneous MCP servers. Tests model ability to discover, choose, and chain real-world tools.
Claude Opus 4.7 · Apr 2026
Live Code Benchmark (v5)
Continuously refreshed LeetCode-style problems collected after model release dates — contamination-resistant by construction.
Gemini 2.5 Pro · 2025
Terminal Bench v2
Long-horizon shell/coding tasks executed end-to-end in a real terminal sandbox — ops, debugging, system admin.
GPT-5.5 · 2025
Generalist Domain Performance Validation
OpenAI's economic-value-weighted suite covering 44 occupations from law to accounting to medicine. Flags real-world deployment readiness.
GPT-5.5 · 2025
Massive Multi-discipline Multimodal Understanding
College-level multimodal questions across art, business, science, health, humanities, and engineering. The vision-LLM analog of MMLU.
GPT-5 · 2025
Multi-Round Coreference Resolution at 1M context
Long-context multi-round retrieval and coreference task. The current frontier benchmark for 1M+ token context windows.
Gemini 2.5 Pro (128k) · 2025
Massive Multitask Language Understanding (saturated)
57-subject knowledge test. Saturated above 90% by 2024; replaced by MMLU-Pro as the live benchmark.
Saturated · superseded
OpenAI HumanEval (saturated)
164 Python problems. Above 95% on frontier models. Replaced by SWE-Bench Verified, LiveCodeBench, and SWE-Bench Pro.
Saturated · superseded
Beyond standardized tests, human preference ratings provide real-world performance insights. The ELO system, borrowed from chess, ranks models based on head-to-head comparisons.
A crowdsourced, randomized battle platform where users chat with two anonymous models side-by-side and vote for the better one. Over 6M+ user votes are used to compute ELO ratings. As of April 2026, Claude Opus 4.6 Thinking holds #1 at 1504 ELO, with Grok 4.20 Beta1 at #4 (1491) and the 1500-Elo barrier broken for the first time.
Uses the Bradley-Terry model. Models gain points for wins and lose points for losses. The difference between two models' ELO scores indicates relative win probability. The 'Thinking' model variants (extended-CoT) have come to dominate the top tier.
Newly broken in 2026: Claude Opus 4.6 Thinking, GPT-5.4, Grok 4.20 — reasoning-tuned frontier models.
Frontier non-reasoning models: GPT-5, Gemini 2.5 Pro, Claude Sonnet 4.6, DeepSeek V3.2.
Mid-tier 2025 models, top open-weight (Qwen 3.6, Kimi K2.6, GLM-5.1).
Last-generation models still competitive for many tasks (Claude 3.7, GPT-4o, Llama 4).
Mid-tier open-source 7-70B models.
Track the latest model rankings across different evaluation criteria.
By 2026, the original LLM benchmarks — MMLU, HumanEval, HellaSwag, ARC, GSM8K — are functionally saturated. Frontier models score above 90% and the remaining points are noise. The field has rotated to adversarial, expert-level, and contamination-resistant tests: Humanity's Last Exam, FrontierMath (closed test set), ARC-AGI 2, SWE-Bench Verified, and LiveCodeBench (continuously refreshed post-release).
Contamination is a real problem. Models that have seen a test set during pretraining will silently inflate scores. The Llama 3 paper documents the standard 13-gram-overlap protocol now expected in any release.
Gaming is also a real problem. In April 2026, scrutiny of some Chinese open-weight models (Kimi K2.6, Qwen 3.6) suggested that headline benchmark gains overstate generalization — confirmed by the gap on harder, gameable-resistant tests like ARC-AGI 2. The takeaway: never trust a single benchmark.
Our models are continuously evaluated against these benchmarks. Try FullAI and see the difference.
Start Building for Free