AI Benchmarks & Metrics

Understanding how AI models are evaluated, compared, and ranked. From standardized tests to human preference ratings.

Standard Benchmarks

The classic benchmarks (MMLU, HumanEval, HellaSwag) saturated by 2024. The list below leads with the 2025-2026 frontier benchmarks — Humanity's Last Exam, FrontierMath, ARC-AGI 2, SWE-Bench Verified, GPQA Diamond — that still have headroom and actually distinguish today's top models.

🌌Frontier

Humanity's Last Exam

Center for AI Safety + Scale AI

Multidisciplinary expert-level questions across math, physics, biology, history, and more. Designed to outlast benchmark saturation. Even GPT-5 / Claude Opus 4.x score under 30%.

SOTA~30%

GPT-5 Pro / Gemini 2.5 Pro · Jan 2025

📐Math

FrontierMath

Epoch AI Frontier Mathematics

Hundreds of original research-level math problems crafted by mathematicians. Closed test set; no contamination. Frontier models score in the low double digits.

SOTA~25%

GPT-5 / o3-class · 2025

🧩Reasoning

ARC-AGI 2

Abstract Reasoning Corpus 2

Visual/abstract reasoning puzzles resistant to memorization. Designed by François Chollet to test fluid intelligence rather than pattern matching.

SOTA~25-40%

OpenAI o3 / GPT-5 · 2025

💻Coding

SWE-Bench Verified

Software Engineering Benchmark (Verified subset)

Real GitHub issues from 12 popular Python projects, human-verified for solvability. Models must produce a working patch that passes all tests.

SOTA87.6%

Claude Opus 4.7 · Apr 2026

🛠️Coding

SWE-Bench Pro

Harder Software Engineering Benchmark

Harder professional-grade software-engineering tasks. Open-weight Chinese models (Kimi K2.6, GLM-5.1) currently top this leaderboard.

SOTA58.6%

Kimi K2.6 · Apr 2026

🎓Expert

GPQA Diamond

Graduate-Level Google-Proof Q&A (Diamond)

Hardest subset of GPQA: 198 graduate-level physics, chemistry, and biology questions. Resistant to web search; even PhDs in adjacent fields score ~34%.

SOTA88.4%

GPT-5 Pro w/ thinking · 2025

🧮Math

AIME 2025

American Invitational Mathematics Examination 2025

Olympiad-level high school math. Numeric answers make it cleanly verifiable — the canonical RLVR target.

SOTA94.6%

GPT-5 (no tools) · 2025

📚Knowledge

MMLU-Pro

Massive Multitask Language Understanding (Pro)

Harder follow-on to MMLU with 10 answer choices and reasoning-focused questions. The successor MMLU as the original saturated above 90%.

SOTA~85%

GPT-5 / Claude Opus 4.x · 2025

🖥️Agents

OSWorld-Verified

OS World (computer-use agents)

Realistic computer-use tasks across Linux, macOS, and Windows applications. Agents must drive the desktop end-to-end.

SOTA78.7%

GPT-5.5 · 2025

📞Agents

Tau-Bench / Tau2

Tool-augmented Conversation Benchmark

Multi-turn customer-service workflows with tool use, policy adherence, and grounded reasoning over a database.

SOTA98%

GPT-5.5 (Telecom) · 2025

🧰Agents

MCP-Atlas

Model Context Protocol — Atlas

Tool-use benchmark over heterogeneous MCP servers. Tests model ability to discover, choose, and chain real-world tools.

SOTA77.3%

Claude Opus 4.7 · Apr 2026

🐍Coding

LiveCodeBench v5

Live Code Benchmark (v5)

Continuously refreshed LeetCode-style problems collected after model release dates — contamination-resistant by construction.

SOTA70.4%

Gemini 2.5 Pro · 2025

⌨️Agents

Terminal-Bench 2.0

Terminal Bench v2

Long-horizon shell/coding tasks executed end-to-end in a real terminal sandbox — ops, debugging, system admin.

SOTA82.7%

GPT-5.5 · 2025

💼Knowledge work

GDPval

Generalist Domain Performance Validation

OpenAI's economic-value-weighted suite covering 44 occupations from law to accounting to medicine. Flags real-world deployment readiness.

SOTA84.9%

GPT-5.5 · 2025

🖼️Multimodal

MMMU

Massive Multi-discipline Multimodal Understanding

College-level multimodal questions across art, business, science, health, humanities, and engineering. The vision-LLM analog of MMLU.

SOTA84.2%

GPT-5 · 2025

📜Long context

MRCR (1M)

Multi-Round Coreference Resolution at 1M context

Long-context multi-round retrieval and coreference task. The current frontier benchmark for 1M+ token context windows.

SOTA91.5%

Gemini 2.5 Pro (128k) · 2025

📖Saturated

MMLU

Massive Multitask Language Understanding (saturated)

57-subject knowledge test. Saturated above 90% by 2024; replaced by MMLU-Pro as the live benchmark.

SOTA~92%

Saturated · superseded

Saturated

HumanEval

OpenAI HumanEval (saturated)

164 Python problems. Above 95% on frontier models. Replaced by SWE-Bench Verified, LiveCodeBench, and SWE-Bench Pro.

SOTA~98%

Saturated · superseded

ELO Ratings & Arena Rankings

Beyond standardized tests, human preference ratings provide real-world performance insights. The ELO system, borrowed from chess, ranks models based on head-to-head comparisons.

Chatbot Arena (LMSYS)

A crowdsourced, randomized battle platform where users chat with two anonymous models side-by-side and vote for the better one. Over 6M+ user votes are used to compute ELO ratings. As of April 2026, Claude Opus 4.6 Thinking holds #1 at 1504 ELO, with Grok 4.20 Beta1 at #4 (1491) and the 1500-Elo barrier broken for the first time.

Methodology

Uses the Bradley-Terry model. Models gain points for wins and lose points for losses. The difference between two models' ELO scores indicates relative win probability. The 'Thinking' model variants (extended-CoT) have come to dominate the top tier.

Understanding ELO Scores

1500+
Frontier (2026)

Newly broken in 2026: Claude Opus 4.6 Thinking, GPT-5.4, Grok 4.20 — reasoning-tuned frontier models.

1400-1500
Top tier

Frontier non-reasoning models: GPT-5, Gemini 2.5 Pro, Claude Sonnet 4.6, DeepSeek V3.2.

1300-1400
Excellent

Mid-tier 2025 models, top open-weight (Qwen 3.6, Kimi K2.6, GLM-5.1).

1200-1300
Strong

Last-generation models still competitive for many tasks (Claude 3.7, GPT-4o, Llama 4).

1100-1200
Competent

Mid-tier open-source 7-70B models.

Benchmark Saturation, Contamination, and Gaming

By 2026, the original LLM benchmarks — MMLU, HumanEval, HellaSwag, ARC, GSM8K — are functionally saturated. Frontier models score above 90% and the remaining points are noise. The field has rotated to adversarial, expert-level, and contamination-resistant tests: Humanity's Last Exam, FrontierMath (closed test set), ARC-AGI 2, SWE-Bench Verified, and LiveCodeBench (continuously refreshed post-release).

Contamination is a real problem. Models that have seen a test set during pretraining will silently inflate scores. The Llama 3 paper documents the standard 13-gram-overlap protocol now expected in any release.

Gaming is also a real problem. In April 2026, scrutiny of some Chinese open-weight models (Kimi K2.6, Qwen 3.6) suggested that headline benchmark gains overstate generalization — confirmed by the gap on harder, gameable-resistant tests like ARC-AGI 2. The takeaway: never trust a single benchmark.

Experience Top-Tier AI Performance

Our models are continuously evaluated against these benchmarks. Try FullAI and see the difference.

Start Building for Free