Evaluation & Safety

If you can't measure it, you don't have it.

2026 frontier evaluation goes far beyond MMLU. The benchmarks that actually distinguish models — GPQA Diamond, FrontierMath, ARC-AGI 2, Humanity's Last Exam, SWE-Bench Verified, OSWorld — are graduate-or-expert difficulty. On the safety side, Responsible Scaling Policies, mandatory dangerous-capability evals (CBRN, cyber, autonomy), constitutional classifiers, and mechanistic interpretability via Sparse Autoencoders are now release-blockers at frontier labs.

Why it matters

MMLU saturated above 90% — benchmark inflation makes it useless for distinguishing frontier models.
Humanity's Last Exam (Jan 2025) — even GPT-5 / Claude Opus 4.x score under 30%; the new headroom benchmark.
RSPs (Responsible Scaling Policies) gate frontier deployment on red-team and dangerous-capability evals.
Sparse Autoencoders (Anthropic 'Scaling Monosemanticity', Gemma Scope) cracked open feature-level interpretability.

State of the art

2025-2026

SWE-Bench Verified — Claude Opus 4.7 leads at 87.6% (Apr 2026); GPT-5 at 74.9%.
GPQA Diamond — GPT-5 Pro at 88.4% with thinking; the graduate-physics standard.
ARC-AGI 2 — abstract reasoning frontier; Western models still beat Chinese open-weight on this gameable-resistant benchmark.
Humanity's Last Exam (Jan 2025) — multidisciplinary expert test designed to outlast frontier saturation.
Mandatory pre-deployment evals by AI Safety Institutes (UK, US) for frontier releases.
Constitutional Classifiers (Anthropic, Jan 2025) — deployed safety filters trained against red-team data.
Sparse Autoencoders for interpretability — Anthropic's 'On the Biology of an LLM' (Mar 2025) attributed model behavior to learned features.

The recipe

A frontier-grade implementation, in order.

Capabilities battery

MMLU-Pro, GPQA Diamond, AIME, FrontierMath, SWE-Bench Verified, LiveCodeBench, Humanity's Last Exam. Use lm-evaluation-harness for reproducibility.

Multimodal & agentic

MMMU for vision QA, OSWorld for desktop agents, Tau-Bench for customer service, MCP-Atlas for tool use, VBench for video.

Decontamination check

13-gram overlap of training data against every published eval. Document the methodology in your model card.

Dangerous-capability evals

CBRN uplift (WMDP), cyber (Cybench), autonomous replication (METR), persuasion (MakeMePay). Required by RSPs.

Red-team

Recruit external red-teamers. Combine automated jailbreaks (PAIR, GCG) + human creativity. Convert findings into preference data.

Interpretability

Train Sparse Autoencoders on intermediate layers. Identify safety-relevant features. Use feature steering to reduce harmful outputs.

⚠️

Common pitfalls

Benchmark gaming: Chinese open-weight models reportedly inflated SWE-Bench scores in 2025. Validate on private/contamination-resistant tests (ARC-AGI 2).

Refusal training degrades helpfulness — measure both attack success rate AND over-refusal rate.

Static benchmarks decay fast — every 6 months, top models saturate. Rotate to harder evals.

Mech-interp via SAEs is promising but expensive and hasn't yet caught a real-world deployment failure. Treat as research-grade.