AI Model Benchmarks May 2026 — LM Council

Summary

A comparison tool and leaderboard for frontier AI models across 18 benchmarks, including Humanity's Last Exam, SWE-bench, and FrontierMath, using data from Epoch AI and Scale AI.

Key quotes

Independently-run benchmarks by Epoch, Scale and others, so may not match self-reported scores by AI orgs.

The page provides an interactive comparison tool for a wide array of frontier models, including various versions of GPT-5, Claude 4, Gemini 3, and Grok 4. It tracks performance across diverse domains such as advanced mathematics, software engineering, and visual physics.