What is SWE-bench Verified?

SWE-bench Verified is a set of 500 human-validated software engineering tasks taken from real GitHub issues in popular Python repositories. The model must produce a code patch that passes the repository's own test suite. It has become the standard measure of practical, agentic coding ability — as of July 2026 the top published score is 95% (Claude Fable 5).

What is GPQA Diamond?

GPQA Diamond is a set of 198 graduate-level physics, chemistry, and biology questions written by domain experts and designed to be "Google-proof" — skilled humans with full web access score only ~34%. Frontier models now exceed 90%, making it one of the few knowledge benchmarks that still differentiates top models.

Are these benchmarks still reliable?

Benchmark saturation is why this leaderboard tracks the current generation of evaluations: classics like MMLU, HumanEval, and GSM8K are effectively solved (top models cluster above 90%), so the industry moved to harder, contamination-resistant tests like SWE-bench Pro and ARC-AGI-2. Even so, vendor-reported scores can use favourable scaffolding — for robust comparisons, combine benchmarks with task-specific evaluations on your own data.

LLM Benchmark Leaderboard

Benchmark scores for 11 major language models, compiled July 2026 from public leaderboards and vendor announcements. Blank cells mean no published score — scores are never estimated. Sortable by provider and evaluation metric.

Score colours: ≥ 90 excellent 75 – 89 good < 75 below avg — not reported

11 models

Model	Params	SWE-bench V	SWE-bench Pro	GPQA-D	OSWorld	Terminal-Bench	ARC-AGI-2	Context	Notes
Claude Fable 5 Anthropic · 2026	Unknown	95.0	—	—	—	—	—	1M	Anthropic premium tier — leads SWE-bench Verified among generally available models
GPT-5.5 OpenAI · 2026	Unknown	88.7	—	93.6	—	—	—	1M	SWE-bench Verified score is vendor-reported
Claude Opus 4.8 Anthropic · 2026	Unknown	88.6	69.2	93.6	—	—	—	1M	Leads SWE-bench Pro among active models
Claude Sonnet 5 Anthropic · 2026-06	Unknown	82.1	63.2	96.2	88.3	—	84.7	1M	GPQA Diamond record holder (96.2%) at mid-tier pricing
Gemini 3.1 Pro Google · 2026	Unknown	80.6	—	94.3	—	—	77.1	1M	Strong multimodal reasoning; video input
DeepSeek V4 Flash DeepSeek · 2026-04	284B MoE (13B active)	73.7	—	86.0	—	—	—	1M	Open weights (MIT); 79.0% SWE-bench Verified at max reasoning effort
DeepSeek V4 Pro DeepSeek · 2026-04	1.6T MoE (49B active)	73.6	—	—	—	—	—	1M	Open weights (MIT); 80.6% SWE-bench Verified at max reasoning effort
Claude Haiku 4.5 Anthropic · 2025-10	Unknown	73.3	—	—	—	—	—	200K	Fastest and cheapest Claude
o3 OpenAI · 2025-04	Unknown	69.1	—	83.3	—	—	—	200K	Scores from OpenAI's April 2025 announcement
GPT-5.4 OpenAI · 2026	Unknown	—	—	—	75.0	—	—	1M	Value flagship with tool search
Gemini 3.5 Flash Google · 2026	Unknown	—	55.1	—	—	76.2	—	1M	Agentic coding at fast-tier pricing

Data as of May 2025. Scores may vary by source, prompt format, and evaluation methodology. Always verify against primary sources.

LLM Benchmark Leaderboard

About

How to use