LLM Benchmark Leaderboard
Benchmark scores for 20 major language models. Sortable by provider and evaluation metric.
| Model | Params | MMLU | HumanEval | GSM8K | HellaSwag | ARC | MATH | MBPP | Context | Notes |
|---|---|---|---|---|---|---|---|---|---|---|
DeepSeek R1 DeepSeek · 2025-01 | 671B (MoE) | 90.8 | 92.3 | 97.3 | 96.0 | 97.1 | 91.0 | 90.7 | 128K | Reasoning-focused; chain-of-thought distillation |
GPT-4o OpenAI · 2024-05 | Unknown | 88.7 | 90.2 | 96.4 | 95.3 | 96.7 | 76.6 | 87.0 | 128K | Multimodal flagship — text, image, audio |
Llama 3.1 405B Meta · 2024-07 | 405B | 88.6 | 89.0 | 96.8 | 95.6 | 96.1 | 73.8 | 88.6 | 128K | Largest open-weight Llama; rivals proprietary models |
DeepSeek V3 DeepSeek · 2024-12 | 671B (MoE) | 88.5 | 91.6 | 97.1 | 95.7 | 96.5 | 87.1 | 89.9 | 128K | MoE architecture; top open-source performance |
Claude 3.5 Sonnet Anthropic · 2024-10 | Unknown | 88.3 | 92.0 | 96.4 | 95.1 | 96.4 | 78.3 | 90.7 | 200K | Strong coding, 200K context |
Claude 3 Opus Anthropic · 2024-03 | Unknown | 86.8 | 84.9 | 95.0 | 95.4 | 94.8 | 60.1 | 89.6 | 200K | Anthropic flagship (2024 Q1) |
GPT-4 Turbo OpenAI · 2023-11 | Unknown | 86.4 | 87.1 | 93.7 | 94.2 | 95.0 | 72.6 | 85.7 | 128K | Predecessor to GPT-4o |
Qwen 2.5 72B Alibaba · 2024-09 | 72B | 86.1 | 86.6 | 95.2 | 94.1 | 94.5 | 83.1 | 88.2 | 131K | Strong multilingual and math performance |
Gemini 1.5 Pro Google · 2024-02 | Unknown | 85.9 | 84.1 | 91.7 | 92.5 | 92.3 | 67.7 | 83.2 | 1M | 1M token context window |
Llama 3.1 70B Meta · 2024-07 | 70B | 83.1 | 80.5 | 93.0 | 93.0 | 93.4 | 65.7 | 86.0 | 128K | Strong open-source mid-size model |
GPT-4o mini OpenAI · 2024-07 | Unknown | 82.0 | 87.2 | 91.5 | 90.1 | 91.3 | 70.2 | 83.5 | 128K | Fast, cost-efficient variant of GPT-4o |
Mistral Large Mistral · 2024-02 | Unknown | 81.2 | 79.0 | 90.2 | 89.2 | 90.2 | 45.0 | 79.6 | 131K | Flagship Mistral model |
Gemini 2.0 Flash Google · 2025-01 | Unknown | 76.4 | 82.4 | 89.0 | 89.5 | 90.7 | — | — | 1M | Fast, low-cost Gemini model |
Command R+ Cohere · 2024-04 | 104B | 75.7 | 69.1 | 85.5 | 88.4 | 87.3 | — | — | 128K | Optimised for RAG and tool use |
Gemma 2 27B Google · 2024-06 | 27B | 75.2 | 74.0 | 90.5 | 88.3 | 88.1 | 54.3 | 74.8 | 8K | Open-weight Gemma flagship |
Claude 3.5 Haiku Anthropic · 2024-11 | Unknown | 74.5 | 87.0 | 88.9 | 89.0 | 90.1 | — | — | 200K | Fast and affordable with strong coding |
Llama 3.1 8B Meta · 2024-07 | 8B | 73.0 | 72.6 | 84.5 | 82.1 | 83.4 | 51.9 | 72.8 | 128K | Efficient open-weight small model |
Phi-3 Mini Microsoft · 2024-04 | 3.8B | 70.9 | 59.0 | 86.6 | 78.7 | 82.9 | 44.6 | 62.1 | 128K | Small model with outsized performance |
GPT-3.5 Turbo OpenAI · 2022-11 | Unknown | 70.0 | 72.5 | 80.8 | 85.5 | 85.2 | 37.8 | 70.8 | 16K | Legacy model, largely superseded |
Mistral 7B Mistral · 2023-09 | 7B | 64.2 | 37.4 | 52.1 | 81.3 | 74.2 | 28.4 | 50.7 | 33K | Compact but performant open model |
Data as of May 2025. Scores may vary by source, prompt format, and evaluation methodology. Always verify against primary sources.
About
This leaderboard aggregates published benchmark scores for 20+ large language models across five standard evaluations: MMLU (general knowledge, 57 tasks), HumanEval (Python code generation), GSM8K (grade school math), HellaSwag (commonsense reasoning), and ARC-Challenge (science questions). Scores are sourced from official model cards and published research. Filter by provider to narrow the comparison, click column headers to sort.
How to use
- 1 Use the provider filter chips to show only models from specific vendors.
- 2 Click any column header to sort by that benchmark — click again to reverse.
- 3 Score cells are color-coded: green (≥90), amber (75–89), muted (<75).
- 4 Missing scores are shown as "—" (not all models are evaluated on all benchmarks).
- 5 Click "What is X?" links to expand descriptions of each benchmark.
- What is MMLU?
- MMLU (Massive Multitask Language Understanding) tests knowledge across 57 subjects including math, science, history, law, and medicine. Each question is multiple-choice with 4 options. A score of 90 means the model answered 90% of questions correctly across all 57 subjects. Human expert performance is ~89%.
- What is HumanEval?
- HumanEval measures code generation capability: 164 Python programming problems where the model must write a function body. Pass@1 score = percentage of problems solved correctly on the first attempt. 90+ is considered strong coding ability.
- Are these benchmarks still reliable?
- Benchmark saturation is a real concern — top models score 87-92% on MMLU, leaving little differentiation. Training data contamination (models trained on benchmark test sets) is also a risk. For robust comparisons, combine benchmark scores with task-specific evaluations on your own data.