Skip to main content
CodeLint.Dev Dev Tools

LLM Benchmark Leaderboard

Benchmark scores for 20 major language models. Sortable by provider and evaluation metric.

Score colours: ≥ 90 excellent 75 – 89 good < 75 below avg — not reported
20 models
Model Params MMLU HumanEval GSM8K HellaSwag ARC MATH MBPP Context Notes
DeepSeek R1
DeepSeek · 2025-01
671B (MoE)90.892.397.396.097.191.090.7128KReasoning-focused; chain-of-thought distillation
GPT-4o
OpenAI · 2024-05
Unknown88.790.296.495.396.776.687.0128KMultimodal flagship — text, image, audio
Llama 3.1 405B
Meta · 2024-07
405B88.689.096.895.696.173.888.6128KLargest open-weight Llama; rivals proprietary models
DeepSeek V3
DeepSeek · 2024-12
671B (MoE)88.591.697.195.796.587.189.9128KMoE architecture; top open-source performance
Claude 3.5 Sonnet
Anthropic · 2024-10
Unknown88.392.096.495.196.478.390.7200KStrong coding, 200K context
Claude 3 Opus
Anthropic · 2024-03
Unknown86.884.995.095.494.860.189.6200KAnthropic flagship (2024 Q1)
GPT-4 Turbo
OpenAI · 2023-11
Unknown86.487.193.794.295.072.685.7128KPredecessor to GPT-4o
Qwen 2.5 72B
Alibaba · 2024-09
72B86.186.695.294.194.583.188.2131KStrong multilingual and math performance
Gemini 1.5 Pro
Google · 2024-02
Unknown85.984.191.792.592.367.783.21M1M token context window
Llama 3.1 70B
Meta · 2024-07
70B83.180.593.093.093.465.786.0128KStrong open-source mid-size model
GPT-4o mini
OpenAI · 2024-07
Unknown82.087.291.590.191.370.283.5128KFast, cost-efficient variant of GPT-4o
Mistral Large
Mistral · 2024-02
Unknown81.279.090.289.290.245.079.6131KFlagship Mistral model
Gemini 2.0 Flash
Google · 2025-01
Unknown76.482.489.089.590.71MFast, low-cost Gemini model
Command R+
Cohere · 2024-04
104B75.769.185.588.487.3128KOptimised for RAG and tool use
Gemma 2 27B
Google · 2024-06
27B75.274.090.588.388.154.374.88KOpen-weight Gemma flagship
Claude 3.5 Haiku
Anthropic · 2024-11
Unknown74.587.088.989.090.1200KFast and affordable with strong coding
Llama 3.1 8B
Meta · 2024-07
8B73.072.684.582.183.451.972.8128KEfficient open-weight small model
Phi-3 Mini
Microsoft · 2024-04
3.8B70.959.086.678.782.944.662.1128KSmall model with outsized performance
GPT-3.5 Turbo
OpenAI · 2022-11
Unknown70.072.580.885.585.237.870.816KLegacy model, largely superseded
Mistral 7B
Mistral · 2023-09
7B64.237.452.181.374.228.450.733KCompact but performant open model

Data as of May 2025. Scores may vary by source, prompt format, and evaluation methodology. Always verify against primary sources.