Skip to main content

LLM Benchmarks

Pass@1 rates on the Qiskit HumanEval benchmark (151 examples, zero-shot, greedy decoding). Maintained by Marqov.

#ModelSemantic
1
Claude Opus 4.7
Anthropic
72%
6478% CI
2
Claude Sonnet 4.6
Anthropic
68%
6075% CI
3
Claude Opus 4.6
Anthropic
62%
5470% CI
4
Qwen3 Coder 480B
Other
37%
3045% CI
5
Gemini 2.5 Pro
Google
36%
2944% CI
6
DeepSeek V4 Pro
Other
35%
2843% CI
7
DeepSeek V4 Flash
Other
30%
2338% CI
8
Qwen3 Coder Plus
Other
29%
2237% CI
9
Gemini 2.5 Flash
Google
28%
2236% CI
10
GPT OSS 120B
Other
23%
1731% CI
11
GPT OSS 20B
Other
18%
1325% CI
12
Gemma 4 12B
Other
18%
1325% CI
13
Codestral
Other
17%
1123% CI
14
Kimi K2.6
Other
17%
1224% CI
15
Mistral Large
Other
13%
819% CI
16
Llama 4 Scout
Other
7%
413% CI
version2026-05publishedMay 20, 2026n151 examplessuitesha256:0e46ce03…

Semantic pass = execution pass in v1.0 (code runs without crashing). Scores are not circuit-correctness. Wilson 95% CI shown. Read the methodology →