LLM Benchmarks
Pass@1 rates on the Qiskit HumanEval benchmark (151 examples, zero-shot, greedy decoding). Maintained by Marqov.
| # | Model | Semantic |
|---|---|---|
| 1 | Claude Opus 4.7 Anthropic | 72% 64–78% CI |
| 2 | Claude Sonnet 4.6 Anthropic | 68% 60–75% CI |
| 3 | Claude Opus 4.6 Anthropic | 62% 54–70% CI |
| 4 | Qwen3 Coder 480B Other | 37% 30–45% CI |
| 5 | Gemini 2.5 Pro Google | 36% 29–44% CI |
| 6 | DeepSeek V4 Pro Other | 35% 28–43% CI |
| 7 | DeepSeek V4 Flash Other | 30% 23–38% CI |
| 8 | Qwen3 Coder Plus Other | 29% 22–37% CI |
| 9 | Gemini 2.5 Flash Google | 28% 22–36% CI |
| 10 | GPT OSS 120B Other | 23% 17–31% CI |
| 11 | GPT OSS 20B Other | 18% 13–25% CI |
| 12 | Gemma 4 12B Other | 18% 13–25% CI |
| 13 | Codestral Other | 17% 11–23% CI |
| 14 | Kimi K2.6 Other | 17% 12–24% CI |
| 15 | Mistral Large Other | 13% 8–19% CI |
| 16 | Llama 4 Scout Other | 7% 4–13% CI |
version2026-05publishedMay 20, 2026n151 examplessuitesha256:0e46ce03…
Semantic pass = execution pass in v1.0 (code runs without crashing). Scores are not circuit-correctness. Wilson 95% CI shown. Read the methodology →