LLM Benchmarks

Pass@1 rates on the Qiskit HumanEval benchmark (151 examples, zero-shot, greedy decoding). Maintained by Marqov.

#	Model	Provider	Syntax	Execution	Semantic
1	Claude Opus 4.7 Anthropic	Anthropic	100% 151/151	74% 111/151	72% 64–78% CI
2	Claude Sonnet 4.6 Anthropic	Anthropic	100% 151/151	72% 108/151	68% 60–75% CI
3	Claude Opus 4.6 Anthropic	Anthropic	100% 151/151	64% 96/151	62% 54–70% CI
4	Qwen3 Coder 480B Other	Other	100% 151/151	38% 57/151	37% 30–45% CI
5	Gemini 2.5 Pro Google	Google	84% 127/151	37% 56/151	36% 29–44% CI
6	DeepSeek V4 Pro Other	Other	95% 143/151	39% 59/151	35% 28–43% CI
7	DeepSeek V4 Flash Other	Other	95% 144/151	32% 48/151	30% 23–38% CI
8	Qwen3 Coder Plus Other	Other	100% 151/151	29% 44/151	29% 22–37% CI
9	Gemini 2.5 Flash Google	Google	93% 141/151	28% 43/151	28% 22–36% CI
10	GPT OSS 120B Other	Other	99% 149/151	24% 36/151	23% 17–31% CI
11	GPT OSS 20B Other	Other	81% 122/151	18% 27/151	18% 13–25% CI
12	Gemma 4 12B Other	Other	81% 123/151	19% 28/151	18% 13–25% CI
13	Codestral Other	Other	100% 151/151	17% 25/151	17% 11–23% CI
14	Kimi K2.6 Other	Other	98% 148/151	17% 26/151	17% 12–24% CI
15	Mistral Large Other	Other	51% 77/151	13% 19/151	13% 8–19% CI
16	Llama 4 Scout Other	Other	100% 151/151	8% 12/151	7% 4–13% CI

version2026-05publishedMay 20, 2026n151 examplessuitesha256:0e46ce03…

Semantic pass = execution pass in v1.0 (code runs without crashing). Scores are not circuit-correctness. Wilson 95% CI shown. Read the methodology →