44 Openai models evaluated
| Rank | Model | Accuracy | Correct | Total | Incorrect | Errors |
|---|---|---|---|---|---|---|
| 1 | Openai/Gpt-5 |
98.9 ± 1.1% | 61 | 61 | 0 | 0 |
| 2 | Openai/Gpt-5-Codex |
98.9 ± 1.1% | 60 | 60 | 0 | 0 |
| 2 | Openai/Gpt-5-Image-Mini |
98.9 ± 1.1% | 60 | 60 | 0 | 0 |
| 2 | Openai/O3-Mini |
98.9 ± 1.1% | 60 | 60 | 0 | 0 |
| 3 | Openai/Gpt-5-Mini |
97.3 ± 2.3% | 60 | 61 | 1 | 0 |
| 3 | Openai/O3 |
97.3 ± 2.3% | 60 | 61 | 1 | 0 |
| 4 | Openai/Gpt-5-Nano |
97.3 ± 2.3% | 59 | 60 | 1 | 0 |
| 4 | Openai/O3-Mini-High |
97.3 ± 2.3% | 59 | 60 | 1 | 0 |
| 4 | Openai/O4-Mini-Deep-Research |
97.3 ± 2.3% | 59 | 60 | 0 | 1 |
| 5 | Openai/O4-Mini |
97.0 ± 2.5% | 54 | 55 | 1 | 0 |
| 5 | Openai/O4-Mini-High |
97.0 ± 2.5% | 54 | 55 | 1 | 0 |
| 6 | Openai/Gpt-4.1-Mini |
95.5 ± 3.4% | 56 | 58 | 1 | 1 |
| 7 | Openai/Gpt-5-Chat |
93.8 ± 4.3% | 55 | 58 | 1 | 2 |
| 7 | Openai/Gpt-Oss-20b |
93.8 ± 4.3% | 55 | 58 | 3 | 0 |
| 8 | Openai/Gpt-Oss-120b |
93.7 ± 4.4% | 54 | 57 | 3 | 0 |
| 9 | Openai/O1-Mini |
93.6 ± 4.5% | 53 | 56 | 3 | 0 |
| 10 | Openai/Codex-Mini |
91.6 ± 5.4% | 50 | 54 | 3 | 1 |
| 11 | Openai/Gpt-Oss-20b:free |
90.9 ± 5.5% | 56 | 61 | 4 | 1 |
| 12 | Openai/O1-Mini-2024-09-12 |
90.1 ± 5.9% | 51 | 56 | 4 | 1 |
| 13 | Openai/Gpt-4o-Mini-2024-07-18 |
80.5 ± 10.2% | 31 | 38 | 5 | 2 |
| 14 | Openai/Gpt-4.1-Nano |
71.9 ± 14.3% | 19 | 26 | 4 | 3 |
| 14 | Openai/Gpt-4o |
71.9 ± 14.3% | 19 | 26 | 3 | 4 |
| 14 | Openai/Gpt-4o-2024-11-20 |
71.9 ± 14.3% | 19 | 26 | 4 | 3 |
| 15 | Openai/Chatgpt-4o-Latest |
70.7 ± 28.0% | 1 | 1 | 0 | 0 |
| 15 | Openai/Gpt-4 |
70.7 ± 28.0% | 1 | 1 | 0 | 0 |
| 15 | Openai/Gpt-4-Turbo |
70.7 ± 28.0% | 1 | 1 | 0 | 0 |
| 15 | Openai/Gpt-4o-2024-05-13 |
70.7 ± 28.0% | 1 | 1 | 0 | 0 |
| 15 | Openai/Gpt-5-Image |
70.7 ± 28.0% | 1 | 1 | 0 | 0 |
| 15 | Openai/Gpt-5-Pro |
70.7 ± 28.0% | 1 | 1 | 0 | 0 |
| 15 | Openai/O1 |
70.7 ± 28.0% | 1 | 1 | 0 | 0 |
| 15 | Openai/O1-Pro |
70.7 ± 28.0% | 1 | 1 | 0 | 0 |
| 15 | Openai/O3-Deep-Research |
70.7 ± 28.0% | 1 | 1 | 0 | 0 |
| 16 | Openai/Gpt-4o-Search-Preview |
69.7 ± 15.3% | 17 | 24 | 6 | 1 |
| 17 | Openai/Gpt-4o-Mini |
64.1 ± 17.8% | 13 | 20 | 6 | 1 |
| 18 | Openai/Gpt-4o-2024-08-06 |
60.3 ± 19.4% | 11 | 18 | 6 | 1 |
| 19 | Openai/Gpt-4o-Mini-Search-Preview |
53.5 ± 23.5% | 7 | 13 | 6 | 0 |
| 20 | Openai/Gpt-3.5-Turbo |
46.0 ± 26.4% | 5 | 11 | 6 | 0 |
| 21 | Openai/Gpt-3.5-Turbo-16k |
32.1 ± 33.0% | 2 | 7 | 5 | 0 |
| 22 | Openai/Gpt-4-0314 |
29.3 ± 54.9% | 0 | 1 | 0 | 1 |
| 22 | Openai/Gpt-4-1106-Preview |
29.3 ± 54.9% | 0 | 1 | 1 | 0 |
| 22 | Openai/Gpt-4-Turbo-Preview |
29.3 ± 54.9% | 0 | 1 | 1 | 0 |
| 22 | Openai/Gpt-4o:extended |
29.3 ± 54.9% | 0 | 1 | 1 | 0 |
| 23 | Openai/Gpt-3.5-Turbo-0613 |
22.8 ± 35.0% | 1 | 6 | 5 | 0 |
| 23 | Openai/Gpt-3.5-Turbo-Instruct |
22.8 ± 35.0% | 1 | 6 | 5 | 0 |