LLM Derivatives Benchmark

This page reports the results of benchmarking the models available on Openrouter against their ability to calculate the derivative. You can read the motivation for this benchmark here.

The benchmark, in its current form, is mostly helpful. We would have ideally preferred to have tested more derivatives until the accuracy of the top models was lower, but we are unfortunately constrained by the budget we dedicated to this. If you wish to sponsor this project with a BYOK, please reach out at [email protected].

The key motivation for this project is to determine if an LLM can, with pure symbolic manipulation, do a derivative correctly, because we believe that the derivative, at some level, measures an LLMs ability to follow a set of rules strictly to a definite and verifiable result.

Some caveats:

Top Models by Provider

This table shows the best performing model(s) from each provider.

Provider Top Model(s) Accuracy Correct Total Incorrect Errors
google Gemini-2.5-Flash 98.9 ± 1.1% 62 62 0 0
openai Gpt-5 98.9 ± 1.1% 61 61 0 0
qwen Qwen3-Max 98.9 ± 1.1% 61 61 0 0
x-ai Grok-4 98.9 ± 1.1% 61 61 0 0
anthropic Claude-3.7-Sonnet:thinking 98.9 ± 1.1% 60 60 0 0
switchpoint Router 98.9 ± 1.1% 60 60 0 0
z-ai Glm-4.5 98.9 ± 1.1% 60 60 0 0
meta-llama Llama-4-Maverick:free 95.6 ± 3.3% 58 60 1 1
nvidia Llama-3.3-Nemotron-Super-49b-V1.5 95.5 ± 3.4% 56 58 2 0
deepseek Deepseek-R1 95.0 ± 3.8% 50 52 1 1
meituan Longcat-Flash-Chat 93.8 ± 4.3% 55 58 3 0
inclusionai Ling-1t 93.7 ± 4.4% 54 57 2 1
openrouter Auto 93.1 ± 4.8% 49 52 3 0
perplexity Sonar-Reasoning 93.1 ± 4.8% 49 52 1 2
baidu Ernie-4.5-300b-A47b 92.1 ± 5.1% 54 58 4 0
mistralai Mistral-Medium-3.1 92.0 ± 5.1% 53 57 3 1
microsoft Phi-4-Reasoning-Plus 90.4 ± 5.7% 53 58 4 1
moonshotai Kimi-K2 88.8 ± 6.3% 52 58 5 1
alibaba Tongyi-Deepresearch-30b-A3b 88.6 ± 6.4% 51 57 6 0
arcee-ai Virtuoso-Large 88.6 ± 6.4% 51 57 6 0
stepfun-ai Step3 80.8 ± 12.1% 19 23 2 2
cohere Command-A 80.5 ± 10.2% 31 38 4 3
deepcogito Cogito-V2-Preview-Deepseek-671b 80.5 ± 10.2% 31 38 5 2
tencent Hunyuan-A13b-Instruct 78.5 ± 12.6% 20 25 5 0
minimax Minimax-M1 76.7 ± 13.5% 18 23 2 3
aion-labs Aion-1.0-Mini 71.9 ± 14.3% 19 26 6 1
nousresearch Hermes-3-Llama-3.1-405b 70.9 ± 14.8% 18 25 6 1
amazon Nova-Pro-V1 69.7 ± 15.3% 17 24 7 0
thedrummer Cydonia-24b-V4.1 68.5 ± 15.9% 16 23 6 1
inception Mercury-Coder 65.7 ± 17.1% 14 21 5 2
cognitivecomputations Dolphin-Mistral-24b-Venice-Edition:free 63.6 ± 19.1% 11 17 5 1
sao10k L3.3-Euryale-70b 59.2 ± 21.1% 9 15 3 3
ai21 Jamba-Large-1.7 56.5 ± 22.2% 8 14 5 1
anthracite-org Magnum-V4-72b 50.0 ± 24.9% 6 12 3 3
ibm-granite Granite-4.0-H-Micro 50.0 ± 24.9% 6 12 5 1
inflection Inflection-3-Productivity 32.1 ± 33.0% 2 7 5 0
bytedance Ui-Tars-1.5-7b 22.8 ± 35.0% 1 6 4 1
neversleep Noromaid-20b 22.8 ± 35.0% 1 6 4 1
eleutherai Llemma_7b 12.9 ± 39.2% 0 4 2 2
gryphe Mythomax-L2-13b 12.9 ± 39.2% 0 4 3 1
mancer Weaver 12.9 ± 39.2% 0 4 2 2
undi95 Remm-Slerp-L2-13b 12.9 ± 39.2% 0 4 3 1

Model Performance

Rank Model Accuracy Correct Total Incorrect Errors
1 Google/Gemini-2.5-Flash 98.9 ± 1.1% 62 62 0 0
2 Google/Gemini-2.5-Pro 98.9 ± 1.1% 61 61 0 0
2 Google/Gemini-2.5-Pro-Preview 98.9 ± 1.1% 61 61 0 0
2 Openai/Gpt-5 98.9 ± 1.1% 61 61 0 0
2 Qwen/Qwen3-Max 98.9 ± 1.1% 61 61 0 0
2 X-ai/Grok-4 98.9 ± 1.1% 61 61 0 0
3 Anthropic/Claude-3.7-Sonnet:thinking 98.9 ± 1.1% 60 60 0 0
3 Google/Gemini-2.5-Flash-Image 98.9 ± 1.1% 60 60 0 0
3 Openai/Gpt-5-Codex 98.9 ± 1.1% 60 60 0 0
3 Openai/Gpt-5-Image-Mini 98.9 ± 1.1% 60 60 0 0
3 Openai/O3-Mini 98.9 ± 1.1% 60 60 0 0
3 Switchpoint/Router 98.9 ± 1.1% 60 60 0 0
3 Z-ai/Glm-4.5 98.9 ± 1.1% 60 60 0 0
4 Openai/Gpt-5-Mini 97.3 ± 2.3% 60 61 1 0
4 Openai/O3 97.3 ± 2.3% 60 61 1 0
5 Openai/Gpt-5-Nano 97.3 ± 2.3% 59 60 1 0
5 Openai/O3-Mini-High 97.3 ± 2.3% 59 60 1 0
5 Openai/O4-Mini-Deep-Research 97.3 ± 2.3% 59 60 0 1
5 Qwen/Qwen3-Next-80b-A3b-Instruct 97.3 ± 2.3% 59 60 1 0
6 Google/Gemini-2.5-Pro-Preview-05-06 97.2 ± 2.4% 58 59 1 0
7 Google/Gemini-2.5-Flash-Image-Preview 97.2 ± 2.4% 57 58 1 0
7 Qwen/Qwen-Plus 97.2 ± 2.4% 57 58 1 0
7 Qwen/Qwen3-Vl-30b-A3b-Instruct 97.2 ± 2.4% 57 58 1 0
8 Qwen/Qwen3-30b-A3b-Thinking-2507 97.1 ± 2.5% 56 57 1 0
8 Qwen/Qwen3-Next-80b-A3b-Thinking 97.1 ± 2.5% 56 57 1 0
8 Qwen/Qwen3-Vl-235b-A22b-Instruct 97.1 ± 2.5% 56 57 1 0
9 Openai/O4-Mini 97.0 ± 2.5% 54 55 1 0
9 Openai/O4-Mini-High 97.0 ± 2.5% 54 55 1 0
10 Qwen/Qwen3-235b-A22b-Thinking-2507 96.9 ± 2.6% 52 53 1 0
11 Qwen/Qwq-32b 96.9 ± 2.7% 51 52 1 0
12 Anthropic/Claude-Sonnet-4 95.8 ± 3.2% 60 62 2 0
12 X-ai/Grok-3 95.8 ± 3.2% 60 62 2 0
13 Meta-llama/Llama-4-Maverick:free 95.6 ± 3.3% 58 60 1 1
14 Qwen/Qwen-Vl-Max 95.6 ± 3.4% 57 59 2 0
15 Nvidia/Llama-3.3-Nemotron-Super-49b-V1.5 95.5 ± 3.4% 56 58 2 0
15 Openai/Gpt-4.1-Mini 95.5 ± 3.4% 56 58 1 1
15 Qwen/Qwen-2.5-Coder-32b-Instruct 95.5 ± 3.4% 56 58 2 0
15 Qwen/Qwen-Max 95.5 ± 3.4% 56 58 1 1
15 Qwen/Qwen-Plus-2025-07-28:thinking 95.5 ± 3.4% 56 58 2 0
16 Google/Gemini-2.5-Flash-Lite-Preview-06-17 95.4 ± 3.5% 55 57 1 1
16 Nvidia/Nemotron-Nano-9b-V2 95.4 ± 3.5% 55 57 2 0
16 Qwen/Qwen-Plus-2025-07-28 95.4 ± 3.5% 55 57 2 0
16 Qwen/Qwen3-235b-A22b-2507 95.4 ± 3.5% 55 57 2 0
17 Deepseek/Deepseek-R1 95.0 ± 3.8% 50 52 1 1
17 Qwen/Qwen3-235b-A22b:free 95.0 ± 3.8% 50 52 1 1
18 Qwen/Qwen3-Vl-30b-A3b-Thinking 94.1 ± 4.5% 42 44 2 0
19 Deepseek/Deepseek-Chat 93.8 ± 4.3% 55 58 3 0
19 Google/Gemini-2.5-Flash-Preview-09-2025 93.8 ± 4.3% 55 58 3 0
19 Meituan/Longcat-Flash-Chat 93.8 ± 4.3% 55 58 3 0
19 Openai/Gpt-5-Chat 93.8 ± 4.3% 55 58 1 2
19 Openai/Gpt-Oss-20b 93.8 ± 4.3% 55 58 3 0
19 Qwen/Qwen3-14b 93.8 ± 4.3% 55 58 3 0
19 Qwen/Qwen3-30b-A3b-Instruct-2507 93.8 ± 4.3% 55 58 3 0
20 Inclusionai/Ling-1t 93.7 ± 4.4% 54 57 2 1
20 Openai/Gpt-Oss-120b 93.7 ± 4.4% 54 57 3 0
21 Openai/O1-Mini 93.6 ± 4.5% 53 56 3 0
21 Qwen/Qwen3-Coder-Plus 93.6 ± 4.5% 53 56 3 0
21 Z-ai/Glm-4.6 93.6 ± 4.5% 53 56 3 0
22 Deepseek/Deepseek-R1-0528 93.5 ± 4.9% 38 40 1 1
23 Anthropic/Claude-Sonnet-4.5 93.2 ± 4.7% 50 53 2 1
24 Openrouter/Auto 93.1 ± 4.8% 49 52 3 0
24 Perplexity/Sonar-Reasoning 93.1 ± 4.8% 49 52 1 2
24 Qwen/Qwen3-8b 93.1 ± 4.8% 49 52 3 0
25 Inclusionai/Ring-1t 92.6 ± 5.6% 33 35 2 0
26 Nvidia/Nemotron-Nano-9b-V2:free 92.5 ± 4.8% 57 61 3 1
27 Baidu/Ernie-4.5-300b-A47b 92.1 ± 5.1% 54 58 4 0
27 Google/Gemini-2.5-Flash-Lite-Preview-09-2025 92.1 ± 5.1% 54 58 3 1
27 Meta-llama/Llama-4-Maverick 92.1 ± 5.1% 54 58 2 2
28 Anthropic/Claude-3.5-Sonnet 92.0 ± 5.1% 53 57 3 1
28 Mistralai/Mistral-Medium-3.1 92.0 ± 5.1% 53 57 3 1
28 Qwen/Qwen3-Vl-8b-Instruct 92.0 ± 5.1% 53 57 4 0
28 X-ai/Grok-3-Mini 92.0 ± 5.1% 53 57 4 0
28 X-ai/Grok-4-Fast 92.0 ± 5.1% 53 57 4 0
29 Openai/Codex-Mini 91.6 ± 5.4% 50 54 3 1
29 X-ai/Grok-3-Beta 91.6 ± 5.4% 50 54 4 0
30 Anthropic/Claude-3.7-Sonnet 91.4 ± 5.5% 49 53 3 1
31 Openai/Gpt-Oss-20b:free 90.9 ± 5.5% 56 61 4 1
32 Deepseek/Deepseek-Prover-V2 90.4 ± 5.7% 53 58 4 1
32 Google/Gemini-2.0-Flash-001 90.4 ± 5.7% 53 58 5 0
32 Microsoft/Phi-4-Reasoning-Plus 90.4 ± 5.7% 53 58 4 1
33 Qwen/Qwen3-Vl-8b-Thinking 90.3 ± 6.2% 43 47 1 3
34 Deepseek/Deepseek-Chat-V3-0324 90.3 ± 5.8% 52 57 3 2
34 Google/Gemma-3-27b-It 90.3 ± 5.8% 52 57 5 0
35 Openai/O1-Mini-2024-09-12 90.1 ± 5.9% 51 56 4 1
36 Qwen/Qwen3-4b:free 89.1 ± 10.5% 5 5 0 0
37 Deepseek/Deepseek-V3.1-Terminus 88.8 ± 6.3% 52 58 5 1
37 Mistralai/Mistral-Medium-3 88.8 ± 6.3% 52 58 4 2
37 Moonshotai/Kimi-K2 88.8 ± 6.3% 52 58 5 1
38 Alibaba/Tongyi-Deepresearch-30b-A3b 88.6 ± 6.4% 51 57 6 0
38 Arcee-ai/Virtuoso-Large 88.6 ± 6.4% 51 57 6 0
38 X-ai/Grok-3-Mini-Beta 88.6 ± 6.4% 51 57 6 0
39 Qwen/Qwen3-Vl-235b-A22b-Thinking 86.3 ± 8.2% 35 40 2 3
40 Anthropic/Claude-3.5-Sonnet-20240620 85.9 ± 7.5% 46 53 7 0
40 Nvidia/Llama-3.1-Nemotron-Ultra-253b-V1 85.9 ± 7.5% 46 53 6 1
41 Deepseek/Deepseek-Chat-V3.1 85.6 ± 7.6% 45 52 6 1
42 Qwen/Qwen3-Coder 85.3 ± 7.8% 44 51 4 3
42 Z-ai/Glm-4.5-Air 85.3 ± 7.8% 44 51 4 3
43 Mistralai/Mistral-Large 85.1 ± 7.9% 43 50 7 0
44 Baidu/Ernie-4.5-21b-A3b 84.8 ± 8.1% 42 49 6 1
45 Mistralai/Mistral-Large-2411 84.5 ± 8.2% 41 48 4 3
45 Perplexity/Sonar-Reasoning-Pro 84.5 ± 8.2% 41 48 3 4
46 Meta-llama/Llama-3.3-70b-Instruct 84.1 ± 8.4% 40 47 5 2
47 Perplexity/Sonar-Deep-Research 84.0 ± 9.5% 29 34 2 3
48 Moonshotai/Kimi-Dev-72b 82.6 ± 9.7% 31 37 3 3
48 Z-ai/Glm-4.5v 82.6 ± 9.7% 31 37 6 0
49 Deepseek/Deepseek-V3.2-Exp 81.4 ± 9.7% 33 40 6 1
50 Qwen/Qwen2.5-Vl-32b-Instruct 81.1 ± 10.4% 28 34 4 2
51 Stepfun-ai/Step3 80.8 ± 12.1% 19 23 2 2
52 Deepseek/Deepseek-R1-Distill-Qwen-32b 80.7 ± 11.3% 23 28 4 1
53 Cohere/Command-A 80.5 ± 10.2% 31 38 4 3
53 Deepcogito/Cogito-V2-Preview-Deepseek-671b 80.5 ± 10.2% 31 38 5 2
53 Mistralai/Devstral-Medium 80.5 ± 10.2% 31 38 4 3
53 Openai/Gpt-4o-Mini-2024-07-18 80.5 ± 10.2% 31 38 5 2
53 Perplexity/Sonar-Pro 80.5 ± 10.2% 31 38 3 4
53 X-ai/Grok-Code-Fast-1 80.5 ± 10.2% 31 38 5 2
54 Baidu/Ernie-4.5-21b-A3b-Thinking 78.9 ± 11.0% 28 35 7 0
54 Google/Gemini-2.5-Flash-Lite 78.9 ± 11.0% 28 35 7 0
54 Meta-llama/Llama-3.3-70b-Instruct:free 78.9 ± 11.0% 28 35 3 4
54 Qwen/Qwen-Vl-Plus 78.9 ± 11.0% 28 35 6 1
54 Qwen/Qwen2.5-Vl-72b-Instruct 78.9 ± 11.0% 28 35 6 1
54 Qwen/Qwen3-30b-A3b 78.9 ± 11.0% 28 35 6 1
55 Tencent/Hunyuan-A13b-Instruct 78.5 ± 12.6% 20 25 5 0
56 Z-ai/Glm-4-32b 78.3 ± 11.3% 27 34 7 0
57 Qwen/Qwen3-235b-A22b 77.3 ± 12.4% 22 28 2 4
58 Minimax/Minimax-M1 76.7 ± 13.5% 18 23 2 3
59 Qwen/Qwen-2.5-72b-Instruct 76.3 ± 12.3% 24 31 6 1
60 Deepseek/Deepseek-R1-0528-Qwen3-8b 75.6 ± 13.3% 20 26 4 2
61 Baidu/Ernie-4.5-Vl-424b-A47b 74.7 ± 13.0% 22 29 5 2
61 Meta-llama/Llama-3.2-90b-Vision-Instruct 74.7 ± 13.0% 22 29 4 3
61 Mistralai/Mistral-Small-3.2-24b-Instruct 74.7 ± 13.0% 22 29 6 1
62 Mistralai/Magistral-Medium-2506:thinking 74.7 ± 13.8% 19 25 2 4
62 Qwen/Qwen3-32b 74.7 ± 13.8% 19 25 6 0
63 Microsoft/Phi-4-Multimodal-Instruct 73.9 ± 13.4% 21 28 7 0
64 Mistralai/Mistral-Saba 72.9 ± 13.8% 20 27 7 0
65 Aion-labs/Aion-1.0-Mini 71.9 ± 14.3% 19 26 6 1
65 Arcee-ai/Coder-Large 71.9 ± 14.3% 19 26 6 1
65 Deepcogito/Cogito-V2-Preview-Llama-70b 71.9 ± 14.3% 19 26 7 0
65 Deepseek/Deepseek-R1-Distill-Qwen-14b 71.9 ± 14.3% 19 26 5 2
65 Moonshotai/Kimi-K2-0905 71.9 ± 14.3% 19 26 5 2
65 Openai/Gpt-4.1-Nano 71.9 ± 14.3% 19 26 4 3
65 Openai/Gpt-4o 71.9 ± 14.3% 19 26 3 4
65 Openai/Gpt-4o-2024-11-20 71.9 ± 14.3% 19 26 4 3
65 Qwen/Qwen3-Coder-Flash 71.9 ± 14.3% 19 26 6 1
66 Google/Gemma-3-12b-It 70.9 ± 14.8% 18 25 7 0
66 Meta-llama/Llama-4-Scout 70.9 ± 14.8% 18 25 1 6
66 Mistralai/Mistral-Large-2407 70.9 ± 14.8% 18 25 6 1
66 Nousresearch/Hermes-3-Llama-3.1-405b 70.9 ± 14.8% 18 25 6 1
67 Aion-labs/Aion-1.0 70.7 ± 28.0% 1 1 0 0
67 Anthropic/Claude-3-Opus 70.7 ± 28.0% 1 1 0 0
67 Anthropic/Claude-Opus-4 70.7 ± 28.0% 1 1 0 0
67 Anthropic/Claude-Opus-4.1 70.7 ± 28.0% 1 1 0 0
67 Deepcogito/Cogito-V2-Preview-Llama-405b 70.7 ± 28.0% 1 1 0 0
67 Openai/Chatgpt-4o-Latest 70.7 ± 28.0% 1 1 0 0
67 Openai/Gpt-4 70.7 ± 28.0% 1 1 0 0
67 Openai/Gpt-4-Turbo 70.7 ± 28.0% 1 1 0 0
67 Openai/Gpt-4o-2024-05-13 70.7 ± 28.0% 1 1 0 0
67 Openai/Gpt-5-Image 70.7 ± 28.0% 1 1 0 0
67 Openai/Gpt-5-Pro 70.7 ± 28.0% 1 1 0 0
67 Openai/O1 70.7 ± 28.0% 1 1 0 0
67 Openai/O1-Pro 70.7 ± 28.0% 1 1 0 0
67 Openai/O3-Deep-Research 70.7 ± 28.0% 1 1 0 0
68 Baidu/Ernie-4.5-Vl-28b-A3b 70.1 ± 16.0% 15 21 4 2
69 Amazon/Nova-Pro-V1 69.7 ± 15.3% 17 24 7 0
69 Google/Gemini-2.0-Flash-Lite-001 69.7 ± 15.3% 17 24 7 0
69 Google/Gemma-3-4b-It 69.7 ± 15.3% 17 24 5 2
69 Mistralai/Devstral-Small-2505 69.7 ± 15.3% 17 24 4 3
69 Nousresearch/Hermes-4-405b 69.7 ± 15.3% 17 24 7 0
69 Openai/Gpt-4o-Search-Preview 69.7 ± 15.3% 17 24 6 1
69 Qwen/Qwen-2.5-7b-Instruct 69.7 ± 15.3% 17 24 5 2
69 Qwen/Qwen-Turbo 69.7 ± 15.3% 17 24 5 2
69 Qwen/Qwen3-Coder-30b-A3b-Instruct 69.7 ± 15.3% 17 24 6 1
70 Anthropic/Claude-3.5-Haiku 68.5 ± 15.9% 16 23 7 0
70 Deepcogito/Cogito-V2-Preview-Llama-109b-Moe 68.5 ± 15.9% 16 23 7 0
70 Mistralai/Devstral-Small 68.5 ± 15.9% 16 23 6 1
70 Mistralai/Pixtral-Large-2411 68.5 ± 15.9% 16 23 7 0
70 Perplexity/Sonar 68.5 ± 15.9% 16 23 1 6
70 Thedrummer/Cydonia-24b-V4.1 68.5 ± 15.9% 16 23 6 1
71 Inception/Mercury-Coder 65.7 ± 17.1% 14 21 5 2
71 Thedrummer/Anubis-70b-V1.1 65.7 ± 17.1% 14 21 5 2
72 Microsoft/Phi-4 65.5 ± 18.2% 12 18 5 1
72 Mistralai/Codestral-2508 65.5 ± 18.2% 12 18 6 0
72 Nousresearch/Hermes-3-Llama-3.1-405b:free 65.5 ± 18.2% 12 18 6 0
72 Nvidia/Llama-3.1-Nemotron-70b-Instruct 65.5 ± 18.2% 12 18 2 4
73 Google/Gemma-2-9b-It 64.1 ± 17.8% 13 20 7 0
73 Openai/Gpt-4o-Mini 64.1 ± 17.8% 13 20 6 1
74 Amazon/Nova-Lite-V1 63.6 ± 19.1% 11 17 4 2
74 Cognitivecomputations/Dolphin-Mistral-24b-Venice-Edition:free 63.6 ± 19.1% 11 17 5 1
74 Meta-llama/Llama-3-70b-Instruct 63.6 ± 19.1% 11 17 5 1
74 Nousresearch/Hermes-3-Llama-3.1-70b 63.6 ± 19.1% 11 17 5 1
75 Mistralai/Mistral-Small-3.1-24b-Instruct:free 61.5 ± 20.0% 10 16 6 0
76 Google/Gemma-3n-E4b-It 60.3 ± 19.4% 11 18 6 1
76 Inception/Mercury 60.3 ± 19.4% 11 18 7 0
76 Meta-llama/Llama-4-Scout:free 60.3 ± 19.4% 11 18 3 4
76 Mistralai/Codestral-2501 60.3 ± 19.4% 11 18 6 1
76 Mistralai/Mistral-7b-Instruct-V0.3 60.3 ± 19.4% 11 18 7 0
76 Mistralai/Mistral-Small-3.1-24b-Instruct 60.3 ± 19.4% 11 18 4 3
76 Openai/Gpt-4o-2024-08-06 60.3 ± 19.4% 11 18 6 1
77 Amazon/Nova-Micro-V1 59.2 ± 21.1% 9 15 3 3
77 Anthropic/Claude-Haiku-4.5 59.2 ± 21.1% 9 15 1 5
77 Mistralai/Magistral-Small-2506 59.2 ± 21.1% 9 15 0 6
77 Sao10k/L3.3-Euryale-70b 59.2 ± 21.1% 9 15 3 3
78 Mistralai/Mistral-7b-Instruct:free 58.2 ± 20.3% 10 17 7 0
79 Ai21/Jamba-Large-1.7 56.5 ± 22.2% 8 14 5 1
79 Meta-llama/Llama-3.1-405b-Instruct 56.5 ± 22.2% 8 14 3 3
79 Meta-llama/Llama-3.1-70b-Instruct 56.5 ± 22.2% 8 14 4 2
79 Thedrummer/Skyfall-36b-V2 56.5 ± 22.2% 8 14 5 1
80 Anthropic/Claude-3-Haiku 53.5 ± 23.5% 7 13 6 0
80 Google/Gemma-2-27b-It 53.5 ± 23.5% 7 13 5 1
80 Mistralai/Magistral-Medium-2506 53.5 ± 23.5% 7 13 3 3
80 Mistralai/Mistral-Small 53.5 ± 23.5% 7 13 5 1
80 Mistralai/Pixtral-12b 53.5 ± 23.5% 7 13 4 2
80 Openai/Gpt-4o-Mini-Search-Preview 53.5 ± 23.5% 7 13 6 0
81 Anthracite-org/Magnum-V4-72b 50.0 ± 24.9% 6 12 3 3
81 Ibm-granite/Granite-4.0-H-Micro 50.0 ± 24.9% 6 12 5 1
81 Mistralai/Mistral-7b-Instruct 50.0 ± 24.9% 6 12 6 0
81 Sao10k/L3.1-70b-Hanami-X1 50.0 ± 24.9% 6 12 5 1
81 Sao10k/L3.1-Euryale-70b 50.0 ± 24.9% 6 12 6 0
82 Microsoft/Wizardlm-2-8x22b 46.0 ± 26.4% 5 11 2 4
82 Mistralai/Ministral-8b 46.0 ± 26.4% 5 11 3 3
82 Openai/Gpt-3.5-Turbo 46.0 ± 26.4% 5 11 6 0
82 Sao10k/L3-Euryale-70b 46.0 ± 26.4% 5 11 5 1
83 Meta-llama/Llama-3.2-3b-Instruct 41.2 ± 28.0% 4 10 5 1
84 Meta-llama/Llama-3.3-8b-Instruct:free 39.3 ± 30.8% 3 8 1 4
84 Mistralai/Ministral-3b 39.3 ± 30.8% 3 8 4 1
84 Mistralai/Mistral-Small-24b-Instruct-2501 39.3 ± 30.8% 3 8 3 2
85 Cohere/Command-R-08-2024 36.4 ± 34.5% 2 6 3 1
86 Cohere/Command-R-Plus-08-2024 35.5 ± 29.7% 3 9 5 1
86 Mistralai/Mistral-Tiny 35.5 ± 29.7% 3 9 3 3
87 Inflection/Inflection-3-Productivity 32.1 ± 33.0% 2 7 5 0
87 Meta-llama/Llama-3-8b-Instruct 32.1 ± 33.0% 2 7 4 1
87 Mistralai/Mistral-Nemo 32.1 ± 33.0% 2 7 3 2
87 Mistralai/Mixtral-8x22b-Instruct 32.1 ± 33.0% 2 7 2 3
87 Nousresearch/Hermes-4-70b 32.1 ± 33.0% 2 7 4 1
87 Openai/Gpt-3.5-Turbo-16k 32.1 ± 33.0% 2 7 5 0
87 Qwen/Qwen-2.5-Vl-7b-Instruct 32.1 ± 33.0% 2 7 3 2
88 Meta-llama/Llama-3.1-405b 29.3 ± 54.9% 0 1 1 0
88 Openai/Gpt-4-1106-Preview 29.3 ± 54.9% 0 1 1 0
88 Openai/Gpt-4-Turbo-Preview 29.3 ± 54.9% 0 1 1 0
88 Openai/Gpt-4o:extended 29.3 ± 54.9% 0 1 1 0
89 Arcee-ai/Afm-4.5b 26.4 ± 37.7% 1 5 2 2
89 Meta-llama/Llama-3.1-8b-Instruct 26.4 ± 37.7% 1 5 3 1
90 Ai21/Jamba-Mini-1.7 22.8 ± 35.0% 1 6 3 2
90 Aion-labs/Aion-Rp-Llama-3.1-8b 22.8 ± 35.0% 1 6 3 2
90 Bytedance/Ui-Tars-1.5-7b 22.8 ± 35.0% 1 6 4 1
90 Inflection/Inflection-3-Pi 22.8 ± 35.0% 1 6 3 2
90 Microsoft/Phi-3-Medium-128k-Instruct 22.8 ± 35.0% 1 6 2 3
90 Neversleep/Noromaid-20b 22.8 ± 35.0% 1 6 4 1
90 Openai/Gpt-3.5-Turbo-0613 22.8 ± 35.0% 1 6 5 0
90 Openai/Gpt-3.5-Turbo-Instruct 22.8 ± 35.0% 1 6 5 0
90 Qwen/Qwen2.5-Coder-7b-Instruct 22.8 ± 35.0% 1 6 3 2
90 Sao10k/L3-Lunaris-8b 22.8 ± 35.0% 1 6 5 0
90 Thedrummer/Rocinante-12b 22.8 ± 35.0% 1 6 4 1
90 Thedrummer/Unslopnemo-12b 22.8 ± 35.0% 1 6 4 1
91 Cohere/Command-R7b-12-2024 12.9 ± 39.2% 0 4 2 2
91 Eleutherai/Llemma_7b 12.9 ± 39.2% 0 4 2 2
91 Gryphe/Mythomax-L2-13b 12.9 ± 39.2% 0 4 3 1
91 Mancer/Weaver 12.9 ± 39.2% 0 4 2 2
91 Meta-llama/Llama-3.2-11b-Vision-Instruct 12.9 ± 39.2% 0 4 1 3
91 Meta-llama/Llama-3.2-3b-Instruct:free 12.9 ± 39.2% 0 4 3 1
91 Microsoft/Phi-3-Mini-128k-Instruct 12.9 ± 39.2% 0 4 2 2
91 Microsoft/Phi-3.5-Mini-128k-Instruct 12.9 ± 39.2% 0 4 1 3
91 Mistralai/Mistral-7b-Instruct-V0.1 12.9 ± 39.2% 0 4 2 2
91 Mistralai/Mixtral-8x7b-Instruct 12.9 ± 39.2% 0 4 3 1
91 Neversleep/Llama-3.1-Lumimaid-8b 12.9 ± 39.2% 0 4 4 0
91 Nousresearch/Hermes-2-Pro-Llama-3-8b 12.9 ± 39.2% 0 4 3 1
91 Undi95/Remm-Slerp-L2-13b 12.9 ± 39.2% 0 4 3 1

Verification Process

We curated a set of 273 expressions. Approximately a third were selected from a popular calculus textbook because we assumed that they would have some pedagogic value, and the rest were randomly generated with varying degrees of complexity. We are assuming that the randomly generated set are new to the LLM, and never formed part of their training data.

Each model was asked to differentiate an expression with respect to some variable, and asked to return its reasoning and the answer.

We then parsed the answer from LaTeX to Python, and numerically evaluated the difference between the supplied answer and the actual derivative. If the two over ten samples had a difference of less than 1e-9, then the result was marked as correct.