LLM Derivatives Benchmark

This page reports the results of benchmarking the models available on Openrouter against their ability to calculate the derivative. You can read the motivation for this benchmark here.

The benchmark, in its current form, is mostly helpful. We would have ideally preferred to have tested more derivatives until the accuracy of the top models was lower, but we are unfortunately constrained by the budget we dedicated to this. If you wish to sponsor this project with a BYOK, please reach out at [email protected].

The key motivation for this project is to determine if an LLM can, with pure symbolic manipulation, do a derivative correctly, because we believe that the derivative, at some level, measures an LLMs ability to follow a set of rules strictly to a definite and verifiable result.

Some caveats:

Only models available on Openrouter that did not use data for training were tested.
Only models that could work with DSPy were tested, in that they had to be able to return a "reasoning" and "answer" field.
Only models that cost less than USD 3 per million tokens were tested more than once.
We dynamically sampled which models needed to be tested to find the top 5. So if a model was statistically unlikely to be in the top 5, further testing was cancelled.
We did not tell any model to "think" or "think hard", and used the same prompt for every model
If a model did not produce an answer that could be converted to a Python expression, it is marked as an error. For example, if an LLM responded with "the answer is ...", it was marked as an error.

Top Models by Provider

This table shows the best performing model(s) from each provider.

Provider	Top Model(s)	Accuracy	Correct	Total	Incorrect	Errors
google	`Gemini-2.5-Flash`	98.9 ± 1.1%	62	62	0	0
openai	`Gpt-5`	98.9 ± 1.1%	61	61	0	0
qwen	`Qwen3-Max`	98.9 ± 1.1%	61	61	0	0
x-ai	`Grok-4`	98.9 ± 1.1%	61	61	0	0
anthropic	`Claude-3.7-Sonnet:thinking`	98.9 ± 1.1%	60	60	0	0
switchpoint	`Router`	98.9 ± 1.1%	60	60	0	0
z-ai	`Glm-4.5`	98.9 ± 1.1%	60	60	0	0
meta-llama	`Llama-4-Maverick:free`	95.6 ± 3.3%	58	60	1	1
nvidia	`Llama-3.3-Nemotron-Super-49b-V1.5`	95.5 ± 3.4%	56	58	2	0
deepseek	`Deepseek-R1`	95.0 ± 3.8%	50	52	1	1
meituan	`Longcat-Flash-Chat`	93.8 ± 4.3%	55	58	3	0
inclusionai	`Ling-1t`	93.7 ± 4.4%	54	57	2	1
openrouter	`Auto`	93.1 ± 4.8%	49	52	3	0
perplexity	`Sonar-Reasoning`	93.1 ± 4.8%	49	52	1	2
baidu	`Ernie-4.5-300b-A47b`	92.1 ± 5.1%	54	58	4	0
mistralai	`Mistral-Medium-3.1`	92.0 ± 5.1%	53	57	3	1
microsoft	`Phi-4-Reasoning-Plus`	90.4 ± 5.7%	53	58	4	1
moonshotai	`Kimi-K2`	88.8 ± 6.3%	52	58	5	1
alibaba	`Tongyi-Deepresearch-30b-A3b`	88.6 ± 6.4%	51	57	6	0
arcee-ai	`Virtuoso-Large`	88.6 ± 6.4%	51	57	6	0
stepfun-ai	`Step3`	80.8 ± 12.1%	19	23	2	2
cohere	`Command-A`	80.5 ± 10.2%	31	38	4	3
deepcogito	`Cogito-V2-Preview-Deepseek-671b`	80.5 ± 10.2%	31	38	5	2
tencent	`Hunyuan-A13b-Instruct`	78.5 ± 12.6%	20	25	5	0
minimax	`Minimax-M1`	76.7 ± 13.5%	18	23	2	3
aion-labs	`Aion-1.0-Mini`	71.9 ± 14.3%	19	26	6	1
nousresearch	`Hermes-3-Llama-3.1-405b`	70.9 ± 14.8%	18	25	6	1
amazon	`Nova-Pro-V1`	69.7 ± 15.3%	17	24	7	0
thedrummer	`Cydonia-24b-V4.1`	68.5 ± 15.9%	16	23	6	1
inception	`Mercury-Coder`	65.7 ± 17.1%	14	21	5	2
cognitivecomputations	`Dolphin-Mistral-24b-Venice-Edition:free`	63.6 ± 19.1%	11	17	5	1
sao10k	`L3.3-Euryale-70b`	59.2 ± 21.1%	9	15	3	3
ai21	`Jamba-Large-1.7`	56.5 ± 22.2%	8	14	5	1
anthracite-org	`Magnum-V4-72b`	50.0 ± 24.9%	6	12	3	3
ibm-granite	`Granite-4.0-H-Micro`	50.0 ± 24.9%	6	12	5	1
inflection	`Inflection-3-Productivity`	32.1 ± 33.0%	2	7	5	0
bytedance	`Ui-Tars-1.5-7b`	22.8 ± 35.0%	1	6	4	1
neversleep	`Noromaid-20b`	22.8 ± 35.0%	1	6	4	1
eleutherai	`Llemma_7b`	12.9 ± 39.2%	0	4	2	2
gryphe	`Mythomax-L2-13b`	12.9 ± 39.2%	0	4	3	1
mancer	`Weaver`	12.9 ± 39.2%	0	4	2	2
undi95	`Remm-Slerp-L2-13b`	12.9 ± 39.2%	0	4	3	1

Model Performance

Rank	Model	Accuracy	Correct	Total	Incorrect	Errors
1	`Google/Gemini-2.5-Flash`	98.9 ± 1.1%	62	62	0	0
2	`Google/Gemini-2.5-Pro`	98.9 ± 1.1%	61	61	0	0
2	`Google/Gemini-2.5-Pro-Preview`	98.9 ± 1.1%	61	61	0	0
2	`Openai/Gpt-5`	98.9 ± 1.1%	61	61	0	0
2	`Qwen/Qwen3-Max`	98.9 ± 1.1%	61	61	0	0
2	`X-ai/Grok-4`	98.9 ± 1.1%	61	61	0	0
3	`Anthropic/Claude-3.7-Sonnet:thinking`	98.9 ± 1.1%	60	60	0	0
3	`Google/Gemini-2.5-Flash-Image`	98.9 ± 1.1%	60	60	0	0
3	`Openai/Gpt-5-Codex`	98.9 ± 1.1%	60	60	0	0
3	`Openai/Gpt-5-Image-Mini`	98.9 ± 1.1%	60	60	0	0
3	`Openai/O3-Mini`	98.9 ± 1.1%	60	60	0	0
3	`Switchpoint/Router`	98.9 ± 1.1%	60	60	0	0
3	`Z-ai/Glm-4.5`	98.9 ± 1.1%	60	60	0	0
4	`Openai/Gpt-5-Mini`	97.3 ± 2.3%	60	61	1	0
4	`Openai/O3`	97.3 ± 2.3%	60	61	1	0
5	`Openai/Gpt-5-Nano`	97.3 ± 2.3%	59	60	1	0
5	`Openai/O3-Mini-High`	97.3 ± 2.3%	59	60	1	0
5	`Openai/O4-Mini-Deep-Research`	97.3 ± 2.3%	59	60	0	1
5	`Qwen/Qwen3-Next-80b-A3b-Instruct`	97.3 ± 2.3%	59	60	1	0
6	`Google/Gemini-2.5-Pro-Preview-05-06`	97.2 ± 2.4%	58	59	1	0
7	`Google/Gemini-2.5-Flash-Image-Preview`	97.2 ± 2.4%	57	58	1	0
7	`Qwen/Qwen-Plus`	97.2 ± 2.4%	57	58	1	0
7	`Qwen/Qwen3-Vl-30b-A3b-Instruct`	97.2 ± 2.4%	57	58	1	0
8	`Qwen/Qwen3-30b-A3b-Thinking-2507`	97.1 ± 2.5%	56	57	1	0
8	`Qwen/Qwen3-Next-80b-A3b-Thinking`	97.1 ± 2.5%	56	57	1	0
8	`Qwen/Qwen3-Vl-235b-A22b-Instruct`	97.1 ± 2.5%	56	57	1	0
9	`Openai/O4-Mini`	97.0 ± 2.5%	54	55	1	0
9	`Openai/O4-Mini-High`	97.0 ± 2.5%	54	55	1	0
10	`Qwen/Qwen3-235b-A22b-Thinking-2507`	96.9 ± 2.6%	52	53	1	0
11	`Qwen/Qwq-32b`	96.9 ± 2.7%	51	52	1	0
12	`Anthropic/Claude-Sonnet-4`	95.8 ± 3.2%	60	62	2	0
12	`X-ai/Grok-3`	95.8 ± 3.2%	60	62	2	0
13	`Meta-llama/Llama-4-Maverick:free`	95.6 ± 3.3%	58	60	1	1
14	`Qwen/Qwen-Vl-Max`	95.6 ± 3.4%	57	59	2	0
15	`Nvidia/Llama-3.3-Nemotron-Super-49b-V1.5`	95.5 ± 3.4%	56	58	2	0
15	`Openai/Gpt-4.1-Mini`	95.5 ± 3.4%	56	58	1	1
15	`Qwen/Qwen-2.5-Coder-32b-Instruct`	95.5 ± 3.4%	56	58	2	0
15	`Qwen/Qwen-Max`	95.5 ± 3.4%	56	58	1	1
15	`Qwen/Qwen-Plus-2025-07-28:thinking`	95.5 ± 3.4%	56	58	2	0
16	`Google/Gemini-2.5-Flash-Lite-Preview-06-17`	95.4 ± 3.5%	55	57	1	1
16	`Nvidia/Nemotron-Nano-9b-V2`	95.4 ± 3.5%	55	57	2	0
16	`Qwen/Qwen-Plus-2025-07-28`	95.4 ± 3.5%	55	57	2	0
16	`Qwen/Qwen3-235b-A22b-2507`	95.4 ± 3.5%	55	57	2	0
17	`Deepseek/Deepseek-R1`	95.0 ± 3.8%	50	52	1	1
17	`Qwen/Qwen3-235b-A22b:free`	95.0 ± 3.8%	50	52	1	1
18	`Qwen/Qwen3-Vl-30b-A3b-Thinking`	94.1 ± 4.5%	42	44	2	0
19	`Deepseek/Deepseek-Chat`	93.8 ± 4.3%	55	58	3	0
19	`Google/Gemini-2.5-Flash-Preview-09-2025`	93.8 ± 4.3%	55	58	3	0
19	`Meituan/Longcat-Flash-Chat`	93.8 ± 4.3%	55	58	3	0
19	`Openai/Gpt-5-Chat`	93.8 ± 4.3%	55	58	1	2
19	`Openai/Gpt-Oss-20b`	93.8 ± 4.3%	55	58	3	0
19	`Qwen/Qwen3-14b`	93.8 ± 4.3%	55	58	3	0
19	`Qwen/Qwen3-30b-A3b-Instruct-2507`	93.8 ± 4.3%	55	58	3	0
20	`Inclusionai/Ling-1t`	93.7 ± 4.4%	54	57	2	1
20	`Openai/Gpt-Oss-120b`	93.7 ± 4.4%	54	57	3	0
21	`Openai/O1-Mini`	93.6 ± 4.5%	53	56	3	0
21	`Qwen/Qwen3-Coder-Plus`	93.6 ± 4.5%	53	56	3	0
21	`Z-ai/Glm-4.6`	93.6 ± 4.5%	53	56	3	0
22	`Deepseek/Deepseek-R1-0528`	93.5 ± 4.9%	38	40	1	1
23	`Anthropic/Claude-Sonnet-4.5`	93.2 ± 4.7%	50	53	2	1
24	`Openrouter/Auto`	93.1 ± 4.8%	49	52	3	0
24	`Perplexity/Sonar-Reasoning`	93.1 ± 4.8%	49	52	1	2
24	`Qwen/Qwen3-8b`	93.1 ± 4.8%	49	52	3	0
25	`Inclusionai/Ring-1t`	92.6 ± 5.6%	33	35	2	0
26	`Nvidia/Nemotron-Nano-9b-V2:free`	92.5 ± 4.8%	57	61	3	1
27	`Baidu/Ernie-4.5-300b-A47b`	92.1 ± 5.1%	54	58	4	0
27	`Google/Gemini-2.5-Flash-Lite-Preview-09-2025`	92.1 ± 5.1%	54	58	3	1
27	`Meta-llama/Llama-4-Maverick`	92.1 ± 5.1%	54	58	2	2
28	`Anthropic/Claude-3.5-Sonnet`	92.0 ± 5.1%	53	57	3	1
28	`Mistralai/Mistral-Medium-3.1`	92.0 ± 5.1%	53	57	3	1
28	`Qwen/Qwen3-Vl-8b-Instruct`	92.0 ± 5.1%	53	57	4	0
28	`X-ai/Grok-3-Mini`	92.0 ± 5.1%	53	57	4	0
28	`X-ai/Grok-4-Fast`	92.0 ± 5.1%	53	57	4	0
29	`Openai/Codex-Mini`	91.6 ± 5.4%	50	54	3	1
29	`X-ai/Grok-3-Beta`	91.6 ± 5.4%	50	54	4	0
30	`Anthropic/Claude-3.7-Sonnet`	91.4 ± 5.5%	49	53	3	1
31	`Openai/Gpt-Oss-20b:free`	90.9 ± 5.5%	56	61	4	1
32	`Deepseek/Deepseek-Prover-V2`	90.4 ± 5.7%	53	58	4	1
32	`Google/Gemini-2.0-Flash-001`	90.4 ± 5.7%	53	58	5	0
32	`Microsoft/Phi-4-Reasoning-Plus`	90.4 ± 5.7%	53	58	4	1
33	`Qwen/Qwen3-Vl-8b-Thinking`	90.3 ± 6.2%	43	47	1	3
34	`Deepseek/Deepseek-Chat-V3-0324`	90.3 ± 5.8%	52	57	3	2
34	`Google/Gemma-3-27b-It`	90.3 ± 5.8%	52	57	5	0
35	`Openai/O1-Mini-2024-09-12`	90.1 ± 5.9%	51	56	4	1
36	`Qwen/Qwen3-4b:free`	89.1 ± 10.5%	5	5	0	0
37	`Deepseek/Deepseek-V3.1-Terminus`	88.8 ± 6.3%	52	58	5	1
37	`Mistralai/Mistral-Medium-3`	88.8 ± 6.3%	52	58	4	2
37	`Moonshotai/Kimi-K2`	88.8 ± 6.3%	52	58	5	1
38	`Alibaba/Tongyi-Deepresearch-30b-A3b`	88.6 ± 6.4%	51	57	6	0
38	`Arcee-ai/Virtuoso-Large`	88.6 ± 6.4%	51	57	6	0
38	`X-ai/Grok-3-Mini-Beta`	88.6 ± 6.4%	51	57	6	0
39	`Qwen/Qwen3-Vl-235b-A22b-Thinking`	86.3 ± 8.2%	35	40	2	3
40	`Anthropic/Claude-3.5-Sonnet-20240620`	85.9 ± 7.5%	46	53	7	0
40	`Nvidia/Llama-3.1-Nemotron-Ultra-253b-V1`	85.9 ± 7.5%	46	53	6	1
41	`Deepseek/Deepseek-Chat-V3.1`	85.6 ± 7.6%	45	52	6	1
42	`Qwen/Qwen3-Coder`	85.3 ± 7.8%	44	51	4	3
42	`Z-ai/Glm-4.5-Air`	85.3 ± 7.8%	44	51	4	3
43	`Mistralai/Mistral-Large`	85.1 ± 7.9%	43	50	7	0
44	`Baidu/Ernie-4.5-21b-A3b`	84.8 ± 8.1%	42	49	6	1
45	`Mistralai/Mistral-Large-2411`	84.5 ± 8.2%	41	48	4	3
45	`Perplexity/Sonar-Reasoning-Pro`	84.5 ± 8.2%	41	48	3	4
46	`Meta-llama/Llama-3.3-70b-Instruct`	84.1 ± 8.4%	40	47	5	2
47	`Perplexity/Sonar-Deep-Research`	84.0 ± 9.5%	29	34	2	3
48	`Moonshotai/Kimi-Dev-72b`	82.6 ± 9.7%	31	37	3	3
48	`Z-ai/Glm-4.5v`	82.6 ± 9.7%	31	37	6	0
49	`Deepseek/Deepseek-V3.2-Exp`	81.4 ± 9.7%	33	40	6	1
50	`Qwen/Qwen2.5-Vl-32b-Instruct`	81.1 ± 10.4%	28	34	4	2
51	`Stepfun-ai/Step3`	80.8 ± 12.1%	19	23	2	2
52	`Deepseek/Deepseek-R1-Distill-Qwen-32b`	80.7 ± 11.3%	23	28	4	1
53	`Cohere/Command-A`	80.5 ± 10.2%	31	38	4	3
53	`Deepcogito/Cogito-V2-Preview-Deepseek-671b`	80.5 ± 10.2%	31	38	5	2
53	`Mistralai/Devstral-Medium`	80.5 ± 10.2%	31	38	4	3
53	`Openai/Gpt-4o-Mini-2024-07-18`	80.5 ± 10.2%	31	38	5	2
53	`Perplexity/Sonar-Pro`	80.5 ± 10.2%	31	38	3	4
53	`X-ai/Grok-Code-Fast-1`	80.5 ± 10.2%	31	38	5	2
54	`Baidu/Ernie-4.5-21b-A3b-Thinking`	78.9 ± 11.0%	28	35	7	0
54	`Google/Gemini-2.5-Flash-Lite`	78.9 ± 11.0%	28	35	7	0
54	`Meta-llama/Llama-3.3-70b-Instruct:free`	78.9 ± 11.0%	28	35	3	4
54	`Qwen/Qwen-Vl-Plus`	78.9 ± 11.0%	28	35	6	1
54	`Qwen/Qwen2.5-Vl-72b-Instruct`	78.9 ± 11.0%	28	35	6	1
54	`Qwen/Qwen3-30b-A3b`	78.9 ± 11.0%	28	35	6	1
55	`Tencent/Hunyuan-A13b-Instruct`	78.5 ± 12.6%	20	25	5	0
56	`Z-ai/Glm-4-32b`	78.3 ± 11.3%	27	34	7	0
57	`Qwen/Qwen3-235b-A22b`	77.3 ± 12.4%	22	28	2	4
58	`Minimax/Minimax-M1`	76.7 ± 13.5%	18	23	2	3
59	`Qwen/Qwen-2.5-72b-Instruct`	76.3 ± 12.3%	24	31	6	1
60	`Deepseek/Deepseek-R1-0528-Qwen3-8b`	75.6 ± 13.3%	20	26	4	2
61	`Baidu/Ernie-4.5-Vl-424b-A47b`	74.7 ± 13.0%	22	29	5	2
61	`Meta-llama/Llama-3.2-90b-Vision-Instruct`	74.7 ± 13.0%	22	29	4	3
61	`Mistralai/Mistral-Small-3.2-24b-Instruct`	74.7 ± 13.0%	22	29	6	1
62	`Mistralai/Magistral-Medium-2506:thinking`	74.7 ± 13.8%	19	25	2	4
62	`Qwen/Qwen3-32b`	74.7 ± 13.8%	19	25	6	0
63	`Microsoft/Phi-4-Multimodal-Instruct`	73.9 ± 13.4%	21	28	7	0
64	`Mistralai/Mistral-Saba`	72.9 ± 13.8%	20	27	7	0
65	`Aion-labs/Aion-1.0-Mini`	71.9 ± 14.3%	19	26	6	1
65	`Arcee-ai/Coder-Large`	71.9 ± 14.3%	19	26	6	1
65	`Deepcogito/Cogito-V2-Preview-Llama-70b`	71.9 ± 14.3%	19	26	7	0
65	`Deepseek/Deepseek-R1-Distill-Qwen-14b`	71.9 ± 14.3%	19	26	5	2
65	`Moonshotai/Kimi-K2-0905`	71.9 ± 14.3%	19	26	5	2
65	`Openai/Gpt-4.1-Nano`	71.9 ± 14.3%	19	26	4	3
65	`Openai/Gpt-4o`	71.9 ± 14.3%	19	26	3	4
65	`Openai/Gpt-4o-2024-11-20`	71.9 ± 14.3%	19	26	4	3
65	`Qwen/Qwen3-Coder-Flash`	71.9 ± 14.3%	19	26	6	1
66	`Google/Gemma-3-12b-It`	70.9 ± 14.8%	18	25	7	0
66	`Meta-llama/Llama-4-Scout`	70.9 ± 14.8%	18	25	1	6
66	`Mistralai/Mistral-Large-2407`	70.9 ± 14.8%	18	25	6	1
66	`Nousresearch/Hermes-3-Llama-3.1-405b`	70.9 ± 14.8%	18	25	6	1
67	`Aion-labs/Aion-1.0`	70.7 ± 28.0%	1	1	0	0
67	`Anthropic/Claude-3-Opus`	70.7 ± 28.0%	1	1	0	0
67	`Anthropic/Claude-Opus-4`	70.7 ± 28.0%	1	1	0	0
67	`Anthropic/Claude-Opus-4.1`	70.7 ± 28.0%	1	1	0	0
67	`Deepcogito/Cogito-V2-Preview-Llama-405b`	70.7 ± 28.0%	1	1	0	0
67	`Openai/Chatgpt-4o-Latest`	70.7 ± 28.0%	1	1	0	0
67	`Openai/Gpt-4`	70.7 ± 28.0%	1	1	0	0
67	`Openai/Gpt-4-Turbo`	70.7 ± 28.0%	1	1	0	0
67	`Openai/Gpt-4o-2024-05-13`	70.7 ± 28.0%	1	1	0	0
67	`Openai/Gpt-5-Image`	70.7 ± 28.0%	1	1	0	0
67	`Openai/Gpt-5-Pro`	70.7 ± 28.0%	1	1	0	0
67	`Openai/O1`	70.7 ± 28.0%	1	1	0	0
67	`Openai/O1-Pro`	70.7 ± 28.0%	1	1	0	0
67	`Openai/O3-Deep-Research`	70.7 ± 28.0%	1	1	0	0
68	`Baidu/Ernie-4.5-Vl-28b-A3b`	70.1 ± 16.0%	15	21	4	2
69	`Amazon/Nova-Pro-V1`	69.7 ± 15.3%	17	24	7	0
69	`Google/Gemini-2.0-Flash-Lite-001`	69.7 ± 15.3%	17	24	7	0
69	`Google/Gemma-3-4b-It`	69.7 ± 15.3%	17	24	5	2
69	`Mistralai/Devstral-Small-2505`	69.7 ± 15.3%	17	24	4	3
69	`Nousresearch/Hermes-4-405b`	69.7 ± 15.3%	17	24	7	0
69	`Openai/Gpt-4o-Search-Preview`	69.7 ± 15.3%	17	24	6	1
69	`Qwen/Qwen-2.5-7b-Instruct`	69.7 ± 15.3%	17	24	5	2
69	`Qwen/Qwen-Turbo`	69.7 ± 15.3%	17	24	5	2
69	`Qwen/Qwen3-Coder-30b-A3b-Instruct`	69.7 ± 15.3%	17	24	6	1
70	`Anthropic/Claude-3.5-Haiku`	68.5 ± 15.9%	16	23	7	0
70	`Deepcogito/Cogito-V2-Preview-Llama-109b-Moe`	68.5 ± 15.9%	16	23	7	0
70	`Mistralai/Devstral-Small`	68.5 ± 15.9%	16	23	6	1
70	`Mistralai/Pixtral-Large-2411`	68.5 ± 15.9%	16	23	7	0
70	`Perplexity/Sonar`	68.5 ± 15.9%	16	23	1	6
70	`Thedrummer/Cydonia-24b-V4.1`	68.5 ± 15.9%	16	23	6	1
71	`Inception/Mercury-Coder`	65.7 ± 17.1%	14	21	5	2
71	`Thedrummer/Anubis-70b-V1.1`	65.7 ± 17.1%	14	21	5	2
72	`Microsoft/Phi-4`	65.5 ± 18.2%	12	18	5	1
72	`Mistralai/Codestral-2508`	65.5 ± 18.2%	12	18	6	0
72	`Nousresearch/Hermes-3-Llama-3.1-405b:free`	65.5 ± 18.2%	12	18	6	0
72	`Nvidia/Llama-3.1-Nemotron-70b-Instruct`	65.5 ± 18.2%	12	18	2	4
73	`Google/Gemma-2-9b-It`	64.1 ± 17.8%	13	20	7	0
73	`Openai/Gpt-4o-Mini`	64.1 ± 17.8%	13	20	6	1
74	`Amazon/Nova-Lite-V1`	63.6 ± 19.1%	11	17	4	2
74	`Cognitivecomputations/Dolphin-Mistral-24b-Venice-Edition:free`	63.6 ± 19.1%	11	17	5	1
74	`Meta-llama/Llama-3-70b-Instruct`	63.6 ± 19.1%	11	17	5	1
74	`Nousresearch/Hermes-3-Llama-3.1-70b`	63.6 ± 19.1%	11	17	5	1
75	`Mistralai/Mistral-Small-3.1-24b-Instruct:free`	61.5 ± 20.0%	10	16	6	0
76	`Google/Gemma-3n-E4b-It`	60.3 ± 19.4%	11	18	6	1
76	`Inception/Mercury`	60.3 ± 19.4%	11	18	7	0
76	`Meta-llama/Llama-4-Scout:free`	60.3 ± 19.4%	11	18	3	4
76	`Mistralai/Codestral-2501`	60.3 ± 19.4%	11	18	6	1
76	`Mistralai/Mistral-7b-Instruct-V0.3`	60.3 ± 19.4%	11	18	7	0
76	`Mistralai/Mistral-Small-3.1-24b-Instruct`	60.3 ± 19.4%	11	18	4	3
76	`Openai/Gpt-4o-2024-08-06`	60.3 ± 19.4%	11	18	6	1
77	`Amazon/Nova-Micro-V1`	59.2 ± 21.1%	9	15	3	3
77	`Anthropic/Claude-Haiku-4.5`	59.2 ± 21.1%	9	15	1	5
77	`Mistralai/Magistral-Small-2506`	59.2 ± 21.1%	9	15	0	6
77	`Sao10k/L3.3-Euryale-70b`	59.2 ± 21.1%	9	15	3	3
78	`Mistralai/Mistral-7b-Instruct:free`	58.2 ± 20.3%	10	17	7	0
79	`Ai21/Jamba-Large-1.7`	56.5 ± 22.2%	8	14	5	1
79	`Meta-llama/Llama-3.1-405b-Instruct`	56.5 ± 22.2%	8	14	3	3
79	`Meta-llama/Llama-3.1-70b-Instruct`	56.5 ± 22.2%	8	14	4	2
79	`Thedrummer/Skyfall-36b-V2`	56.5 ± 22.2%	8	14	5	1
80	`Anthropic/Claude-3-Haiku`	53.5 ± 23.5%	7	13	6	0
80	`Google/Gemma-2-27b-It`	53.5 ± 23.5%	7	13	5	1
80	`Mistralai/Magistral-Medium-2506`	53.5 ± 23.5%	7	13	3	3
80	`Mistralai/Mistral-Small`	53.5 ± 23.5%	7	13	5	1
80	`Mistralai/Pixtral-12b`	53.5 ± 23.5%	7	13	4	2
80	`Openai/Gpt-4o-Mini-Search-Preview`	53.5 ± 23.5%	7	13	6	0
81	`Anthracite-org/Magnum-V4-72b`	50.0 ± 24.9%	6	12	3	3
81	`Ibm-granite/Granite-4.0-H-Micro`	50.0 ± 24.9%	6	12	5	1
81	`Mistralai/Mistral-7b-Instruct`	50.0 ± 24.9%	6	12	6	0
81	`Sao10k/L3.1-70b-Hanami-X1`	50.0 ± 24.9%	6	12	5	1
81	`Sao10k/L3.1-Euryale-70b`	50.0 ± 24.9%	6	12	6	0
82	`Microsoft/Wizardlm-2-8x22b`	46.0 ± 26.4%	5	11	2	4
82	`Mistralai/Ministral-8b`	46.0 ± 26.4%	5	11	3	3
82	`Openai/Gpt-3.5-Turbo`	46.0 ± 26.4%	5	11	6	0
82	`Sao10k/L3-Euryale-70b`	46.0 ± 26.4%	5	11	5	1
83	`Meta-llama/Llama-3.2-3b-Instruct`	41.2 ± 28.0%	4	10	5	1
84	`Meta-llama/Llama-3.3-8b-Instruct:free`	39.3 ± 30.8%	3	8	1	4
84	`Mistralai/Ministral-3b`	39.3 ± 30.8%	3	8	4	1
84	`Mistralai/Mistral-Small-24b-Instruct-2501`	39.3 ± 30.8%	3	8	3	2
85	`Cohere/Command-R-08-2024`	36.4 ± 34.5%	2	6	3	1
86	`Cohere/Command-R-Plus-08-2024`	35.5 ± 29.7%	3	9	5	1
86	`Mistralai/Mistral-Tiny`	35.5 ± 29.7%	3	9	3	3
87	`Inflection/Inflection-3-Productivity`	32.1 ± 33.0%	2	7	5	0
87	`Meta-llama/Llama-3-8b-Instruct`	32.1 ± 33.0%	2	7	4	1
87	`Mistralai/Mistral-Nemo`	32.1 ± 33.0%	2	7	3	2
87	`Mistralai/Mixtral-8x22b-Instruct`	32.1 ± 33.0%	2	7	2	3
87	`Nousresearch/Hermes-4-70b`	32.1 ± 33.0%	2	7	4	1
87	`Openai/Gpt-3.5-Turbo-16k`	32.1 ± 33.0%	2	7	5	0
87	`Qwen/Qwen-2.5-Vl-7b-Instruct`	32.1 ± 33.0%	2	7	3	2
88	`Meta-llama/Llama-3.1-405b`	29.3 ± 54.9%	0	1	1	0
88	`Openai/Gpt-4-1106-Preview`	29.3 ± 54.9%	0	1	1	0
88	`Openai/Gpt-4-Turbo-Preview`	29.3 ± 54.9%	0	1	1	0
88	`Openai/Gpt-4o:extended`	29.3 ± 54.9%	0	1	1	0
89	`Arcee-ai/Afm-4.5b`	26.4 ± 37.7%	1	5	2	2
89	`Meta-llama/Llama-3.1-8b-Instruct`	26.4 ± 37.7%	1	5	3	1
90	`Ai21/Jamba-Mini-1.7`	22.8 ± 35.0%	1	6	3	2
90	`Aion-labs/Aion-Rp-Llama-3.1-8b`	22.8 ± 35.0%	1	6	3	2
90	`Bytedance/Ui-Tars-1.5-7b`	22.8 ± 35.0%	1	6	4	1
90	`Inflection/Inflection-3-Pi`	22.8 ± 35.0%	1	6	3	2
90	`Microsoft/Phi-3-Medium-128k-Instruct`	22.8 ± 35.0%	1	6	2	3
90	`Neversleep/Noromaid-20b`	22.8 ± 35.0%	1	6	4	1
90	`Openai/Gpt-3.5-Turbo-0613`	22.8 ± 35.0%	1	6	5	0
90	`Openai/Gpt-3.5-Turbo-Instruct`	22.8 ± 35.0%	1	6	5	0
90	`Qwen/Qwen2.5-Coder-7b-Instruct`	22.8 ± 35.0%	1	6	3	2
90	`Sao10k/L3-Lunaris-8b`	22.8 ± 35.0%	1	6	5	0
90	`Thedrummer/Rocinante-12b`	22.8 ± 35.0%	1	6	4	1
90	`Thedrummer/Unslopnemo-12b`	22.8 ± 35.0%	1	6	4	1
91	`Cohere/Command-R7b-12-2024`	12.9 ± 39.2%	0	4	2	2
91	`Eleutherai/Llemma_7b`	12.9 ± 39.2%	0	4	2	2
91	`Gryphe/Mythomax-L2-13b`	12.9 ± 39.2%	0	4	3	1
91	`Mancer/Weaver`	12.9 ± 39.2%	0	4	2	2
91	`Meta-llama/Llama-3.2-11b-Vision-Instruct`	12.9 ± 39.2%	0	4	1	3
91	`Meta-llama/Llama-3.2-3b-Instruct:free`	12.9 ± 39.2%	0	4	3	1
91	`Microsoft/Phi-3-Mini-128k-Instruct`	12.9 ± 39.2%	0	4	2	2
91	`Microsoft/Phi-3.5-Mini-128k-Instruct`	12.9 ± 39.2%	0	4	1	3
91	`Mistralai/Mistral-7b-Instruct-V0.1`	12.9 ± 39.2%	0	4	2	2
91	`Mistralai/Mixtral-8x7b-Instruct`	12.9 ± 39.2%	0	4	3	1
91	`Neversleep/Llama-3.1-Lumimaid-8b`	12.9 ± 39.2%	0	4	4	0
91	`Nousresearch/Hermes-2-Pro-Llama-3-8b`	12.9 ± 39.2%	0	4	3	1
91	`Undi95/Remm-Slerp-L2-13b`	12.9 ± 39.2%	0	4	3	1

Accuracy: Shown as median ± error margin, giving the range of the 95% confidence interval.
Correct: Number of expressions solved correctly.
Total: Total number of expressions evaluated.
Incorrect: Number of expressions where the model gave a mathematically wrong derivative (excludes parsing/technical errors)
Errors: Number of expressions where parsing or technical errors occurred (e.g., unparseable output, API errors, truncated responses)

Verification Process

We curated a set of 273 expressions. Approximately a third were selected from a popular calculus textbook because we assumed that they would have some pedagogic value, and the rest were randomly generated with varying degrees of complexity. We are assuming that the randomly generated set are new to the LLM, and never formed part of their training data.

Each model was asked to differentiate an expression with respect to some variable, and asked to return its reasoning and the answer.

We then parsed the answer from LaTeX to Python, and numerically evaluated the difference between the supplied answer and the actual derivative. If the two over ten samples had a difference of less than 1e-9, then the result was marked as correct.