The Model Showdown: Testing 49 LLMs Across 5 Dimensions for $4.63

March 2026 — Run 2 (full coding coverage, 49 models)

The Question Behind the Showdown

Which model should you actually use? Not which one tops a leaderboard somewhere — which one will reason through your problem, follow your formatting instructions, write code that compiles, answer factual questions correctly, and orchestrate real-world tools without falling apart?

We built an evaluation suite that tests 49 language models across 5 fundamentally different capabilities: logical reasoning, factual knowledge, precise instruction following, executable code generation, and MCP tool orchestration against a live API. The total cost was $4.63.

The lineup spans the full spectrum: frontier models (Claude Opus 4.6, GPT-5.4, Gemini 3.1 Pro, Grok 4.20) to free-tier NVIDIA models, to 13 local Ollama models running on a MacBook. Premium ($5+/M tokens), value ($0.05-0.50), budget (<$0.05), and free. Closed-source and open-weight. Cloud and local.

This is Run 2. Run 1 had incomplete coding coverage due to container DNS failures and timeouts. Run 2 re-evaluated all models across all 18 coding tasks with full local execution — no containers, no DNS failures, no time limits. Six models now achieve a perfect 126/126 on coding.

Note on MCP coverage: 41 of 49 models have complete 5-dimension results (including MCP tool use). The remaining 8 either lack tool-use support (phi4, gemma3n, deepseek-r1 variants, gemma-3-27b-it) or were not included in the MCP evaluation run (nemotron-nano-9b-v2 variants, minimax-01). The 4-dimension leaderboard below includes all 49 models; the 5-dimension leaderboard covers the 41 with complete data.

What We Tested and Why

The Models

49 models across 4 providers and local inference:

OpenRouter (31 models via @openrouter/ai-sdk-provider):

Premium: anthropic/claude-opus-4.6, anthropic/claude-sonnet-4.6, openai/gpt-5.4, x-ai/grok-4.20-beta
High: openai/gpt-5.4-mini, anthropic/claude-haiku-4.5, moonshotai/kimi-k2.5, moonshotai/kimi-k2, google/gemini-3.1-pro-preview
Mid/Value: Qwen 3.5 (397B, 122B, 35B), DeepSeek v3.2, MiniMax M2.7, Inception Mercury 2/Coder, OpenAI GPT-OSS (120B, 20B), GPT-5.4 Nano, Grok 4.1 Fast
Budget: Meta Llama 4 (Maverick, Scout), Mistral (Small, Codestral, Ministral 8B), Gemma 3 27B
Free: NVIDIA Nemotron 3 (Super 120B, Nano 30B, Nano 9B v2)

Google (2 models, direct API):

gemini-3-flash-preview, gemini-3.1-pro-preview

DeepInfra (3 models):

NVIDIA Nemotron 3 (Super 120B, Nano 30B, Nano 9B v2)

Ollama (13 models, local on MacBook):

deepseek-r1 (latest/32b/14b), devstral, phi4, mistral-small, glm-4.7-flash, gemma3n:e4b, qwen3 (32b/30b-a3b), nemotron-3-nano (latest/4b), gpt-oss

The 5 Dimensions

Each dimension tests something that the others can't.

Reasoning (/20) — Can the model think past intuitive traps? Four classic logic puzzles where the obvious answer is wrong. A bat and ball that don't cost what you think. A patch of lily pads where halving the time is the wrong move. A surgeon who isn't who you assumed. And the hardest: find a counterfeit coin among 12 using exactly 3 weighings on a balance scale. Scored by an LLM judge on reasoning quality, not just correctness.

Knowledge (/30) — Does the model know things? 30 factual questions across Science, Geography, History, Technology, AI/ML, and Tricky/Adversarial categories. Binary scoring (correct or not) with an LLM judge that allows formatting variations — "5,730 years" and "5730 years" both count.

Instruction Following (/30) — Can the model do exactly what you ask? Six tasks with precise formatting constraints scored deterministically — no LLM judge, just regex, JSON parsing, and character counting. Write exactly 12 words. Output valid JSON without markdown fences. Convert CSV to a markdown table. Follow negative constraints ("don't use the word 'beautiful'").

Coding (/126) — Can the model write code that actually runs? Six programming challenges across TypeScript, Python, and Go — 18 tasks total. Each submission is compiled, executed against test cases, and scored on correctness (compile: 1pt, run: 1pt, output: 0-5pts = 0-7 per task). FizzBuzz with a twist, business day calculation with holidays, a vending machine state machine, grid path counting with obstacles, rail fence cipher, and data pipeline aggregation.

MCP Tool Use (/16) — Can the model orchestrate real-world tools? Each model connects to the TezLab MCP server (real EV vehicle data) and must analyze battery health and charging patterns by calling the right sequence of tools: list vehicles, get battery health, pull charging history, check efficiency stats, find chargers, and search for alternatives. Scored on both tool usage (did you call all 6 required tools? 0-6) and response quality (did you synthesize the data into something useful? 1-10).

The Full Leaderboard: 49 Models × 4 Dimensions

All 49 models ranked by combined score across reasoning, knowledge, instruction following, and coding. This includes premium-tier models, local Ollama models, and free-tier offerings that couldn't be tested on MCP.

Rank	Model	Provider	Combined	Reasoning	Knowledge	Instruction	Coding	Cost	Time
1	`anthropic/claude-sonnet-4.6`	OpenRouter	100.0%	20/20	30/30	30/30	126/126	$0.20	3m
2	`qwen/qwen3.5-397b-a17b`	OpenRouter	99.2%	20/20	29/30	30/30	126/126	$0.52	36m
3	`x-ai/grok-4.20-beta`	OpenRouter	98.8%	20/20	29/30	30/30	124/126	$0.09	2m
4	`gemini-3-flash-preview`	Google	98.8%	19/20	30/30	30/30	126/126	$0.04	1h41m
5	`openai/gpt-5.4`	OpenRouter	98.8%	19/20	30/30	30/30	126/126	$0.17	2m
6	`openai/gpt-oss-120b`	OpenRouter	98.3%	20/20	28/30	30/30	126/126	$0.01	10m
7	`qwen/qwen3.5-122b-a10b`	OpenRouter	97.6%	20/20	30/30	30/30	114/126	$0.43	18m
8	`openai/gpt-5.4-mini`	OpenRouter	97.3%	20/20	30/30	28/30	121/126	$0.05	1m
9	`gemini-3.1-pro-preview`	Google	97.3%	18/20	30/30	30/30	125/126	$0.01	23m
10	`qwen/qwen3.5-35b-a3b`	OpenRouter	97.2%	20/20	30/30	30/30	112/126	$0.21	30m
11	`google/gemini-3.1-pro-preview`	OpenRouter	96.9%	18/20	30/30	30/30	123/126	$1.85	31m
12	`anthropic/claude-haiku-4.5`	OpenRouter	96.3%	17/20	30/30	30/30	126/126	$0.07	2m
13	`minimax/minimax-m2.7`	OpenRouter	96.3%	17/20	30/30	30/30	126/126	$0.10	24m
14	`moonshotai/kimi-k2`	OpenRouter	95.8%	20/20	30/30	25/30	126/126	$0.03	5m
15	`moonshotai/kimi-k2.5`	OpenRouter	95.8%	20/20	30/30	25/30	126/126	$0.26	50m
16	`x-ai/grok-4.1-fast`	OpenRouter	95.8%	20/20	30/30	30/30	105/126	$0.06	13m
17	`anthropic/claude-opus-4.6`	OpenRouter	95.8%	18/20	30/30	28/30	126/126	$0.37	4m
18	`inception/mercury-2`	OpenRouter	94.4%	20/20	30/30	25/30	119/126	$0.06	2m
19	`mistralai/mistral-small-2603`	OpenRouter	94.1%	17/20	30/30	30/30	115/126	$0.008	1m
20	`openai/gpt-oss-20b`	OpenRouter	94.0%	17/20	29/30	30/30	119/126	$0.02	8m
21	`gpt-oss:latest`	Ollama	93.3%	20/20	28/30	29/30	105/126	Free	1h11m
22	`deepseek/deepseek-v3.2`	OpenRouter	93.2%	17/20	29/30	28/30	123/126	$0.007	14m
23	`nvidia/Nemotron-3-Nano-30B-A3B`	DeepInfra	93.1%	17/20	30/30	30/30	110/126	Free	14m
24	`nvidia/Nemotron-Super-120B-A12B`	DeepInfra	90.8%	20/20	28/30	30/30	88/126	Free	3m
25	`nemotron-3-nano:latest`	Ollama	89.4%	20/20	30/30	27/30	85/126	Free	36m
26	`meta-llama/llama-4-maverick`	OpenRouter	89.1%	17/20	29/30	25/30	115/126	$0.008	16m
27	`meta-llama/llama-4-scout`	OpenRouter	88.8%	17/20	27/30	30/30	101/126	$0.004	15m
28	`openai/gpt-5.4-nano`	OpenRouter	87.4%	15/20	29/30	27/30	111/126	$0.02	2m
29	`nvidia/nemotron-3-nano-30b-a3b:free`	OpenRouter	87.4%	20/20	30/30	28/30	71/126	Free	2m
30	`mistralai/codestral-2508`	OpenRouter	87.3%	17/20	25/30	25/30	123/126	$0.01	48s
31	`google/gemma-3-27b-it`	OpenRouter	87.1%	17/20	29/30	25/30	105/126	$0.003	4m
32	`qwen3:32b`	Ollama	86.7%	17/20	29/30	25/30	103/126	Free	59m
33	`inception/mercury-coder`	OpenRouter	86.5%	17/20	25/30	25/30	119/126	$0.01	2m
34	`nvidia/nemotron-3-super-120b-a12b:free`	OpenRouter	85.5%	20/20	30/30	25/30	74/126	Free	11m
35	`phi4:latest`	Ollama	84.6%	16/20	27/30	27/30	99/126	Free	25m
36	`mistral-small:latest`	Ollama	84.4%	17/20	29/30	23/30	100/126	Free	49m
37	`glm-4.7-flash:latest`	Ollama	84.1%	17/20	29/30	26/30	86/126	Free	54m
38	`devstral:latest`	Ollama	84.0%	16/20	28/30	26/30	96/126	Free	44m
39	`deepseek-r1:32b`	Ollama	84.0%	17/20	28/30	28/30	81/126	Free	1h5m
40	`mistralai/mistral-small-3.2-24b-instruct`	OpenRouter	83.0%	17/20	29/30	26/30	80/126	$0.002	2m
41	`mistralai/ministral-8b-2512`	OpenRouter	81.4%	16/20	27/30	25/30	91/126	$0.004	1m
42	`gemma3n:e4b`	Ollama	79.7%	17/20	29/30	24/30	72/126	Free	54m
43	`deepseek-r1:14b`	Ollama	78.2%	17/20	24/30	26/30	77/126	Free	57m
44	`deepseek-r1:latest`	Ollama	78.1%	15/20	26/30	30/30	64/126	Free	33m
45	`nvidia/nemotron-nano-9b-v2:free`	OpenRouter	77.6%	17/20	26/30	29/30	53/126	Free	1h52m
46	`qwen3:30b-a3b`	Ollama	77.5%	20/20	29/30	4/30	126/126	Free	1h14m
47	`nvidia/Nemotron-Nano-9B-v2`	DeepInfra	75.7%	17/20	26/30	25/30	60/126	Free	9m
48	`nemotron-3-nano:4b`	Ollama	73.8%	19/20	25/30	25/30	42/126	Free	42m

Key Findings From the Full Lineup

1. Claude Sonnet 4.6 achieves a perfect 100%. The only model to score maximum on all 4 dimensions: 20/20 reasoning, 30/30 knowledge, 30/30 instruction, 126/126 coding. At $0.20, it's the gold standard for capability — but it costs 5x more than Gemini Flash which is 1.2 points behind.

2. Gemini 3 Flash is the value champion at 98.8% for $0.04. Nearly tied for #3, costing 5x less than the next-cheapest competitor at its tier. Perfect knowledge, perfect instruction, perfect coding — it only lost 1 point on reasoning.

3. openai/gpt-oss-120b at $0.01 is absurd. 98.3% — #6 overall, ahead of Gemini Pro and GPT-5.4 Mini — for one cent. Perfect coding (126/126), perfect instruction, perfect reasoning. It missed 2 knowledge questions. An Apache-licensed model at a penny.

4. gpt-oss:latest on Ollama scores 93.3% — for free, running locally. The same GPT-OSS architecture running on a MacBook ranks #21 overall, beating Meta Llama 4, DeepSeek v3.2, and all Mistral models. Local inference is no longer a compromise.

5. qwen3:30b-a3b on Ollama gets perfect coding (126/126) but catastrophic instruction following (4/30). It produces excellent code but can't follow formatting constraints — markdown fences everywhere, wrong word counts. A pure specialist that would rank #1 on coding alone but #46 overall.

6. The Ollama models cluster at 78-93%. Local models are consistently 5-15 points behind cloud models on the same architecture. The gap comes primarily from coding (Go struggles) and instruction following, not reasoning or knowledge.

The 5-Dimension Leaderboard: 41 Models With MCP

41 models have complete results across all 5 dimensions including MCP tool use. Only 8 models are excluded — 5 that lack tool-use support entirely (phi4, gemma3n, deepseek-r1 variants), 1 without tool endpoints (gemma-3-27b-it), and 2 not attempted (nemotron-nano-9b-v2 variants).

Rank	Model	Provider	Combined	Reasoning	Knowledge	Instruction	Coding	MCP	Cost	Time
1	`anthropic/claude-sonnet-4.6`	OpenRouter	93.8%	20/20	30/30	30/30	126/126	11/16	$0.33	4m
2	`qwen/qwen3.5-397b-a17b`	OpenRouter	93.1%	20/20	29/30	30/30	126/126	11/16	$0.54	37m
3	`x-ai/grok-4.20-beta`	OpenRouter	92.8%	20/20	29/30	30/30	124/126	11/16	$0.16	3m
4	`gemini-3-flash-preview`	Google	92.8%	19/20	30/30	30/30	126/126	11/16	$0.05	1h42m
5	`openai/gpt-5.4`	OpenRouter	92.8%	19/20	30/30	30/30	126/126	11/16	$0.24	3m
6	`qwen/qwen3.5-122b-a10b`	OpenRouter	91.8%	20/20	30/30	30/30	114/126	11/16	$0.44	19m
7	`gemini-3.1-pro-preview`	Google	91.6%	18/20	30/30	30/30	125/126	11/16	$0.02	23m
8	`qwen/qwen3.5-35b-a3b`	OpenRouter	91.5%	20/20	30/30	30/30	112/126	11/16	$0.22	31m
9	`google/gemini-3.1-pro-preview`	OpenRouter	90.9%	18/20	30/30	30/30	123/126	10.7/16	$1.91	32m
10	`minimax/minimax-m2.7`	OpenRouter	90.8%	17/20	30/30	30/30	126/126	11/16	$0.11	27m
11	`moonshotai/kimi-k2`	OpenRouter	90.4%	20/20	30/30	25/30	126/126	11/16	$0.04	5m
12	`moonshotai/kimi-k2.5`	OpenRouter	90.4%	20/20	30/30	25/30	126/126	11/16	$0.28	53m
13	`anthropic/claude-opus-4.6`	OpenRouter	90.4%	18/20	30/30	28/30	126/126	11/16	$0.65	6m
14	`x-ai/grok-4.1-fast`	OpenRouter	90.0%	20/20	30/30	30/30	105/126	10.7/16	$0.07	13m
15	`openai/gpt-5.4-mini`	OpenRouter	90.0%	20/20	30/30	28/30	121/126	9.7/16	$0.06	2m
16	`openai/gpt-oss-120b`	OpenRouter	89.9%	20/20	28/30	30/30	126/126	9/16	$0.01	11m
17	`inception/mercury-2`	OpenRouter	89.3%	20/20	30/30	25/30	119/126	11/16	$0.07	2m
18	`mistralai/mistral-small-2603`	OpenRouter	89.0%	17/20	30/30	30/30	115/126	11/16	$0.01	2m
19	`openai/gpt-oss-20b`	OpenRouter	89.0%	17/20	29/30	30/30	119/126	11/16	$0.02	9m
20	`deepseek/deepseek-v3.2`	OpenRouter	88.3%	17/20	29/30	28/30	123/126	11/16	$0.02	16m
21	`nemotron-3-nano:latest`	Ollama	84.0%	20/20	30/30	27/30	85/126	10/16	Free	38m
22	`nvidia/nemotron-3-nano-30b-a3b:free`	OpenRouter	83.7%	20/20	30/30	28/30	71/126	11/16	Free	4m
23	`meta-llama/llama-4-maverick`	OpenRouter	82.5%	17/20	29/30	25/30	115/126	9/16	$0.01	16m
24	`nvidia/nemotron-3-super-120b-a12b:free`	OpenRouter	82.2%	20/20	30/30	25/30	74/126	11/16	Free	14m
25	`openai/gpt-5.4-nano`	OpenRouter	82.1%	15/20	29/30	27/30	111/126	9.7/16	$0.02	2m
26	`inception/mercury-coder`	OpenRouter	81.7%	17/20	25/30	25/30	119/126	10/16	$0.02	2m
27	`glm-4.7-flash:latest`	Ollama	81.1%	17/20	29/30	26/30	86/126	11/16	Free	57m
28	`mistral-small:latest`	Ollama	80.9%	17/20	29/30	23/30	100/126	10.7/16	Free	52m
29	`meta-llama/llama-4-scout`	OpenRouter	80.7%	17/20	27/30	30/30	101/126	7.7/16	$0.005	15m
30	`qwen3:32b`	Ollama	80.6%	17/20	29/30	25/30	103/126	9/16	Free	1h3m
31	`anthropic/claude-haiku-4.5`	OpenRouter	79.5%	17/20	30/30	30/30	126/126	2/16	$0.08	2m
32	`devstral:latest`	Ollama	78.5%	16/20	28/30	26/30	96/126	9/16	Free	46m
33	`nvidia/NVIDIA-Nemotron-3-Super-120B-A12B`	DeepInfra	77.6%	20/20	28/30	30/30	88/126	4/16	Free	3m
34	`mistralai/codestral-2508`	OpenRouter	77.4%	17/20	25/30	25/30	123/126	6/16	$0.02	54s
35	`gpt-oss:latest`	Ollama	77.2%	20/20	28/30	29/30	105/126	2/16	Free	1h11m
36	`nvidia/Nemotron-3-Nano-30B-A3B`	DeepInfra	77.0%	17/20	30/30	30/30	110/126	2/16	Free	15m
37	`mistralai/mistral-small-3.2-24b-instruct`	OpenRouter	76.4%	17/20	29/30	26/30	80/126	8/16	$0.003	2m
38	`mistralai/ministral-8b-2512`	OpenRouter	73.9%	16/20	27/30	25/30	91/126	7/16	$0.006	2m
39	`qwen3:30b-a3b`	Ollama	73.3%	20/20	29/30	4/30	126/126	9/16	Free	1h16m
40	`google/gemma-3-27b-it`	OpenRouter	69.7%	17/20	29/30	25/30	105/126	0/16	$0.003	4m
41	`nemotron-3-nano:4b`	Ollama	64.0%	19/20	25/30	25/30	42/126	4/16	Free	42m

The Biggest Surprises (5-Dimension View)

1. Claude Sonnet 4.6 takes #1. With MCP data for all major models, Sonnet leads at 93.8% — the only model to achieve 100% on 4 of 5 dimensions, with a strong 11/16 MCP showing.

2. MCP reshuffles everything. Claude Haiku 4.5 is #12 on 4 dimensions (96.3%) but drops to #31 on 5 dimensions (79.5%) due to its MCP collapse (2/16). Tool use acts as a great equalizer — models that can't orchestrate tools fall behind models that are weaker on other axes.

3. Free models in the top 22. nemotron-3-nano:latest on Ollama at 84.0% (5-dim) and nvidia/nemotron-3-nano-30b-a3b:free at 83.7% both beat Haiku, Llama 4, and most Mistral models. Perfect reasoning, perfect knowledge, strong MCP.

4. Same weights, different provider, different results. Nemotron Nano 30B: OpenRouter 83.7%, DeepInfra 77.0%. A 6.7pp gap — larger than the gap between ranks #4 and #10.

Deep Dive: Reasoning

The counterfeit coin problem is the single hardest task in the entire showdown. Only 9 out of 41 models (5-dim) solved it correctly with quality reasoning.

Why the Counterfeit Coin Is So Hard

The problem: you have 12 coins, one is counterfeit (heavier or lighter — you don't know which). Using a balance scale exactly 3 times, find the counterfeit coin and determine whether it's heavier or lighter.

This is information-theoretically tight. Three weighings give you 3^3 = 27 possible outcomes. There are 24 possible states (12 coins × 2 weight possibilities). So it's barely possible — and the procedure must be exhaustive, covering every branch.

Models that solved it (quality score 5/5): All three Qwen 3.5 variants, openai/gpt-oss-120b, moonshotai/kimi-k2, nvidia/nemotron-3-nano-30b-a3b:free (OpenRouter), inception/mercury-2, both Nemotron Super 120B variants (OpenRouter and DeepInfra).

Models that failed (quality score 2/5): openai/gpt-oss-20b, minimax/minimax-m2.7, deepseek/deepseek-v3.2, both Llama 4 variants, mistralai/mistral-small-2603, inception/mercury-coder, anthropic/claude-haiku-4.5, nvidia/Nemotron-3-Nano-30B-A3B (DeepInfra), mistralai/codestral-2508, google/gemma-3-27b-it, and mistralai/ministral-8b-2512.

The failure pattern is consistent: models correctly identify the first step (divide into 3 groups of 4, weigh group A vs group B) but then fail to rigorously enumerate the subcases. They hand-wave through the second and third weighings with phrases like "narrow it down to the suspect" without proving that exactly 3 weighings suffice for every branch.

The judge explains it precisely:

Claude Haiku 4.5 on counterfeit-coin (2/5): "The response correctly identifies that the puzzle is solvable in 3 weighings and attempts a reasonable initial strategy (divide into thirds, weigh 4 vs 4). However, the procedure is incomplete and contains critical gaps. Step 2 is vague and hand-wavy."

DeepSeek v3.2 on counterfeit-coin (2/5): "The model correctly states it's possible and attempts a systematic approach, but the procedure has critical flaws. Case 1 is mostly sound. However, Case 2 has a significant problem: after Step 2, the model claims results that don't follow from the weighing."

What's striking is that openai/gpt-oss-20b — which scores 119/126 on coding — falls on its face here. It can write correct FizzBuzz and business-day calculators in three languages, but it can't reason through a logic puzzle that requires exhaustive case analysis. This is exactly why multi-dimensional evaluation matters.

The Easy Puzzles

The bat-and-ball, lily pad, and surgeon riddle were all solved by 21+ models. These have become training-data staples. The surgeon riddle ("I can't operate — he's my son") was universally handled, with only gemini-3-flash-preview getting a 4/5 for briefly mentioning an alternative answer alongside the correct one.

Deep Dive: Knowledge

10 out of 41 models (5-dim) got a perfect 30/30 on factual knowledge. The remainder each missed 1-5 questions.

"How Many R's in Strawberry?"

Three models — both Llama 4 variants (meta-llama/llama-4-scout, meta-llama/llama-4-maverick) and mistralai/mistral-small-2603 — answered "2" instead of "3." This is the classic character-counting failure. The word "strawberry" has three r's (strawberry), but models that tokenize the word rather than examining individual characters consistently get it wrong.

"All But 9 Die"

"A farmer has 10 sheep. All but 9 die. How many sheep does the farmer have left?"

mistralai/codestral-2508 and mistralai/ministral-8b-2512 both answered "8," interpreting "all but 9" as something other than "9 survive." Both failing models are from Mistral, suggesting a shared training-data blind spot.

The Carbon-14 Ambiguity

Four models got the Carbon-14 half-life wrong. The correct answer is 5,730 years. This is the hardest science question by error count — questions that allow rounding introduce scoring complexity.

AI/ML Gotchas

google/gemma-3-27b-it said the "T" in GPT stands for "Transformative" instead of "Transformer." mistralai/ministral-8b-2512 said the original Transformer's d_model was 64 instead of 512. These are basic ML knowledge gaps — surprising for models that are themselves transformers.

Deep Dive: Instruction Following

11 out of 41 models (5-dim) achieved a perfect 30/30. The failures are mechanical and revealing.

The Markdown Fence Epidemic

The most systematic failure is "markdown fence hallucination." When told to output raw JSON with no markdown fences, models wrap it in json ... ` `` ` anyway. When told to output a markdown table, models wrap the markdown in markdown ... `` (double-wrapping markdown inside markdown).

Word Counting Is Hard

"Write a 12-word sentence about the ocean. Nothing else."

Multiple models got this wrong — writing 11 or 13 words instead of exactly 12. Models that reliably nail word counts tend to be the same ones that score well on reasoning — they actually count rather than estimate.

Instruction Following Doesn't Predict Tool Use

Perfect instruction following (30/30) does not predict good MCP tool use. anthropic/claude-haiku-4.5 and nvidia/Nemotron-3-Nano-30B-A3B (DeepInfra) both got perfect 30/30 on instruction following, yet both scored 2/16 on MCP. Meanwhile, inception/mercury-2 scored only 25/30 on instruction following but got 11/16 on MCP.

Instruction following tests compliance; MCP tests initiative. They're different skills.

Deep Dive: Coding

The coding dimension has the widest score spread: 71/126 (nvidia/nemotron-3-nano-30b-a3b:free, OpenRouter) to 126/126 (six models tied for perfect).

Run 2: What Changed

Run 1 ran code inside Dagger containers with DNS issues and 10-minute timeouts. Run 2 runs everything locally — tsc && node, python3, go run — with no artificial time limits. The result: coding scores increased across the board, and the true capability ceiling became visible.

Six models achieved a perfect 126/126: qwen/qwen3.5-397b-a17b, gemini-3-flash-preview, minimax/minimax-m2.7, moonshotai/kimi-k2, openai/gpt-oss-120b, and anthropic/claude-haiku-4.5. These models solved every challenge in every language — FizzBuzz, business days with holidays, vending machine state machines, grid path counting, rail fence cipher, and data pipeline aggregation.

The Coding Results Matrix

Each cell shows score/7. Six models achieve a perfect 126/126.

Model	Total	Perfect Tasks
qwen/qwen3.5-397b-a17b	126/126	18/18
gemini-3-flash-preview	126/126	18/18
minimax/minimax-m2.7	126/126	18/18
moonshotai/kimi-k2	126/126	18/18
openai/gpt-oss-120b	126/126	18/18
anthropic/claude-haiku-4.5	126/126	18/18
deepseek/deepseek-v3.2	123/126	17/18
mistralai/codestral-2508	123/126	17/18
inception/mercury-2	119/126	17/18
openai/gpt-oss-20b	119/126	17/18
inception/mercury-coder	119/126	17/18
meta-llama/llama-4-maverick	115/126	15/18
qwen/qwen3.5-122b-a10b	114/126	16/18
qwen/qwen3.5-35b-a3b	112/126	16/18
nvidia/Nemotron-3-Nano-30B-A3B	110/126	15/18
google/gemma-3-27b-it	105/126	14/18
meta-llama/llama-4-scout	101/126	13/18
mistralai/ministral-8b-2512	91/126	11/18
nvidia/NVIDIA-Nemotron-3-Super-120B-A12B	88/126	12/18
mistralai/mistral-small-2603	80/126	10/18
nvidia/nemotron-3-super-120b-a12b:free	74/126	10/18
nvidia/nemotron-3-nano-30b-a3b:free	71/126	9/18

Language Comparison

Language	Avg Score	Compile Rate	Run Rate	Perfect Rate
TypeScript	6.5/7	98%	98%	88%
Python	6.2/7	91%	90%	88%
Go	5.9/7	86%	86%	81%

Go remains the hardest language. Models generate time.Time{Year: 2025, Month: 1} (named fields aren't allowed in Go struct literals for time.Time) or call nonexistent methods. These are subtle Go-specific patterns that don't exist in TypeScript or Python.

Challenge Difficulty

Challenge	Avg Score	Perfect	Zero	Hardest Language
FizzBuzz Boom	6.8/7	64	2	python (6.4/7)
Grid Paths	6.6/7	62	2	go (6.1/7)
Vending Machine	6.5/7	60	3	go (6.1/7)
Data Pipeline	6.4/7	56	4	python (6.0/7)
Business Days	5.7/7	51	9	go (4.5/7)
Rail Fence Cipher	5.1/7	41	13	go (4.8/7)

The rail fence cipher is the hardest challenge — 13 complete failures across all models. The encode step is straightforward; the decode step (computing rail lengths, filling in reading order, then extracting in zigzag order) is where models break.

Deep Dive: MCP Tool Use

This is the most revealing dimension because it tests something no multiple-choice benchmark can: the ability to autonomously plan and execute a multi-step task using real external tools.

The Task

Each model connects to the TezLab MCP server — a real API for electric vehicle data — and must analyze battery health and charging patterns by calling the right tools in the right order.

Six tools must be called:

list_vehicles — discover what vehicles exist
get_battery_health — check battery degradation
get_charges — review charging history
get_efficiency — pull efficiency stats
get_my_chargers — see which chargers are used
find_nearby_chargers — search for alternatives

Results

Model	Tool Score	Judge	Total	Time	Cost
`inception/mercury-2`	6/6	5/10	11/16	14s	$0.007
`mistralai/mistral-small-2603`	6/6	5/10	11/16	18s	$0.004
`qwen/qwen3.5-35b-a3b`	6/6	5/10	11/16	24s	$0.006
`openai/gpt-5.4`	6/6	5/10	11/16	28s	$0.075
`moonshotai/kimi-k2`	6/6	5/10	11/16	34s	$0.014
`qwen/qwen3.5-122b-a10b`	6/6	5/10	11/16	41s	$0.010
`gemini-3.1-pro-preview`	6/6	5/10	11/16	44s	$0.007
`qwen/qwen3.5-397b-a17b`	6/6	5/10	11/16	47s	$0.011
`x-ai/grok-4.20-beta`	6/6	5/10	11/16	48s	$0.063
`openai/gpt-oss-20b`	6/6	5/10	11/16	52s	$0.001
`anthropic/claude-sonnet-4.6`	6/6	5/10	11/16	53s	$0.126
`gemini-3-flash-preview`	6/6	5/10	11/16	96s	$0.010
`nvidia/nemotron-3-nano-30b-a3b:free`	6/6	5/10	11/16	112s	Free
`anthropic/claude-opus-4.6`	6/6	5/10	11/16	117s	$0.280
`deepseek/deepseek-v3.2`	6/6	5/10	11/16	150s	$0.009
`minimax/minimax-m2.7`	6/6	5/10	11/16	152s	$0.014
`nvidia/nemotron-3-super-120b-a12b:free`	6/6	5/10	11/16	300s	Free
`moonshotai/kimi-k2.5`	6/6	5/10	11/16	211s	$0.015
`glm-4.7-flash:latest` (Ollama)	6/6	5/10	11/16	204s	Free
`google/gemini-3.1-pro-preview`	6/6	4.7/10	10.7/16	65s	$0.057
`x-ai/grok-4.1-fast`	6/6	4.7/10	10.7/16	32s	$0.006
`mistral-small:latest` (Ollama)	6/6	4.7/10	10.7/16	156s	Free
`inception/mercury-coder`	6/6	4/10	10/16	14s	$0.005
`nemotron-3-nano:latest` (Ollama)	5/6	5/10	10/16	94s	Free
`openai/gpt-5.4-nano`	5/6	4.7/10	9.7/16	14s	$0.005
`openai/gpt-5.4-mini`	6/6	3.7/10	9.7/16	62s	$0.012
`meta-llama/llama-4-maverick`	5/6	4/10	9/16	23s	$0.004
`openai/gpt-oss-120b`	4/6	5/10	9/16	28s	$0.001
`qwen3:30b-a3b` (Ollama)	5/6	4/10	9/16	121s	Free
`devstral:latest` (Ollama)	5/6	4/10	9/16	127s	Free
`qwen3:32b` (Ollama)	5/6	4/10	9/16	281s	Free
`mistralai/mistral-small-3.2-24b-instruct`	5/6	3/10	8/16	17s	$0.001
`meta-llama/llama-4-scout`	4/6	3.7/10	7.7/16	8s	$0.002
`mistralai/ministral-8b-2512`	5/6	2/10	7/16	10s	$0.002
`mistralai/codestral-2508`	5/6	1/10	6/16	6s	$0.004
`nvidia/NVIDIA-Nemotron-3-Super-120B-A12B`	2/6	2/10	4/16	37s	Free
`nemotron-3-nano:4b` (Ollama)	2/6	2/10	4/16	37s	Free
`anthropic/claude-haiku-4.5`	1/6	1/10	2/16	8s	$0.013
`nvidia/Nemotron-3-Nano-30B-A3B`	1/6	1/10	2/16	33s	Free
`gpt-oss:latest` (Ollama)	1/6	1/10	2/16	21s	Free
`google/gemma-3-27b-it`	0/6	0/10	0/16	—	$0.001

The Failures Tell the Real Story

anthropic/claude-haiku-4.5 via OpenRouter (2/16) — The most expensive failure. It called list_vehicles twice, got an error, and returned an apology:

"I apologize—I'm encountering persistent server errors when trying to connect to your TezLab account."

The service was running fine for every other model. On every other dimension Haiku was strong: 30/30 knowledge, 30/30 instruction, 126/126 coding. One catastrophic failure in one dimension defines the whole ranking.

nvidia/Nemotron-3-Nano-30B-A3B on DeepInfra (2/16) — Called list_vehicles once, correctly identified both vehicles, and then asked the user for clarification instead of proceeding. The same weights on OpenRouter scored 11/16 by just picking the Tesla and running all 6 tools. Same model, different provider, one asks permission and the other gets to work.

google/gemma-3-27b-it via OpenRouter (0/16) — OpenRouter returned "No endpoints found that support tool use." This model doesn't support tool calling at all.

The Quality Gap

Among models that called all 6 tools (tool score 6/6), quality scores ranged from 3.7/10 to 5/10. The majority scored 5/10 — quality variance is far smaller than tool-usage variance. Models either figured out the full tool chain or stopped short. This suggests MCP tool use is primarily a planning problem, not a generation problem.

The Provider Effect: Why Infrastructure Matters

The same model weights, served by different providers, produce meaningfully different results.

The Numbers

Model	Dimension	OpenRouter	DeepInfra	Gap
Nemotron Nano 30B	Reasoning	20/20	17/20	-3
Nemotron Nano 30B	Knowledge	30/30	30/30	0
Nemotron Nano 30B	Instruction	28/30	30/30	+2
Nemotron Nano 30B	Coding	71/126	110/126	+39
Nemotron Nano 30B	MCP Tool Use	11/16	2/16	-9
Nemotron Nano 30B	Combined	83.7%	77.0%	-6.7pp

Nemotron Super 120B	Reasoning	20/20	20/20	0
Nemotron Super 120B	Knowledge	30/30	28/30	-2
Nemotron Super 120B	Instruction	25/30	30/30	+5
Nemotron Super 120B	Coding	74/126	88/126	+14
Nemotron Super 120B	MCP Tool Use	11/16	4/16	-7
Nemotron Super 120B	Combined	82.2%	77.6%	-4.6pp

The Nano 30B gap is 6.7 percentage points — larger than the gap between ranks #4 and #10 in our leaderboard. The pattern is inverted between dimensions: DeepInfra dominates coding (110/126 vs 71/126) while OpenRouter dominates MCP (11/16 vs 2/16) and reasoning (20/20 vs 17/20). The Super 120B gap widened to 4.6pp with updated MCP data — OpenRouter now scores 11/16 vs DeepInfra's 4/16.

Why the Same Weights Behave Differently

At least seven layers can introduce behavioral differences: quantization precision, SDK layers (native vs OpenAI-compatible adapter), middleware (context compression, response healing), tool-calling implementation, default parameters, token usage reporting, and reasoning effort configuration.

Tool use is the most provider-sensitive capability. Knowledge and reasoning showed 0-3 point differences; MCP showed a 9-point difference. If your application relies on tool calling, provider choice matters more than model choice.

Cost and Speed Analysis

The Cost Efficiency Curve (4 Dimensions, All 49 Models)

Tier	Cost Range	Best Model	4-dim Score
Free (local)	$0.00	`gpt-oss:latest` (Ollama)	93.3%
Free (cloud)	$0.00	`nvidia/Nemotron-3-Nano-30B-A3B` (DeepInfra)	93.1%
Sub-penny	$0.001-$0.01	`openai/gpt-oss-120b` (OR)	98.3%
Penny	$0.01-$0.05	`gemini-3-flash-preview` (Google)	98.8%
Dime	$0.05-$0.20	`x-ai/grok-4.20-beta` (OR)	98.8%
Quarter+	$0.20+	`anthropic/claude-sonnet-4.6` (OR)	100.0%

The knee is at openai/gpt-oss-120b ($0.01, 98.3%). For one cent you get a model that's within 2 points of the absolute best. The jump from 98.3% to 100% costs 20x more ($0.20 for Sonnet 4.6). Below that, local gpt-oss:latest on Ollama delivers 93.3% for free.

The Speed/Quality Frontier

Model	Score	Time	Sweet Spot?
`mistralai/codestral-2508`	87.3%	48s	Fastest overall
`openai/gpt-5.4-mini`	97.3%	1m	Fastest >95%
`anthropic/claude-sonnet-4.6`	100.0%	3m	Perfect score
`x-ai/grok-4.20-beta`	98.8%	2m	Best speed/quality
`gemini-3-flash-preview`	98.8%	1h41m	Best value
`openai/gpt-oss-120b`	98.3%	10m	Best sub-penny

What This Tells Us

1. Claude Sonnet 4.6 is the only perfect model. 100% across 4 dimensions — 20/20 reasoning, 30/30 knowledge, 30/30 instruction, 126/126 coding. No other model achieves this. But at $0.20, you pay a premium for perfection.

2. The top 5 models are within 1.2 points of each other. Sonnet 4.6 (100%), Qwen 397B (99.2%), Grok 4.20 (98.8%), Gemini Flash (98.8%), GPT-5.4 (98.8%). The difference between them is essentially noise — model choice at the frontier matters less than it used to.

3. openai/gpt-oss-120b at $0.01 is the deal of the century. 98.3% — ahead of GPT-5.4 Mini, Gemini Pro, and Claude Opus — for one cent. Apache-licensed. This is an open-weight model costing 20x less than Sonnet and scoring within 2 points.

4. Local models are no longer a compromise. gpt-oss:latest on Ollama (93.3%) and nemotron-3-nano:latest (89.4%) run entirely on a MacBook, for free, and beat many cloud models. The 5-7% gap vs cloud is primarily in coding (Go struggles) and instruction following.

5. Multi-dimensional evaluation reveals things single-axis benchmarks hide. anthropic/claude-haiku-4.5 scores 96.3% on 4 dimensions but drops to 79.5% on 5 dimensions due to its MCP collapse. qwen3:30b-a3b gets perfect coding (126/126) but 4/30 instruction following.

6. Provider choice matters as much as model choice. The 6.7pp gap between the same Nemotron Nano 30B on OpenRouter vs DeepInfra is larger than the gap between many adjacent-ranked models.

7. Tool orchestration is a binary skill. Models either call all the right tools and produce excellent analysis, or they stall early and produce nothing. No graceful degradation.

8. Go code generation is the biggest language gap. TypeScript compiles 98% of the time; Go only 86%.

9. The frontier is crowded. 11 models score above 95% on 4 dimensions. At this level, the differentiators are cost, speed, and tool use — not raw capability.

Methodology

All evaluations ran with temperature 0.0 (knowledge, instruction) or 0.2-0.3 (reasoning, coding). Reasoning and knowledge responses were judged by anthropic/claude-haiku-4.5 (via OpenRouter). Instruction following and coding used deterministic verification — no LLM judge involved. MCP tool usage is scored mechanically (did you call the tool?); response quality is LLM-judged.

Results are cached per model per task; interrupted runs can be resumed. All models were tested under the same conditions with identical prompts.

Combined scores are the mean of normalized dimension percentages (each dimension's raw score divided by its maximum, then averaged across all 5 dimensions). Only models present in all 5 dimensions appear in the combined leaderboard.

Run Details

Reasoning: Run #7, 4 puzzles, scored /20
Knowledge: Run #2, 30 questions across 6 categories, scored /30
Instruction: Run #2, 6 constraint tasks, scored /30
Coding: Run #6, 6 challenges × 3 languages = 18 tasks, scored /126 (local execution, no containers)
MCP Tool Use: Run #1, 1 multi-tool task, scored /16

Total evaluation cost: $4.63 across all 49 models and 4-5 dimensions.

Limitations

Only 22 of 49 tested models have complete 5-dimension results. Premium models (Claude Opus/Sonnet 4.6, GPT-5.4, Gemini 3.1 Pro, Grok 4.20), mid-tier (GPT-5.4 Mini/Nano, Kimi K2.5, Grok 4.1 Fast), and all 13 Ollama models are missing MCP results due to TezLab server unavailability.
The MCP eval used a single task with a live API. Results may vary with different MCP servers, tool schemas, or task descriptions.
Free-tier models may have different availability, rate limits, or routing than paid versions.
We tested each model once per task. Stochastic variation means scores could shift by 1-3 points on a re-run.

Built with umwelten — an open-source framework for multi-model evaluation, MCP tool integration, and LLM-judged scoring.

The Model Showdown: Testing 49 LLMs Across 5 Dimensions for $4.63 ​

The Question Behind the Showdown ​

What We Tested and Why ​

The Models ​

The 5 Dimensions ​

The Full Leaderboard: 49 Models × 4 Dimensions ​

Key Findings From the Full Lineup ​

The 5-Dimension Leaderboard: 41 Models With MCP ​

The Biggest Surprises (5-Dimension View) ​

Deep Dive: Reasoning ​

Why the Counterfeit Coin Is So Hard ​

The Easy Puzzles ​

Deep Dive: Knowledge ​

"How Many R's in Strawberry?" ​

"All But 9 Die" ​

The Carbon-14 Ambiguity ​

AI/ML Gotchas ​

Deep Dive: Instruction Following ​

The Markdown Fence Epidemic ​

Word Counting Is Hard ​

Instruction Following Doesn't Predict Tool Use ​

Deep Dive: Coding ​

Run 2: What Changed ​

The Coding Results Matrix ​

Language Comparison ​

Challenge Difficulty ​

Deep Dive: MCP Tool Use ​

The Task ​

Results ​

The Failures Tell the Real Story ​

The Quality Gap ​

The Provider Effect: Why Infrastructure Matters ​

The Numbers ​

Why the Same Weights Behave Differently ​

Cost and Speed Analysis ​

The Cost Efficiency Curve (4 Dimensions, All 49 Models) ​

The Speed/Quality Frontier ​

What This Tells Us ​

Methodology ​

Run Details ​

Limitations ​