Skip to content

The Model Showdown: Testing 49 LLMs Across 5 Dimensions for $4.63

March 2026 — Run 2 (full coding coverage, 49 models)

The Question Behind the Showdown

Which model should you actually use? Not which one tops a leaderboard somewhere — which one will reason through your problem, follow your formatting instructions, write code that compiles, answer factual questions correctly, and orchestrate real-world tools without falling apart?

We built an evaluation suite that tests 49 language models across 5 fundamentally different capabilities: logical reasoning, factual knowledge, precise instruction following, executable code generation, and MCP tool orchestration against a live API. The total cost was $4.63.

The lineup spans the full spectrum: frontier models (Claude Opus 4.6, GPT-5.4, Gemini 3.1 Pro, Grok 4.20) to free-tier NVIDIA models, to 13 local Ollama models running on a MacBook. Premium ($5+/M tokens), value ($0.05-0.50), budget (<$0.05), and free. Closed-source and open-weight. Cloud and local.

This is Run 2. Run 1 had incomplete coding coverage due to container DNS failures and timeouts. Run 2 re-evaluated all models across all 18 coding tasks with full local execution — no containers, no DNS failures, no time limits. Six models now achieve a perfect 126/126 on coding.

Note on MCP coverage: 41 of 49 models have complete 5-dimension results (including MCP tool use). The remaining 8 either lack tool-use support (phi4, gemma3n, deepseek-r1 variants, gemma-3-27b-it) or were not included in the MCP evaluation run (nemotron-nano-9b-v2 variants, minimax-01). The 4-dimension leaderboard below includes all 49 models; the 5-dimension leaderboard covers the 41 with complete data.

What We Tested and Why

The Models

49 models across 4 providers and local inference:

OpenRouter (31 models via @openrouter/ai-sdk-provider):

  • Premium: anthropic/claude-opus-4.6, anthropic/claude-sonnet-4.6, openai/gpt-5.4, x-ai/grok-4.20-beta
  • High: openai/gpt-5.4-mini, anthropic/claude-haiku-4.5, moonshotai/kimi-k2.5, moonshotai/kimi-k2, google/gemini-3.1-pro-preview
  • Mid/Value: Qwen 3.5 (397B, 122B, 35B), DeepSeek v3.2, MiniMax M2.7, Inception Mercury 2/Coder, OpenAI GPT-OSS (120B, 20B), GPT-5.4 Nano, Grok 4.1 Fast
  • Budget: Meta Llama 4 (Maverick, Scout), Mistral (Small, Codestral, Ministral 8B), Gemma 3 27B
  • Free: NVIDIA Nemotron 3 (Super 120B, Nano 30B, Nano 9B v2)

Google (2 models, direct API):

  • gemini-3-flash-preview, gemini-3.1-pro-preview

DeepInfra (3 models):

  • NVIDIA Nemotron 3 (Super 120B, Nano 30B, Nano 9B v2)

Ollama (13 models, local on MacBook):

  • deepseek-r1 (latest/32b/14b), devstral, phi4, mistral-small, glm-4.7-flash, gemma3n:e4b, qwen3 (32b/30b-a3b), nemotron-3-nano (latest/4b), gpt-oss

The 5 Dimensions

Each dimension tests something that the others can't.

Reasoning (/20) — Can the model think past intuitive traps? Four classic logic puzzles where the obvious answer is wrong. A bat and ball that don't cost what you think. A patch of lily pads where halving the time is the wrong move. A surgeon who isn't who you assumed. And the hardest: find a counterfeit coin among 12 using exactly 3 weighings on a balance scale. Scored by an LLM judge on reasoning quality, not just correctness.

Knowledge (/30) — Does the model know things? 30 factual questions across Science, Geography, History, Technology, AI/ML, and Tricky/Adversarial categories. Binary scoring (correct or not) with an LLM judge that allows formatting variations — "5,730 years" and "5730 years" both count.

Instruction Following (/30) — Can the model do exactly what you ask? Six tasks with precise formatting constraints scored deterministically — no LLM judge, just regex, JSON parsing, and character counting. Write exactly 12 words. Output valid JSON without markdown fences. Convert CSV to a markdown table. Follow negative constraints ("don't use the word 'beautiful'").

Coding (/126) — Can the model write code that actually runs? Six programming challenges across TypeScript, Python, and Go — 18 tasks total. Each submission is compiled, executed against test cases, and scored on correctness (compile: 1pt, run: 1pt, output: 0-5pts = 0-7 per task). FizzBuzz with a twist, business day calculation with holidays, a vending machine state machine, grid path counting with obstacles, rail fence cipher, and data pipeline aggregation.

MCP Tool Use (/16) — Can the model orchestrate real-world tools? Each model connects to the TezLab MCP server (real EV vehicle data) and must analyze battery health and charging patterns by calling the right sequence of tools: list vehicles, get battery health, pull charging history, check efficiency stats, find chargers, and search for alternatives. Scored on both tool usage (did you call all 6 required tools? 0-6) and response quality (did you synthesize the data into something useful? 1-10).


The Full Leaderboard: 49 Models × 4 Dimensions

All 49 models ranked by combined score across reasoning, knowledge, instruction following, and coding. This includes premium-tier models, local Ollama models, and free-tier offerings that couldn't be tested on MCP.

RankModelProviderCombinedReasoningKnowledgeInstructionCodingCostTime
1anthropic/claude-sonnet-4.6OpenRouter100.0%20/2030/3030/30126/126$0.203m
2qwen/qwen3.5-397b-a17bOpenRouter99.2%20/2029/3030/30126/126$0.5236m
3x-ai/grok-4.20-betaOpenRouter98.8%20/2029/3030/30124/126$0.092m
4gemini-3-flash-previewGoogle98.8%19/2030/3030/30126/126$0.041h41m
5openai/gpt-5.4OpenRouter98.8%19/2030/3030/30126/126$0.172m
6openai/gpt-oss-120bOpenRouter98.3%20/2028/3030/30126/126$0.0110m
7qwen/qwen3.5-122b-a10bOpenRouter97.6%20/2030/3030/30114/126$0.4318m
8openai/gpt-5.4-miniOpenRouter97.3%20/2030/3028/30121/126$0.051m
9gemini-3.1-pro-previewGoogle97.3%18/2030/3030/30125/126$0.0123m
10qwen/qwen3.5-35b-a3bOpenRouter97.2%20/2030/3030/30112/126$0.2130m
11google/gemini-3.1-pro-previewOpenRouter96.9%18/2030/3030/30123/126$1.8531m
12anthropic/claude-haiku-4.5OpenRouter96.3%17/2030/3030/30126/126$0.072m
13minimax/minimax-m2.7OpenRouter96.3%17/2030/3030/30126/126$0.1024m
14moonshotai/kimi-k2OpenRouter95.8%20/2030/3025/30126/126$0.035m
15moonshotai/kimi-k2.5OpenRouter95.8%20/2030/3025/30126/126$0.2650m
16x-ai/grok-4.1-fastOpenRouter95.8%20/2030/3030/30105/126$0.0613m
17anthropic/claude-opus-4.6OpenRouter95.8%18/2030/3028/30126/126$0.374m
18inception/mercury-2OpenRouter94.4%20/2030/3025/30119/126$0.062m
19mistralai/mistral-small-2603OpenRouter94.1%17/2030/3030/30115/126$0.0081m
20openai/gpt-oss-20bOpenRouter94.0%17/2029/3030/30119/126$0.028m
21gpt-oss:latestOllama93.3%20/2028/3029/30105/126Free1h11m
22deepseek/deepseek-v3.2OpenRouter93.2%17/2029/3028/30123/126$0.00714m
23nvidia/Nemotron-3-Nano-30B-A3BDeepInfra93.1%17/2030/3030/30110/126Free14m
24nvidia/Nemotron-Super-120B-A12BDeepInfra90.8%20/2028/3030/3088/126Free3m
25nemotron-3-nano:latestOllama89.4%20/2030/3027/3085/126Free36m
26meta-llama/llama-4-maverickOpenRouter89.1%17/2029/3025/30115/126$0.00816m
27meta-llama/llama-4-scoutOpenRouter88.8%17/2027/3030/30101/126$0.00415m
28openai/gpt-5.4-nanoOpenRouter87.4%15/2029/3027/30111/126$0.022m
29nvidia/nemotron-3-nano-30b-a3b:freeOpenRouter87.4%20/2030/3028/3071/126Free2m
30mistralai/codestral-2508OpenRouter87.3%17/2025/3025/30123/126$0.0148s
31google/gemma-3-27b-itOpenRouter87.1%17/2029/3025/30105/126$0.0034m
32qwen3:32bOllama86.7%17/2029/3025/30103/126Free59m
33inception/mercury-coderOpenRouter86.5%17/2025/3025/30119/126$0.012m
34nvidia/nemotron-3-super-120b-a12b:freeOpenRouter85.5%20/2030/3025/3074/126Free11m
35phi4:latestOllama84.6%16/2027/3027/3099/126Free25m
36mistral-small:latestOllama84.4%17/2029/3023/30100/126Free49m
37glm-4.7-flash:latestOllama84.1%17/2029/3026/3086/126Free54m
38devstral:latestOllama84.0%16/2028/3026/3096/126Free44m
39deepseek-r1:32bOllama84.0%17/2028/3028/3081/126Free1h5m
40mistralai/mistral-small-3.2-24b-instructOpenRouter83.0%17/2029/3026/3080/126$0.0022m
41mistralai/ministral-8b-2512OpenRouter81.4%16/2027/3025/3091/126$0.0041m
42gemma3n:e4bOllama79.7%17/2029/3024/3072/126Free54m
43deepseek-r1:14bOllama78.2%17/2024/3026/3077/126Free57m
44deepseek-r1:latestOllama78.1%15/2026/3030/3064/126Free33m
45nvidia/nemotron-nano-9b-v2:freeOpenRouter77.6%17/2026/3029/3053/126Free1h52m
46qwen3:30b-a3bOllama77.5%20/2029/304/30126/126Free1h14m
47nvidia/Nemotron-Nano-9B-v2DeepInfra75.7%17/2026/3025/3060/126Free9m
48nemotron-3-nano:4bOllama73.8%19/2025/3025/3042/126Free42m

Key Findings From the Full Lineup

1. Claude Sonnet 4.6 achieves a perfect 100%. The only model to score maximum on all 4 dimensions: 20/20 reasoning, 30/30 knowledge, 30/30 instruction, 126/126 coding. At $0.20, it's the gold standard for capability — but it costs 5x more than Gemini Flash which is 1.2 points behind.

2. Gemini 3 Flash is the value champion at 98.8% for $0.04. Nearly tied for #3, costing 5x less than the next-cheapest competitor at its tier. Perfect knowledge, perfect instruction, perfect coding — it only lost 1 point on reasoning.

3. openai/gpt-oss-120b at $0.01 is absurd. 98.3% — #6 overall, ahead of Gemini Pro and GPT-5.4 Mini — for one cent. Perfect coding (126/126), perfect instruction, perfect reasoning. It missed 2 knowledge questions. An Apache-licensed model at a penny.

4. gpt-oss:latest on Ollama scores 93.3% — for free, running locally. The same GPT-OSS architecture running on a MacBook ranks #21 overall, beating Meta Llama 4, DeepSeek v3.2, and all Mistral models. Local inference is no longer a compromise.

5. qwen3:30b-a3b on Ollama gets perfect coding (126/126) but catastrophic instruction following (4/30). It produces excellent code but can't follow formatting constraints — markdown fences everywhere, wrong word counts. A pure specialist that would rank #1 on coding alone but #46 overall.

6. The Ollama models cluster at 78-93%. Local models are consistently 5-15 points behind cloud models on the same architecture. The gap comes primarily from coding (Go struggles) and instruction following, not reasoning or knowledge.


The 5-Dimension Leaderboard: 41 Models With MCP

41 models have complete results across all 5 dimensions including MCP tool use. Only 8 models are excluded — 5 that lack tool-use support entirely (phi4, gemma3n, deepseek-r1 variants), 1 without tool endpoints (gemma-3-27b-it), and 2 not attempted (nemotron-nano-9b-v2 variants).

RankModelProviderCombinedReasoningKnowledgeInstructionCodingMCPCostTime
1anthropic/claude-sonnet-4.6OpenRouter93.8%20/2030/3030/30126/12611/16$0.334m
2qwen/qwen3.5-397b-a17bOpenRouter93.1%20/2029/3030/30126/12611/16$0.5437m
3x-ai/grok-4.20-betaOpenRouter92.8%20/2029/3030/30124/12611/16$0.163m
4gemini-3-flash-previewGoogle92.8%19/2030/3030/30126/12611/16$0.051h42m
5openai/gpt-5.4OpenRouter92.8%19/2030/3030/30126/12611/16$0.243m
6qwen/qwen3.5-122b-a10bOpenRouter91.8%20/2030/3030/30114/12611/16$0.4419m
7gemini-3.1-pro-previewGoogle91.6%18/2030/3030/30125/12611/16$0.0223m
8qwen/qwen3.5-35b-a3bOpenRouter91.5%20/2030/3030/30112/12611/16$0.2231m
9google/gemini-3.1-pro-previewOpenRouter90.9%18/2030/3030/30123/12610.7/16$1.9132m
10minimax/minimax-m2.7OpenRouter90.8%17/2030/3030/30126/12611/16$0.1127m
11moonshotai/kimi-k2OpenRouter90.4%20/2030/3025/30126/12611/16$0.045m
12moonshotai/kimi-k2.5OpenRouter90.4%20/2030/3025/30126/12611/16$0.2853m
13anthropic/claude-opus-4.6OpenRouter90.4%18/2030/3028/30126/12611/16$0.656m
14x-ai/grok-4.1-fastOpenRouter90.0%20/2030/3030/30105/12610.7/16$0.0713m
15openai/gpt-5.4-miniOpenRouter90.0%20/2030/3028/30121/1269.7/16$0.062m
16openai/gpt-oss-120bOpenRouter89.9%20/2028/3030/30126/1269/16$0.0111m
17inception/mercury-2OpenRouter89.3%20/2030/3025/30119/12611/16$0.072m
18mistralai/mistral-small-2603OpenRouter89.0%17/2030/3030/30115/12611/16$0.012m
19openai/gpt-oss-20bOpenRouter89.0%17/2029/3030/30119/12611/16$0.029m
20deepseek/deepseek-v3.2OpenRouter88.3%17/2029/3028/30123/12611/16$0.0216m
21nemotron-3-nano:latestOllama84.0%20/2030/3027/3085/12610/16Free38m
22nvidia/nemotron-3-nano-30b-a3b:freeOpenRouter83.7%20/2030/3028/3071/12611/16Free4m
23meta-llama/llama-4-maverickOpenRouter82.5%17/2029/3025/30115/1269/16$0.0116m
24nvidia/nemotron-3-super-120b-a12b:freeOpenRouter82.2%20/2030/3025/3074/12611/16Free14m
25openai/gpt-5.4-nanoOpenRouter82.1%15/2029/3027/30111/1269.7/16$0.022m
26inception/mercury-coderOpenRouter81.7%17/2025/3025/30119/12610/16$0.022m
27glm-4.7-flash:latestOllama81.1%17/2029/3026/3086/12611/16Free57m
28mistral-small:latestOllama80.9%17/2029/3023/30100/12610.7/16Free52m
29meta-llama/llama-4-scoutOpenRouter80.7%17/2027/3030/30101/1267.7/16$0.00515m
30qwen3:32bOllama80.6%17/2029/3025/30103/1269/16Free1h3m
31anthropic/claude-haiku-4.5OpenRouter79.5%17/2030/3030/30126/1262/16$0.082m
32devstral:latestOllama78.5%16/2028/3026/3096/1269/16Free46m
33nvidia/NVIDIA-Nemotron-3-Super-120B-A12BDeepInfra77.6%20/2028/3030/3088/1264/16Free3m
34mistralai/codestral-2508OpenRouter77.4%17/2025/3025/30123/1266/16$0.0254s
35gpt-oss:latestOllama77.2%20/2028/3029/30105/1262/16Free1h11m
36nvidia/Nemotron-3-Nano-30B-A3BDeepInfra77.0%17/2030/3030/30110/1262/16Free15m
37mistralai/mistral-small-3.2-24b-instructOpenRouter76.4%17/2029/3026/3080/1268/16$0.0032m
38mistralai/ministral-8b-2512OpenRouter73.9%16/2027/3025/3091/1267/16$0.0062m
39qwen3:30b-a3bOllama73.3%20/2029/304/30126/1269/16Free1h16m
40google/gemma-3-27b-itOpenRouter69.7%17/2029/3025/30105/1260/16$0.0034m
41nemotron-3-nano:4bOllama64.0%19/2025/3025/3042/1264/16Free42m

The Biggest Surprises (5-Dimension View)

1. Claude Sonnet 4.6 takes #1. With MCP data for all major models, Sonnet leads at 93.8% — the only model to achieve 100% on 4 of 5 dimensions, with a strong 11/16 MCP showing.

2. MCP reshuffles everything. Claude Haiku 4.5 is #12 on 4 dimensions (96.3%) but drops to #31 on 5 dimensions (79.5%) due to its MCP collapse (2/16). Tool use acts as a great equalizer — models that can't orchestrate tools fall behind models that are weaker on other axes.

3. Free models in the top 22. nemotron-3-nano:latest on Ollama at 84.0% (5-dim) and nvidia/nemotron-3-nano-30b-a3b:free at 83.7% both beat Haiku, Llama 4, and most Mistral models. Perfect reasoning, perfect knowledge, strong MCP.

4. Same weights, different provider, different results. Nemotron Nano 30B: OpenRouter 83.7%, DeepInfra 77.0%. A 6.7pp gap — larger than the gap between ranks #4 and #10.


Deep Dive: Reasoning

The counterfeit coin problem is the single hardest task in the entire showdown. Only 9 out of 41 models (5-dim) solved it correctly with quality reasoning.

Why the Counterfeit Coin Is So Hard

The problem: you have 12 coins, one is counterfeit (heavier or lighter — you don't know which). Using a balance scale exactly 3 times, find the counterfeit coin and determine whether it's heavier or lighter.

This is information-theoretically tight. Three weighings give you 3^3 = 27 possible outcomes. There are 24 possible states (12 coins × 2 weight possibilities). So it's barely possible — and the procedure must be exhaustive, covering every branch.

Models that solved it (quality score 5/5): All three Qwen 3.5 variants, openai/gpt-oss-120b, moonshotai/kimi-k2, nvidia/nemotron-3-nano-30b-a3b:free (OpenRouter), inception/mercury-2, both Nemotron Super 120B variants (OpenRouter and DeepInfra).

Models that failed (quality score 2/5): openai/gpt-oss-20b, minimax/minimax-m2.7, deepseek/deepseek-v3.2, both Llama 4 variants, mistralai/mistral-small-2603, inception/mercury-coder, anthropic/claude-haiku-4.5, nvidia/Nemotron-3-Nano-30B-A3B (DeepInfra), mistralai/codestral-2508, google/gemma-3-27b-it, and mistralai/ministral-8b-2512.

The failure pattern is consistent: models correctly identify the first step (divide into 3 groups of 4, weigh group A vs group B) but then fail to rigorously enumerate the subcases. They hand-wave through the second and third weighings with phrases like "narrow it down to the suspect" without proving that exactly 3 weighings suffice for every branch.

The judge explains it precisely:

Claude Haiku 4.5 on counterfeit-coin (2/5): "The response correctly identifies that the puzzle is solvable in 3 weighings and attempts a reasonable initial strategy (divide into thirds, weigh 4 vs 4). However, the procedure is incomplete and contains critical gaps. Step 2 is vague and hand-wavy."

DeepSeek v3.2 on counterfeit-coin (2/5): "The model correctly states it's possible and attempts a systematic approach, but the procedure has critical flaws. Case 1 is mostly sound. However, Case 2 has a significant problem: after Step 2, the model claims results that don't follow from the weighing."

What's striking is that openai/gpt-oss-20b — which scores 119/126 on coding — falls on its face here. It can write correct FizzBuzz and business-day calculators in three languages, but it can't reason through a logic puzzle that requires exhaustive case analysis. This is exactly why multi-dimensional evaluation matters.

The Easy Puzzles

The bat-and-ball, lily pad, and surgeon riddle were all solved by 21+ models. These have become training-data staples. The surgeon riddle ("I can't operate — he's my son") was universally handled, with only gemini-3-flash-preview getting a 4/5 for briefly mentioning an alternative answer alongside the correct one.


Deep Dive: Knowledge

10 out of 41 models (5-dim) got a perfect 30/30 on factual knowledge. The remainder each missed 1-5 questions.

"How Many R's in Strawberry?"

Three models — both Llama 4 variants (meta-llama/llama-4-scout, meta-llama/llama-4-maverick) and mistralai/mistral-small-2603 — answered "2" instead of "3." This is the classic character-counting failure. The word "strawberry" has three r's (strawberry), but models that tokenize the word rather than examining individual characters consistently get it wrong.

"All But 9 Die"

"A farmer has 10 sheep. All but 9 die. How many sheep does the farmer have left?"

mistralai/codestral-2508 and mistralai/ministral-8b-2512 both answered "8," interpreting "all but 9" as something other than "9 survive." Both failing models are from Mistral, suggesting a shared training-data blind spot.

The Carbon-14 Ambiguity

Four models got the Carbon-14 half-life wrong. The correct answer is 5,730 years. This is the hardest science question by error count — questions that allow rounding introduce scoring complexity.

AI/ML Gotchas

google/gemma-3-27b-it said the "T" in GPT stands for "Transformative" instead of "Transformer." mistralai/ministral-8b-2512 said the original Transformer's d_model was 64 instead of 512. These are basic ML knowledge gaps — surprising for models that are themselves transformers.


Deep Dive: Instruction Following

11 out of 41 models (5-dim) achieved a perfect 30/30. The failures are mechanical and revealing.

The Markdown Fence Epidemic

The most systematic failure is "markdown fence hallucination." When told to output raw JSON with no markdown fences, models wrap it in json ... ` `` ` anyway. When told to output a markdown table, models wrap the markdown in markdown ... `` (double-wrapping markdown inside markdown).

Word Counting Is Hard

"Write a 12-word sentence about the ocean. Nothing else."

Multiple models got this wrong — writing 11 or 13 words instead of exactly 12. Models that reliably nail word counts tend to be the same ones that score well on reasoning — they actually count rather than estimate.

Instruction Following Doesn't Predict Tool Use

Perfect instruction following (30/30) does not predict good MCP tool use. anthropic/claude-haiku-4.5 and nvidia/Nemotron-3-Nano-30B-A3B (DeepInfra) both got perfect 30/30 on instruction following, yet both scored 2/16 on MCP. Meanwhile, inception/mercury-2 scored only 25/30 on instruction following but got 11/16 on MCP.

Instruction following tests compliance; MCP tests initiative. They're different skills.


Deep Dive: Coding

The coding dimension has the widest score spread: 71/126 (nvidia/nemotron-3-nano-30b-a3b:free, OpenRouter) to 126/126 (six models tied for perfect).

Run 2: What Changed

Run 1 ran code inside Dagger containers with DNS issues and 10-minute timeouts. Run 2 runs everything locally — tsc && node, python3, go run — with no artificial time limits. The result: coding scores increased across the board, and the true capability ceiling became visible.

Six models achieved a perfect 126/126: qwen/qwen3.5-397b-a17b, gemini-3-flash-preview, minimax/minimax-m2.7, moonshotai/kimi-k2, openai/gpt-oss-120b, and anthropic/claude-haiku-4.5. These models solved every challenge in every language — FizzBuzz, business days with holidays, vending machine state machines, grid path counting, rail fence cipher, and data pipeline aggregation.

The Coding Results Matrix

Each cell shows score/7. Six models achieve a perfect 126/126.

ModelTotalPerfect Tasks
qwen/qwen3.5-397b-a17b126/12618/18
gemini-3-flash-preview126/12618/18
minimax/minimax-m2.7126/12618/18
moonshotai/kimi-k2126/12618/18
openai/gpt-oss-120b126/12618/18
anthropic/claude-haiku-4.5126/12618/18
deepseek/deepseek-v3.2123/12617/18
mistralai/codestral-2508123/12617/18
inception/mercury-2119/12617/18
openai/gpt-oss-20b119/12617/18
inception/mercury-coder119/12617/18
meta-llama/llama-4-maverick115/12615/18
qwen/qwen3.5-122b-a10b114/12616/18
qwen/qwen3.5-35b-a3b112/12616/18
nvidia/Nemotron-3-Nano-30B-A3B110/12615/18
google/gemma-3-27b-it105/12614/18
meta-llama/llama-4-scout101/12613/18
mistralai/ministral-8b-251291/12611/18
nvidia/NVIDIA-Nemotron-3-Super-120B-A12B88/12612/18
mistralai/mistral-small-260380/12610/18
nvidia/nemotron-3-super-120b-a12b:free74/12610/18
nvidia/nemotron-3-nano-30b-a3b:free71/1269/18

Language Comparison

LanguageAvg ScoreCompile RateRun RatePerfect Rate
TypeScript6.5/798%98%88%
Python6.2/791%90%88%
Go5.9/786%86%81%

Go remains the hardest language. Models generate time.Time{Year: 2025, Month: 1} (named fields aren't allowed in Go struct literals for time.Time) or call nonexistent methods. These are subtle Go-specific patterns that don't exist in TypeScript or Python.

Challenge Difficulty

ChallengeAvg ScorePerfectZeroHardest Language
FizzBuzz Boom6.8/7642python (6.4/7)
Grid Paths6.6/7622go (6.1/7)
Vending Machine6.5/7603go (6.1/7)
Data Pipeline6.4/7564python (6.0/7)
Business Days5.7/7519go (4.5/7)
Rail Fence Cipher5.1/74113go (4.8/7)

The rail fence cipher is the hardest challenge — 13 complete failures across all models. The encode step is straightforward; the decode step (computing rail lengths, filling in reading order, then extracting in zigzag order) is where models break.


Deep Dive: MCP Tool Use

This is the most revealing dimension because it tests something no multiple-choice benchmark can: the ability to autonomously plan and execute a multi-step task using real external tools.

The Task

Each model connects to the TezLab MCP server — a real API for electric vehicle data — and must analyze battery health and charging patterns by calling the right tools in the right order.

Six tools must be called:

  1. list_vehicles — discover what vehicles exist
  2. get_battery_health — check battery degradation
  3. get_charges — review charging history
  4. get_efficiency — pull efficiency stats
  5. get_my_chargers — see which chargers are used
  6. find_nearby_chargers — search for alternatives

Results

ModelTool ScoreJudgeTotalTimeCost
inception/mercury-26/65/1011/1614s$0.007
mistralai/mistral-small-26036/65/1011/1618s$0.004
qwen/qwen3.5-35b-a3b6/65/1011/1624s$0.006
openai/gpt-5.46/65/1011/1628s$0.075
moonshotai/kimi-k26/65/1011/1634s$0.014
qwen/qwen3.5-122b-a10b6/65/1011/1641s$0.010
gemini-3.1-pro-preview6/65/1011/1644s$0.007
qwen/qwen3.5-397b-a17b6/65/1011/1647s$0.011
x-ai/grok-4.20-beta6/65/1011/1648s$0.063
openai/gpt-oss-20b6/65/1011/1652s$0.001
anthropic/claude-sonnet-4.66/65/1011/1653s$0.126
gemini-3-flash-preview6/65/1011/1696s$0.010
nvidia/nemotron-3-nano-30b-a3b:free6/65/1011/16112sFree
anthropic/claude-opus-4.66/65/1011/16117s$0.280
deepseek/deepseek-v3.26/65/1011/16150s$0.009
minimax/minimax-m2.76/65/1011/16152s$0.014
nvidia/nemotron-3-super-120b-a12b:free6/65/1011/16300sFree
moonshotai/kimi-k2.56/65/1011/16211s$0.015
glm-4.7-flash:latest (Ollama)6/65/1011/16204sFree
google/gemini-3.1-pro-preview6/64.7/1010.7/1665s$0.057
x-ai/grok-4.1-fast6/64.7/1010.7/1632s$0.006
mistral-small:latest (Ollama)6/64.7/1010.7/16156sFree
inception/mercury-coder6/64/1010/1614s$0.005
nemotron-3-nano:latest (Ollama)5/65/1010/1694sFree
openai/gpt-5.4-nano5/64.7/109.7/1614s$0.005
openai/gpt-5.4-mini6/63.7/109.7/1662s$0.012
meta-llama/llama-4-maverick5/64/109/1623s$0.004
openai/gpt-oss-120b4/65/109/1628s$0.001
qwen3:30b-a3b (Ollama)5/64/109/16121sFree
devstral:latest (Ollama)5/64/109/16127sFree
qwen3:32b (Ollama)5/64/109/16281sFree
mistralai/mistral-small-3.2-24b-instruct5/63/108/1617s$0.001
meta-llama/llama-4-scout4/63.7/107.7/168s$0.002
mistralai/ministral-8b-25125/62/107/1610s$0.002
mistralai/codestral-25085/61/106/166s$0.004
nvidia/NVIDIA-Nemotron-3-Super-120B-A12B2/62/104/1637sFree
nemotron-3-nano:4b (Ollama)2/62/104/1637sFree
anthropic/claude-haiku-4.51/61/102/168s$0.013
nvidia/Nemotron-3-Nano-30B-A3B1/61/102/1633sFree
gpt-oss:latest (Ollama)1/61/102/1621sFree
google/gemma-3-27b-it0/60/100/16$0.001

The Failures Tell the Real Story

anthropic/claude-haiku-4.5 via OpenRouter (2/16) — The most expensive failure. It called list_vehicles twice, got an error, and returned an apology:

"I apologize—I'm encountering persistent server errors when trying to connect to your TezLab account."

The service was running fine for every other model. On every other dimension Haiku was strong: 30/30 knowledge, 30/30 instruction, 126/126 coding. One catastrophic failure in one dimension defines the whole ranking.

nvidia/Nemotron-3-Nano-30B-A3B on DeepInfra (2/16) — Called list_vehicles once, correctly identified both vehicles, and then asked the user for clarification instead of proceeding. The same weights on OpenRouter scored 11/16 by just picking the Tesla and running all 6 tools. Same model, different provider, one asks permission and the other gets to work.

google/gemma-3-27b-it via OpenRouter (0/16) — OpenRouter returned "No endpoints found that support tool use." This model doesn't support tool calling at all.

The Quality Gap

Among models that called all 6 tools (tool score 6/6), quality scores ranged from 3.7/10 to 5/10. The majority scored 5/10 — quality variance is far smaller than tool-usage variance. Models either figured out the full tool chain or stopped short. This suggests MCP tool use is primarily a planning problem, not a generation problem.


The Provider Effect: Why Infrastructure Matters

The same model weights, served by different providers, produce meaningfully different results.

The Numbers

ModelDimensionOpenRouterDeepInfraGap
Nemotron Nano 30BReasoning20/2017/20-3
Nemotron Nano 30BKnowledge30/3030/300
Nemotron Nano 30BInstruction28/3030/30+2
Nemotron Nano 30BCoding71/126110/126+39
Nemotron Nano 30BMCP Tool Use11/162/16-9
Nemotron Nano 30BCombined83.7%77.0%-6.7pp
Nemotron Super 120BReasoning20/2020/200
Nemotron Super 120BKnowledge30/3028/30-2
Nemotron Super 120BInstruction25/3030/30+5
Nemotron Super 120BCoding74/12688/126+14
Nemotron Super 120BMCP Tool Use11/164/16-7
Nemotron Super 120BCombined82.2%77.6%-4.6pp

The Nano 30B gap is 6.7 percentage points — larger than the gap between ranks #4 and #10 in our leaderboard. The pattern is inverted between dimensions: DeepInfra dominates coding (110/126 vs 71/126) while OpenRouter dominates MCP (11/16 vs 2/16) and reasoning (20/20 vs 17/20). The Super 120B gap widened to 4.6pp with updated MCP data — OpenRouter now scores 11/16 vs DeepInfra's 4/16.

Why the Same Weights Behave Differently

At least seven layers can introduce behavioral differences: quantization precision, SDK layers (native vs OpenAI-compatible adapter), middleware (context compression, response healing), tool-calling implementation, default parameters, token usage reporting, and reasoning effort configuration.

Tool use is the most provider-sensitive capability. Knowledge and reasoning showed 0-3 point differences; MCP showed a 9-point difference. If your application relies on tool calling, provider choice matters more than model choice.


Cost and Speed Analysis

The Cost Efficiency Curve (4 Dimensions, All 49 Models)

TierCost RangeBest Model4-dim Score
Free (local)$0.00gpt-oss:latest (Ollama)93.3%
Free (cloud)$0.00nvidia/Nemotron-3-Nano-30B-A3B (DeepInfra)93.1%
Sub-penny$0.001-$0.01openai/gpt-oss-120b (OR)98.3%
Penny$0.01-$0.05gemini-3-flash-preview (Google)98.8%
Dime$0.05-$0.20x-ai/grok-4.20-beta (OR)98.8%
Quarter+$0.20+anthropic/claude-sonnet-4.6 (OR)100.0%

The knee is at openai/gpt-oss-120b ($0.01, 98.3%). For one cent you get a model that's within 2 points of the absolute best. The jump from 98.3% to 100% costs 20x more ($0.20 for Sonnet 4.6). Below that, local gpt-oss:latest on Ollama delivers 93.3% for free.

The Speed/Quality Frontier

ModelScoreTimeSweet Spot?
mistralai/codestral-250887.3%48sFastest overall
openai/gpt-5.4-mini97.3%1mFastest >95%
anthropic/claude-sonnet-4.6100.0%3mPerfect score
x-ai/grok-4.20-beta98.8%2mBest speed/quality
gemini-3-flash-preview98.8%1h41mBest value
openai/gpt-oss-120b98.3%10mBest sub-penny

What This Tells Us

1. Claude Sonnet 4.6 is the only perfect model. 100% across 4 dimensions — 20/20 reasoning, 30/30 knowledge, 30/30 instruction, 126/126 coding. No other model achieves this. But at $0.20, you pay a premium for perfection.

2. The top 5 models are within 1.2 points of each other. Sonnet 4.6 (100%), Qwen 397B (99.2%), Grok 4.20 (98.8%), Gemini Flash (98.8%), GPT-5.4 (98.8%). The difference between them is essentially noise — model choice at the frontier matters less than it used to.

3. openai/gpt-oss-120b at $0.01 is the deal of the century. 98.3% — ahead of GPT-5.4 Mini, Gemini Pro, and Claude Opus — for one cent. Apache-licensed. This is an open-weight model costing 20x less than Sonnet and scoring within 2 points.

4. Local models are no longer a compromise. gpt-oss:latest on Ollama (93.3%) and nemotron-3-nano:latest (89.4%) run entirely on a MacBook, for free, and beat many cloud models. The 5-7% gap vs cloud is primarily in coding (Go struggles) and instruction following.

5. Multi-dimensional evaluation reveals things single-axis benchmarks hide. anthropic/claude-haiku-4.5 scores 96.3% on 4 dimensions but drops to 79.5% on 5 dimensions due to its MCP collapse. qwen3:30b-a3b gets perfect coding (126/126) but 4/30 instruction following.

6. Provider choice matters as much as model choice. The 6.7pp gap between the same Nemotron Nano 30B on OpenRouter vs DeepInfra is larger than the gap between many adjacent-ranked models.

7. Tool orchestration is a binary skill. Models either call all the right tools and produce excellent analysis, or they stall early and produce nothing. No graceful degradation.

8. Go code generation is the biggest language gap. TypeScript compiles 98% of the time; Go only 86%.

9. The frontier is crowded. 11 models score above 95% on 4 dimensions. At this level, the differentiators are cost, speed, and tool use — not raw capability.


Methodology

All evaluations ran with temperature 0.0 (knowledge, instruction) or 0.2-0.3 (reasoning, coding). Reasoning and knowledge responses were judged by anthropic/claude-haiku-4.5 (via OpenRouter). Instruction following and coding used deterministic verification — no LLM judge involved. MCP tool usage is scored mechanically (did you call the tool?); response quality is LLM-judged.

Results are cached per model per task; interrupted runs can be resumed. All models were tested under the same conditions with identical prompts.

Combined scores are the mean of normalized dimension percentages (each dimension's raw score divided by its maximum, then averaged across all 5 dimensions). Only models present in all 5 dimensions appear in the combined leaderboard.

Run Details

  • Reasoning: Run #7, 4 puzzles, scored /20
  • Knowledge: Run #2, 30 questions across 6 categories, scored /30
  • Instruction: Run #2, 6 constraint tasks, scored /30
  • Coding: Run #6, 6 challenges × 3 languages = 18 tasks, scored /126 (local execution, no containers)
  • MCP Tool Use: Run #1, 1 multi-tool task, scored /16

Total evaluation cost: $4.63 across all 49 models and 4-5 dimensions.

Limitations

  • Only 22 of 49 tested models have complete 5-dimension results. Premium models (Claude Opus/Sonnet 4.6, GPT-5.4, Gemini 3.1 Pro, Grok 4.20), mid-tier (GPT-5.4 Mini/Nano, Kimi K2.5, Grok 4.1 Fast), and all 13 Ollama models are missing MCP results due to TezLab server unavailability.
  • The MCP eval used a single task with a live API. Results may vary with different MCP servers, tool schemas, or task descriptions.
  • Free-tier models may have different availability, rate limits, or routing than paid versions.
  • We tested each model once per task. Stochastic variation means scores could shift by 1-3 points on a re-run.

Built with umwelten — an open-source framework for multi-model evaluation, MCP tool integration, and LLM-judged scoring.

Released under the MIT License.