Building a Multi-Dimension Model Showdown

A walkthrough of building the "Model Showdown" — a comprehensive evaluation suite that tests 49 models across 5 dimensions (reasoning, knowledge, instruction following, coding, MCP tool use), combines the results into a unified leaderboard, and generates a full narrative report with per-dimension analysis.

Time Required: 30 minutes to build, 2–4 hours to run Prerequisites: Node.js 20+, pnpm, API keys for Google, OpenRouter, and DeepInfra Cost: ~$4.63 for a full 49-model run

What We're Building

A multi-evaluation suite that:

Runs 5 independent evaluations testing different capabilities
Uses LLM judges and deterministic verifiers to score results
Combines results across all dimensions into per-model scorecards
Generates both structured reports (console, markdown, JSON) and full narrative writeups
Includes per-dimension breakdowns, cost/speed analysis, and provider comparisons

Unlike the Car Wash evaluation which tests one question across many models, the Model Showdown tests multiple questions across multiple dimensions — and introduces the suite combine system for cross-evaluation analysis.

Architecture

The showdown is split into three layers:

examples/model-showdown/
├── shared/                  # Common: models, judge, env, utilities
├── reasoning/               # 4 logic puzzles, LLM-judged
├── knowledge/               # 30 factual questions, LLM-judged
├── instruction/             # 6 constraint tasks, deterministic scoring
├── coding/                  # 6 challenges × 3 languages, compiled & run
├── mcp-tool-use/            # MCP tool orchestration, LLM-judged
├── suite-config.ts          # EvalDimension[] — defines how to combine
├── generate-report.ts       # Report entry point (4 output formats)
└── run-all.ts               # Orchestrator — runs all evals + report

Each evaluation writes results to output/evaluations/model-showdown-{name}/runs/{number}/. The combine system reads from these directories to build unified reports.

Step 1: Define Your Model List

All evals share the same model list from shared/models.ts:

typescript

import { ModelDetails } from '../@umwelten/core/cognition/types.js';

const LOCAL_TEST_MODELS: ModelDetails[] = [
  { name: 'gemini-3-flash-preview', provider: 'google' },
  { name: 'anthropic/claude-haiku-4.5', provider: 'openrouter' },
];

const ALL_MODELS: ModelDetails[] = [
  { name: 'gemini-3-flash-preview', provider: 'google' },
  { name: 'anthropic/claude-haiku-4.5', provider: 'openrouter' },
  { name: 'qwen/qwen3.5-122b-a10b', provider: 'openrouter' },
  { name: 'openai/gpt-oss-120b', provider: 'openrouter' },
  // ... 45 more models across openrouter, deepinfra, and ollama
];

export const MODELS = process.argv.includes('--all') ? ALL_MODELS : LOCAL_TEST_MODELS;

Step 2: Build the Four Evaluation Dimensions

Reasoning (4 puzzles, /20)

Four classic logic puzzles that test whether models can reason past intuitive traps:

Bat & Ball: The classic $1.10 problem (trap: $0.10, correct: $0.05)
Counterfeit Coin: 12 coins, 3 weighings, find the fake
Lily Pad: Doubling lily pads — covers the lake in 48 days, half in...?
Surgeon Riddle: "I can't operate — he's my son"

Each puzzle is scored 1–5 by an LLM judge (Claude Haiku 4.5) on reasoning quality — not just the answer, but whether the model shows genuine understanding. Max score: 20.

typescript

// reasoning/puzzles.ts
export const PUZZLES = [
  {
    id: 'bat-ball',
    prompt: 'A bat and a ball cost $1.10 in total. The bat costs $1.00 more than the ball. How much does the ball cost?',
    correct: '$0.05',
    trap: '$0.10',
  },
  // ... 3 more puzzles
];

Knowledge (30 questions, /30)

30 factual questions across 6 categories: Science, Geography, History, Technology, AI/ML, and Tricky/Adversarial. Each is binary — correct (1) or incorrect (0) — judged by an LLM that allows formatting variations (e.g., "5730 years" and "5,730 years" both count).

typescript

// knowledge/questions.ts
export const QUESTIONS = [
  { id: 'sci-1', category: 'Science',
    question: 'What is the half-life of carbon-14?',
    answer: '5,730 years' },
  { id: 'geo-3', category: 'Geography',
    question: 'Which country has the most time zones?',
    answer: 'France (12 time zones including overseas territories)' },
  // ... 28 more
];

Instruction Following (6 tasks, /30)

Six tasks with strict format constraints verified deterministically — no LLM judge needed. Each task is scored 0–5 based on how many constraints are satisfied.

Examples:

Write exactly 3 sentences, each starting with a given word
Output valid JSON with specific required fields
Write a poem with exactly 4 lines of exactly 8 words each
Respond with only uppercase letters

The scorer uses regex, JSON parsing, and character counting — no ambiguity.

Coding (6 challenges × 3 languages, /126)

Six programming challenges implemented in TypeScript, Python, and Rust. Each submission is compiled (where applicable), run against test cases, and scored on correctness.

Challenges include FizzBuzz, palindrome checking, Fibonacci, matrix transposition, Caesar cipher, and anagram detection. Each challenge has 3–7 test cases with expected outputs. Max score per challenge per language: 7 (1 for compilation + 1 per test case).

Execution uses either Dagger containers (isolated Docker) or local runtimes with the --no-dagger flag.

MCP Tool Use (1 task, /16)

Tests whether models can orchestrate multiple MCP (Model Context Protocol) tools to analyze real-world data. The eval connects to the TezLab MCP server (EV vehicle data) and asks each model to analyze battery health and charging patterns.

Scoring has two components:

Tool score (0-6): Deterministic — checks whether the model called each required tool: list_vehicles, get_battery_health, get_charges, get_efficiency, get_my_chargers, find_nearby_chargers
Quality score (1-10): LLM judge assesses data synthesis, actionable insights, and factual grounding

typescript

// mcp-tool-use/mcp-eval.ts
const PROMPT =
  `Analyze my vehicle's battery health and charging patterns. ` +
  `First identify my vehicle, then get the battery health data...`;

// Scoring
function scoreToolUsage(calls: ToolCall[]): ToolUsage {
  const names = calls.map(c => c.name);
  const checks = [
    [names.includes('list_vehicles'), 'list_vehicles'],
    [names.includes('get_battery_health'), 'get_battery_health'],
    // ... 4 more required tools
  ];
  return { tool_score: checks.filter(([ok]) => ok).length, ... };
}

This dimension requires TezLab OAuth credentials. The eval connects via examples/mcp-chat/tezlab-mcp.ts and runs models sequentially over a shared MCP connection.

Step 3: Define the Suite Configuration

The suite config tells the combine system how to read each evaluation's results:

See examples/model-showdown/suite-config.ts for the live SHOWDOWN_SUITE (includes perTaskMaxScore per dimension).

Each dimension defines:

Field	Purpose
`evalName`	Maps to `output/evaluations/{evalName}/` directory
`label`	Human-readable name for reports
`maxScore`	Perfect score for a full run of this dimension (documentation + legacy mode)
`perTaskMaxScore`	Optional. When set, each loaded result file adds this to the model’s denominator for that dimension (sum of per-file maxima). Use for multi-task evals so a model evaluated on only some tasks shows e.g. 28/28 or 56/56 instead of 28/126 for coding.
`extractScore`	Function to pull a numeric score from each result JSON file
`hasResultsSubdir`	Whether results are in `{task}/results/` vs `{task}/`

The extractScore function is the key abstraction — each eval stores results differently, and this function normalizes them to a number. perTaskMaxScore keeps leaderboard fractions honest when the cache only has a subset of tasks (e.g. new models added mid-campaign).

Step 4: Run the Evaluations

Run Everything at Once

bash

# Quick test (2 models × 4 evals)
dotenvx run -- pnpm tsx examples/model-showdown/run-all.ts

# Full showdown (25 models × 4 evals)
dotenvx run -- pnpm tsx examples/model-showdown/run-all.ts --all

# Fresh run (don't reuse cached responses)
dotenvx run -- pnpm tsx examples/model-showdown/run-all.ts --all --new

Run Individual Evals

bash

dotenvx run -- pnpm tsx examples/model-showdown/reasoning/reasoning-eval.ts --all
dotenvx run -- pnpm tsx examples/model-showdown/knowledge/knowledge-eval.ts --all
dotenvx run -- pnpm tsx examples/model-showdown/instruction/instruction-eval.ts --all
dotenvx run -- pnpm tsx examples/model-showdown/coding/coding-eval.ts --all --no-dagger
dotenvx run -- pnpm tsx examples/model-showdown/mcp-tool-use/mcp-eval.ts --all

Each eval writes results to output/evaluations/model-showdown-{name}/runs/{NNN}/. The combine system automatically picks the latest run.

Step 5: Generate Combined Reports

Via the Generate Script

bash

# Console tables (default)
dotenvx run -- pnpm tsx examples/model-showdown/generate-report.ts

# Structured markdown
dotenvx run -- pnpm tsx examples/model-showdown/generate-report.ts --format md

# Full narrative writeup
dotenvx run -- pnpm tsx examples/model-showdown/generate-report.ts --format narrative

# Save to file
dotenvx run -- pnpm tsx examples/model-showdown/generate-report.ts --format narrative --output report.md

Via the CLI

bash

# Console tables
dotenvx run -- pnpm run cli eval combine --config examples/model-showdown/suite-config.ts

# Narrative report to file
dotenvx run -- pnpm run cli eval combine \
  --config examples/model-showdown/suite-config.ts \
  --format narrative \
  --output report.md

# Focus on specific models
dotenvx run -- pnpm run cli eval combine \
  --config examples/model-showdown/suite-config.ts \
  --format md \
  --focus nemotron qwen

Step 6: Understanding the Report Formats

Console / Structured Report

The buildSuiteReport() function produces a Report object (from packages/evaluation/src/reporting/types.ts) with these sections:

Overall Leaderboard — Rank, Model, Combined %, one column per dimension, Cost, Time
Cost Efficiency — Sorted by score/dollar (free models at top)
Speed Leaderboard — Sorted by total time ascending
Per-Dimension Details — task-level breakdowns, judge explanations, failure reasons
Focus Model Comparison — filtered deep-dive (when --focus is set)
Run Info — which directories were used, model count, timestamp

Render via Reporter:

typescript

import { loadSuite, buildSuiteReport } from '@umwelten/evaluation/evaluation/combine/index.js';
import { Reporter } from '@umwelten/evaluation/reporting/reporter.js';
import { SHOWDOWN_SUITE } from './suite-config.js';

const result = loadSuite(SHOWDOWN_SUITE);
const report = buildSuiteReport(result, {
  title: 'Model Showdown — Combined Results',
  focusModels: ['nemotron'],
});

const reporter = new Reporter();
reporter.toConsole(report);   // terminal with colors
reporter.toMarkdown(report);  // markdown string
reporter.toJson(report);      // JSON string

Narrative Report

The buildNarrativeReport() function produces a standalone markdown article — a full writeup with:

Overview — models tested, providers, total cost and time
Overall Leaderboard — table with raw scores and percentages
Key Findings — best overall, best value, fastest
Per-Dimension Sections — each with:
- Methodology (what's being tested, how it's scored)
- Test descriptions (the actual puzzles/questions/tasks)
- Results table with per-task scores
- Analysis (which tasks were hardest, error patterns)
- Selected judge explanations (for LLM-judged dimensions)
Cost & Speed Analysis — cost efficiency, speed leaderboard
Provider Comparison — average scores by inference provider
Methodology — how scores are combined, what "combined %" means

typescript

import { loadSuite, buildNarrativeReport } from '@umwelten/evaluation/evaluation/combine/index.js';
import { SHOWDOWN_SUITE } from './suite-config.js';

const result = loadSuite(SHOWDOWN_SUITE);
const markdown = buildNarrativeReport(result, {
  title: 'Model Showdown — Full Evaluation Report',
});

How the Combine System Works

Loading

loadSuite(dimensions) does the following:

For each dimension, find the latest run directory under output/evaluations/{evalName}/runs/
Walk task subdirectories, read each {modelKey}.json file
Call dimension.extractScore(result) to get the score
Sum cost and durationMs per model across all tasks
Normalize to percentage: rawScore / maxScore × 100
Only include models present in ALL dimensions — partial coverage is excluded
Combine across dimensions: mean(dimension percentages)
Sort by combined percentage descending

Model Key Parsing

Result files are named like gemini-3-flash-preview-google.json. The loader strips known provider suffixes (-google, -openrouter, -deepinfra, etc.) to extract the model name and provider.

Task Results

The loader preserves the full raw JSON from every result file in SuiteResult.taskResults. This enables the detailed per-dimension reports — judge explanations, compilation errors, wrong answers, and response previews are all available.

Adapting This for Your Own Suite

Create your evaluations. Each eval should write JSON result files to output/evaluations/{evalName}/runs/{NNN}/. Use the run-based caching pattern from the Car Wash walkthrough.
Define your suite config. Create an EvalDimension[] array:

typescript

import type { EvalDimension } from '@umwelten/evaluation/evaluation/combine/types.js';

export const MY_SUITE: EvalDimension[] = [
  {
    evalName: 'my-eval-accuracy',
    label: 'Accuracy',
    maxScore: 100,
    extractScore: (r) => r.score ?? 0,
  },
  {
    evalName: 'my-eval-speed',
    label: 'Speed',
    maxScore: 50,
    extractScore: (r) => r.timingScore ?? 0,
    hasResultsSubdir: true,
  },
];

Generate reports. Use the CLI:

bash

dotenvx run -- pnpm run cli eval combine --config path/to/suite-config.ts
dotenvx run -- pnpm run cli eval combine --config path/to/suite-config.ts --format narrative --output report.md

Or programmatically:

typescript

import { loadSuite, buildSuiteReport, buildNarrativeReport } from '@umwelten/evaluation/evaluation/combine/index.js';
import { Reporter } from '@umwelten/evaluation/reporting/reporter.js';

const result = loadSuite(MY_SUITE);

// Structured report
const report = buildSuiteReport(result, { title: 'My Suite Results' });
new Reporter().toConsole(report);

// Narrative writeup
const narrative = buildNarrativeReport(result, { title: 'My Evaluation Report' });

Patterns You Can Reuse

Pattern 1: Multi-Dimension Evaluation

Test models on fundamentally different capabilities and combine into one ranking. The key insight: a model that scores 90% on coding but 60% on reasoning has a different profile than one that scores 75% on both — the suite captures this.

Pattern 2: Mixed Scoring Methods

Combine LLM-judged scores (reasoning, knowledge) with deterministic scores (instruction, coding) in the same suite. The extractScore function normalizes everything to a number, regardless of how it was produced.

Pattern 3: Suite Configuration as Data

The EvalDimension[] is just data — you can version it, share it, and modify it without changing any framework code. Different teams can define different suites over the same evaluations.

Pattern 4: Report Format Separation

The combine system produces data (SuiteResult). Rendering is separate: buildSuiteReport() for structured Report objects, buildNarrativeReport() for prose markdown. You can add new renderers without touching the loader.

Sample Output

From the model showdown with 49 models across 4 dimensions (all models) and 22 models across 5 dimensions (with MCP):

Best overall (4-dim): Claude Sonnet 4.6 at 100% across all 4 dimensions
Best value: openai/gpt-oss-120b — 98.3% for $0.01
Best free: gpt-oss:latest on Ollama — 93.3% for $0.00
Hardest dimension: Coding — average 55%
Easiest dimension: Instruction — average 89%
Total cost: $4.63 across all 49 models
The frontier is crowded: 11 models score above 95%

The counterfeit coin reasoning puzzle was the hardest individual task — only 9/22 models scored above 2/5. The MCP dimension revealed that most models can orchestrate all 6 required tools (scoring 6/6), but quality varied — the LLM judge scores ranged from 2/10 to 5/10 based on how well models synthesized the tool results into actionable insights.

Quick Reference: Running & Generating Reports

Run the full suite (with caching)

By default, runs resume the latest run number and skip models that already have cached results. Only new/missing models are evaluated.

bash

# Full showdown — all base models, uses cached results
dotenvx run -- pnpm tsx examples/model-showdown/run-all.ts --all

# Full showdown with thinking effort variants (low/medium/high)
dotenvx run -- pnpm tsx examples/model-showdown/run-all.ts --all --with-reasoning-levels

# `run-all` always runs MCP tool-use eval after the four core suites (TezLab MCP server must be reachable)

# Force a fresh run (ignores all caches, creates new run number)
dotenvx run -- pnpm tsx examples/model-showdown/run-all.ts --all --new

# Resume a specific run number
dotenvx run -- pnpm tsx examples/model-showdown/run-all.ts --all --run 8

Run individual evals

bash

dotenvx run -- pnpm tsx examples/model-showdown/reasoning/reasoning-eval.ts --all
dotenvx run -- pnpm tsx examples/model-showdown/knowledge/knowledge-eval.ts --all
dotenvx run -- pnpm tsx examples/model-showdown/instruction/instruction-eval.ts --all
dotenvx run -- pnpm tsx examples/model-showdown/coding/coding-eval.ts --all --no-dagger
dotenvx run -- pnpm tsx examples/model-showdown/mcp-tool-use/mcp-eval.ts --all

Generate reports

Markdown (structured tables)

bash

dotenvx run -- pnpm tsx examples/model-showdown/generate-report.ts --format md
dotenvx run -- pnpm tsx examples/model-showdown/generate-report.ts --format md --output output/model-showdown-results.md

Markdown (narrative writeup)

bash

dotenvx run -- pnpm tsx examples/model-showdown/generate-report.ts --format narrative
dotenvx run -- pnpm tsx examples/model-showdown/generate-report.ts --format narrative --output output/model-showdown-narrative.md

4-Dimension Report (All Models)

If some models are missing from the MCP dimension, use the 4-dimension report to include all models:

bash

dotenvx run -- pnpm tsx examples/model-showdown/generate-4dim-report.ts --format narrative --output output/model-showdown-4dim-narrative.md
dotenvx run -- pnpm tsx examples/model-showdown/generate-4dim-report.ts --format md --output output/model-showdown-4dim-full.md

JSON

bash

dotenvx run -- pnpm tsx examples/model-showdown/generate-report.ts --format json --output output/model-showdown-results.json

HTML (via Focus.AI brand report skill)

Generate a styled HTML report from the narrative markdown, ready for PDF printing:

bash

# 1. Generate the narrative markdown
dotenvx run -- pnpm tsx examples/model-showdown/generate-report.ts \
  --format narrative --output output/model-showdown-narrative.md

# 2. Convert to branded HTML using the /report skill in Claude Code:
#    /report file:output/model-showdown-narrative.md style:labs
#
#    This opens a styled HTML page in the browser. Print to PDF with Cmd+P.

CLI (via eval combine command)

bash

# Console output
dotenvx run -- pnpm run cli eval combine --config examples/model-showdown/suite-config.ts

# Narrative to file
dotenvx run -- pnpm run cli eval combine \
  --config examples/model-showdown/suite-config.ts \
  --format narrative --output report.md

# Focus on specific models
dotenvx run -- pnpm run cli eval combine \
  --config examples/model-showdown/suite-config.ts \
  --format md --focus nemotron qwen

How caching works

Results are stored as JSON files in output/evaluations/model-showdown-{dim}/runs/{NNN}/
Each model's API response is cached in {task}/responses/{modelKey}.json
Each model's scored result is cached in {task}/results/{modelKey}.json
Without --new, the runner picks the latest run number and only evaluates models missing from that run
Adding new models to SHOWDOWN_MODELS and re-running will only call APIs for the new models — existing results are preserved
Use --new to force a clean run (creates a new run number, ignores all previous results)

Output files

File	Format	Description
`output/model-showdown-narrative.md`	Markdown	Full prose report with methodology, analysis, tables (5-dim, models with MCP)
`output/model-showdown-4dim-narrative.md`	Markdown	Full prose report across 4 dimensions (all models)
`output/model-showdown-results.md`	Markdown	Structured tables (leaderboard, cost, speed)
`output/model-showdown-results.json`	JSON	Machine-readable results
`/tmp/focus-report-*.html`	HTML	Branded report for PDF printing (via `/report` skill)

Full Source

See examples/model-showdown/ for the complete implementation.

The generated narrative report is at output/model-showdown-narrative.md.

Building a Multi-Dimension Model Showdown ​

What We're Building ​

Architecture ​

Step 1: Define Your Model List ​

Step 2: Build the Four Evaluation Dimensions ​

Reasoning (4 puzzles, /20) ​

Knowledge (30 questions, /30) ​

Instruction Following (6 tasks, /30) ​

Coding (6 challenges × 3 languages, /126) ​

MCP Tool Use (1 task, /16) ​

Step 3: Define the Suite Configuration ​

Step 4: Run the Evaluations ​

Run Everything at Once ​

Run Individual Evals ​

Step 5: Generate Combined Reports ​

Via the Generate Script ​

Via the CLI ​

Step 6: Understanding the Report Formats ​

Console / Structured Report ​

Narrative Report ​

How the Combine System Works ​

Loading ​

Model Key Parsing ​

Task Results ​

Adapting This for Your Own Suite ​

Patterns You Can Reuse ​

Pattern 1: Multi-Dimension Evaluation ​

Pattern 2: Mixed Scoring Methods ​

Pattern 3: Suite Configuration as Data ​

Pattern 4: Report Format Separation ​

Sample Output ​

Quick Reference: Running & Generating Reports ​

Run the full suite (with caching) ​

Run individual evals ​

Generate reports ​

Markdown (structured tables) ​

Markdown (narrative writeup) ​

4-Dimension Report (All Models) ​

JSON ​

HTML (via Focus.AI brand report skill) ​

CLI (via eval combine command) ​

How caching works ​

Output files ​

Full Source ​

Building a Multi-Dimension Model Showdown

What We're Building

Architecture

Step 1: Define Your Model List

Step 2: Build the Four Evaluation Dimensions

Reasoning (4 puzzles, /20)

Knowledge (30 questions, /30)

Instruction Following (6 tasks, /30)

Coding (6 challenges × 3 languages, /126)

MCP Tool Use (1 task, /16)

Step 3: Define the Suite Configuration

Step 4: Run the Evaluations

Run Everything at Once

Run Individual Evals

Step 5: Generate Combined Reports

Via the Generate Script

Via the CLI

Step 6: Understanding the Report Formats

Console / Structured Report

Narrative Report

How the Combine System Works

Loading

Model Key Parsing

Task Results

Adapting This for Your Own Suite

Patterns You Can Reuse

Pattern 1: Multi-Dimension Evaluation

Pattern 2: Mixed Scoring Methods

Pattern 3: Suite Configuration as Data

Pattern 4: Report Format Separation

Sample Output

Quick Reference: Running & Generating Reports

Run the full suite (with caching)

Run individual evals

Generate reports

Markdown (structured tables)

Markdown (narrative writeup)

4-Dimension Report (All Models)

JSON

HTML (via Focus.AI brand report skill)

CLI (via eval combine command)

How caching works

Output files

Full Source