Model Evaluation

Learn how to systematically evaluate and compare AI models using Umwelten's comprehensive evaluation system.

Overview

Model evaluation is at the heart of Umwelten's functionality. The eval command family provides systematic testing across multiple models with comprehensive reporting, cost analysis, and resume capability.

Quick Start with EvalSuite

The fastest way to create an evaluation is EvalSuite. Define tasks with prompts and scoring — the suite handles caching, execution, judging, and output.

typescript

import '../../src/env/load.js';
import { z } from 'zod';
import { EvalSuite } from '../../src/evaluation/suite.js';

const suite = new EvalSuite({
  name: 'my-eval',
  stimulus: { role: 'helpful assistant', temperature: 0.3, maxTokens: 500 },
  models: [
    { name: 'gemini-3-flash-preview', provider: 'google' },
    { name: 'openai/gpt-5.4-nano', provider: 'openrouter' },
  ],
  tasks: [{
    id: 'q1',
    prompt: 'What is 2+2?',
    maxScore: 1,
    verify: (r) => ({ score: r.trim() === '4' ? 1 : 0, details: r.trim() }),
  }],
});

suite.run();

bash

dotenvx run -- pnpm tsx my-eval.ts          # run it
dotenvx run -- pnpm tsx my-eval.ts --all    # use allModels list
dotenvx run -- pnpm tsx my-eval.ts --new    # fresh run

Two scoring modes: VerifyTask (deterministic verify() function) and JudgeTask (LLM judge with Zod schema). See Creating Evaluations for full details and examples.

CLI Evaluation

Simple Model Comparison

bash

pnpm run cli -- eval run \
  --prompt "Explain machine learning in simple terms" \
  --models "ollama:gemma3:12b,google:gemini-3-flash-preview,openrouter:openai/gpt-4o-mini" \
  --id "ml-explanation" \
  --concurrent

With System Context

bash

pnpm run cli -- eval run \
  --prompt "Explain quantum computing applications" \
  --models "google:gemini-3-flash-preview,openrouter:openai/gpt-4o" \
  --id "quantum-apps" \
  --system "You are a physics professor explaining to undergraduate students" \
  --temperature 0.3

Advanced Features

Interactive UI Mode

Watch evaluations in real-time:

bash

pnpm run cli -- eval run \
  --prompt "Write a creative story about AI" \
  --models "ollama:gemma3:12b,google:gemini-3-flash-preview" \
  --id "ai-story" \
  --ui \
  --concurrent

File Attachments

Test multimodal capabilities:

bash

pnpm run cli -- eval run \
  --prompt "Analyze this document and extract key insights" \
  --models "google:gemini-3-flash-preview,google:gemini-2.5-pro-exp-03-25" \
  --id "document-analysis" \
  --attach "./documents/report.pdf" \
  --concurrent

Evaluation Options

Core Parameters

--prompt: The prompt to evaluate (required)
--models: Comma-separated models in provider:model format (required)
--id: Unique evaluation identifier (required)
--system: Optional system prompt
--temperature: Temperature for generation (0.0-2.0)
--timeout: Timeout in milliseconds (minimum 1000ms)

Advanced Options

--resume: Re-run existing responses (default: false)
--attach: Comma-separated file paths to attach
--ui: Use interactive UI with streaming responses
--concurrent: Enable concurrent evaluation for faster processing
--max-concurrency <number>: Maximum concurrent evaluations (1-20, default: 3)

Report Generation

Generate Reports

bash

# Markdown report (default)
pnpm run cli -- eval report --id ml-explanation

# HTML report with rich formatting
pnpm run cli -- eval report --id quantum-apps --format html --output report.html

# CSV export for analysis
pnpm run cli -- eval report --id ai-story --format csv --output results.csv

# JSON for programmatic use
pnpm run cli -- eval report --id document-analysis --format json

List Evaluations

bash

# List all evaluations
pnpm run cli -- eval list

# Show detailed information
pnpm run cli -- eval list --details

# JSON format for scripting
pnpm run cli -- eval list --json

Best Practices

Model Selection

Start with free Ollama models for development
Use Google Gemini 2.0 Flash for production (cost-effective)
Reserve premium models (GPT-4o) for critical quality needs
Use multiple models for comparison and validation

Prompt Design

Be specific about desired output format and length
Include context about target audience when relevant
Use system prompts to set role and expertise level
Test with different temperature values for creativity vs consistency

Performance Optimization

Use --concurrent for faster multi-model evaluation (3-5x speedup)
Set appropriate --timeout for complex prompts
Use --ui for long-running evaluations to monitor progress
Enable --resume for reliability with large evaluation sets

Pairwise Ranking

After running an evaluation, you can rank the results head-to-head using an LLM judge with Elo ratings. The PairwiseRanker class (src/evaluation/ranking/pairwise-ranker.ts) handles pairing, judging, position-bias mitigation, and caching.

typescript

import { PairwiseRanker, evaluationResultsToRankingEntries } from '../src/evaluation/ranking/index.js';

const entries = evaluationResultsToRankingEntries(evalResult);
const ranker = new PairwiseRanker(entries, {
  judgeModel: { name: 'anthropic/claude-haiku-4.5', provider: 'openrouter' },
  judgeInstructions: ['Compare these responses. Which is more helpful and accurate?'],
  pairingMode: 'swiss',
  swissRounds: 5,
  cacheDir: './output/rankings/my-ranking',
});

const output = await ranker.rank();
for (const r of output.rankings) {
  console.log(`${r.model} — Elo ${r.elo} (${r.wins}W/${r.losses}L/${r.ties}T)`);
}

See the full Pairwise Ranking Guide for configuration details and the Pairwise Ranking Example for a complete walkthrough.

Combining Multiple Evaluations

When you have multiple evaluations that test different capabilities, use eval combine to aggregate them into a unified leaderboard.

Define a Suite Configuration

Create a TypeScript file that defines how to read each evaluation's results:

typescript

import type { EvalDimension } from '../src/evaluation/combine/types.js';

export const MY_SUITE: EvalDimension[] = [
  {
    evalName: 'my-eval-reasoning',
    label: 'Reasoning',
    maxScore: 20,
    extractScore: (r) => r.judge?.reasoning_quality ?? 0,
    hasResultsSubdir: true,
  },
  {
    evalName: 'my-eval-knowledge',
    label: 'Knowledge',
    maxScore: 30,
    extractScore: (r) => r.correct ? 1 : 0,
  },
];

Generate Combined Reports

bash

# Console leaderboard
dotenvx run -- pnpm run cli eval combine --config path/to/suite-config.ts

# Structured markdown
dotenvx run -- pnpm run cli eval combine --config path/to/suite-config.ts --format md

# Full narrative writeup with methodology, analysis, and judge explanations
dotenvx run -- pnpm run cli eval combine --config path/to/suite-config.ts --format narrative --output report.md

# Focus on specific models
dotenvx run -- pnpm run cli eval combine --config path/to/suite-config.ts --format md --focus nemotron qwen

The combine system:

Reads result JSON files from each eval's output/evaluations/{name}/runs/ directory
Extracts scores using the dimension's extractScore function
Normalizes each dimension to 0–100%, then averages across dimensions
Only includes models present in ALL dimensions
Preserves raw data for detailed per-task breakdowns and judge explanations

For a complete walkthrough, see Building a Multi-Dimension Model Showdown.

Examples

For comprehensive examples, see:

Text Generation - Basic model comparison
Creative Writing - Temperature and creativity testing
Analysis & Reasoning - Complex reasoning tasks
Cost Optimization - Budget-conscious evaluation
Pairwise Ranking - Head-to-head Elo ranking via LLM judge
Model Showdown - Multi-dimension evaluation suite with combined reporting

Next Steps

Try batch processing for multiple files
Explore structured output for data extraction
Learn cost analysis for budget optimization
Use pairwise ranking for head-to-head model comparison
Build multi-dimension suites with combined reporting

Model Evaluation ​

Overview ​

Quick Start with EvalSuite ​

CLI Evaluation ​

Simple Model Comparison ​

With System Context ​

Advanced Features ​

Interactive UI Mode ​

File Attachments ​

Evaluation Options ​

Core Parameters ​

Advanced Options ​

Report Generation ​

Generate Reports ​

List Evaluations ​

Best Practices ​

Model Selection ​

Prompt Design ​

Performance Optimization ​

Pairwise Ranking ​

Combining Multiple Evaluations ​

Define a Suite Configuration ​

Generate Combined Reports ​

Examples ​

Next Steps ​

Model Evaluation

Overview

Quick Start with EvalSuite

CLI Evaluation

Simple Model Comparison

With System Context

Advanced Features

Interactive UI Mode

File Attachments

Evaluation Options

Core Parameters

Advanced Options

Report Generation

Generate Reports

List Evaluations

Best Practices

Model Selection

Prompt Design

Performance Optimization

Pairwise Ranking

Combining Multiple Evaluations

Define a Suite Configuration

Generate Combined Reports

Examples

Next Steps