Pairwise Ranking Example

This example demonstrates head-to-head model comparison using the PairwiseRanker to produce Elo-based rankings via an LLM judge.

Running the Example

The Rivian narrative ranking script compares model responses from a prior evaluation:

bash

# Swiss tournament (default, 5 rounds)
dotenvx run -- pnpm tsx examples/mcp-chat/elo-rivian.ts --run 4

# Full round-robin (all pairs)
dotenvx run -- pnpm tsx examples/mcp-chat/elo-rivian.ts --full

# Custom swiss rounds
dotenvx run -- pnpm tsx examples/mcp-chat/elo-rivian.ts --rounds 7

What This Example Shows

Pairwise Ranking: Head-to-head LLM-judge comparisons between model responses
Elo Ratings: Bradley-Terry model for computing relative rankings from pairwise results
Swiss Tournament: Efficient pairing strategy (5 rounds vs full round-robin)
Caching: Comparison results are cached — re-runs are instant for existing matchups
Metadata Preservation: Tool usage scores carried through from evaluation to ranking

Code Walkthrough

1. Load Responses from Prior Evaluation

The script reads cached model responses from a completed evaluation run:

typescript

import { PairwiseRanker } from '@umwelten/evaluation/evaluation/ranking/index.js';
import type { RankingEntry } from '@umwelten/evaluation/evaluation/ranking/index.js';

// Load responses from evaluation run directory
const entries: RankingEntry[] = [];
for (const file of fs.readdirSync(resultsDir)) {
  const result = JSON.parse(fs.readFileSync(resultPath, 'utf8'));
  const resp = JSON.parse(fs.readFileSync(responsePath, 'utf8'));

  entries.push({
    key,
    model: result.model,
    provider: result.provider,
    responseText: resp.responseText,
    metadata: { toolScore: result.toolUsage?.tool_score ?? 0 },
  });
}

2. Configure the Ranker

Set up the judge model and ranking parameters:

typescript

const ranker = new PairwiseRanker(entries, {
  judgeModel: { name: 'anthropic/claude-haiku-4.5', provider: 'openrouter' },
  judgeInstructions: [
    'You will see two model responses (A and B) to the same prompt:',
    '"Summarize the 10 days of the Rivian\'s activity."',
    '',
    'Judge ONLY the quality of the narrative summary. Consider:',
    '- Storytelling: Does it read like a story, not a data dump?',
    '- Specificity: Real dates, distances, locations, charge percentages?',
    '- Completeness: Does it cover the full 10-day range?',
    '- Engagement: Would a human enjoy reading this?',
    '- Structure: Is it well-organized with a clear arc?',
    '',
    'Focus purely on which response is a better piece of writing.',
    'If one is clearly better, pick it. Only say "tie" if genuinely equal.',
  ],
  pairingMode: fullRoundRobin ? 'all' : 'swiss',
  swissRounds: 5,
  cacheDir: path.join(runDir, 'elo'),
  onProgress: (label, cached) => {
    process.stdout.write(cached ? `  📁 ${label} (cached)\n` : `  ${label}\n`);
  },
});

3. Run and Display Results

typescript

const output = await ranker.rank();

for (let i = 0; i < output.rankings.length; i++) {
  const r = output.rankings[i];
  const medal = i < 3 ? ['🥇', '🥈', '🥉'][i] : `${i + 1}.`;
  const toolScore = (r.metadata as any)?.toolScore ?? 0;
  console.log(
    `${medal} ${r.provider}:${r.model} — Elo ${r.elo} ` +
    `(${r.wins}W/${r.losses}L/${r.ties}T) Tools: ${toolScore}/5`
  );
}

Expected Output

⚔️  Elo Narrative Ranking — Run 004
════════════════════════════════════════════════════════════════
Loaded 12 responses with text

Mode: Swiss tournament — 5 rounds
Judge: openrouter:anthropic/claude-haiku-4.5

📋 Round 1/5
────────────────────────────────────────────────────────────
  [R1 #1] gpt-4o vs gemini-3-flash-preview → gpt-4o (high)
  [R1 #2] claude-sonnet-4 vs llama-3.3-70b → claude-sonnet-4 (high)
  [R1 #3] gemini-2.5-pro vs mistral-large → gemini-2.5-pro (medium)
  ...

🏆 ELO NARRATIVE RANKINGS
════════════════════════════════════════════════════════════════
Rank  Model                                     Elo     W    L    T    Games  Tools
────────────────────────────────────────────────────────────────
🥇    openrouter:openai/gpt-4o                  1580    4    1    0    5      4/5
🥈    openrouter:anthropic/claude-sonnet-4       1565    3    1    1    5      5/5
🥉    google:gemini-2.5-pro                      1548    3    2    0    5      3/5
4.    openrouter:meta-llama/llama-3.3-70b        1502    2    2    1    5      2/5
...

📊 30 comparisons completed
📁 Results: output/evaluations/rivian-10day/runs/004/elo/rankings.json
   Comparisons cached — re-run is instant for existing matchups.

Standalone Usage Pattern

Use the ranker independently of any evaluation framework:

typescript

import { PairwiseRanker } from '@umwelten/evaluation/evaluation/ranking/index.js';

// Your responses from any source
const entries = [
  { key: 'model-a', model: 'gpt-4o', provider: 'openrouter', responseText: 'Response from model A...' },
  { key: 'model-b', model: 'gemini-3-flash', provider: 'google', responseText: 'Response from model B...' },
  { key: 'model-c', model: 'claude-sonnet-4', provider: 'openrouter', responseText: 'Response from model C...' },
];

const ranker = new PairwiseRanker(entries, {
  judgeModel: { name: 'gemini-3-flash-preview', provider: 'google' },
  judgeInstructions: ['Compare the two responses. Which is more helpful?'],
  pairingMode: 'all',  // round-robin for small sets
});

const output = await ranker.rank();
console.log(`Top model: ${output.rankings[0].model} with Elo ${output.rankings[0].elo}`);

Integration with Evaluation Framework

Bridge from runEvaluation() results:

typescript

import { runEvaluation } from '@umwelten/evaluation/evaluation/api.js';
import { evaluationResultsToRankingEntries, PairwiseRanker } from '@umwelten/evaluation/evaluation/ranking/index.js';

// Step 1: Run evaluation
const evalResult = await runEvaluation({
  evaluationId: 'my-eval',
  prompt: 'Write a haiku about programming',
  models: ['google:gemini-3-flash-preview', 'openrouter:openai/gpt-4o', 'openrouter:anthropic/claude-sonnet-4'],
});

// Step 2: Convert to ranking entries
const entries = evaluationResultsToRankingEntries(evalResult);

// Step 3: Rank pairwise
const ranker = new PairwiseRanker(entries, {
  judgeModel: { name: 'gemini-3-flash-preview', provider: 'google' },
  judgeInstructions: ['Which haiku is more evocative and follows proper 5-7-5 form?'],
  pairingMode: 'all',
});

const output = await ranker.rank();

Key Concepts

Position Bias Mitigation

The ranker randomly flips which response is shown as "A" vs "B" for each comparison. This eliminates systematic preference for the first or second position.

Incremental Caching

With cacheDir set, every comparison is cached to comparisons.json. If you re-run the script:

Cached comparisons replay instantly (no API calls)
New matchups (e.g., from additional swiss rounds) are computed and added
Rankings update incrementally

Swiss vs Round-Robin Trade-offs

For 12 models:

Swiss (5 rounds): ~30 comparisons, good approximate ranking
Round-robin: 66 comparisons, exact ranking
Swiss at 7 rounds with 12 models gives accuracy close to round-robin

Error Handling

If a judge call fails (network error, rate limit, malformed response), the comparison counts as a tie. This prevents one bad API call from corrupting the ranking.

Next Steps

Pairwise Ranking Guide — Detailed configuration and usage guide
API Reference — Full type and function reference
Matrix Evaluation — Generate responses to rank
Cost Optimization — Budget-conscious evaluation

Pairwise Ranking Example ​

Running the Example ​

What This Example Shows ​

Code Walkthrough ​

1. Load Responses from Prior Evaluation ​

2. Configure the Ranker ​

3. Run and Display Results ​

Expected Output ​

Standalone Usage Pattern ​

Integration with Evaluation Framework ​

Key Concepts ​

Position Bias Mitigation ​

Incremental Caching ​

Swiss vs Round-Robin Trade-offs ​

Error Handling ​

Next Steps ​

Pairwise Ranking Example

Running the Example

What This Example Shows

Code Walkthrough

1. Load Responses from Prior Evaluation

2. Configure the Ranker

3. Run and Display Results

Expected Output

Standalone Usage Pattern

Integration with Evaluation Framework

Key Concepts

Position Bias Mitigation

Incremental Caching

Swiss vs Round-Robin Trade-offs

Error Handling

Next Steps