Pairwise Ranking Example
This example demonstrates head-to-head model comparison using the PairwiseRanker to produce Elo-based rankings via an LLM judge.
Running the Example
The Rivian narrative ranking script compares model responses from a prior evaluation:
# Swiss tournament (default, 5 rounds)
dotenvx run -- pnpm tsx examples/mcp-chat/elo-rivian.ts --run 4
# Full round-robin (all pairs)
dotenvx run -- pnpm tsx examples/mcp-chat/elo-rivian.ts --full
# Custom swiss rounds
dotenvx run -- pnpm tsx examples/mcp-chat/elo-rivian.ts --rounds 7What This Example Shows
- Pairwise Ranking: Head-to-head LLM-judge comparisons between model responses
- Elo Ratings: Bradley-Terry model for computing relative rankings from pairwise results
- Swiss Tournament: Efficient pairing strategy (5 rounds vs full round-robin)
- Caching: Comparison results are cached — re-runs are instant for existing matchups
- Metadata Preservation: Tool usage scores carried through from evaluation to ranking
Code Walkthrough
1. Load Responses from Prior Evaluation
The script reads cached model responses from a completed evaluation run:
import { PairwiseRanker } from '../../src/evaluation/ranking/index.js';
import type { RankingEntry } from '../../src/evaluation/ranking/index.js';
// Load responses from evaluation run directory
const entries: RankingEntry[] = [];
for (const file of fs.readdirSync(resultsDir)) {
const result = JSON.parse(fs.readFileSync(resultPath, 'utf8'));
const resp = JSON.parse(fs.readFileSync(responsePath, 'utf8'));
entries.push({
key,
model: result.model,
provider: result.provider,
responseText: resp.responseText,
metadata: { toolScore: result.toolUsage?.tool_score ?? 0 },
});
}2. Configure the Ranker
Set up the judge model and ranking parameters:
const ranker = new PairwiseRanker(entries, {
judgeModel: { name: 'anthropic/claude-haiku-4.5', provider: 'openrouter' },
judgeInstructions: [
'You will see two model responses (A and B) to the same prompt:',
'"Summarize the 10 days of the Rivian\'s activity."',
'',
'Judge ONLY the quality of the narrative summary. Consider:',
'- Storytelling: Does it read like a story, not a data dump?',
'- Specificity: Real dates, distances, locations, charge percentages?',
'- Completeness: Does it cover the full 10-day range?',
'- Engagement: Would a human enjoy reading this?',
'- Structure: Is it well-organized with a clear arc?',
'',
'Focus purely on which response is a better piece of writing.',
'If one is clearly better, pick it. Only say "tie" if genuinely equal.',
],
pairingMode: fullRoundRobin ? 'all' : 'swiss',
swissRounds: 5,
cacheDir: path.join(runDir, 'elo'),
onProgress: (label, cached) => {
process.stdout.write(cached ? ` 📁 ${label} (cached)\n` : ` ${label}\n`);
},
});3. Run and Display Results
const output = await ranker.rank();
for (let i = 0; i < output.rankings.length; i++) {
const r = output.rankings[i];
const medal = i < 3 ? ['🥇', '🥈', '🥉'][i] : `${i + 1}.`;
const toolScore = (r.metadata as any)?.toolScore ?? 0;
console.log(
`${medal} ${r.provider}:${r.model} — Elo ${r.elo} ` +
`(${r.wins}W/${r.losses}L/${r.ties}T) Tools: ${toolScore}/5`
);
}Expected Output
⚔️ Elo Narrative Ranking — Run 004
════════════════════════════════════════════════════════════════
Loaded 12 responses with text
Mode: Swiss tournament — 5 rounds
Judge: openrouter:anthropic/claude-haiku-4.5
📋 Round 1/5
────────────────────────────────────────────────────────────
[R1 #1] gpt-4o vs gemini-3-flash-preview → gpt-4o (high)
[R1 #2] claude-sonnet-4 vs llama-3.3-70b → claude-sonnet-4 (high)
[R1 #3] gemini-2.5-pro vs mistral-large → gemini-2.5-pro (medium)
...
🏆 ELO NARRATIVE RANKINGS
════════════════════════════════════════════════════════════════
Rank Model Elo W L T Games Tools
────────────────────────────────────────────────────────────────
🥇 openrouter:openai/gpt-4o 1580 4 1 0 5 4/5
🥈 openrouter:anthropic/claude-sonnet-4 1565 3 1 1 5 5/5
🥉 google:gemini-2.5-pro 1548 3 2 0 5 3/5
4. openrouter:meta-llama/llama-3.3-70b 1502 2 2 1 5 2/5
...
📊 30 comparisons completed
📁 Results: output/evaluations/rivian-10day/runs/004/elo/rankings.json
Comparisons cached — re-run is instant for existing matchups.Standalone Usage Pattern
Use the ranker independently of any evaluation framework:
import { PairwiseRanker } from '../../src/evaluation/ranking/index.js';
// Your responses from any source
const entries = [
{ key: 'model-a', model: 'gpt-4o', provider: 'openrouter', responseText: 'Response from model A...' },
{ key: 'model-b', model: 'gemini-3-flash', provider: 'google', responseText: 'Response from model B...' },
{ key: 'model-c', model: 'claude-sonnet-4', provider: 'openrouter', responseText: 'Response from model C...' },
];
const ranker = new PairwiseRanker(entries, {
judgeModel: { name: 'gemini-3-flash-preview', provider: 'google' },
judgeInstructions: ['Compare the two responses. Which is more helpful?'],
pairingMode: 'all', // round-robin for small sets
});
const output = await ranker.rank();
console.log(`Top model: ${output.rankings[0].model} with Elo ${output.rankings[0].elo}`);Integration with Evaluation Framework
Bridge from runEvaluation() results:
import { runEvaluation } from '../../src/evaluation/api.js';
import { evaluationResultsToRankingEntries, PairwiseRanker } from '../../src/evaluation/ranking/index.js';
// Step 1: Run evaluation
const evalResult = await runEvaluation({
evaluationId: 'my-eval',
prompt: 'Write a haiku about programming',
models: ['google:gemini-3-flash-preview', 'openrouter:openai/gpt-4o', 'openrouter:anthropic/claude-sonnet-4'],
});
// Step 2: Convert to ranking entries
const entries = evaluationResultsToRankingEntries(evalResult);
// Step 3: Rank pairwise
const ranker = new PairwiseRanker(entries, {
judgeModel: { name: 'gemini-3-flash-preview', provider: 'google' },
judgeInstructions: ['Which haiku is more evocative and follows proper 5-7-5 form?'],
pairingMode: 'all',
});
const output = await ranker.rank();Key Concepts
Position Bias Mitigation
The ranker randomly flips which response is shown as "A" vs "B" for each comparison. This eliminates systematic preference for the first or second position.
Incremental Caching
With cacheDir set, every comparison is cached to comparisons.json. If you re-run the script:
- Cached comparisons replay instantly (no API calls)
- New matchups (e.g., from additional swiss rounds) are computed and added
- Rankings update incrementally
Swiss vs Round-Robin Trade-offs
For 12 models:
- Swiss (5 rounds): ~30 comparisons, good approximate ranking
- Round-robin: 66 comparisons, exact ranking
- Swiss at 7 rounds with 12 models gives accuracy close to round-robin
Error Handling
If a judge call fails (network error, rate limit, malformed response), the comparison counts as a tie. This prevents one bad API call from corrupting the ranking.
Next Steps
- Pairwise Ranking Guide — Detailed configuration and usage guide
- API Reference — Full type and function reference
- Matrix Evaluation — Generate responses to rank
- Cost Optimization — Budget-conscious evaluation