Evaluation framework (API)
Use this page as a quick map of evaluation-related code. Long-form patterns (strategies, caching, runner examples) live in the canonical doc: Evaluation framework (architecture).
Published package exports
From umwelten (src/index.ts):
runEvaluation,runEvaluationWithProgress,generateReport,listEvaluations,parseModel- Types:
EvaluationConfig,EvaluationResult,EnhancedEvaluationConfig
Deep imports from umwelten/dist/... reach EvaluationRunner, strategies, combine/ suite loaders, PairwiseRanker, etc., when you need more control than the CLI wrappers.
Mental model
Stimulus— What to test (prompt shape, tools, output style).Interaction+BaseModelRunner— How a single model run is executed (used inside runners andrunEvaluation).EvaluationRunner/ strategies — Repeatable evaluations with disk cache under an eval id.eval combine+EvalDimension[]— Merge several evaluation runs into one leaderboard/report (examples/model-showdown).
Minimal custom runner
Extend EvaluationRunner and implement getModelResponse. When working inside the umwelten repo (e.g. pnpm tsx scripts), import from src/... as in the examples under Evaluation framework (architecture). From another package, use deep imports from the published dist/ layout (the root exports field only exposes the main entry; EvaluationRunner is not re-exported there).
CLI parity
Prefer the CLI for ad-hoc runs; use the API when embedding in your own scripts:
pnpm run cli -- eval run …→runEvaluationpnpm run cli -- eval report …→generateReportpnpm run cli -- eval combine --config …→ suite aggregation (see Model evaluation)
See also
- Model evaluation guide
- Creating evaluations
- Pairwise ranking API
- Cognition /
ModelResponse(use.content, not.text)