Skills ml-model-eval-benchmark
Compare model candidates using weighted metrics and deterministic ranking outputs. Use for benchmark leaderboards and model promotion decisions.
install
source · Clone the upstream repo
git clone https://github.com/openclaw/skills
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/openclaw/skills "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/0x-professor/ml-model-eval-benchmark" ~/.claude/skills/clawdbot-skills-ml-model-eval-benchmark && rm -rf "$T"
manifest:
skills/0x-professor/ml-model-eval-benchmark/SKILL.mdsource content
ML Model Eval Benchmark
Overview
Produce consistent model ranking outputs from metric-weighted evaluation inputs.
Workflow
- Define metric weights and accepted metric ranges.
- Ingest model metrics for each candidate.
- Compute weighted score and ranking.
- Export leaderboard and promotion recommendation.
Use Bundled Resources
- Run
to generate benchmark outputs.scripts/benchmark_models.py - Read
for weighting and tie-break guidance.references/benchmarking-guide.md
Guardrails
- Keep metric names and scales consistent across candidates.
- Record weighting assumptions in the output.