Skilllibrary benchmark-design

Designs reproducible ML/LLM evaluation benchmarks including metric selection (BLEU, ROUGE, pass@k), evaluation protocols (few-shot, zero-shot, chain-of-thought), dataset splits with contamination prevention, and statistical significance testing. Use when building or auditing evaluation suites, leaderboards, or model comparison frameworks. Do not use for training pipelines or data curation.

install

source · Clone the upstream repo

git clone https://github.com/merceralex397-collab/skilllibrary

Claude Code · Install into ~/.claude/skills/

T=$(mktemp -d) && git clone --depth=1 https://github.com/merceralex397-collab/skilllibrary "$T" && mkdir -p ~/.claude/skills && cp -r "$T/12-ai-llm-training-architecture-and-research/benchmark-design" ~/.claude/skills/merceralex397-collab-skilllibrary-benchmark-design && rm -rf "$T"

manifest: 12-ai-llm-training-architecture-and-research/benchmark-design/SKILL.md

source content

Purpose

Designs fair, discriminating, reproducible benchmarks for LLM and ML model evaluation. Covers evaluation protocol design, metric selection, dataset split integrity, statistical significance testing, and leaderboard construction using standard frameworks like lm-evaluation-harness.

When to use this skill

Use this skill when:

designing a new evaluation suite or benchmark for model comparison
selecting metrics (accuracy, F1, BLEU, ROUGE-L, pass@k, exact match) for a specific task type
defining evaluation protocols: few-shot prompting (k=0,1,5), chain-of-thought, or tool-use evals
auditing an existing benchmark for data contamination or statistical validity
building leaderboard infrastructure with reproducibility guarantees

Do not use this skill when

the task is about curating training data (use
```
dataset-curation
```
)
the task is about cleaning or labeling raw data (use
```
data-cleaning-labeling
```
)
the work is model training or architecture changes, not evaluation design
you need to optimize inference speed rather than measure model quality

Operating procedure

Define the capability being measured. State what the benchmark tests (e.g., reasoning, code generation, factual recall). Map to existing benchmark patterns: MMLU for knowledge, HumanEval/MBPP for code, GSM8K for math, TruthfulQA for factuality.
Select evaluation protocol. Choose few-shot count and prompt format. For reasoning tasks, decide between direct answering vs. chain-of-thought. Document the exact prompt template including system message and separator tokens.
Choose metrics. Match metrics to task type:
- Classification/QA: accuracy, macro-F1, exact match
- Generation: BLEU, ROUGE-L, BERTScore
- Code: pass@k (k=1,10,100) with temperature sampling per the Chen et al. unbiased estimator
- Open-ended: win-rate via LLM-as-judge with position debiasing
Design dataset splits. Create held-out test sets with contamination prevention: compute 13-gram overlap against Common Crawl and known training corpora. Insert canary strings (
```
BENCHMARK_CANARY_GUID
```
) to detect leakage. Ensure test set has ≥500 samples for stable estimates.
Implement with lm-evaluation-harness. Define tasks as YAML configs in
```
lm_eval/tasks/
```
. Use
```
EleutherAI/lm-evaluation-harness
```
for standardized few-shot evaluation. For custom tasks, subclass
```
Task
```
and implement
```
doc_to_text
```
,
```
doc_to_target
```
, and
```
process_results
```
.
Compute statistical significance. Report bootstrap confidence intervals (n=1000 resamples) for all metrics. Use paired bootstrap tests or McNemar's test when comparing two models. Report effect sizes, not just p-values.
Document reproducibility. Record: model revision/commit, decoding params (temperature, top_p, max_tokens), prompt hash, library versions, and hardware. Publish a benchmark card with dataset size, license, and known limitations.

Decision rules

Use pass@k with unbiased estimator for code benchmarks; never report pass@1 from greedy decoding as pass@k.
Require ≥1000 bootstrap resamples for confidence intervals on benchmarks with <2000 samples.
Flag any benchmark where 13-gram overlap with known pretraining data exceeds 5% of test samples.
For LLM-as-judge evaluations, always run both orderings (A/B and B/A) and average to remove position bias.
Prefer normalized metrics (accuracy, macro-F1) over raw counts for cross-dataset comparison.

Output requirements

```
Benchmark specification
```
— task definition, prompt template, metric formulas, and dataset split sizes
```
Evaluation config
```
— lm-evaluation-harness YAML or equivalent runner configuration
```
Contamination report
```
— n-gram overlap analysis between test set and known training corpora
```
Results table
```
— metric scores with 95% confidence intervals and sample counts
```
Reproducibility card
```
— model versions, decoding params, library versions, hardware

References

EleutherAI lm-evaluation-harness:

github.com/EleutherAI/lm-evaluation-harness

OpenAI Evals framework:
```
github.com/openai/evals
```
HELM (Holistic Evaluation of Language Models):
```
crfm.stanford.edu/helm
```
Chen et al. "Evaluating Large Language Models Trained on Code" — pass@k estimator
MMLU, HumanEval, GSM8K, TruthfulQA benchmark papers

Related skills

```
eval-dataset-design
```
— designing the data that benchmarks consume
```
dataset-curation
```
— sourcing and filtering data for training or evaluation
```
safety-alignment
```
— evaluating safety properties specifically
```
reward-modeling
```
— using human preference data for evaluation signals

Failure handling

If the test set is too small for stable metrics (<200 samples), flag this and recommend increasing size or using bootstrapped CIs.
If contamination analysis reveals >5% overlap, quarantine affected samples and report cleaned vs. full results.
If no existing benchmark covers the target capability, design a minimal viable eval with ≥500 samples before committing to a full suite.
If inter-annotator agreement (Cohen's κ) on human-labeled gold data is <0.6, revise annotation guidelines before using as ground truth.