Skilllibrary eval-dataset-design
Designs evaluation datasets for LLMs including task taxonomy, difficulty calibration, contamination prevention, scoring rubrics, and benchmark integration with lm-evaluation-harness and HELM. Use when creating, auditing, or improving eval sets for model assessment.
install
source · Clone the upstream repo
git clone https://github.com/merceralex397-collab/skilllibrary
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/merceralex397-collab/skilllibrary "$T" && mkdir -p ~/.claude/skills && cp -r "$T/12-ai-llm-training-architecture-and-research/eval-dataset-design" ~/.claude/skills/merceralex397-collab-skilllibrary-eval-dataset-design && rm -rf "$T"
manifest:
12-ai-llm-training-architecture-and-research/eval-dataset-design/SKILL.mdsource content
Purpose
Design rigorous evaluation datasets that measure real model capabilities across task types (classification, generation, reasoning, code, math) with calibrated difficulty, contamination controls, and reproducible scoring.
When to use this skill
Use this skill when:
- designing a new eval benchmark or task suite for an LLM
- auditing an existing eval set for contamination, difficulty calibration, or coverage gaps
- creating tasks for
or HELM-style evaluation pipelineslm-evaluation-harness - building scoring rubrics for open-ended generation tasks
- selecting few-shot examples or designing prompt templates for consistent evaluation
Do not use this skill when
- the task is fine-tuning or training a model (use
orfine-tuning
)instruction-tuning - the goal is inference speed optimization (use
)inference-kernel-optimization - you need to run existing benchmarks without modification (just use lm-eval-harness directly)
Operating procedure
- Define task taxonomy. Classify eval targets: extractive QA, multi-choice classification, free-form generation, code synthesis, mathematical reasoning, or multi-turn dialogue. Each type requires different scoring.
- Calibrate difficulty. Apply item response theory (IRT) concepts: estimate item difficulty and discrimination from pilot annotations. Establish human baselines — recruit 3–5 annotators and measure inter-annotator agreement (Krippendorff's α ≥ 0.7).
- Prevent contamination. Insert canary strings (unique UUIDs) in held-out splits. Enforce temporal cutoffs (training data before date X, eval data after). Deduplicate against known pretraining corpora using MinHash or exact n-gram matching.
- Design prompt templates. Use fixed prompt templates per task type. For few-shot, select examples via diversity sampling (maximize coverage of answer types). Keep
between 0–5 shots; always include a balanced label distribution.k - Build scoring rubrics. For open-ended generation: define 3–5 rubric dimensions (accuracy, relevance, fluency, completeness). Use Likert scales (1–5). For code: use pass@k with unit test suites. For math: exact-match after normalization.
- Implement in lm-evaluation-harness. Create a task YAML: define
,dataset_path
,doc_to_text
, anddoc_to_target
. Register undermetric_list
. For HELM: define scenario + adapter + metric triples.lm_eval/tasks/ - Validate the eval set. Run against 2+ models of known quality to verify score separation. Check ceiling (human) and floor (random baseline) performance.
Decision rules
- Minimum 200 examples per eval task for statistical power; 500+ preferred for fine-grained analysis.
- Always include a random-baseline and human-ceiling measurement.
- If inter-annotator agreement < 0.6, revise task definition or annotation guidelines before proceeding.
- Prefer exact-match and pass@k over BLEU/ROUGE for tasks with deterministic answers.
- Use bootstrap confidence intervals (n=1000) when reporting aggregate scores.
Output requirements
— taxonomy, prompt template, and example formatTask specification
— train (optional few-shot source), validation, test with contamination controlsDataset splits
— metric definitions, human annotation guidelines, automated scorer codeScoring rubric
— random baseline, human ceiling, and 1–2 reference model scoresBaseline resultslm-eval-harness task YAML or HELM scenario config
References
- EleutherAI lm-evaluation-harness: https://github.com/EleutherAI/lm-evaluation-harness
- HELM benchmark: https://crfm.stanford.edu/helm/
- Item Response Theory for NLP: Lalor et al., 2019
- Contamination detection: Dodge et al., "Documenting Large Webtext Corpora"
- Benchmark saturation: Kiela et al., "Dynabench" (NeurIPS 2021)
Related skills
benchmark-designsafety-alignmentfine-tuninginstruction-tuning
Failure handling
- If contamination is detected in the eval set, quarantine affected examples and re-sample from the held-out pool.
- If human agreement is too low, run a calibration round with revised guidelines before collecting final annotations.
- If score separation between known-good and known-bad models is < 5%, the eval task lacks discriminative power — redesign the task or increase difficulty.