Skilllibrary dataset-curation
Designs training dataset composition including domain mixing ratios (code, text, math, conversation), quality scoring (perplexity, classifier, reward-model), decontamination against eval sets, token budgeting, versioning, and dataset card documentation. Use when assembling or rebalancing a pretraining or fine-tuning corpus. Do not use for raw text cleaning or annotation workflows.
install
source · Clone the upstream repo
git clone https://github.com/merceralex397-collab/skilllibrary
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/merceralex397-collab/skilllibrary "$T" && mkdir -p ~/.claude/skills && cp -r "$T/12-ai-llm-training-architecture-and-research/dataset-curation" ~/.claude/skills/merceralex397-collab-skilllibrary-dataset-curation && rm -rf "$T"
manifest:
12-ai-llm-training-architecture-and-research/dataset-curation/SKILL.mdsource content
Purpose
Assembles, balances, decontaminates, versions, and documents training datasets for LLM pretraining and fine-tuning. Covers domain mixing strategy, quality scoring pipelines, eval-set decontamination, token budget estimation, and HuggingFace dataset card authoring.
When to use this skill
Use this skill when:
- deciding domain mixing ratios for a pretraining corpus (e.g., 50% web, 20% code, 15% books, 10% math, 5% conversation)
- scoring document quality using perplexity models, classifier-based scoring, or reward-model-based filtering
- decontaminating training data against evaluation benchmarks (MMLU, HumanEval, GSM8K) via n-gram overlap
- estimating total token counts and compute budgets for a training run
- versioning datasets and writing dataset cards for reproducibility
- building data pipelines with HuggingFace
,datasets
, or RedPajama-style toolingdatatrove
Do not use this skill when
- the task is raw text normalization, dedup, or PII removal (use
)data-cleaning-labeling - the task is generating synthetic data (use
)synthetic-data-generation - the task is designing evaluation benchmarks (use
)benchmark-design - the task is preference data collection for RLHF (use
)preference-optimization
Operating procedure
- Inventory available sources. Catalog each data source with: name, domain (web/code/books/math/conversation/scientific), estimated token count, license, and quality tier. Use
to inspect metadata. Track in a manifest file or database table.datasets.load_dataset_builder(name).info - Define domain mixing ratios. Set proportions based on target capabilities. Common starting points (Llama-style): web text 50%, code 20%, academic/books 15%, math 10%, conversation 5%. Use the Doremi or data mixing law approach: train small proxy models with different ratios and measure downstream performance to optimize.
- Score document quality. Apply tiered scoring:
- Tier 1 (fast): heuristic filters — remove short docs (<50 tokens), high-repetition, non-target-language
- Tier 2 (medium): perplexity scoring with a KenLM model or small GPT-2; keep bottom 80% by perplexity
- Tier 3 (expensive): classifier trained on curated high-quality examples, or reward model scores
- Assign each document a quality score; use score for weighted sampling during training.
- Decontaminate against eval sets. For every benchmark in the eval suite: (a) extract all test-set items, (b) compute 13-gram overlap between training documents and test items, (c) remove or flag any training document with ≥1 matching 13-gram from a test item. Insert canary strings in eval sets for ongoing detection. Use
's decontamination module or custom suffix-array implementation.datatrove - Estimate token budget. Count tokens per source using the target tokenizer (
). Compute total tokens needed using Chinchilla scaling: ~20 tokens per parameter for compute-optimal training. Track epoch count per domain — avoid >2 epochs on any single source.AutoTokenizer.from_pretrained - Version and document. Tag each dataset version with a unique ID (git SHA or timestamp). Write a HuggingFace dataset card including: source descriptions, processing steps applied, mixing ratios, total tokens per domain, license information, known limitations, and decontamination report. Store the manifest as
alongside the data.dataset_card.yaml - Build the pipeline. Use
for large-scale processing: readers → filters → dedup → decontam → writers. For smaller corpora, use HuggingFacedatatrove
with probability weights matching target ratios. Output in tokenized format (Arrow or binary) for efficient DataLoader consumption.datasets.interleave_datasets()
Decision rules
- Never exceed 2 epochs on any single data source; diminishing returns and memorization risk increase sharply.
- Always decontaminate against every eval benchmark before training; post-hoc contamination analysis is insufficient.
- Prefer quality over quantity: a 500B-token high-quality corpus outperforms a 1T-token unfiltered corpus.
- When mixing ratios are uncertain, run ablations on a small model (1B params) for 10B tokens to compare.
- Code data should preserve file structure and imports; avoid random chunking that breaks syntax.
Output requirements
— table of sources with domain, token count, license, quality tier, and mixing weightData manifest
— domain ratios and sampling probabilities, with justificationMixing config
— overlap counts per eval benchmark, documents removed, method usedDecontamination report
— total tokens, tokens per domain, estimated epochs, compute estimateToken budget
— HuggingFace-format card with provenance, processing steps, and limitationsDataset card
References
- HuggingFace
library:datasetshuggingface.co/docs/datasets
pipeline:datatrovegithub.com/huggingface/datatrove- RedPajama-Data:
github.com/togethercomputer/RedPajama-Data - Dolma toolkit:
github.com/allenai/dolma - Chinchilla scaling laws: Hoffmann et al. "Training Compute-Optimal Large Language Models"
- Doremi data mixing: Xie et al. "DoReMi: Optimizing Data Mixtures Speeds Up Language Model Pretraining"
Related skills
— raw text cleaning and annotation before curationdata-cleaning-labeling
— generating additional data for underrepresented domainssynthetic-data-generation
— designing the eval sets that curation must decontaminate againstbenchmark-design
— curating preference pairs for alignment trainingpreference-optimization
Failure handling
- If a data source has unclear licensing, quarantine it and flag for legal review; do not include in the default mix.
- If decontamination removes >10% of a domain's data, investigate whether the source was derived from eval benchmarks.
- If token counts diverge >20% from estimates after tokenization, re-audit the manifest for counting errors.
- If ablation experiments show a domain is hurting performance, reduce its weight to ≤2% before removing entirely.