Skilllibrary dataset-curation

Designs training dataset composition including domain mixing ratios (code, text, math, conversation), quality scoring (perplexity, classifier, reward-model), decontamination against eval sets, token budgeting, versioning, and dataset card documentation. Use when assembling or rebalancing a pretraining or fine-tuning corpus. Do not use for raw text cleaning or annotation workflows.

install

source · Clone the upstream repo

git clone https://github.com/merceralex397-collab/skilllibrary

Claude Code · Install into ~/.claude/skills/

T=$(mktemp -d) && git clone --depth=1 https://github.com/merceralex397-collab/skilllibrary "$T" && mkdir -p ~/.claude/skills && cp -r "$T/12-ai-llm-training-architecture-and-research/dataset-curation" ~/.claude/skills/merceralex397-collab-skilllibrary-dataset-curation && rm -rf "$T"

manifest: 12-ai-llm-training-architecture-and-research/dataset-curation/SKILL.md

source content

Purpose

Assembles, balances, decontaminates, versions, and documents training datasets for LLM pretraining and fine-tuning. Covers domain mixing strategy, quality scoring pipelines, eval-set decontamination, token budget estimation, and HuggingFace dataset card authoring.

When to use this skill

Use this skill when:

deciding domain mixing ratios for a pretraining corpus (e.g., 50% web, 20% code, 15% books, 10% math, 5% conversation)
scoring document quality using perplexity models, classifier-based scoring, or reward-model-based filtering
decontaminating training data against evaluation benchmarks (MMLU, HumanEval, GSM8K) via n-gram overlap
estimating total token counts and compute budgets for a training run
versioning datasets and writing dataset cards for reproducibility
building data pipelines with HuggingFace
```
datasets
```
,
```
datatrove
```
, or RedPajama-style tooling

Do not use this skill when

the task is raw text normalization, dedup, or PII removal (use
```
data-cleaning-labeling
```
)
the task is generating synthetic data (use
```
synthetic-data-generation
```
)
the task is designing evaluation benchmarks (use
```
benchmark-design
```
)
the task is preference data collection for RLHF (use
```
preference-optimization
```
)

Operating procedure

Inventory available sources. Catalog each data source with: name, domain (web/code/books/math/conversation/scientific), estimated token count, license, and quality tier. Use
```
datasets.load_dataset_builder(name).info
```
to inspect metadata. Track in a manifest file or database table.
Define domain mixing ratios. Set proportions based on target capabilities. Common starting points (Llama-style): web text 50%, code 20%, academic/books 15%, math 10%, conversation 5%. Use the Doremi or data mixing law approach: train small proxy models with different ratios and measure downstream performance to optimize.
Score document quality. Apply tiered scoring:
- Tier 1 (fast): heuristic filters — remove short docs (<50 tokens), high-repetition, non-target-language
- Tier 2 (medium): perplexity scoring with a KenLM model or small GPT-2; keep bottom 80% by perplexity
- Tier 3 (expensive): classifier trained on curated high-quality examples, or reward model scores
- Assign each document a quality score; use score for weighted sampling during training.
Decontaminate against eval sets. For every benchmark in the eval suite: (a) extract all test-set items, (b) compute 13-gram overlap between training documents and test items, (c) remove or flag any training document with ≥1 matching 13-gram from a test item. Insert canary strings in eval sets for ongoing detection. Use
```
datatrove
```
's decontamination module or custom suffix-array implementation.
Estimate token budget. Count tokens per source using the target tokenizer (
```
AutoTokenizer.from_pretrained
```
). Compute total tokens needed using Chinchilla scaling: ~20 tokens per parameter for compute-optimal training. Track epoch count per domain — avoid >2 epochs on any single source.
Version and document. Tag each dataset version with a unique ID (git SHA or timestamp). Write a HuggingFace dataset card including: source descriptions, processing steps applied, mixing ratios, total tokens per domain, license information, known limitations, and decontamination report. Store the manifest as
```
dataset_card.yaml
```
alongside the data.
Build the pipeline. Use
```
datatrove
```
for large-scale processing: readers → filters → dedup → decontam → writers. For smaller corpora, use HuggingFace
```
datasets.interleave_datasets()
```
with probability weights matching target ratios. Output in tokenized format (Arrow or binary) for efficient DataLoader consumption.

Decision rules

Never exceed 2 epochs on any single data source; diminishing returns and memorization risk increase sharply.
Always decontaminate against every eval benchmark before training; post-hoc contamination analysis is insufficient.
Prefer quality over quantity: a 500B-token high-quality corpus outperforms a 1T-token unfiltered corpus.
When mixing ratios are uncertain, run ablations on a small model (1B params) for 10B tokens to compare.
Code data should preserve file structure and imports; avoid random chunking that breaks syntax.

Output requirements

```
Data manifest
```
— table of sources with domain, token count, license, quality tier, and mixing weight
```
Mixing config
```
— domain ratios and sampling probabilities, with justification
```
Decontamination report
```
— overlap counts per eval benchmark, documents removed, method used
```
Token budget
```
— total tokens, tokens per domain, estimated epochs, compute estimate
```
Dataset card
```
— HuggingFace-format card with provenance, processing steps, and limitations

References

HuggingFace
```
datasets
```
library:
```
huggingface.co/docs/datasets
```

datatrove

pipeline:

github.com/huggingface/datatrove

RedPajama-Data:

github.com/togethercomputer/RedPajama-Data

Dolma toolkit:
```
github.com/allenai/dolma
```
Chinchilla scaling laws: Hoffmann et al. "Training Compute-Optimal Large Language Models"
Doremi data mixing: Xie et al. "DoReMi: Optimizing Data Mixtures Speeds Up Language Model Pretraining"

Related skills

```
data-cleaning-labeling
```
— raw text cleaning and annotation before curation
```
synthetic-data-generation
```
— generating additional data for underrepresented domains
```
benchmark-design
```
— designing the eval sets that curation must decontaminate against
```
preference-optimization
```
— curating preference pairs for alignment training

Failure handling

If a data source has unclear licensing, quarantine it and flag for legal review; do not include in the default mix.
If decontamination removes >10% of a domain's data, investigate whether the source was derived from eval benchmarks.
If token counts diverge >20% from estimates after tokenization, re-audit the manifest for counting errors.
If ablation experiments show a domain is hurting performance, reduce its weight to ≤2% before removing entirely.