Skilllibrary data-cleaning-labeling
Cleans raw text data (Unicode normalization, deduplication via MinHash/SimHash, perplexity filtering, language ID) and manages annotation pipelines (inter-annotator agreement, gold sets, calibration). Use when preparing training or evaluation data that needs dedup, quality filtering, PII removal, or human labeling workflows. Do not use for dataset mixing/curation strategy or synthetic data generation.
install
source · Clone the upstream repo
git clone https://github.com/merceralex397-collab/skilllibrary
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/merceralex397-collab/skilllibrary "$T" && mkdir -p ~/.claude/skills && cp -r "$T/12-ai-llm-training-architecture-and-research/data-cleaning-labeling" ~/.claude/skills/merceralex397-collab-skilllibrary-data-cleaning-labeling && rm -rf "$T"
manifest:
12-ai-llm-training-architecture-and-research/data-cleaning-labeling/SKILL.mdsource content
Purpose
Cleans and normalizes raw text corpora for ML training, implements deduplication pipelines, applies quality filters, removes PII, and designs annotation workflows with measurable inter-annotator reliability. Covers the full path from raw scraped data to labeled, training-ready datasets.
When to use this skill
Use this skill when:
- normalizing text data: fixing Unicode (NFKC), collapsing whitespace, stripping HTML/boilerplate, fixing encoding issues
- deduplicating a corpus using MinHash (datasketch), SimHash, exact hash, or n-gram overlap methods
- building quality filters using perplexity scoring (KenLM), heuristic rules, or classifier-based filtering
- performing language identification with fasttext
or similar modelslid.176.bin - designing annotation guidelines, measuring inter-annotator agreement (Cohen's κ, Fleiss' κ, Krippendorff's α)
- removing PII (emails, phone numbers, SSNs, names) using regex patterns or NER-based detection
Do not use this skill when
- the task is about dataset mixing ratios or domain balancing (use
)dataset-curation - the task is generating synthetic training data (use
)synthetic-data-generation - you need to design evaluation benchmarks (use
)benchmark-design - the work is standard ETL with no ML training or annotation component
Operating procedure
- Profile the raw data. Sample 10K documents and compute: character encoding distribution, language distribution (via
), average document length, duplicate rate (exact MD5), and Unicode category breakdown. Usefasttext
for large corpora.datasets.load_dataset(..., streaming=True) - Normalize text. Apply in order: (a) Unicode NFKC normalization, (b) fix mojibake via
, (c) strip HTML withftfy.fix_text()
ortrafilatura
, (d) collapse whitespace and normalize line endings, (e) remove zero-width characters and control chars exceptbeautifulsoup4
.\n\t - Deduplicate. Choose method by corpus size:
- <1M docs: exact dedup via SHA-256 hash of normalized text
- 1M–100M docs: MinHash LSH with
datasketch.MinHashLSH(threshold=0.8, num_perm=128) - Substring-level: use suffix arrays or SimHash for near-duplicate paragraph detection
- Log dedup rate; typical web crawls lose 30–60% of documents.
- Filter for quality. Apply cascading filters:
- Language ID: keep only target languages (fasttext confidence >0.5)
- Perplexity: score with KenLM model trained on known-good data; remove top 5% highest perplexity
- Heuristic filters: remove docs with >80% punctuation, <50 chars, >30% uppercase, excessive repetition (line repeated >3 times)
- Classifier: optionally train a fasttext quality classifier on curated positive/negative examples
- Remove PII. Apply regex for emails, phone numbers, SSNs, IP addresses. Use a NER model (spaCy
or Presidio) for names/addresses. Replace detected PII with type-specific tokens:en_core_web_lg
,[EMAIL]
,[PHONE]
.[NAME] - Design annotation pipeline. Write annotation guidelines with ≥5 worked examples per label. Compute inter-annotator agreement on a pilot batch (≥200 samples, ≥3 annotators). Target Cohen's κ ≥ 0.7 for binary tasks, Fleiss' κ ≥ 0.6 for multi-class.
- Implement label QA. Embed 5–10% gold-standard items (pre-labeled by experts) in each annotation batch. Flag annotators whose gold accuracy drops below 85%. Run calibration sessions when κ drops below threshold.
Decision rules
- Always deduplicate before quality filtering to avoid wasting compute on duplicates.
- Use MinHash with ≥128 permutations for corpora >1M documents; exact hash is insufficient for near-duplicates.
- Require Cohen's κ ≥ 0.7 before using human labels for training; retrain annotators if below 0.6.
- PII removal must run before any data leaves the secure processing environment.
- Log every filtering step with sample counts: raw → deduped → filtered → labeled, to track data loss.
Output requirements
— encoding stats, language distribution, length distribution, duplicate rateData profile report
— ordered list of normalization and filter steps with thresholdsCleaning pipeline config
— method used, threshold, number of clusters, documents removedDedup report
— label definitions, examples, edge cases, and decision treeAnnotation guidelines
— inter-annotator κ scores, gold-set accuracy per annotator, final label distributionQuality metrics
References
library for MinHash/LSH:datasketchgithub.com/ekzhu/datasketch
language identification:fasttextfasttext.cc/docs/en/language-identification.html
for Unicode fixes:ftfygithub.com/rspeer/python-ftfy
for web text extraction:trafilaturagithub.com/adbar/trafilatura- Microsoft Presidio for PII detection:
github.com/microsoft/presidio - HuggingFace
library for data loading and processingdatasets
Related skills
— mixing and balancing cleaned data into training setsdataset-curation
— generating additional labeled datasynthetic-data-generation
— designing evaluation sets from cleaned databenchmark-design
— end-to-end pretraining that consumes cleaned datallm-creation
Failure handling
- If dedup rate exceeds 70%, investigate the data source for crawler traps or template pages before proceeding.
- If inter-annotator κ is below 0.5, halt labeling and revise guidelines; do not train on unreliable labels.
- If PII detection recall is uncertain, run a manual audit on 500 random samples before releasing the dataset.
- If quality filtering removes >50% of data, lower thresholds incrementally and inspect borderline samples rather than accepting a severely reduced corpus.