Claude-skill-registry funsloth-check
Validate datasets for Unsloth fine-tuning. Use when the user wants to check a dataset, analyze tokens, calculate Chinchilla optimality, or prepare data for training.
install
source · Clone the upstream repo
git clone https://github.com/majiayu000/claude-skill-registry
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/majiayu000/claude-skill-registry "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/data/funsloth-check" ~/.claude/skills/majiayu000-claude-skill-registry-funsloth-check && rm -rf "$T"
manifest:
skills/data/funsloth-check/SKILL.mdsource content
Dataset Validation for Unsloth Fine-tuning
Validate datasets before fine-tuning with Unsloth.
Quick Start
For automated validation, use the script:
python scripts/validate_dataset.py --dataset "dataset-id" --model llama-3.1-8b --lora-rank 16
Workflow
1. Get Dataset Source
Ask for: HF dataset ID (e.g.,
mlabonne/FineTome-100k) or local path (e.g., ./data.jsonl)
2. Load and Detect Format
Auto-detect format from structure. See DATA_FORMATS.md for details.
| Format | Detection | Key Fields |
|---|---|---|
| Raw | only | |
| Alpaca | + | , |
| ShareGPT | array | , |
| ChatML | array | , |
3. Validate Schema
Check required fields exist. Report issues with fix suggestions.
4. Show Samples
Display 2-3 examples for visual verification.
5. Token Analysis
Report statistics: total tokens, min/max/mean/median sequence length.
Flag concerns:
- Sequences > 4096 tokens
- Sequences < 10 tokens
6. Chinchilla Analysis
Ask for target model and LoRA rank, then calculate:
| Chinchilla Fraction | Interpretation |
|---|---|
| < 0.5x | Dataset may be too small |
| 0.5x - 2.0x | Good range |
| > 2.0x | Large dataset, may take longer |
7. Recommendations
Based on analysis, suggest:
for ShareGPT datastandardize_sharegpt()- Sequence length adjustments
- Learning rate for small datasets
8. Optional: HF Upload
Offer to upload local datasets to Hub.
9. Handoff
Pass context to
funsloth-train:
dataset_id: "mlabonne/FineTome-100k" format_type: "sharegpt" total_tokens: 15000000 target_model: "llama-3.1-8b" use_lora: true lora_rank: 16 chinchilla_fraction: 1.2
Bundled Resources
- scripts/validate_dataset.py - Automated validation script
- DATA_FORMATS.md - Dataset format reference