Claude-skill-registry funsloth-check

Validate datasets for Unsloth fine-tuning. Use when the user wants to check a dataset, analyze tokens, calculate Chinchilla optimality, or prepare data for training.

install
source · Clone the upstream repo
git clone https://github.com/majiayu000/claude-skill-registry
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/majiayu000/claude-skill-registry "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/data/funsloth-check" ~/.claude/skills/majiayu000-claude-skill-registry-funsloth-check && rm -rf "$T"
manifest: skills/data/funsloth-check/SKILL.md
source content

Dataset Validation for Unsloth Fine-tuning

Validate datasets before fine-tuning with Unsloth.

Quick Start

For automated validation, use the script:

python scripts/validate_dataset.py --dataset "dataset-id" --model llama-3.1-8b --lora-rank 16

Workflow

1. Get Dataset Source

Ask for: HF dataset ID (e.g.,

mlabonne/FineTome-100k
) or local path (e.g.,
./data.jsonl
)

2. Load and Detect Format

Auto-detect format from structure. See DATA_FORMATS.md for details.

FormatDetectionKey Fields
Raw
text
only
text
Alpaca
instruction
+
output
instruction
,
output
ShareGPT
conversations
array
from
,
value
ChatML
messages
array
role
,
content

3. Validate Schema

Check required fields exist. Report issues with fix suggestions.

4. Show Samples

Display 2-3 examples for visual verification.

5. Token Analysis

Report statistics: total tokens, min/max/mean/median sequence length.

Flag concerns:

  • Sequences > 4096 tokens
  • Sequences < 10 tokens

6. Chinchilla Analysis

Ask for target model and LoRA rank, then calculate:

Chinchilla FractionInterpretation
< 0.5xDataset may be too small
0.5x - 2.0xGood range
> 2.0xLarge dataset, may take longer

7. Recommendations

Based on analysis, suggest:

  • standardize_sharegpt()
    for ShareGPT data
  • Sequence length adjustments
  • Learning rate for small datasets

8. Optional: HF Upload

Offer to upload local datasets to Hub.

9. Handoff

Pass context to

funsloth-train
:

dataset_id: "mlabonne/FineTome-100k"
format_type: "sharegpt"
total_tokens: 15000000
target_model: "llama-3.1-8b"
use_lora: true
lora_rank: 16
chinchilla_fraction: 1.2

Bundled Resources