Claude-skill-registry fine-tuning-data-generator
Generates comprehensive synthetic fine-tuning datasets in ChatML format (JSONL) for use with Unsloth, Axolotl, and similar training frameworks. Gathers requirements, creates datasets with diverse examples, validates quality, and provides framework integration guidance.
git clone https://github.com/majiayu000/claude-skill-registry
T=$(mktemp -d) && git clone --depth=1 https://github.com/majiayu000/claude-skill-registry "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/data/fine-tuning-data-generator" ~/.claude/skills/majiayu000-claude-skill-registry-fine-tuning-data-generator && rm -rf "$T"
skills/data/fine-tuning-data-generator/SKILL.mdFine-Tuning Data Generator
This skill generates high-quality synthetic training data in ChatML format for fine-tuning language models using frameworks like Unsloth, Axolotl, or similar tools.
What Do I Need?
| Need | Resource |
|---|---|
| Planning my dataset - requirements, strategy, quality checklist | |
| How to create diverse examples - variation techniques, multi-turn patterns, format-specific guidance | |
| ChatML format details - structure, specification, common issues, framework compatibility | |
| Example datasets - inspiration across domains, multi-turn samples, edge cases | |
| Validating quality - validation workflow, analyzing datasets, troubleshooting | |
| Training & deployment - framework setup, hyperparameters, optimization, deployment | |
Workflow
Phase 1: Gather Requirements
Start with these essential clarifying questions:
Task Definition:
- What is the model being trained to do? (e.g., customer support, code generation, creative writing)
- What specific domain or subject matter? (e.g., legal, medical, e-commerce, software development)
- How many training examples are needed? (Recommend: 100+ for simple tasks, 500-1000+ for complex)
Quality & Diversity:
- Complexity range: simple to complex mix, or focus on specific difficulty level?
- Diversity: edge cases, error handling, unusual scenarios?
- Tone/style: professional, friendly, technical, concise, detailed?
- Response length preferences?
- Any specific formats: code blocks, lists, tables, JSON?
Dataset Composition:
- Distribution across subtopics: evenly distributed or weighted?
- Include negative examples (what NOT to do)?
- Need validation split? (Recommend 10-20% of total)
See
for detailed question templates.resources/dataset-strategy.md
Phase 2: Create Generation Plan
Present a plan covering:
- Number and distribution of examples across categories
- Key topics/scenarios to cover
- Diversity strategies (phrasing variations, complexity levels, edge cases)
- System prompt approach (consistent vs. varied)
- Quality assurance approach
Get user approval before generating.
Phase 3: Generate Synthetic Data
Create examples following these quality standards:
Key Principles:
- Realistic scenarios reflecting real-world use cases
- Natural language with varied phrasing and formality levels
- Accurate, helpful responses aligned with desired behavior
- Consistent ChatML formatting throughout
- Balanced difficulty (unless specified)
- Meaningful variety (no repetition)
- Include edge cases and error scenarios
Diversity Techniques:
- Vary query phrasing (questions, commands, statements)
- Include different expertise levels (beginner, intermediate, expert)
- Cover both positive and negative examples
- Mix short and long-form responses
- Include multi-step reasoning when appropriate
- Add context variations
See
for detailed techniques, domain-specific guidance, and batch generation workflow.resources/generation-techniques.md
Phase 4: Validate & Document
Run validation tools and checks:
# Validate JSON formatting and structure python scripts/validate_chatml.py training_data.jsonl # Analyze dataset statistics and diversity python scripts/analyze_dataset.py training_data.jsonl # Export statistics python scripts/analyze_dataset.py training_data.jsonl --export stats.json
Quality Checklist:
- JSON validation passed (no errors)
- Analysis shows good diversity metrics
- Manual sample review passed
- No duplicate or near-duplicate examples
- All required fields present
- Realistic user queries
- Accurate, helpful responses
- Balanced category distribution
- Dataset metadata documented
See
for validation details, troubleshooting, and documentation templates.resources/quality-validation.md
Phase 5: Integration & Training
Prepare for training with your framework of choice:
Output Files:
- Main training settraining_data.jsonl
- Optional validation setvalidation_data.jsonl
- Metadata and statisticsdataset_info.txt
Framework Setup:
- Unsloth: Automatic ChatML detection, efficient 4-bit training
- Axolotl: Specify
andtype: chat_templatechat_template: chatml - Hugging Face: Use tokenizer's
methodapply_chat_template() - Custom: Load from JSONL, handle ChatML formatting
See
for setup code, hyperparameters, deployment options, and best practices.resources/framework-integration.md
ChatML Format Overview
Each training example is a JSON object with a
messages array:
{"messages": [{"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "How do I reverse a string in Python?"}, {"role": "assistant", "content": "Use slicing: `text[::-1]`"}]}
Roles:
: Sets assistant behavior (optional but recommended)system
: User's input/queryuser
: Model's expected responseassistant
Multi-turn: Add additional user/assistant message pairs for conversations.
See
for detailed specification, validation, common issues, and framework-specific notes.resources/chatml-format.md
Tool Reference
Scripts in scripts/
scripts/validate_chatml.py
Validates ChatML format JSONL files:
python scripts/validate_chatml.py training_data.jsonl python scripts/validate_chatml.py training_data.jsonl --verbose
Checks:
- Valid JSON formatting
- Required fields (messages, role, content)
- Valid role values (system, user, assistant)
- Proper message order
- Duplicate detection
- Diversity metrics
analyze_dataset.py
Provides comprehensive statistics and analysis:
python scripts/analyze_dataset.py training_data.jsonl python scripts/analyze_dataset.py training_data.jsonl --export stats.json
Provides:
- Dataset overview (total examples, message counts)
- Message length statistics
- System prompt variations
- User query patterns (questions, commands, code-related, length categories)
- Assistant response patterns (code blocks, lists, headers, length categories)
- Quality indicators (diversity score, balance ratio)
- Token estimates and cost projection
Common Workflows
Small Dataset (100-200 examples)
- Gather requirements
- Create generation plan for 1-2 categories
- Generate in single batch, review quality
- Validate and document
- Ready for training
Medium Dataset (500-1000 examples)
- Gather requirements
- Create detailed plan with multiple categories
- Generate in 2-3 batches, reviewing after each
- Analyze diversity and adjust approach
- Fill any gaps
- Final validation and documentation
Large Dataset (2000+ examples)
- Gather comprehensive requirements
- Create multi-batch generation plan
- Batch 1 (50-100): Foundation examples
- Batch 2 (100-200): Complexity expansion
- Batch 3 (100-200): Coverage filling
- Batch 4 (50-100): Polish and validation
- Run full validation suite
- Generate comprehensive documentation
Best Practices
Start Small, Iterate
- Generate 10-20 examples first
- Review and get feedback
- Refine approach based on feedback
- Scale up to full dataset
Quality Over Quantity
- Better to have 500 diverse, high-quality examples than 5,000 repetitive ones
- Each example should teach something new
- Maintain consistent response quality throughout
Diversify Systematically
- Vary query phrasing (questions, commands, statements)
- Cover different expertise levels
- Mix response complexities
- Include edge cases (typically 20-30% of dataset)
- Use batch generation workflow for large datasets
Test Before Deployment
- Test dataset with actual training framework
- Monitor training metrics for issues
- Test fine-tuned model outputs before deployment
- Compare results to base model
Document Everything
- Keep notes on generation parameters
- Save different dataset versions
- Document any modifications made
- Record generation strategies used
- Track model performance metrics
Advanced Features
Batch Generation Strategy
For datasets 500+ examples:
- Generate 50-100 examples at a time
- Review distribution and diversity after each batch
- Adjust generation strategy based on identified gaps
- Prevents repetition and maintains creativity
Common Pitfalls to Avoid
- Over-templating: Creates repetitive patterns (vary naturally)
- Unrealistic Queries: Overly formal/robotic user inputs (use varied phrasing)
- Narrow Coverage: Limited scenarios and phrasing (ensure diversity)
- Inconsistent Quality: Quality degradation over time (use quality checklist)
- JSON Errors: Invalid formatting breaking training (always validate)
- Missing Context: System prompts without detail (provide clear instructions)
- Response Mismatch: Responses don't address queries (verify relevance)
Dataset Size Recommendations
| Task Complexity | Recommended Size | Notes |
|---|---|---|
| Simple tasks | 100-500 | Well-defined, limited variation |
| Medium tasks | 500-2,000 | Multiple scenarios, moderate complexity |
| Complex tasks | 2,000-10,000+ | Many edge cases, high variability |
| Domain adaptation | 1,000-5,000 | Specialized knowledge required |
Resources
- Planning & Strategy:
- Requirements gathering, planning, quality checklistsresources/dataset-strategy.md - Generation Techniques:
- Diversity techniques, domain-specific guidance, batch workflowsresources/generation-techniques.md - ChatML Specification:
- Format details, validation, framework notesresources/chatml-format.md - Example Datasets:
- Diverse domain examples, multi-turn patternsresources/examples.md - Quality Validation:
- Validation workflow, analysis, troubleshootingresources/quality-validation.md - Framework Integration:
- Setup for Unsloth, Axolotl, HuggingFace; deployment optionsresources/framework-integration.md
Version: 2.0 | Updated: 2024 | Pattern: Modular Orchestration