Claude-skill-registry ai-training-data-generation
Generate high-quality training datasets from documents, text corpora, and structured content. Use when creating AI training data from dictionaries, documents, or when generating examples for machine learning models. Optimized for low-resource languages and domain-specific knowledge extraction.
install
source · Clone the upstream repo
git clone https://github.com/majiayu000/claude-skill-registry
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/majiayu000/claude-skill-registry "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/data/ai-training-data-generation" ~/.claude/skills/majiayu000-claude-skill-registry-ai-training-data-generation && rm -rf "$T"
manifest:
skills/data/ai-training-data-generation/SKILL.mdsource content
AI Training Data Generation
Overview
A comprehensive skill for automatically generating high-quality training datasets from documents, text corpora, and structured content. Optimized for low-resource languages, dictionary content, and domain-specific knowledge extraction.
Capabilities
- Multi-strategy Generation: Dictionary pairs, contextual definitions, completion tasks, classification examples
- Quality Filtering: Confidence scoring, duplicate removal, and content validation
- Format Flexibility: Support for multiple AI training formats (JSONL, HuggingFace, Ollama, OpenAI)
- Language Awareness: Multi-language support with special handling for accented characters
- Scalable Processing: Generate thousands of examples from large documents
- Balance Management: Ensure dataset diversity and prevent category imbalance
Core Strategies
1. Dictionary Pair Extraction
Extract word-definition pairs from structured and semi-structured text.
Detection Patterns:
- Separator-based:
,word – definitionterm: meaning - Linguistic indicators:
,means
,is defined asrefers to - Structural cues: Indentation, formatting, list structures
- Context analysis: Surrounding text for validation
2. Implementation Pattern
from .ai_training_generator import AITrainingDataGenerator # Initialize generator generator = AITrainingDataGenerator(min_confidence=0.7) # Generate comprehensive training data training_data = generator.generate_comprehensive_training_data( parsed_document, target_count=10000 ) # Export in multiple formats files = generator.export_training_data( training_data, output_dir="training_output", format_type="ollama" )
Output Format Examples
JSONL Format (Standard)
{"input": "What does 'ááfengen' mean?", "output": "very good, excellent", "type": "dictionary_pair", "confidence": 0.95}
Ollama Format
{"prompt": "Translate this Chuukese word: ngang", "response": "fish", "system": "You are a Chuukese-English translator."}
HuggingFace Format
{"text": "### Instruction:\nWhat does 'chomong' mean in Chuukese?\n\n### Response:\nto help, assist"}
OpenAI Fine-tuning Format
{"messages": [{"role": "user", "content": "Define: kúún"}, {"role": "assistant", "content": "to go, to leave"}]}
Quality Assurance
- Content validity: Does the example make linguistic sense?
- Pattern matching: Does it follow expected language patterns?
- Context appropriateness: Is the context relevant and helpful?
- Uniqueness: Avoid repetitive or duplicate content
Best Practices
- Multiple validation passes: Automated and manual quality checks
- Confidence thresholds: Adjust based on use case requirements
- Human review sampling: Periodic manual validation of generated examples
- Balance management: Ensure even distribution across categories
Dependencies
: Regular expression pattern matchingre
: Data serialization and exportjson
: Duplicate detection and content hashinghashlib
: Data structure utilities and countingcollections