Claude-skill-registry data-designer
Generate high-quality synthetic datasets using statistical samplers and Claude's native LLM capabilities. Use when users ask to create synthetic data, generate datasets, create fake/mock data, generate test data, training data, or any data generation task. Supports CSV, JSON, JSONL, Parquet output. Adapted from NVIDIA NeMo DataDesigner (Apache 2.0).
git clone https://github.com/majiayu000/claude-skill-registry
T=$(mktemp -d) && git clone --depth=1 https://github.com/majiayu000/claude-skill-registry "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/data/data-designer" ~/.claude/skills/majiayu000-claude-skill-registry-data-designer && rm -rf "$T"
skills/data/data-designer/SKILL.mdData Designer
Generate synthetic datasets combining statistical samplers with Claude's LLM capabilities. No external API keys required.
Workflow
- Clarify requirements - Ask about purpose, columns, size, format
- Create schema - Write
defining columnsdataset_schema.json - Generate preview - Run
for 3-5 rowsbatch_generator.py - Iterate - Refine based on feedback
- Generate full dataset - Batch generate, then merge
- Deliver - Export to requested format
Column Types
Statistical Samplers (No LLM)
| Type | Description | Key Params |
|---|---|---|
| Weighted random choice | , |
| Hierarchical (parent-based) | , |
| Uniform distribution | , , |
| Normal distribution | , , , |
| Binary probability | , , |
| Poisson distribution | |
| Random dates | , , |
| Synthetic personas | , , |
| Unique IDs | , |
LLM Columns (Claude generates)
| Type | Description |
|---|---|
| Free-form text |
| Code with syntax validation |
| JSON matching schema |
| Quality scoring |
Schema Format
Create
dataset_schema.json:
{ "name": "dataset_name", "seed": 42, "columns": [ {"name": "category", "type": "category", "params": {"values": ["A","B"], "weights": [0.6,0.4]}}, {"name": "text", "type": "llm_text", "prompt": "Write about {{ category }}.", "depends_on": ["category"]} ], "output": {"format": "csv", "filename": "output"} }
For full schema reference: references/schema.md
Jinja2 Templating
Reference columns in prompts:
Write a {{ rating }}-star review for {{ product_name }} by {{ customer.first_name }}.
Supports:
{{ var }}, {{ obj.field }}, {% if %}, filters
Scripts
Generate Data
# Preview python scripts/batch_generator.py --schema schema.json --rows 5 --output preview.json --preview # Full generation python scripts/batch_generator.py --schema schema.json --rows 100 --batch-size 20 --output batches/
Merge & Export
python scripts/merger.py --input batches/ --output dataset.csv --flatten
Formats:
csv, json, jsonl, parquet
Generation Strategy
- Sampler columns first - Python scripts, fast
- LLM columns in dependency order - Topological sort by
depends_on - Batch processing - Generate in batches of 20-50 for large datasets
For LLM columns, Claude generates directly:
- Render Jinja2 prompt with row data
- Generate content
- Validate if configured
- Retry on failure (max 3)
Examples
Simple:
"Generate 50 product reviews with ratings 1-5"
Complex:
"Create 200 support tickets with: ticket_id (UUID), customer (name, email), category (billing/technical/general), priority (1-5 gaussian), description (LLM)"
Code:
"Generate 100 Python functions with description, code (validated), tests"
Tips
- Use
for reproducibilityseed - Preview first, then scale
- Keep LLM prompts specific
- Use
for correlated datasubcategory
Attribution
Adapted from NVIDIA NeMo DataDesigner (Apache 2.0).