Claude-skill-registry data-designer

Generate high-quality synthetic datasets using statistical samplers and Claude's native LLM capabilities. Use when users ask to create synthetic data, generate datasets, create fake/mock data, generate test data, training data, or any data generation task. Supports CSV, JSON, JSONL, Parquet output. Adapted from NVIDIA NeMo DataDesigner (Apache 2.0).

install
source · Clone the upstream repo
git clone https://github.com/majiayu000/claude-skill-registry
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/majiayu000/claude-skill-registry "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/data/data-designer" ~/.claude/skills/majiayu000-claude-skill-registry-data-designer && rm -rf "$T"
manifest: skills/data/data-designer/SKILL.md
source content

Data Designer

Generate synthetic datasets combining statistical samplers with Claude's LLM capabilities. No external API keys required.

Workflow

  1. Clarify requirements - Ask about purpose, columns, size, format
  2. Create schema - Write
    dataset_schema.json
    defining columns
  3. Generate preview - Run
    batch_generator.py
    for 3-5 rows
  4. Iterate - Refine based on feedback
  5. Generate full dataset - Batch generate, then merge
  6. Deliver - Export to requested format

Column Types

Statistical Samplers (No LLM)

TypeDescriptionKey Params
category
Weighted random choice
values
,
weights
subcategory
Hierarchical (parent-based)
mapping
,
category
uniform
Uniform distribution
low
,
high
,
dtype
gaussian
Normal distribution
mean
,
std
,
min_val
,
max_val
bernoulli
Binary probability
p
,
true_value
,
false_value
poisson
Poisson distribution
mean
datetime
Random dates
start
,
end
,
format
person
Synthetic personas
fields
,
age_range
,
locale
uuid
Unique IDs
prefix
,
format

LLM Columns (Claude generates)

TypeDescription
llm_text
Free-form text
llm_code
Code with syntax validation
llm_structured
JSON matching schema
llm_judge
Quality scoring

Schema Format

Create

dataset_schema.json
:

{
  "name": "dataset_name",
  "seed": 42,
  "columns": [
    {"name": "category", "type": "category", "params": {"values": ["A","B"], "weights": [0.6,0.4]}},
    {"name": "text", "type": "llm_text", "prompt": "Write about {{ category }}.", "depends_on": ["category"]}
  ],
  "output": {"format": "csv", "filename": "output"}
}

For full schema reference: references/schema.md

Jinja2 Templating

Reference columns in prompts:

Write a {{ rating }}-star review for {{ product_name }} by {{ customer.first_name }}.

Supports:

{{ var }}
,
{{ obj.field }}
,
{% if %}
, filters

Scripts

Generate Data

# Preview
python scripts/batch_generator.py --schema schema.json --rows 5 --output preview.json --preview

# Full generation
python scripts/batch_generator.py --schema schema.json --rows 100 --batch-size 20 --output batches/

Merge & Export

python scripts/merger.py --input batches/ --output dataset.csv --flatten

Formats:

csv
,
json
,
jsonl
,
parquet

Generation Strategy

  1. Sampler columns first - Python scripts, fast
  2. LLM columns in dependency order - Topological sort by
    depends_on
  3. Batch processing - Generate in batches of 20-50 for large datasets

For LLM columns, Claude generates directly:

  • Render Jinja2 prompt with row data
  • Generate content
  • Validate if configured
  • Retry on failure (max 3)

Examples

Simple:

"Generate 50 product reviews with ratings 1-5"

Complex:

"Create 200 support tickets with: ticket_id (UUID), customer (name, email), category (billing/technical/general), priority (1-5 gaussian), description (LLM)"

Code:

"Generate 100 Python functions with description, code (validated), tests"

Tips

  • Use
    seed
    for reproducibility
  • Preview first, then scale
  • Keep LLM prompts specific
  • Use
    subcategory
    for correlated data

Attribution

Adapted from NVIDIA NeMo DataDesigner (Apache 2.0).