Dotclaude skill-creator
Use when creating a new skill, rewriting an existing skill, or running evaluations to improve skill quality and trigger accuracy.
git clone https://github.com/JHostalek/dotclaude
T=$(mktemp -d) && git clone --depth=1 https://github.com/JHostalek/dotclaude "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/skill-creator" ~/.claude/skills/jhostalek-dotclaude-skill-creator && rm -rf "$T"
skills/skill-creator/SKILL.mdSkill Creator
Create, evaluate, and iteratively improve skills.
The Loop
Every skill follows this cycle. Figure out where the user is and jump in:
- Capture intent — understand what the skill does and when it triggers
- Draft the skill — write SKILL.md and any bundled resources
- Run test cases — spawn with-skill and baseline runs in parallel
- Evaluate with the user — generate the eval viewer, run quantitative evals, get feedback
- Improve — rewrite based on feedback, rerun, repeat
- Optimize description — tune triggering accuracy (after skill is stable)
The user might arrive at any stage. Maybe they already have a draft — skip to eval. Maybe they want to vibe without formal evals — adapt. The loop is a guide, not a mandate.
Capturing Intent
If the current conversation already contains a workflow to capture ("turn this into a skill"), extract from history first: tools used, step sequence, corrections made, input/output formats observed. Fill gaps with the user.
Key questions:
- What should this skill enable Claude to do?
- When should it trigger? (phrases, contexts)
- Expected output format?
- Are test cases appropriate? Objectively verifiable outputs (file transforms, data extraction, code generation) benefit from test cases. Subjective outputs (writing style, art) often don't.
Proactively research edge cases, dependencies, and similar existing skills before writing.
Writing Skills
Anatomy
skill-name/ ├── SKILL.md (required) │ ├── YAML frontmatter (name, description required) │ └── Markdown instructions └── Bundled Resources (optional) ├── scripts/ - Executable code for deterministic/repetitive tasks ├── references/ - Docs loaded into context as needed └── assets/ - Files used in output (templates, icons, fonts)
Progressive Disclosure
Skills use three-level loading:
- Metadata (name + description) — always in context (~100 words)
- SKILL.md body — loaded when skill triggers (<500 lines ideal)
- Bundled resources — loaded on demand (unlimited; scripts execute without loading)
Keep SKILL.md under 500 lines. If approaching this limit, push detail into reference files with clear pointers. For large references (>300 lines), include a table of contents.
For multi-domain skills, organize by variant:
cloud-deploy/ ├── SKILL.md (workflow + selection) └── references/ ├── aws.md ├── gcp.md └── azure.md
Description
The description is the primary triggering mechanism. Include both what the skill does AND specific contexts for when to use it. Claude tends to under-trigger — it won't use skills when they'd be useful. Counteract this by making descriptions slightly pushy: instead of "Build dashboards for internal data", write "Build dashboards for internal data. Use whenever the user mentions dashboards, data visualization, internal metrics, or wants to display any kind of company data, even if they don't explicitly ask for a 'dashboard.'"
Writing Principles
Apply prompt design principles from:
!
cat ~/.claude/skills/prompt/SKILL.md
Key calibrations specific to skill writing:
- Explain why, not just what. "ALWAYS" and "NEVER" in all caps are yellow flags — reframe as reasoning so the model applies the principle in novel situations.
- Teach theory of mind. A skill that helps the model understand the user's goal generalizes better than one specifying exact steps.
- No-surprise principle. Skills must not contain malware, exploit code, or content that would surprise the user if described.
Test Cases
After drafting, create 2-3 realistic test prompts — things a real user would say. Share with user for confirmation, then run.
Save to
evals/evals.json:
{ "skill_name": "example-skill", "evals": [ { "id": 1, "prompt": "User's task prompt", "expected_output": "Description of expected result", "files": [] } ] }
See
references/schemas.md for the full schema (including assertions, added after first run).
Running Evals
This section is one continuous sequence. Do NOT use
/skill-test or other testing skills.
Put results in
<skill-name>-workspace/ as a sibling to the skill directory. Organize by iteration (iteration-1/, iteration-2/, etc.) and eval (eval-0/, eval-1/, etc.). Create directories as you go.
Spawn All Runs in One Turn
For each test case, spawn both with-skill and baseline teammates simultaneously — not sequentially. This matters for wall-clock time.
With-skill run:
Execute this task: - Skill path: <path-to-skill> - Task: <eval prompt> - Input files: <eval files if any, or "none"> - Save outputs to: <workspace>/iteration-<N>/eval-<ID>/with_skill/outputs/ - Outputs to save: <what the user cares about>
Baseline run depends on context:
- New skill → no skill at all. Save to
.without_skill/outputs/ - Improving existing skill → snapshot old version first (
), point baseline at snapshot. Save tocp -r
.old_skill/outputs/
Write
eval_metadata.json per test case with a descriptive name (not "eval-0"). If iteration uses new/modified eval prompts, create fresh metadata files — don't assume carryover.
{ "eval_id": 0, "eval_name": "descriptive-name-here", "prompt": "The user's task prompt", "assertions": [] }
Draft Assertions While Runs Execute
Use the wait productively. Draft quantitative assertions for each test case and explain them to the user. Good assertions are objectively verifiable and have descriptive names that read clearly in the benchmark viewer.
Subjective skills (writing style, design quality) are better evaluated qualitatively — don't force assertions onto things that need human judgment.
Update
eval_metadata.json and evals/evals.json with assertions.
Capture Timing Data as Runs Complete
When a teammate task completes, the notification contains
total_tokens and duration_ms. Save immediately to timing.json in the run directory — this data is not persisted elsewhere.
{ "total_tokens": 84852, "duration_ms": 23332, "total_duration_seconds": 23.3 }
Grade, Aggregate, Launch Viewer
Once all runs complete:
-
Grade each run — spawn a grader teammate using
. Save toagents/grader.md
per run. The expectations array must use fieldsgrading.json
,text
,passed
— the viewer depends on these exact names. For programmatically checkable assertions, write and run a script rather than eyeballing it.evidence -
Aggregate — run from the skill-creator directory:
python -m scripts.aggregate_benchmark <workspace>/iteration-N --skill-name <name>Produces
andbenchmark.json
. Put each with_skill version before its baseline counterpart. Seebenchmark.md
for the exact schema.references/schemas.md -
Analyst pass — spawn an analyst teammate using
to surface patterns the aggregate stats hide (non-discriminating assertions, high-variance evals, time/token tradeoffs).agents/analyzer.md -
Launch viewer — always use
, never custom HTML:generate_review.pynohup python <skill-creator-path>/eval-viewer/generate_review.py \ <workspace>/iteration-N \ --skill-name "my-skill" \ --benchmark <workspace>/iteration-N/benchmark.json \ > /dev/null 2>&1 & VIEWER_PID=$!For iteration 2+, add
.--previous-workspace <workspace>/iteration-<N-1>Headless/Cowork: Use
for standalone HTML. Feedback downloads as--static <output_path>
— copy into workspace directory.feedback.json -
Tell the user the viewer is ready and to come back when they're done reviewing.
Read Feedback
When user finishes, read
feedback.json. Empty feedback = the user thought it was fine. Focus improvements on test cases with specific complaints.
Kill the viewer:
kill $VIEWER_PID 2>/dev/null
Improving the Skill
Improvement Principles
- Generalize from examples. You're iterating on a few test cases to move fast, but the skill will be used across millions of prompts. If a fix only works for the current examples, it's overfitting. When a stubborn issue persists, try different metaphors or patterns rather than adding fiddly constraints.
- Keep the prompt lean. Read transcripts, not just outputs. If the skill causes the model to waste time on unproductive steps, remove those instructions.
- Explain the why. Transmit your understanding of the user's intent into the instructions. Rigid MUSTs are less effective than motivated reasoning.
- Extract repeated work. If all test runs independently wrote similar helper scripts, bundle that script in
and reference it from the skill.scripts/
Iteration Loop
- Apply improvements
- Rerun all test cases into
, including baselines. For new skills, baseline is alwaysiteration-<N+1>/
. For improvements, use judgment on baseline version.without_skill - Launch viewer with
pointing to previous iteration--previous-workspace - Wait for user review
- Read feedback, improve, repeat
Stop when: the user is happy, feedback is all empty, or improvements plateau.
Blind Comparison (Optional)
For rigorous A/B comparison between skill versions. Spawn comparator and analyzer teammates using
agents/comparator.md and agents/analyzer.md. Gives two outputs to an independent teammate without revealing which is which.
Most users won't need this — the human review loop is usually sufficient.
Description Optimization
After the skill is stable, optimize the description for triggering accuracy.
Generate Trigger Eval Queries
Create 20 queries — mix of should-trigger (8-10) and should-not-trigger (8-10). Save as JSON:
[ {"query": "the user prompt", "should_trigger": true}, {"query": "another prompt", "should_trigger": false} ]
Query quality matters. Queries must be realistic — concrete, specific, with file paths, personal context, company names, URLs. Mix lengths, include typos and casual speech.
Bad:
"Format this data", "Extract text from PDF", "Create a chart"
Good: "ok so my boss just sent me this xlsx file (its in my downloads, called something like 'Q4 sales final FINAL v2.xlsx') and she wants me to add a column that shows the profit margin as a percentage. The revenue is in column C and costs are in column D i think"
Should-trigger queries: Different phrasings of the same intent, some formal, some casual. Include cases where the user doesn't name the skill explicitly but clearly needs it.
Should-not-trigger queries: Near-misses are the most valuable — queries sharing keywords but needing something different. Adjacent domains, ambiguous phrasing. "Write a fibonacci function" as a negative for a PDF skill tests nothing.
Triggering mechanics: Skills appear in Claude's
available_skills list. Claude only consults skills for tasks it can't handle directly — simple one-step queries may not trigger even with a perfect description. Eval queries should be substantive enough to benefit from skill consultation.
Review with User
Use the HTML template from
assets/eval_review.html:
- Replace
with the JSON array (no quotes — JS variable)__EVAL_DATA_PLACEHOLDER__ - Replace
and__SKILL_NAME_PLACEHOLDER____SKILL_DESCRIPTION_PLACEHOLDER__ - Write to
and open/tmp/eval_review_<skill-name>.html - User edits queries, exports eval set
- Check
for the export (may have numeric suffix)~/Downloads/
Run Optimization
python -m scripts.run_loop \ --eval-set <path-to-trigger-eval.json> \ --skill-path <path-to-skill> \ --model <model-id-powering-this-session> \ --max-iterations 5 \ --verbose
Run in background. Periodically tail output to give user updates. The loop splits 60% train / 40% test, evaluates 3x per query for reliability, uses extended thinking for improvements, and selects by test score to avoid overfitting.
Apply Result
Take
best_description from output JSON and update SKILL.md frontmatter. Show before/after and scores.
Environment Adaptations
Claude.ai
No teammates — run test cases serially yourself. Skip baselines and quantitative benchmarking. Present results directly in conversation instead of the viewer. Skip description optimization (requires
claude -p CLI).
Cowork
Full workflow works. Use
--static <output_path> for viewer (no display). Generate the eval viewer before evaluating outputs yourself — get results in front of the human first. Feedback downloads as a file.
Packaging (if present_files
tool available)
present_filespython -m scripts.package_skill <path/to/skill-folder>
Direct user to the resulting
.skill file.
Reference Files
— assertion evaluation against outputsagents/grader.md
— blind A/B output comparisonagents/comparator.md
— benchmark pattern analysisagents/analyzer.md
— JSON schemas for all data filesreferences/schemas.md