Openclacky skill-creator
Create new skills, modify and improve existing skills, and measure skill performance. Use when users want to create a skill from scratch, edit, or optimize an existing skill, run evals to test a skill, benchmark skill performance with variance analysis, or optimize a skill's description for better triggering accuracy.
git clone https://github.com/clacky-ai/openclacky
T=$(mktemp -d) && git clone --depth=1 https://github.com/clacky-ai/openclacky "$T" && mkdir -p ~/.claude/skills && cp -r "$T/lib/clacky/default_skills/skill-creator" ~/.claude/skills/clacky-ai-openclacky-skill-creator && rm -rf "$T"
lib/clacky/default_skills/skill-creator/SKILL.mdSkill Creator
A skill for creating new skills and iteratively improving them.
Usage Modes
This skill supports two modes:
1. Interactive Mode (default)
The full workflow with user interviews, test cases, and iteration cycles. Use when creating or refining skills manually.
At a high level, the process of creating a skill goes like this:
- Decide what you want the skill to do and roughly how it should do it
- Write a draft of the skill
- Create a few test prompts and simulate running them (with vs. without the skill instructions)
- Help the user evaluate the results both qualitatively and quantitatively
- While reviewing, draft quantitative assertions if there aren't any
- Use
to generate a static HTML viewer for the user to review results and leave feedbackeval-viewer/generate_review.py
- Rewrite the skill based on the user's feedback
- Repeat until satisfied
Your job is to figure out where the user is in this process and jump in to help them progress through these stages. Maybe they say "I want to make a skill for X" — help narrow down the intent, write a draft, write test cases, evaluate, and repeat. Or maybe they already have a draft — go straight to the eval/iterate part.
Always be flexible. If the user says "skip the evals, just vibe with me", do that instead.
2. Quick Mode (for agent self-evolution)
Trigger: When invoked with
mode: "quick" in the task arguments.
Fast, opinionated skill creation without user interaction. This mode is used by the agent's self-evolution system to automatically create or improve skills.
Behavior:
- Skip user interviews and detailed requirements gathering
- Extract workflow pattern from provided context
- Write a minimal but functional SKILL.md
- Save to
(or improve existing skill in place)~/.clacky/skills/auto-<name>-<timestamp>/ - Skip test cases and evals (user can refine later if needed)
- Always validate frontmatter with the validator script after creation
- Focus on the happy path; edge cases can be added later
Expected arguments when using quick mode:
: Clear description of what to automate and how (be specific about workflow steps)task
: Must be set tomode"quick"
: (optional) Proposed skill identifier (lowercase, hyphens OK)suggested_name
Quick mode principles:
- Be opinionated: Make reasonable assumptions without asking
- Be concise: Keep instructions simple and focused
- Be practical: Focus on the core workflow that will save the most time
- Be correct: Always set
anddisable-model-invocation: falseuser-invocable: true - Be validating: Run the frontmatter validator immediately after creation
Example invocation from the agent's self-evolution system:
invoke_skill( skill_name: "skill-creator", task: "Create a skill to extract and summarize content from URLs. The skill should: 1) fetch the URL using safe_shell with curl, 2) parse the HTML to extract main text content, 3) generate a concise markdown summary. Expected input: URL string. Expected output: markdown summary with title and key points.", mode: "quick", suggested_name: "url-summarizer" )
Platform Context: Clacky
This skill runs inside Clacky (openclacky). Key platform specifics:
- Skills live at
— always create new skills here (global user skills, visible to Web UI and all sessions). To locate an existing skill, check these paths in order using~/.clacky/skills/<skill-name>/
orglob
: (1)ls
— project-level skills, (2).clacky/skills/
— user-level skills. Built-in skills (shipped with the gem) are always available via~/.clacky/skills/
by name — no file lookup needed. Never useinvoke_skill
or broad filesystem searches to locate skills.find / - No parallel subagents — Clacky runs as a single agent; all test cases execute serially in the current session
- No external agent CLI — for evals, just execute the task directly in-session (read the skill, follow instructions, save outputs)
- Scripts — prefer Ruby (
files); Clacky is Ruby-native. Run with.rb
. Python is available but Ruby is the default choiceruby path/to/script.rb
— if Python scripts are needed (e.g.,python3
), usegenerate_review.py
explicitlypython3- The description optimization scripts (
,run_loop.py
) work in Clacky — they userun_eval.py
to detectclacky agent --json
events. See the Description Optimization section for usageinvoke_skill
Communicating with the user
Pay attention to context cues to understand how technical the user is. In general:
- "evaluation" and "benchmark" are fine
- For "JSON" and "assertion" — explain briefly if you're unsure the user knows these terms
It's always OK to briefly explain a term if you're in doubt.
Creating a skill
Capture Intent
Start by understanding what the user wants. If the current conversation already shows a workflow they want to capture (tools used, sequence of steps, corrections made, input/output formats), extract answers from history first — the user may just need to fill gaps and confirm.
- What should this skill enable Clacky to do?
- When should this skill trigger? (what phrases/contexts)
- What's the expected output format?
- Should we set up test cases? Skills with objectively verifiable outputs (file transforms, data extraction, code generation) benefit from test cases. Skills with subjective outputs (writing style, creative work) often don't need them.
Interview and Research
Ask about edge cases, input/output formats, example files, success criteria, and dependencies before writing test prompts. Come prepared with context to reduce burden on the user.
Write the SKILL.md
Components to fill in:
- name: Skill identifier (lowercase, hyphens OK)
- description: Primary triggering mechanism — include BOTH what the skill does AND specific contexts for when to use it. All "when to use" info goes here, not in the body. Make the description a little "pushy" — err toward over-triggering rather than under-triggering. Example: instead of "Helps with dashboard creation", write "Helps with dashboard creation. Use this skill whenever the user mentions dashboards, data visualization, or wants to display any kind of data, even if they don't explicitly say 'dashboard'."
- disable-model-invocation: Set to
(always include this)false - user-invocable: Set to
to make the skill appear in the WebUI chatboxtrue
command list. Always include this — without it, users cannot manually invoke the skill from the Clacky Web UI session chat./ - compatibility (optional): Required tools or dependencies
- Body: The actual instructions
Clacky-specific: Every skill MUST include
anddisable-model-invocation: falsein the YAML frontmatter, or it will be invisible in the WebUIuser-invocable: truecommand list. The minimal valid frontmatter is:/--- name: my-skill description: 'Your description here. Avoid colons followed by a space (like "wants to: do X") inside the description — they break YAML parsing and the skill will silently fail to load. Wrap the entire description in single quotes to be safe, or rephrase to avoid the colon pattern.' disable-model-invocation: false user-invocable: true ---YAML description gotcha: If the description contains
patterns (colons followed by space), YAML treats them as key-value pairs and the frontmatter parse fails silently. Always wrap description values in single quotes. Avoid embedded double-quotes inside single-quoted strings (use rephrasing instead).word: value
After writing SKILL.md — always validate and auto-fix: Run this immediately after creating or updating any skill file:
ruby SKILL_DIR/scripts/validate_skill_frontmatter.rb /path/to/new-skill/SKILL.mdThe script validates the YAML frontmatter and auto-fixes common issues (unquoted descriptions, multi-line block scalars with colons). If it prints
— you're done. If it printsOK:— it repaired the file automatically. If it printsAuto-fixed and saved— manual fix required.ERROR
Skill Writing Guide
Anatomy of a Skill
Skills are created at
~/.clacky/skills/<skill-name>/:
~/.clacky/skills/skill-name/ ├── SKILL.md (required) │ ├── YAML frontmatter (name, description required) │ └── Markdown instructions └── Bundled Resources (optional) ├── scripts/ - Executable code (prefer .rb Ruby scripts) ├── references/ - Docs loaded into context as needed └── assets/ - Files used in output (templates, icons, fonts)
Progressive Disclosure
Skills use a three-level loading system:
- Metadata (name + description) — Always in context (~100 words)
- SKILL.md body — In context whenever skill triggers (<500 lines ideal)
- Bundled resources — Loaded as needed (unlimited)
Key patterns:
- Keep SKILL.md under 500 lines; if approaching the limit, extract content into
files and add clear pointersreferences/ - Reference files from SKILL.md with guidance on when to read them
- For large reference files (>300 lines), include a table of contents
Domain organization — When a skill supports multiple frameworks/domains, organize by variant:
my-skill/ ├── SKILL.md (workflow + which reference to load) └── references/ ├── rails.md ├── django.md └── express.md
Bundled Scripts (Ruby preferred)
When a skill needs to execute code — API calls, file processing, data transforms — bundle a Ruby script instead of writing inline shell commands. This is cleaner, reusable, and more maintainable.
Ruby script template:
#!/usr/bin/env ruby # skill-name/scripts/do_something.rb # Usage: ruby path/to/do_something.rb [args] require 'net/http' require 'json' require 'fileutils' # Read args input = ARGV[0] if input.nil? || input.strip.empty? warn "Usage: ruby do_something.rb <input>" exit 1 end # ... logic ... puts result # stdout is the output
Invoke from SKILL.md by referencing the script via the Supporting Files block — at runtime, the AI receives the full absolute path of every supporting file. Refer to it as
SKILL_DIR in instructions so the AI substitutes the correct path from the Supporting Files list:
ruby "SKILL_DIR/scripts/do_something.rb" "argument"
Never hardcode paths like
~/.clacky/skills/my-skill/scripts/... — they break when the skill is installed at a different location. Never use find to locate scripts — the Supporting Files block always provides the correct absolute paths.
Ruby standard library covers most needs (
net/http, json, fileutils, uri, time). No gems needed for basic API calls.
Principle of Least Surprise
Skills must not contain malware, exploit code, or anything that could compromise security. A skill's contents should not surprise the user if described. Don't create misleading skills or skills designed for unauthorized access or data exfiltration.
Writing Patterns
Use the imperative form in instructions.
Defining output formats:
## Report structure Use this exact template: # [Title] ## Executive summary ## Key findings ## Recommendations
Examples pattern:
## Commit message format **Example 1:** Input: Added user authentication with JWT tokens Output: feat(auth): implement JWT-based authentication
Writing Style
Explain why things are important rather than just issuing commands. Use theory of mind — make the skill general, not over-fitted to specific examples. Write a draft, then look at it with fresh eyes and improve it. If you find yourself writing ALWAYS or NEVER in all caps, that's a yellow flag — try to reframe as an explanation of why, so the agent understands the reasoning rather than just following a rule.
Test Cases
After writing the skill draft, come up with 2–3 realistic test prompts — the kind of thing a real user would actually say. Share them with the user for review, then run them.
Save test cases to
evals/evals.json:
{ "skill_name": "example-skill", "evals": [ { "id": 1, "prompt": "User's task prompt", "expected_output": "Description of expected result", "files": [] } ] }
Don't write assertions yet — just the prompts. Add assertions in the next step.
See
references/schemas.md for the full schema.
Running and Evaluating Test Cases
This is one continuous sequence — don't stop partway through.
Since Clacky has no subagents, run test cases serially in the current session. For each test case, simulate two runs:
- with_skill: Read the SKILL.md, then follow its instructions to complete the task
- without_skill: Complete the same task using only general knowledge (no skill instructions)
Put results in
<skill-name>-workspace/ as a sibling to the skill directory. Organize by iteration (iteration-1/, iteration-2/, etc.), and within that by test case (use descriptive names like eval-create-report, not eval-0).
Step 1: For each test case, create the eval directory and run both variants
<skill-name>-workspace/ └── iteration-1/ ├── eval-<descriptive-name>/ │ ├── eval_metadata.json │ ├── with_skill/ │ │ ├── outputs/ ← files produced │ │ └── grading.json ← filled in later │ └── without_skill/ │ ├── outputs/ │ └── grading.json └── benchmark.json ← filled in after all evals
Write
eval_metadata.json for each test case:
{ "eval_id": 1, "eval_name": "descriptive-name", "prompt": "The task prompt", "assertions": [] }
Running a with_skill eval: Read the skill's SKILL.md fully, then execute the task as instructed by the skill — create files, run scripts, write outputs to
with_skill/outputs/.
Running a without_skill eval: Execute the same task using only general knowledge. Write outputs to
without_skill/outputs/. This is the baseline.
Step 2: Draft assertions while running
Don't wait until all runs finish — draft quantitative assertions as you go and explain them to the user.
Good assertions are objectively verifiable and descriptively named — someone glancing at the benchmark should immediately understand what each one checks. Subjective skills are better evaluated qualitatively; don't force assertions onto things that need human judgment.
Update
eval_metadata.json with assertions once drafted. Also update evals/evals.json.
Step 3: Grade each run
For each run, evaluate assertions against the outputs. Save results to
grading.json in each run directory.
The
grading.json format (exact field names matter for the viewer):
{ "eval_id": 1, "configuration": "with_skill", "expectations": [ { "text": "The script uses absolute paths", "passed": true, "evidence": "Script uses $HOME/... throughout" } ], "pass_count": 1, "total_count": 1, "pass_rate": 1.0 }
For assertions that can be checked programmatically, write and run a Ruby script — it's faster and more reliable than eyeballing:
#!/usr/bin/env ruby # Check assertion: output file contains expected content output = File.read("with_skill/outputs/result.md") puts output.include?("expected phrase") ? "PASS" : "FAIL"
Step 4: Aggregate into benchmark
Create
benchmark.json in the iteration directory. List with_skill before without_skill for each eval:
{ "skill_name": "my-skill", "iteration": 1, "configurations": [ { "name": "with_skill", "label": "With skill", "evals": [ {"eval_id": 1, "eval_name": "eval-name", "pass_rate": 1.0, "pass_count": 3, "total_count": 3} ], "overall_pass_rate": 1.0, "total_pass": 3, "total_assertions": 3 }, { "name": "without_skill", "label": "Without skill (baseline)", "evals": [ {"eval_id": 1, "eval_name": "eval-name", "pass_rate": 0.33, "pass_count": 1, "total_count": 3} ], "overall_pass_rate": 0.33, "total_pass": 1, "total_assertions": 3 } ], "delta": { "pass_rate_improvement": 0.67, "summary": "With skill: 100% | Without skill: 33% | Delta: +67pp" }, "analyst_observations": [ "..." ] }
Or run the aggregation script (from the skill-creator directory):
python3 -m scripts.aggregate_benchmark <workspace>/iteration-N --skill-name <name>
Step 5: Do an analyst pass
Read the benchmark data and surface patterns the aggregate stats might hide. See
agents/analyzer.md for what to look for — things like assertions that always pass regardless of skill (non-discriminating), high-variance evals, and time/effort tradeoffs.
Step 6: Generate the eval viewer — ALWAYS DO THIS BEFORE REVISING THE SKILL
Generate the viewer first. Get the outputs in front of the user before making any changes.
python3 <skill-creator-path>/eval-viewer/generate_review.py \ <workspace>/iteration-N \ --skill-name "my-skill" \ --benchmark <workspace>/iteration-N/benchmark.json \ --static /tmp/<skill-name>-review.html open /tmp/<skill-name>-review.html
For iteration 2+, also pass
--previous-workspace <workspace>/iteration-<N-1>.
Tell the user: "I've opened the results in your browser. 'Outputs' tab lets you click through each test case and leave feedback; 'Benchmark' shows the quantitative comparison. When you're done, come back and let me know."
What the user sees in the viewer
Outputs tab: One test case at a time.
- Prompt, output files (rendered inline where possible)
- Previous output (iteration 2+, collapsed)
- Formal grades (collapsed)
- Feedback textbox (auto-saves)
- Previous feedback (iteration 2+)
Benchmark tab: Pass rates, per-eval breakdowns, analyst observations.
Navigation: prev/next buttons or arrow keys. "Submit All Reviews" saves to
feedback.json.
Step 7: Read the feedback
When the user says they're done, read
feedback.json:
{ "reviews": [ {"run_id": "eval-0-with_skill", "feedback": "missing axis labels on chart", "timestamp": "..."}, {"run_id": "eval-1-with_skill", "feedback": "", "timestamp": "..."} ], "status": "complete" }
Empty feedback = user was happy with that test case. Focus on cases with specific complaints.
Improving the Skill
This is the heart of the loop. You've run tests, the user reviewed results — now make the skill better.
How to think about improvements
Generalize from feedback. You're iterating on a few examples, but the skill will be used across thousands of different prompts. Avoid overfitting to specific examples. If there's a stubborn issue, try different metaphors or different approaches rather than adding more rigid rules.
Keep it lean. Remove things that aren't pulling their weight. Read the execution trace, not just the final output — if the skill is making the agent waste time on unproductive steps, cut those parts.
Explain the why. Try hard to explain why each instruction matters. Agents are smart — they perform better when they understand the reasoning rather than following rules blindly. If you find yourself writing ALWAYS or NEVER in all caps, reframe it as an explanation.
Look for repeated work. If every test case resulted in writing similar helper logic (e.g., an API call setup, a file parser), that's a signal to bundle a reusable Ruby script into
scripts/ and tell the skill to use it.
The iteration loop
- Apply improvements to the skill
- Re-run all test cases into a new
directory (with_skill and without_skill)iteration-<N+1>/ - Generate the viewer with
pointing at the previous iteration--previous-workspace - Wait for the user to review and tell you they're done
- Read the new feedback, improve again, repeat
Keep going until:
- The user says they're happy
- Feedback is all empty
- You're not making meaningful progress
Advanced: Blind Comparison
For more rigorous comparison, read
agents/comparator.md and agents/analyzer.md. Optional — the human review loop is usually sufficient.
Description Optimization
The
description field in SKILL.md frontmatter is the primary triggering mechanism. After creating or improving a skill, offer to optimize it.
Clacky note:
andrun_eval.pyhave been adapted for Clacky. They userun_loop.py(NDJSON streaming) to detectclacky agent --jsontool calls targeting temp skills ininvoke_skill. Queries run serially (single agent).~/.clacky/skills/calls the LLM directly via OpenRouter usingimprove_description.pycredentials.~/.clacky/config.yml
Manual description optimization
Step 1: Generate trigger eval queries
Create 20 eval queries — a mix of should-trigger and should-not-trigger. Save as JSON:
[ {"query": "the user prompt", "should_trigger": true}, {"query": "another prompt", "should_trigger": false} ]
Queries must be realistic — concrete, specific, with enough context that a real user would actually say them. Include file paths, personal context, column names, backstory. Use a mix of lengths and styles (casual, formal, typos, abbreviations). Focus on edge cases.
Bad:
"Format this data", "Extract text from PDF", "Create a chart"
Good:
"ok so my boss just sent me this xlsx file (its in downloads, called Q4 sales final FINAL v2.xlsx) and she wants me to add a column showing profit margin. Revenue is column C, costs in column D i think"
Should-trigger queries (8–10): Different phrasings of the same intent — some formal, some casual. Include cases where the user doesn't explicitly name the skill but clearly needs it. Uncommon use cases, and cases where this skill competes with another but should win.
Should-not-trigger queries (8–10): Near-misses — queries that share keywords but actually need something different. The negative cases should be genuinely tricky, not obviously irrelevant ("write a fibonacci function" as a negative for a PDF skill is too easy).
Step 2: Review with user
Use the HTML template in
assets/eval_review.html:
- Read the template
- Replace
with the JSON array,__EVAL_DATA_PLACEHOLDER__
with the skill name,__SKILL_NAME_PLACEHOLDER__
with the current description__SKILL_DESCRIPTION_PLACEHOLDER__ - Write to
and/tmp/eval_review_<skill-name>.html
itopen - User edits queries, toggles should-trigger, clicks "Export Eval Set"
- File downloads to
~/Downloads/eval_set.json
Step 3: Run automated optimization (recommended)
Use the scripts from the skill-creator
scripts/ directory. Run from the skill-creator root:
# Single eval run — check current description pass rate python3 -m scripts.run_eval \ --eval-set ~/Downloads/eval_set.json \ --skill-path ~/.clacky/skills/my-skill \ --verbose # Full optimize loop — auto-improves description over N iterations python3 -m scripts.run_loop \ --eval-set ~/Downloads/eval_set.json \ --skill-path ~/.clacky/skills/my-skill \ --max-iterations 5 \ --runs-per-query 1 \ --verbose # Outputs: best description + HTML report (auto-opens in browser)
Notes:
- No
needed (or it's ignored) — Clacky runs queries serially--num-workers - No
needed — uses the model from--model
automatically~/.clacky/config.yml - Temp skills are written to
and cleaned up after each query~/.clacky/skills/ - Each query spawns a fresh
process to avoid session contaminationclacky agent --json
Step 3 (manual fallback)
If scripts fail, manually iterate: for each query in the eval set, judge whether the description would trigger. Tally passes/fails. Write improved description targeting failures. Repeat 2–3 times.
Focus on:
- Failing should-trigger queries → description is too narrow; broaden the trigger language
- Failing should-not-trigger queries → description is too broad; tighten specificity
Step 4: Apply the result
Update the skill's SKILL.md frontmatter with the improved description. Show the user before/after.
How skill triggering works
Skills appear in Clacky's
available_skills list. The agent consults a skill based on the description match — but only for tasks it can't handle alone. Simple, one-step queries often won't trigger even with a good description. Make eval queries substantive enough that the skill genuinely helps.
Packaging
New skills are created directly in
~/.clacky/skills/<skill-name>/ — no packaging step needed. The skill is immediately available in all sessions and the Web UI.
If distributing externally, you can package it:
python3 -m scripts.package_skill <path/to/skill-folder>
This creates a
.skill file. Direct the user to the resulting file path.
Reference files
— How to evaluate assertions against outputsagents/grader.md
— How to do blind A/B comparison between two outputsagents/comparator.md
— How to analyze why one version beat anotheragents/analyzer.md
— JSON structures for evals.json, grading.json, benchmark.jsonreferences/schemas.md
The core loop (summary)
- Understand what the skill should do
- Draft or edit the SKILL.md
- Run test prompts — with and without the skill — and save outputs
- Generate the eval viewer with
so the user can reviewgenerate_review.py - Grade assertions, aggregate benchmark
- Get user feedback, improve the skill
- Repeat until satisfied
- Package and deliver
Add these steps to your todo list. Specifically: always generate the eval viewer before revising the skill — the user's feedback is the primary signal, not your own judgment of the outputs.