Openclaw-superpowers skill-trigger-tester

Scores a skill's description field against sample user prompts to predict whether OpenClaw will correctly trigger it — before you publish or install.

install
source · Clone the upstream repo
git clone https://github.com/ArchieIndian/openclaw-superpowers
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/ArchieIndian/openclaw-superpowers "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/core/skill-trigger-tester" ~/.claude/skills/archieindian-openclaw-superpowers-skill-trigger-tester && rm -rf "$T"
OpenClaw · Install into ~/.openclaw/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/ArchieIndian/openclaw-superpowers "$T" && mkdir -p ~/.openclaw/skills && cp -r "$T/skills/core/skill-trigger-tester" ~/.openclaw/skills/archieindian-openclaw-superpowers-skill-trigger-tester && rm -rf "$T"
manifest: skills/core/skill-trigger-tester/SKILL.md
source content

Skill Trigger Tester

What it does

OpenClaw maps user intent to skills by matching the user's message against each skill's

description:
field. A weak description means your skill silently never fires. A description that's too broad means it fires when it shouldn't.

Skill Trigger Tester helps you validate the trigger quality of a skill's description before publishing. You give it:

  • The description string you're testing
  • A set of "should fire" prompts (true positives)
  • A set of "should not fire" prompts (true negatives)

It scores precision, recall, and gives an overall trigger quality grade (A–F) plus actionable suggestions.

When to invoke

  • Before publishing any new skill to ClawHub
  • When a skill you expect to trigger isn't firing
  • When a skill keeps firing on irrelevant prompts
  • Inside
    create-skill
    workflow (Step 5: validation)

Scoring model

The tool uses a keyword + semantic overlap heuristic against the description field:

MetricMeaning
Recall% of "should fire" prompts that would match
Precision% of matches that are actually "should fire"
F1Harmonic mean of recall and precision

Grade thresholds:

GradeF1
A≥ 0.85
B≥ 0.70
C≥ 0.55
D≥ 0.40
F< 0.40

How to use

python3 test.py --description "Diagnoses skill discovery failures" \
  --should-fire "why isn't my skill loading" \
               "my skill disappeared from the registry" \
               "check if my skills are healthy" \
  --should-not-fire "write a skill" \
                    "install superpowers" \
                    "review my code"

python3 test.py --file skill-spec.yaml    # Load test cases from YAML file
python3 test.py --format json             # Machine-readable output

Test spec file format

description: "Diagnoses skill discovery failures — YAML parse errors, path violations"
should_fire:
  - "why isn't my skill loading"
  - "my skill disappeared"
  - "check skill health"
should_not_fire:
  - "write a new skill"
  - "install openclaw"

Procedure

Step 1 — Write your test cases

For each skill you're testing, list 3–5 prompts that should trigger it and 3–5 that should not. Be honest about edge cases.

Step 2 — Run the scorer

python3 test.py --description "<your description>" \
  --should-fire "..." --should-not-fire "..."

Step 3 — Interpret results

  • Grade A/B: description is well-calibrated. Publish.
  • Grade C: borderline — add more specific keywords to the description, or narrow the wording.
  • Grade D/F: description is too vague or uses jargon the user won't say. Rewrite and retest.

Step 4 — Iterate

Try alternative descriptions and compare scores side-by-side using

--compare
.

Step 5 — Add test file to the skill directory

Commit the

trigger-tests.yaml
spec alongside the skill. Future contributors can run it to verify trigger quality hasn't regressed.

Common mistakes

  • Too generic:
    "Helps with skills"
    — will either never fire or fire on everything
  • Technical jargon:
    "Validates SKILL.md frontmatter schema coherence"
    — users don't say this
  • Action + object only:
    "Creates skills"
    — add when/why context
  • Missing synonyms: If users might say "check" or "verify" or "test", the description needs to capture the semantic range

Output example

Skill Trigger Quality — skill-doctor
─────────────────────────────────────────────
Description: "Diagnoses silent skill discovery failures..."

Should fire (5 prompts):    4 / 5 matched   recall    = 0.80
Should not fire (5 prompts): 1 / 5 matched  precision = 0.80
F1 score: 0.80   Grade: B

⚠ 1 false negative: "check if skills are healthy"
   Suggestion: Add "healthy", "health", or "check" to the description.

⚠ 1 false positive: "check my code"
   Suggestion: Narrow description to avoid generic "check" overlap.