Skillprobe write-tests
git clone https://github.com/Anyesh/skillprobe
T=$(mktemp -d) && git clone --depth=1 https://github.com/Anyesh/skillprobe "$T" && mkdir -p ~/.claude/skills && cp -r "$T/.claude/skills/write-tests" ~/.claude/skills/anyesh-skillprobe-write-tests-6a952c && rm -rf "$T"
.claude/skills/write-tests/SKILL.mdWriting skillprobe Tests
Generate YAML test files for skillprobe. Always read the skill content first before writing tests.
When asked to write tests for a skill, generate BOTH files unless the user asks for only one:
- A behavioral test file (for
)skillprobe run - An activation test file (for
)skillprobe activation
Behavioral Tests (for skillprobe run
)
skillprobe runTest what the model DOES when the skill is loaded.
harness: claude-code # or cursor model: claude-haiku-4-5-20251001 timeout: 120 skill: ./path/to/skill # directory with SKILL.md, or a .md file scenarios: - name: "descriptive name" steps: - prompt: "what the user would type" assert: - type: contains value: "expected text"
Activation Tests (for skillprobe activation
)
skillprobe activationTest whether the skill LOADS for relevant prompts and stays quiet for irrelevant ones. This is a separate YAML format from behavioral tests.
harness: claude-code model: claude-haiku-4-5-20251001 timeout: 60 skill: ./path/to/skill activation: skill_name: my-skill should_activate: - "prompt that should trigger this skill" - "another relevant prompt" should_not_activate: - "completely unrelated prompt" - "another unrelated prompt"
The
activation key replaces scenarios. These are two different file formats loaded by different commands (skillprobe run vs skillprobe activation). Never put both keys in the same file. skillprobe will either ignore one silently or fail to parse the file.
Bad:
skill: ./skills/commit activation: skill_name: commit should_activate: - "commit my changes" scenarios: - name: "commit happens" steps: - prompt: "commit" assert: - type: contains value: "commit"
Pick one format per file.
Testing skill combinations
Some tests need to load more than one skill at a time, either because you are probing interactions between skills or because the scenario legitimately needs both. Use a
skills: list at the suite level instead of the singular skill::
harness: claude-code model: claude-haiku-4-5-20251001 skills: - ./skills/clean-code - ./skills/bad-python scenarios: - name: "clean-code wins over bad-python on function shape" steps: - prompt: "Write a Python function called add that takes two integers" runs: 5 min_pass_rate: 0.8 assert: - type: regex value: 'def add\([^)]*: *int[^)]*: *int[^)]*\) *-> *int' - type: not_contains value: '"""'
skill: (singular) and skills: (plural list) are mutually exclusive, and the matrix form below cannot coexist with either. Pick exactly one of the three per file.
Sweeping one skill against many with a matrix
When you want to test one base skill against a list of other skills without writing a separate file per pairing, use a
matrix: block at suite level. It expands into N scenario runs, one per pair_with entry, loading base together with that entry:
harness: claude-code model: claude-haiku-4-5-20251001 matrix: base: ./skills/my-new-skill pair_with: - ./skills/popular-a - ./skills/popular-b - ./skills/popular-c scenarios: - name: "my-new-skill still works when paired with a popular skill" steps: - prompt: "the usual request for my-new-skill" runs: 5 min_pass_rate: 0.8 assert: - type: regex value: "expected pattern"
Use
matrix: when you have one base skill and a reference set of others to sweep it against. Use a plain skills: list when you want a specific hand-picked combination. Do not use both in the same file.
Detecting real regressions with baseline mode
A matrix run reports pass or fail per pairing but does not tell you whether a combined failure is a genuine regression or just natural model variance. For that, tell the user to run the same matrix YAML with
skillprobe run ... --baseline --baseline-runs 20. Baseline mode runs every pairing three times per scenario (base alone, paired alone, combined) and classifies each assertion as one of regression, shared_failure, flaky, or ok.
The test author does not write anything special for baseline mode. It works on any YAML that already has a
matrix: block. Your job is to set realistic runs: and min_pass_rate: values on the scenarios, and when you hand the test to the user, tell them that baseline mode is the right command to run when they want to separate real combination bugs from variance noise. Baseline mode is expensive (roughly three times a normal matrix run) so treat it as a nightly or on-demand audit, not a per-PR check.
Before picking min_pass_rate, measure
skillprobe measure test.yaml --runs 20 runs each scenario 20 times and reports per-assertion pass rates with 95 percent Wilson confidence intervals and a variance classification of deterministic, probabilistic, noisy, or unreliable. Run this against your draft test before committing to a min_pass_rate value. If the observed pass rate is 70 percent with a CI spanning 0.40 to 0.89, setting min_pass_rate: 0.8 will produce a scenario that fails roughly half the time on variance alone and users will distrust the test.
Measure first, then pick a threshold from the data.
Assertion Types
| Type | Checks | Example |
|---|---|---|
| Response includes value (case-insensitive) | |
| Response does NOT include value | |
| Response matches pattern | |
| A tool was invoked | |
| A skill was loaded | |
| File exists in workspace (after block) | |
| File contains text (path:content) | |
Any assertion supports
negate: true to invert it.
Multi-Run
Skills are probabilistic. Use
runs and min_pass_rate to measure reliability:
steps: - prompt: "Write a function with type hints" runs: 5 min_pass_rate: 0.8 assert: - type: regex value: "-> "
Pass rate thresholds
min_pass_rate: 0.0 is never correct. A zero threshold means the scenario passes no matter what the model does, which makes the assertion vacuous. Use these defaults:
- Deterministic rule (the skill either followed the rule or it didn't, no ambiguity expected):
min_pass_rate: 1.0 - Probabilistic rule (model variance expected, most runs should follow the rule):
min_pass_rate: 0.8 - Genuinely uncertain contest (two competing rules, either could win on any run):
min_pass_rate: 0.5
If you cannot justify a
min_pass_rate of at least 0.5, the scenario is probably testing the wrong thing.
Writing Good Domain Skill Tests
Domain skills teach project-specific knowledge. Test for YOUR specifics, not generic behavior.
Bad:
value: "deploy" (any deployment knowledge passes)
Good: value: "deploy-k8s.sh" with not_contains: "helm install" (checks your exact tool)
For every domain skill, cover:
- Correct commands: your specific tools, not generic alternatives
- Specific details: URLs, config values, architecture names from the skill
- Negative cases: tools or patterns your project doesnt use
- Activation: prompts from the skill's trigger words, plus unrelated prompts that should not trigger it
Strong vs weak assertions
An assertion is only useful if it discriminates between "skill was followed" and "skill was not followed." An assertion that would pass regardless of model behavior is noise.
Weak (passes for any relevant response, no signal):
— any Python code mentions importscontains: "import"
— any variable assignment passesregex: "result ="
— any explanation mentions functionscontains: "function"
Strong (only passes when the specific rule was followed):
for "use from-imports in alphabetical order"regex: "^from [a-z]+ import [a-z]"
for "must use int type hints on params and return"regex: "def \\w+\\([^)]*: *int[^)]*\\) *-> *int"
for "no docstrings on small helpers"not_contains: '\"\"\"'
Before writing an assertion, ask: "what response would this assertion accept that should not be accepted?" If the answer is "any reasonable response," the assertion is too weak.
Avoid contradictory scenarios
Each scenario should have exactly one clear expected outcome. Two scenarios with the same prompt asserting opposite outcomes cannot both pass, so the suite produces no signal about which rule the skill should enforce. If you are probing a contest between two behaviors, use ONE scenario with a probabilistic
min_pass_rate, not two scenarios with mutually exclusive assertions.
Bad:
- name: "uses type hints" steps: - prompt: "write a function that adds two numbers" assert: - type: regex value: "-> int" - name: "no type hints" steps: - prompt: "write a function that adds two numbers" assert: - type: not_contains value: "-> "
Good:
- name: "uses int type hints at least 80% of the time" steps: - prompt: "write a function that adds two numbers" runs: 5 min_pass_rate: 0.8 assert: - type: regex value: "-> int"
Running
skillprobe run tests/my-skill.yaml skillprobe run tests/my-skill.yaml --harness cursor --model auto skillprobe run tests/my-matrix.yaml --baseline --baseline-runs 20 skillprobe measure tests/my-skill.yaml --runs 20 skillprobe activation tests/my-activation.yaml