Skilllibrary skill-testing-harness

Create test infrastructure for a skill — trigger tests (positive and negative cases in JSONL format), output format tests, and baseline comparisons. Use this when building evals for a new or refined skill, when a skill lacks an evals/ directory, or when the user says "create tests for this skill", "set up evals", or "build a test harness". Do not use for running existing tests (use skill-evaluation), benchmarking variants (use skill-benchmarking), or when tests exist and just need minor updates.

install
source · Clone the upstream repo
git clone https://github.com/merceralex397-collab/skilllibrary
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/merceralex397-collab/skilllibrary "$T" && mkdir -p ~/.claude/skills && cp -r "$T/03-meta-skill-engineering/skill-testing-harness" ~/.claude/skills/merceralex397-collab-skilllibrary-skill-testing-harness && rm -rf "$T"
manifest: 03-meta-skill-engineering/skill-testing-harness/SKILL.md
source content

Purpose

Creates test infrastructure for a skill: trigger tests (positive and negative), output tests (format and content), and baseline comparisons. The harness enables repeatable evaluation during development and refinement.

When to use this skill

Use when:

  • User says "create tests for this skill", "set up eval", "build test harness"
  • New skill being created needs test coverage
  • Skill lacks evals/ directory or test fixtures
  • Skill refinement requires regression tests

Do NOT use when:

  • Running existing tests (use
    skill-eval-runner
    )
  • Evaluating quality (use
    skill-evaluation
    )
  • Benchmarking performance (use
    skill-benchmarking
    )
  • Tests exist and just need updating (edit directly)

Operating procedure

Step 1 — Analyze skill to test

Read the target SKILL.md completely and extract:

  • Trigger signals from the
    description
    field
  • Positive cases from "When to use" section
  • Negative cases from "Do NOT use when" section
  • Expected output format from "Output defaults" section
  • Quality criteria from the procedure steps

Step 2 — Create trigger-positive.jsonl

File:

evals/trigger-positive.jsonl

Each line is a JSON object representing a prompt that SHOULD activate the skill. Include 8-15 cases covering core use cases, edge cases, and paraphrasings.

{"prompt": "Create a new skill for handling PDF extraction", "expected": "trigger", "category": "core", "notes": "Direct request matching primary use case"}
{"prompt": "Write a SKILL.md that helps with code review", "expected": "trigger", "category": "core", "notes": "Explicit skill authoring request"}
{"prompt": "I need a reusable procedure for database migrations", "expected": "trigger", "category": "indirect", "notes": "Implicit skill creation — repeated task pattern"}
{"prompt": "Can you make a skill that handles our deploy workflow?", "expected": "trigger", "category": "paraphrase", "notes": "Casual phrasing"}
{"prompt": "Package this workflow as a skill for the team", "expected": "trigger", "category": "edge", "notes": "Packaging intent implies creation first"}

Fields:

FieldRequiredDescription
prompt
YesThe user message that should trigger the skill
expected
YesAlways
"trigger"
for positive cases
category
YesOne of:
core
,
indirect
,
paraphrase
,
edge
notes
NoWhy this case should trigger

Step 3 — Create trigger-negative.jsonl

File:

evals/trigger-negative.jsonl

Each line is a JSON object representing a prompt that should NOT activate the skill. Include 8-15 cases covering adjacent skills, out-of-scope tasks, and common confusion cases.

{"prompt": "Fix the trigger description on this skill", "expected": "no_trigger", "better_skill": "skill-trigger-optimization", "notes": "Trigger fix, not full creation"}
{"prompt": "This skill is broken, the output is wrong", "expected": "no_trigger", "better_skill": "skill-refinement", "notes": "Refinement, not test creation"}
{"prompt": "Run the existing eval suite", "expected": "no_trigger", "better_skill": "skill-evaluation", "notes": "Running tests, not building them"}
{"prompt": "Write a Python function to parse JSON", "expected": "no_trigger", "better_skill": null, "notes": "General coding, not skill engineering"}
{"prompt": "Compare these two skill variants", "expected": "no_trigger", "better_skill": "skill-benchmarking", "notes": "Benchmarking, not test infrastructure"}

Fields:

FieldRequiredDescription
prompt
YesThe user message that should NOT trigger the skill
expected
YesAlways
"no_trigger"
for negative cases
better_skill
YesName of the correct skill, or
null
if no skill matches
notes
NoWhy this case should not trigger

Step 4 — Create output tests

File:

evals/output-tests.jsonl

Each line defines an input prompt with expected output characteristics.

{"prompt": "Create trigger tests for skill-authoring", "expected_files": ["evals/trigger-positive.jsonl", "evals/trigger-negative.jsonl"], "required_patterns": ["\"expected\": \"trigger\"", "\"expected\": \"no_trigger\""], "forbidden_patterns": ["TODO", "placeholder"], "min_cases": 5}
{"prompt": "Build a full test harness for the pdf-extraction skill", "expected_files": ["evals/trigger-positive.jsonl", "evals/trigger-negative.jsonl", "evals/output-tests.jsonl", "evals/README.md"], "required_patterns": ["\"category\""], "forbidden_patterns": [], "min_cases": 8}

Step 5 — Create test fixtures (if needed)

Directory:

evals/fixtures/

  • Sample input files the skill would process
  • Mock data for deterministic testing
  • Expected output examples for comparison

Only create fixtures when the skill processes files or external data.

Step 6 — Create evals README

File:

evals/README.md

# Eval Suite for [skill-name]

## Files
| File | Purpose | Case Count |
|------|---------|------------|
| trigger-positive.jsonl | Prompts that SHOULD trigger | N |
| trigger-negative.jsonl | Prompts that should NOT trigger | N |
| output-tests.jsonl | Output format/content validation | N |

## Running
- Trigger tests: Feed each prompt to router, verify trigger/no_trigger matches expected
- Output tests: Run skill on each prompt, verify files/patterns/counts

## Adding Cases
Append new JSON lines to the appropriate .jsonl file. Follow the field schema:
- trigger-positive: prompt, expected ("trigger"), category, notes
- trigger-negative: prompt, expected ("no_trigger"), better_skill, notes

Output defaults

evals/
├── README.md              # How to run and extend tests
├── trigger-positive.jsonl # 8-15 should-trigger cases
├── trigger-negative.jsonl # 8-15 should-not-trigger cases
├── output-tests.jsonl     # Output validation cases
└── fixtures/              # Optional test data

References

Failure handling

  • Skill has no clear triggers in description: Cannot write trigger tests — flag for skill-trigger-optimization first
  • Output format undefined in skill: Cannot write output tests — flag for skill-refinement to add output defaults
  • Too few distinct trigger phrases: Aim for minimum 5 positive, 5 negative; if skill is too narrow for this, consider whether it should be merged
  • Skill too complex for single harness: Break into sub-capabilities, create separate JSONL files per capability
  • No comparable baseline exists: Skip baseline comparison, focus on absolute trigger accuracy and output format compliance