Skilllibrary skill-eval-runner

Run trigger tests, behavior tests, and baseline comparisons for a skill's eval suite, then produce a structured quality verdict. Use when a skill has been modified and needs regression testing, when CI/pre-release validation requires documented eval results, or when measuring quality before catalog inclusion. Do not use when no evals exist yet (build them first) or for manual evaluation without test files.

install
source · Clone the upstream repo
git clone https://github.com/merceralex397-collab/skilllibrary
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/merceralex397-collab/skilllibrary "$T" && mkdir -p ~/.claude/skills && cp -r "$T/01-package-scaffolding/skill-eval-runner" ~/.claude/skills/merceralex397-collab-skilllibrary-skill-eval-runner && rm -rf "$T"
manifest: 01-package-scaffolding/skill-eval-runner/SKILL.md
source content

Skill Eval Runner

Executes a skill's eval suite and produces a structured quality report.

Procedure

1. Locate test suite

Check for eval files in the skill directory:

  • evals/triggers.yaml
    — trigger accuracy tests
  • evals/outputs.yaml
    — behavior correctness tests
  • evals/baselines.yaml
    — baseline comparison tests

Note which exist and which are missing.

2. Run trigger tests

For each case in

triggers.yaml
:

  • Positive cases: Does the skill fire? (expected: yes)
  • Negative cases: Does the skill NOT fire? (expected: no)

Record: prompt, expected, actual, Pass/Fail.

3. Run output tests

For each case in

outputs.yaml
:

  • Run the skill with the given input
  • Check against
    expected_sections
    ,
    required_patterns
    ,
    forbidden_patterns

Record: test name, checks passed/total, Pass/Fail.

4. Run baseline comparison

For each case in

baselines.yaml
:

  • Run with skill active vs without skill
  • Does the skill add value? (Yes if 2+ elements baseline would lack)

5. Aggregate results

Calculate:

  • Trigger precision: TP / (TP + FP)
  • Trigger recall: TP / (TP + FN)
  • Output pass rate: passed / total checks
  • Baseline win: Yes/No

6. Issue verdict

VerdictCriteria
PassAll rates ≥80% AND baseline win
Pass with issuesAny rate 60-79%
FailAny rate <60% OR baseline lose

Output contract

## Eval Report: [skill-name]
Date: [YYYY-MM-DD]

### Trigger Tests
| Prompt | Type | Expected | Actual | Result |
|--------|------|----------|--------|--------|
Precision: X%  Recall: Y%

### Output Tests
| Test | Checks Passed | Result |
|------|--------------|--------|
Pass rate: X%

### Baseline Comparison
Skill adds value: [Yes/No]

### Verdict: [Pass | Pass with issues | Fail]
Issues: [list or "None"]

Failure handling

  • Silent skill failure (no output): Record Fail, halt remaining tests for that case
  • Missing test files: Report which are missing, run available ones, note incomplete coverage
  • Flaky results: Run 3x, use majority result, note flakiness in report
  • All tests missing: Cannot evaluate — report "No eval suite found" and recommend building one

References