Skilllibrary skill-eval-runner
Run trigger tests, behavior tests, and baseline comparisons for a skill's eval suite, then produce a structured quality verdict. Use when a skill has been modified and needs regression testing, when CI/pre-release validation requires documented eval results, or when measuring quality before catalog inclusion. Do not use when no evals exist yet (build them first) or for manual evaluation without test files.
install
source · Clone the upstream repo
git clone https://github.com/merceralex397-collab/skilllibrary
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/merceralex397-collab/skilllibrary "$T" && mkdir -p ~/.claude/skills && cp -r "$T/01-package-scaffolding/skill-eval-runner" ~/.claude/skills/merceralex397-collab-skilllibrary-skill-eval-runner && rm -rf "$T"
manifest:
01-package-scaffolding/skill-eval-runner/SKILL.mdsource content
Skill Eval Runner
Executes a skill's eval suite and produces a structured quality report.
Procedure
1. Locate test suite
Check for eval files in the skill directory:
— trigger accuracy testsevals/triggers.yaml
— behavior correctness testsevals/outputs.yaml
— baseline comparison testsevals/baselines.yaml
Note which exist and which are missing.
2. Run trigger tests
For each case in
triggers.yaml:
- Positive cases: Does the skill fire? (expected: yes)
- Negative cases: Does the skill NOT fire? (expected: no)
Record: prompt, expected, actual, Pass/Fail.
3. Run output tests
For each case in
outputs.yaml:
- Run the skill with the given input
- Check against
,expected_sections
,required_patternsforbidden_patterns
Record: test name, checks passed/total, Pass/Fail.
4. Run baseline comparison
For each case in
baselines.yaml:
- Run with skill active vs without skill
- Does the skill add value? (Yes if 2+ elements baseline would lack)
5. Aggregate results
Calculate:
- Trigger precision: TP / (TP + FP)
- Trigger recall: TP / (TP + FN)
- Output pass rate: passed / total checks
- Baseline win: Yes/No
6. Issue verdict
| Verdict | Criteria |
|---|---|
| Pass | All rates ≥80% AND baseline win |
| Pass with issues | Any rate 60-79% |
| Fail | Any rate <60% OR baseline lose |
Output contract
## Eval Report: [skill-name] Date: [YYYY-MM-DD] ### Trigger Tests | Prompt | Type | Expected | Actual | Result | |--------|------|----------|--------|--------| Precision: X% Recall: Y% ### Output Tests | Test | Checks Passed | Result | |------|--------------|--------| Pass rate: X% ### Baseline Comparison Skill adds value: [Yes/No] ### Verdict: [Pass | Pass with issues | Fail] Issues: [list or "None"]
Failure handling
- Silent skill failure (no output): Record Fail, halt remaining tests for that case
- Missing test files: Report which are missing, run available ones, note incomplete coverage
- Flaky results: Run 3x, use majority result, note flakiness in report
- All tests missing: Cannot evaluate — report "No eval suite found" and recommend building one
References
- Anthropic eval overview: https://docs.anthropic.com/en/docs/test-and-evaluate/eval-overview
- OpenAI evals: https://developers.openai.com/docs/evals/getting-started