Skilllibrary skill-evaluation
Evaluate whether a skill routes correctly and produces better output than a baseline — measuring precision, recall, output quality, and win rate against no-skill runs. Use this when validating a new skill before promotion, verifying a refinement worked, or auditing whether an existing skill still adds value. Do not use for building the test harness itself (use skill-testing-harness), benchmarking multiple variants head-to-head (use skill-benchmarking), or debugging a skill that is obviously broken (fix it first with skill-refinement).
git clone https://github.com/merceralex397-collab/skilllibrary
T=$(mktemp -d) && git clone --depth=1 https://github.com/merceralex397-collab/skilllibrary "$T" && mkdir -p ~/.claude/skills && cp -r "$T/03-meta-skill-engineering/skill-evaluation" ~/.claude/skills/merceralex397-collab-skilllibrary-skill-evaluation && rm -rf "$T"
03-meta-skill-engineering/skill-evaluation/SKILL.mdPurpose
Measures whether a skill is working: does it route correctly (trigger when should, not trigger when shouldn't) and does it produce better output than not having the skill? Produces quantitative evidence that a skill adds value.
When to use this skill
Use when:
- User says "is this skill working?", "evaluate this skill", "does this help?"
- New skill needs validation before promotion to stable
- Skill was refined and needs verification fix worked
- Comparing skill variants to decide which to keep
- Periodic audit of skill library quality
Do NOT use when:
- Building the test harness/suite (use
)skill-testing-harness - Running existing automated evals (use
)skill-eval-runner - Benchmarking performance metrics (use
)skill-benchmarking - Skill is obviously broken—fix it first (use
)skill-refinement
Operating procedure
- Define what "working" means for this skill:
- Routing accuracy: Triggers on positive cases? Not trigger on negative?
- Output quality: Correct, well-formatted, complete?
- Baseline comparison: Better than without skill?
- Prepare evaluation inputs:
- 5-10 positive trigger cases (should trigger)
- 5-10 negative trigger cases (should NOT trigger)
- 3-5 cases for output quality assessment
- Run routing evaluation:
- Each positive case: Did skill trigger? (target: 100%)
- Each negative case: Did skill NOT trigger? (target: 100%)
- Calculate Precision = TP / (TP + FP)
- Calculate Recall = TP / (TP + FN)
- Run output quality evaluation:
- For each quality case, run with skill active
- Assess against rubric:
- Correct: Solves the problem?
- Complete: All parts present?
- Well-formatted: Matches expected format?
- No hallucination: Claims accurate?
- Run baseline comparison (recommended):
- Same quality cases, run without skill
- Compare: Which better? (blind if possible)
- Calculate win rate: skill wins / total
- Synthesize results:
- Routing: Pass if precision ≥95%, recall ≥90%
- Quality: Pass if ≥80% outputs meet rubric
- Baseline: Pass if skill wins ≥60%
- Issue verdict: Pass / Fail / Needs Work (with specific issues)
Output defaults
## Skill Evaluation: [skill-name] ### Routing Accuracy | Metric | Value | Target | Pass? | |--------|-------|--------|-------| | Precision | X% | ≥95% | ✓/✗ | | Recall | X% | ≥90% | ✓/✗ | **Issues**: [false negatives/positives] ### Output Quality (N cases) **Score**: X/N pass (Y%) ### Baseline Comparison **Win rate**: X/N (Y%) ### Verdict: [Pass | Fail | Needs Work] **Issues**: [list or "None"]
References
- https://docs.anthropic.com/en/docs/test-and-evaluate/eval-overview — Anthropic eval framework
- https://docs.github.com/en/copilot/concepts/agents/about-agent-skills — Agent skills overview
- https://developers.openai.com/codex/skills — Codex skill format
- Skill's evals/ directory if it exists
Failure handling
- No eval cases exist: Create minimum set (3 positive, 3 negative triggers)
- Can't determine if triggered: Check client logs or add instrumentation
- Baseline inconclusive: Increase sample size or define clearer rubric
- Routing passes but quality fails: Route to
skill-refinement