Skilllibrary skill-benchmarking
Compare skill variants or before-and-after versions using pass rate, token usage, latency, and qualitative win rate. Use this when choosing between multiple skill variants, measuring whether a refinement helped, or justifying skill maintenance investment. Do not use for evaluating a single skill in isolation (use skill-evaluation) or for building test infrastructure (use skill-testing-harness).
install
source · Clone the upstream repo
git clone https://github.com/merceralex397-collab/skilllibrary
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/merceralex397-collab/skilllibrary "$T" && mkdir -p ~/.claude/skills && cp -r "$T/03-meta-skill-engineering/skill-benchmarking" ~/.claude/skills/merceralex397-collab-skilllibrary-skill-benchmarking && rm -rf "$T"
manifest:
03-meta-skill-engineering/skill-benchmarking/SKILL.mdsource content
Purpose
Compares skill variants or before/after versions using quantitative metrics: pass rate, token usage, latency, and qualitative win rate. Produces data to decide which variant to keep, whether refinement helped, or if skill is worth maintaining.
When to use this skill
Use when:
- Multiple skill variants exist and one needs choosing
- Skill was refined and impact needs measurement
- User asks "which is better?", "did this help?", "benchmark these"
- Periodic audit to cull underperformers
- Justifying skill maintenance investment
Do NOT use when:
- No variants to compare (use
for single skill)skill-evaluation - Building test harness (use
)skill-testing-harness - Debugging why skill fails (use
)skill-refinement - Single quick check, not systematic comparison
Operating procedure
- Define comparison:
- What variants? (A vs B, before vs after, skill vs no-skill)
- What metrics matter?
- Minimum sample size? (recommend N≥10 per variant)
- Select benchmark cases:
- Use evals/ cases if exist
- Cover: typical, edge, adversarial
- Same cases for all variants
- Collect quantitative metrics:
- Pass rate: % meeting acceptance criteria
- Token usage: Avg input + output tokens
- Latency: Time to complete (if measurable)
- Routing accuracy: Precision and recall
- Collect qualitative metrics (blind comparison):
- Present outputs without labels
- Rate: Which better? (A/B/Tie)
- Calculate win rate
- Statistical analysis:
- Pass rate: Significant difference? (chi-squared)
- Continuous metrics: Meaningful? (>10% = meaningful)
- Win rate: Different from 50%?
- Produce benchmark report:
- Summary table
- Statistical notes
- Recommendation
Output defaults
## Benchmark: [Skill A] vs [Skill B] ### Summary | Metric | A | B | Winner | |--------|---|---|--------| | Pass Rate | 85% | 92% | B | | Avg Tokens | 1200 | 980 | B | | Win Rate | 35% | 65% | B | ### Detailed Results [By category if applicable] ### Statistical Notes - Pass rate difference: p=X - Win rate: significantly > 50%? Yes/No ### Recommendation **Keep [winner]**, [deprecate/archive] [loser]. Rationale: [why]
References
- https://docs.anthropic.com/en/docs/test-and-evaluate/eval-overview — Anthropic eval guidance
- https://docs.github.com/en/copilot/concepts/agents/about-agent-skills — Agent skills overview
- Skill evals/ directories for test cases
Failure handling
- Not enough test cases: Note results preliminary; minimum N=10
- Metrics too close: Keep simpler/smaller as tiebreaker
- Variants serve different purposes: Don't force winner; document when each appropriate
- Can't blind comparison: Note bias risk