Meta-Skill-Engineering skill-benchmarking

install

source · Clone the upstream repo

git clone https://github.com/merceralex397-collab/Meta-Skill-Engineering

Claude Code · Install into ~/.claude/skills/

T=$(mktemp -d) && git clone --depth=1 https://github.com/merceralex397-collab/Meta-Skill-Engineering "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skill-benchmarking" ~/.claude/skills/merceralex397-collab-meta-skill-engineering-skill-benchmarking-89be41 && rm -rf "$T"

manifest: skill-benchmarking/SKILL.md

source content

Purpose

Compare two or more skill variants (A vs B, before vs after, skill vs no-skill) on the same test cases. Produce a summary table with pass rate, token usage, and win rate, then recommend which variant to keep.

When to use

Two or more skill variants exist and one must be chosen
A skill was refined and the change needs measured impact
User says "which is better?", "did this help?", "benchmark these"
Periodic audit to cull underperforming variants
Justifying whether skill maintenance investment is paying off

When NOT to use

Only one skill, no variant to compare →
```
skill-evaluation
```
Need to build test cases or harness →
```
skill-testing-harness
```
Skill is broken and needs fixing →
```
skill-improver
```
Quick spot-check, not systematic comparison

Procedure

Define the comparison
- Identify variants: A vs B, before vs after, skill vs no-skill baseline.
- Choose metrics: pass rate, token usage, win rate. Drop metrics irrelevant to the decision.
- Set minimum sample size (N ≥ 10 per variant).
Select benchmark cases
- Reuse existing eval cases if available.
- Cover typical, edge, and adversarial inputs.
- Use identical cases across all variants — never mix.
Collect metrics
- Pass rate: % of cases meeting acceptance criteria.
- Token usage: Average input + output tokens per case.
- Routing accuracy: Precision and recall if routing is in scope.
- Win rate: For each case, judge which variant produced better output (A / B / Tie). Calculate overall win percentage.
Win-rate judging method:
- Blind comparison: Strip skill-name indicators and variant labels from outputs before comparing. Present as "Output A" and "Output B" so judgment isn't biased by which variant is "new".
- Scoring rubric (apply to each case independently):
  - Correctness: Is the output factually right and free of hallucination?
  - Completeness: Are all required sections and elements present?
  - Conciseness: Is there unnecessary padding, repetition, or filler?
  - Actionability: Could someone act on this output without further clarification?
- Ties: If the quality difference is marginal across all rubric dimensions, favor the shorter or cheaper output (token efficiency as tiebreaker). Do not force a winner when there isn't one.
- >2 variants: Use round-robin pairwise comparison (A vs B, A vs C, B vs C), not free-for-all ranking. Tally wins per variant across all pairs. This avoids the "middle option bias" that free-form ranking introduces.
Assess significance
- Pass rate: Is the difference > 5 percentage points?
- Token usage: Is the difference > 10%?
- Win rate: Is it meaningfully above 50%?
- If all metrics are within noise, declare tie.
Produce benchmark report
- Fill in the output contract below.

Output contract

Produce exactly this structure:

## Benchmark: [Variant A] vs [Variant B]

### Summary
| Metric         | A     | B     | Winner |
|----------------|-------|-------|--------|
| Pass Rate      | 85%   | 92%   | B      |
| Avg Tokens     | 1200  | 980   | B      |
| Win Rate       | 35%   | 65%   | B      |
| Cases Tested   | 15    | 15    | —      |

### Breakdown
[Results by category if cases span multiple categories. Omit if single category.]

### Significance
- Pass rate delta: [X]pp — [meaningful | within noise]
- Token delta: [X]% — [meaningful | within noise]
- Win rate: [X]% — [above | at | below] 50%

### Recommendation
**Keep [winner]**. [Deprecate | archive] [loser].
Rationale: [one sentence explaining the deciding factor]

Failure handling

Fewer than 10 cases per variant: Run anyway but mark results as preliminary. State the sample size and warn that conclusions may not hold.
Metrics too close to call: Recommend keeping the simpler or smaller variant as tiebreaker. Never force a winner when differences are within noise.
Variants serve different purposes: Do not force a single winner. Document which contexts favor each variant and recommend keeping both with routing guidance.
Missing acceptance criteria: Ask the user to define pass/fail before running. Do not invent criteria.

Next steps

After benchmarking:

Improve the weaker variant →
```
skill-improver
```
Deprecate the losing variant →
```
skill-lifecycle-management
```
Package the winner →
```
skill-packaging
```