Meta-Skill-Engineering skill-benchmarking

install

source · Clone the upstream repo

git clone https://github.com/merceralex397-collab/Meta-Skill-Engineering

Claude Code · Install into ~/.claude/skills/

T=$(mktemp -d) && git clone --depth=1 https://github.com/merceralex397-collab/Meta-Skill-Engineering "$T" && mkdir -p ~/.claude/skills && cp -r "$T/.opencode/skills/skill-benchmarking" ~/.claude/skills/merceralex397-collab-meta-skill-engineering-skill-benchmarking && rm -rf "$T"

manifest: .opencode/skills/skill-benchmarking/SKILL.md

source content

Purpose

Compare two or more skill variants (A vs B, before vs after) on the same test cases. Produce a summary table with pass rate, routing accuracy, and usefulness score, then recommend which variant to keep.

When to use

Two or more skill variants exist and one must be chosen
A skill was refined and the change needs measured impact
User says "which is better?", "did this help?", "benchmark these"
Periodic audit to cull underperforming variants
Justifying whether skill maintenance investment is paying off

When NOT to use

Only one skill, no variant to compare →
```
skill-evaluation
```
Need to build test cases or harness →
```
skill-testing-harness
```
Skill is broken and needs fixing →
```
skill-improver
```
Quick spot-check, not systematic comparison →
```
skill-evaluation
```
(single-skill mode)

Procedure

Define the comparison
- Identify variants: A vs B, or before vs after. For skill vs no-skill baseline evaluation, use
```
skill-evaluation
```
  instead.
- Choose metrics: pass rate, routing accuracy, usefulness score. Drop metrics irrelevant to the decision.
- Set minimum sample size (N ≥ 10 per variant).
Select benchmark cases
- Reuse existing eval cases if available.
- Cover typical, edge, and adversarial inputs.
- Use identical cases across all variants — never mix.
Collect metrics

Use automated tooling to collect metrics consistently:
```
./scripts/run-evals.sh <skill-name>                                     # Trigger + behavior tests
./scripts/run-evals.sh --usefulness <skill-name>                        # Add LLM-judged quality
./scripts/run-baseline-comparison.sh <variant-a.md> <variant-b.md>     # Before/after comparison
```
- Pass rate: % of trigger-positive and behavior cases meeting acceptance criteria.
- Routing accuracy: Precision and recall from trigger-positive and trigger-negative eval sets.
- Usefulness score: Average LLM-judged quality score (1–5) from
```
--usefulness
```
  runs. Compare scores between variants to determine which produces higher-quality output.
Usefulness comparison method:
- Run
```
--usefulness
```
  evals for each variant under identical conditions.
- Compare the average usefulness scores directly. A variant with a meaningfully higher score (≥ 0.5 points) is the winner on quality.
- Scoring rubric (applied by LLM judge per case):
  - Correctness: Is the output factually right and free of hallucination?
  - Completeness: Are all required sections and elements present?
  - Conciseness: Is there unnecessary padding, repetition, or filler?
  - Actionability: Could someone act on this output without further clarification?
- Ties: If the usefulness score difference is < 0.5 points and pass rates are within noise, favor the simpler variant. Do not force a winner when there isn't one.
- >2 variants: Compare usefulness scores pairwise (A vs B, A vs C, B vs C). Tally wins per variant across all pairs.
Assess significance
- Pass rate: Is the difference > 5 percentage points?
- Usefulness score: Is the difference ≥ 0.5 points?
- If all metrics are within noise, declare tie.
Produce benchmark report
- Fill in the output contract below.

Output contract

Produce exactly this structure:

## Benchmark: [Variant A] vs [Variant B]

### Summary
| Metric           | A     | B     | Winner |
|------------------|-------|-------|--------|
| Pos Trigger %    | 85%   | 92%   | B      |
| Neg Reject %     | 95%   | 90%   | A      |
| Behavior Pass %  | 80%   | 88%   | B      |
| Usefulness Score  | 3.2   | 4.1   | B      |
| Cases Tested     | 15    | 15    | —      |

### Breakdown
[Results by category if cases span multiple categories. Omit if single category.]

### Significance
- Pass rate delta: [X]pp — [meaningful | within noise]
- Usefulness delta: [X] points — [meaningful | within noise]

### Recommendation
**Keep [winner]**. [Deprecate | archive] [loser].
Rationale: [one sentence explaining the deciding factor]

Failure handling

Fewer than 10 cases per variant: Run anyway but mark results as preliminary. State the sample size and warn that conclusions may not hold.
Metrics too close to call: Recommend keeping the simpler or smaller variant as tiebreaker. Never force a winner when differences are within noise.
Variants serve different purposes: Do not force a single winner. Document which contexts favor each variant and recommend keeping both with routing guidance.
Missing acceptance criteria: Ask the user to define pass/fail before running. Do not invent criteria.

Next steps

After benchmarking:

Improve the weaker variant →
```
skill-improver
```
Deprecate the losing variant →
```
skill-lifecycle-management
```