Meta-Skill-Engineering skill-evaluation

install

source · Clone the upstream repo

git clone https://github.com/merceralex397-collab/Meta-Skill-Engineering

Claude Code · Install into ~/.claude/skills/

T=$(mktemp -d) && git clone --depth=1 https://github.com/merceralex397-collab/Meta-Skill-Engineering "$T" && mkdir -p ~/.claude/skills && cp -r "$T/.opencode/skills/skill-evaluation" ~/.claude/skills/merceralex397-collab-meta-skill-engineering-skill-evaluation && rm -rf "$T"

manifest: .opencode/skills/skill-evaluation/SKILL.md

source content

Purpose

Produce quantitative evidence that a single skill adds value: it triggers on the right inputs, stays silent on wrong inputs, and improves output quality over the no-skill baseline.

When to use

"Is this skill working?" / "evaluate this skill" / "does this help?"
"Run the eval suite" / "regression test this skill"
New skill needs validation before promotion to stable
Skill was refined and you need to verify the fix worked
Skill has been modified and needs regression testing against its eval suite
CI/pre-release validation requires documented eval results
Periodic audit of whether an existing skill still adds value

When NOT to use

Comparing two or more skill variants head-to-head →
```
skill-benchmarking
```
Creating eval files, trigger tests, or test infrastructure →
```
skill-testing-harness
```
Skill is obviously broken or producing bad output →
```
skill-improver
```

Procedure

Entry mode selection

Check whether the skill has an existing eval suite:

If
```
evals/
```
directory exists with test files → use Suite Mode (Step 0)
If no eval suite exists → use Ad-hoc Mode (start at Step 1)

Step 0 — Suite mode: run existing eval suite

Locate test files in the skill directory. Supported formats:

evals/trigger-positive.jsonl

and

evals/trigger-negative.jsonl

```
evals/behavior.jsonl
```

Run the eval suite using

scripts/run-evals.sh

./scripts/run-evals.sh <skill-name>                  # Standard eval
./scripts/run-evals.sh --usefulness <skill-name>     # With LLM-as-Judge usefulness scoring
./scripts/run-evals.sh --runs 3 <skill-name>         # Multi-run with majority voting

This runs trigger tests (positive trigger rate and negative rejection rate), behavior tests (output format compliance), and optionally usefulness tests (LLM-judged output quality). Results are saved to

eval-results/

Calculate positive trigger rate, negative rejection rate, output pass rate, and baseline win rate. Then skip to Step 6 to synthesize the verdict.

If some eval files are missing, note incomplete coverage and fall through to ad-hoc mode for the missing test types.

1. Define success criteria

Routing: triggers on positive cases, stays silent on negative cases
Quality: outputs are correct, complete, well-formatted, no hallucination
Baseline: outputs are better than running without the skill

2. Prepare evaluation inputs

5–10 positive trigger cases (should activate the skill)
5–10 negative trigger cases (should NOT activate)
3–5 quality cases for output assessment

How to construct effective test cases:

Positive cases: Read the skill's "When to use" section. Each bullet becomes at least one test case using realistic phrasing. Then add paraphrased versions — formal ("Please evaluate this skill's effectiveness"), casual ("is this skill any good?"), and indirect ("I'm not sure this skill helps"). This tests routing robustness, not just keyword matching.
Negative cases: Read the skill's "When NOT to use" section. Each bullet becomes at least one test case. Then add near-miss cases drawn from adjacent skills' trigger phrases — these test whether the boundary is sharp. For example, if evaluating
```
skill-evaluation
```
, add trigger phrases from
```
skill-benchmarking
```
as negative cases.
Quality cases: Use realistic, complete task prompts that exercise the full procedure — not just routing. Include at least one edge case where the skill must make a judgment call (e.g., ambiguous input, missing data, conflicting requirements).
Anti-pattern to avoid: Do not write trigger tests that contain the skill name (e.g., "use skill-evaluation to assess this"). Real users rarely name the skill explicitly; tests that do will inflate precision and miss real routing failures.

3. Evaluate routing accuracy

Run each positive case — did the skill trigger? (target: 100%)
Run each negative case — did the skill stay silent? (target: 100%)
Positive trigger rate = TP / (TP + FN) — how often the skill fires on positive cases
Negative rejection rate = TN / (TN + FP) — how often the skill stays silent on negative cases

4. Evaluate output quality

Run each quality case with the skill active
Score against rubric: correct? complete? well-formatted? no hallucination?

5. Run baseline comparison (optional, manual)

This step is manual and requires the

copilot

CLI. It measures whether the skill is actually better than no skill:

Temporarily remove or rename the skill's SKILL.md so it cannot be loaded
Re-run the same quality cases without the skill active
Restore the skill's SKILL.md after baseline runs complete
Blind-compare outputs (judge without knowing which is skill vs baseline)
Win rate = skill-wins / total-cases

Note:

run-baseline-comparison.sh

compares two versions of a SKILL.md (before/after modification), not skill-vs-no-skill. Use it for modification quality checks, not for this baseline step.

6. Synthesize and verdict

Routing target: positive trigger rate ≥ 95% and negative rejection rate ≥ 90%
Minimum gate (automated): positive trigger rate ≥ 80% and negative rejection rate ≥ 80%. Skills below this threshold fail.
Quality target: ≥ 80% of outputs meet the rubric
Baseline target: win rate ≥ 60%
The 95%/90% targets are aspirational quality bars. The 80%/80% gates are the automated pass/fail thresholds in
```
run-evals.sh
```
. Skills between 80–95% pass the gate but should still be improved.
Verdict: Pass / Fail / Needs Work with the specific failing metrics

Output contract

## Skill Evaluation: [skill-name]

### Routing Accuracy
| Metric                 | Value | Gate (≥80%) | Target | Pass? |
|------------------------|-------|-------------|--------|-------|
| Positive trigger rate  | X%    | ≥ 80%       | ≥ 95%  | ✓/✗   |
| Negative rejection rate| X%    | ≥ 80%       | ≥ 90%  | ✓/✗   |

Misrouted cases: [list or "None"]

### Output Quality (N cases)
Score: X/N pass (Y%)

### Baseline Comparison (if run separately via run-baseline-comparison.sh)
Win rate: X/N (Y%)

### Verdict: [Pass | Fail | Needs Work]
Failing metrics: [list or "None"]
Next action: [specific remediation or "Ready for promotion"]

### Handoff (for downstream skills)
- **Eval report**: eval-results/[skill-name]-eval.md
- **Primary failure**: [routing | output-quality | usefulness | none]
- **Failing cases**:
  - [prompt text] — [reason: misrouted / wrong output / low usefulness score]
  - ...
- **Recommended next skill**: [skill-trigger-optimization | skill-improver | skill-benchmarking | skill-safety-review | none]

The Handoff section is derived from the eval results above. When routing to downstream skills (especially

skill-improver

), include the eval report path and the specific failing prompts. The

run-evals.sh

script writes reports to

eval-results/

with a

-eval.md

symlink to the latest.

Failure handling

Situation	Action
No eval cases exist	Create minimum set: 3 positive triggers, 3 negative triggers, 2 quality cases. Mark them as ad-hoc in the report.
Cannot determine whether skill triggered	Inspect client routing logs. If unavailable, compare output structure with and without the skill description present.
Baseline comparison inconclusive (win rate 45–55%)	Double the sample size. If still inconclusive, report as "neutral — skill neither helps nor hurts."
Routing passes but output quality fails	Stop evaluation. Route to `skill-improver` with the eval report path ( `eval-results/<skill>-eval.md` ) and the specific failing prompts.
Skill passes eval but fails in real usage	Eval set has coverage gaps. Add the failing real-world case and re-run.

Routing to downstream skills: When handing off to another skill (trigger-optimization, improver, etc.), always include:

The eval report path:
```
eval-results/<skill-name>-eval.md
```
The primary failure type and specific failing prompts from the evaluation
The recommended next skill based on the failure type

This enables

skill-improver

to use eval-driven diagnosis (reading the report) rather than relying on heuristic guesswork.

Next steps

After evaluation:

If routing fails →
```
skill-trigger-optimization
```
Diagnose anti-patterns before improving →
```
skill-anti-patterns
```
If output quality fails →
```
skill-improver
```
If comparing variants →
```
skill-benchmarking
```
Before promotion to stable →
```
skill-safety-review
```