Crucible skill-selection-evals
Eval-only skill for measuring skill routing accuracy. Not invoked directly — contains selection evals that test whether the agent picks the correct skill for a given prompt.
install
source · Clone the upstream repo
git clone https://github.com/raddue/crucible
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/raddue/crucible "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/skill-selection-evals" ~/.claude/skills/raddue-crucible-skill-selection-evals && rm -rf "$T"
manifest:
skills/skill-selection-evals/SKILL.mdsource content
Skill-Selection Evals
This is not an executable skill. It contains evaluation data for measuring the accuracy of skill selection (routing) decisions.
Purpose
Crucible's 49 execution evals measure quality once a skill is invoked. Selection evals measure whether the right skill gets invoked in the first place.
Eval Types
- Direct selection: Given a prompt, does the agent pick the correct skill?
- Negative selection: Given a prompt that sounds like skill X but is not, does the agent avoid the false positive?
- Context-dependent: Same verb, different context, different correct skill.
- Cascade ordering: Multi-skill tasks requiring correct invocation order.
Boundaries Tested
- test-methodology — TDD vs test-coverage vs adversarial-tester
- review-direction — code-review vs review-feedback
- adversarial-scope — red-team vs inquisitor vs audit vs siege
- completion-claims — verify vs finish
- bug-handling — debugging vs verify vs audit
Difficulty Ratings
Each eval is rated easy/medium/hard based on routing ambiguity. This enables stratified baseline measurement — distinguishing between improvements that lift hard cases (high value) vs confirming easy cases already work (low signal).
See Also
— the eval dataevals/evals.json
— grading criteria and baseline measurement protocolGRADING.md