Ai skill-judge

Evaluate Agent Skill quality against official specifications. Use when reviewing SKILL.md files, auditing skill packages, improving skill design, or checking if a skill follows best practices. Provides 8-dimension scoring (120 points) with actionable improvements. Triggers on review skill, evaluate skill, audit skill, improve skill, skill quality, SKILL.md review.

install
source · Clone the upstream repo
git clone https://github.com/wpank/ai
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/wpank/ai "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/tools/skill-judge" ~/.claude/skills/wpank-ai-skill-judge && rm -rf "$T"
manifest: skills/tools/skill-judge/SKILL.md
source content

Skill Judge

Evaluate Agent Skills against official specifications and patterns derived from 17+ official examples.

WHAT This Skill Does

Scores skills across 8 dimensions (120 points total) and provides specific, actionable improvement suggestions.

WHEN To Use

  • Reviewing/auditing a SKILL.md file
  • Improving an existing skill's design
  • Checking if a skill follows best practices
  • Before publishing a skill to the ecosystem

KEYWORDS: review skill, evaluate skill, audit skill, skill quality, SKILL.md

Installation

OpenClaw / Moltbot / Clawbot

npx clawhub@latest install skill-judge

Core Philosophy

The Core Formula

Good Skill = Expert-only Knowledge − What Claude Already Knows

A Skill's value = its knowledge delta — the gap between what it provides and what the model already knows.

TypeDefinitionTreatment
ExpertClaude genuinely doesn't know thisMust keep — this is the Skill's value
ActivationClaude knows but may not think ofKeep if brief — serves as reminder
RedundantClaude definitely knows thisDelete — wastes tokens

Good Skill ratio: >70% Expert, <20% Activation, <10% Redundant


Evaluation Dimensions (120 points)

D1: Knowledge Delta (20 pts) — THE CORE DIMENSION

Does the Skill add genuine expert knowledge?

ScoreCriteria
0-5Explains basics Claude knows (tutorials, standard library usage)
6-10Mixed: some expert knowledge diluted by obvious content
11-15Mostly expert knowledge with minimal redundancy
16-20Pure knowledge delta — every paragraph earns its tokens

Red flags (instant ≤5): "What is [basic concept]", step-by-step tutorials, generic best practices

Green flags (high delta): Decision trees, non-obvious trade-offs, edge cases from experience, "NEVER do X because [non-obvious reason]"


D2: Mindset + Procedures (15 pts)

Does the Skill transfer expert thinking patterns AND domain-specific procedures?

ScoreCriteria
0-3Only generic procedures Claude already knows
4-7Has domain procedures but lacks thinking frameworks
8-11Good balance: thinking patterns + domain-specific workflows
12-15Expert-level: shapes thinking AND provides procedures Claude wouldn't know

Valuable thinking patterns: "Before [action], ask yourself: Purpose? Constraints? Differentiation?"

Valuable procedures: Domain-specific sequences, non-obvious ordering, critical steps easy to miss

Redundant procedures: Generic file operations, standard programming patterns


D3: Anti-Pattern Quality (15 pts)

Does the Skill have effective NEVER lists?

ScoreCriteria
0-3No anti-patterns mentioned
4-7Generic warnings ("avoid errors", "be careful")
8-11Specific NEVER list with some reasoning
12-15Expert-grade anti-patterns with WHY — things only experience teaches

Test: Would an expert read the anti-pattern list and say "yes, I learned this the hard way"?


D4: Specification Compliance — Especially Description (15 pts)

The description is THE MOST IMPORTANT field. It's the only thing the agent sees before deciding to load the skill.

ScoreCriteria
0-5Missing frontmatter or invalid format
6-10Has frontmatter but description is vague or incomplete
11-13Valid frontmatter, description has WHAT but weak on WHEN
14-15Perfect: comprehensive description with WHAT, WHEN, and trigger keywords

Description must answer:

  1. WHAT: What does this Skill do?
  2. WHEN: In what situations should it be used?
  3. KEYWORDS: What terms should trigger this Skill?

Poor: "Helps with document tasks"
Good: "Create, edit, and analyze .docx files. Use when working with Word documents, tracked changes, or professional document formatting."


D5: Progressive Disclosure (15 pts)

Does the Skill implement proper content layering?

LayerContentSize
1: Metadataname + description~100 tokens
2: SKILL.mdGuidelines, decision trees< 500 lines ideal
3: Resourcesscripts/, references/, assets/No limit
ScoreCriteria
0-5Everything dumped in SKILL.md (>500 lines, no structure)
6-10Has references but unclear when to load them
11-13Good layering with MANDATORY triggers present
14-15Perfect: decision trees + explicit triggers + "Do NOT Load" guidance

Good trigger: "MANDATORY - READ ENTIRE FILE: Before proceeding, you MUST read

docx-js.md
"

Bad trigger: Just listing references at the end without loading guidance


D6: Freedom Calibration (15 pts)

Is specificity appropriate for the task's fragility?

Task TypeShould HaveWhy
Creative/DesignHigh freedomMultiple valid approaches
Code reviewMedium freedomPrinciples exist but judgment required
File format operationsLow freedomOne wrong byte corrupts file
ScoreCriteria
0-5Severely mismatched (rigid scripts for creative, vague for fragile)
6-10Partially appropriate
11-13Good calibration for most scenarios
14-15Perfect freedom calibration throughout

Test: "If Agent makes a mistake, what's the consequence?" High consequence → Low freedom


D7: Pattern Recognition (10 pts)

Does the Skill follow an established pattern?

Pattern~LinesWhen to Use
Mindset~50Creative tasks requiring taste
Navigation~30Multiple distinct scenarios (routes to sub-files)
Philosophy~150Art/creation requiring originality
Process~200Complex multi-step projects
Tool~300Precise operations on specific formats
ScoreCriteria
0-3No recognizable pattern, chaotic structure
4-6Partially follows a pattern with significant deviations
7-8Clear pattern with minor deviations
9-10Masterful application of appropriate pattern

D8: Practical Usability (15 pts)

Can an Agent actually use this Skill effectively?

ScoreCriteria
0-5Confusing, incomplete, or untested guidance
6-10Usable but with noticeable gaps
11-13Clear guidance for common cases
14-15Comprehensive: edge cases, error handling, decision trees

Check for: Decision trees for multi-path scenarios, working code examples, error handling/fallbacks, edge cases covered


NEVER Do When Evaluating

  • Give high scores just because it "looks professional"
  • Ignore token waste — every redundant paragraph = deduction
  • Let length impress you — 43-line Skill can outperform 500-line Skill
  • Skip mentally testing the decision trees
  • Forgive explaining basics with "provides helpful context"
  • Overlook missing anti-patterns
  • Undervalue the description field — poor description = skill never gets used
  • Put "when to use" info only in the body (agent only sees description before loading)

Evaluation Protocol

Step 1: Knowledge Delta Scan

Read SKILL.md and mark each section:

  • [E] Expert: Claude doesn't know this — value-add
  • [A] Activation: Claude knows but reminder useful — acceptable
  • [R] Redundant: Claude knows this — should delete

Calculate ratio: E:A:R (target >70:20:10)

Step 2: Structure Analysis

[ ] Valid frontmatter (name ≤64 chars, comprehensive description)
[ ] Total lines in SKILL.md
[ ] Reference files and sizes
[ ] Pattern identification (Mindset/Navigation/Philosophy/Process/Tool)
[ ] Loading triggers present (if references exist)

Step 3: Score Each Dimension

For each dimension: find evidence, assign score, note improvements if < max

Step 4: Calculate Total & Grade

GradePercentageMeaning
A90%+ (108+)Excellent — production-ready
B80-89% (96-107)Good — minor improvements needed
C70-79% (84-95)Adequate — clear improvement path
D60-69% (72-83)Below Average — significant issues
F<60% (<72)Poor — needs fundamental redesign

Step 5: Generate Report

# Skill Evaluation Report: [Skill Name]

## Summary
- **Total Score**: X/120 (X%)
- **Grade**: [A/B/C/D/F]
- **Pattern**: [Mindset/Navigation/Philosophy/Process/Tool]
- **Knowledge Ratio**: E:A:R = X:Y:Z
- **Verdict**: [One sentence]

## Dimension Scores
| Dimension | Score | Max | Notes |
|-----------|-------|-----|-------|
| D1: Knowledge Delta | X | 20 | |
| D2: Mindset + Procedures | X | 15 | |
| D3: Anti-Pattern Quality | X | 15 | |
| D4: Specification Compliance | X | 15 | |
| D5: Progressive Disclosure | X | 15 | |
| D6: Freedom Calibration | X | 15 | |
| D7: Pattern Recognition | X | 10 | |
| D8: Practical Usability | X | 15 | |

## Critical Issues
[Must-fix problems]

## Top 3 Improvements
1. [Highest impact with specific guidance]
2. [Second priority]
3. [Third priority]

Common Failure Patterns

PatternSymptomFix
TutorialExplains what X is, basic library usageDelete basics. Focus on expert decisions.
Dump800+ lines, everything includedCore in SKILL.md (<300), details in references/
Orphan ReferencesReferences exist but never loadedAdd "MANDATORY - READ" at decision points
Checkbox ProcedureStep 1, Step 2... mechanicalTransform to "Before doing X, ask yourself..."
Vague Warning"Be careful", "avoid errors"Specific NEVER list with concrete examples
Invisible SkillGreat content, rarely activatedFix description: WHAT + WHEN + KEYWORDS
Wrong Location"When to use" in body, not descriptionMove triggers to description field
Over-EngineeredREADME, CHANGELOG, CONTRIBUTINGDelete. Only what Agent needs for the task.

The Meta-Question

"Would an expert in this domain say: 'Yes, this captures knowledge that took me years to learn'?"

If yes → genuine value. If no → compressing what Claude already knows.

The best Skills are compressed expert brains — 10 years of experience in 50 lines.