Claude-code-ultimate-guide eval-skills

Audit all skills in the current project for frontmatter completeness, effort level appropriateness, allowed-tools scoping, and content quality. Produces a scored report with effort-level recommendations for each skill. Use when onboarding to a new project, reviewing skill quality before shipping, or adding effort fields to an existing skill library.

install
source · Clone the upstream repo
git clone https://github.com/FlorianBruniaux/claude-code-ultimate-guide
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/FlorianBruniaux/claude-code-ultimate-guide "$T" && mkdir -p ~/.claude/skills && cp -r "$T/examples/skills/eval-skills" ~/.claude/skills/florianbruniaux-claude-code-ultimate-guide-eval-skills && rm -rf "$T"
manifest: examples/skills/eval-skills/SKILL.md
source content

Skill Evaluator

Discover all skills in the project, score them across 6 criteria, and infer the appropriate

effort
level based on content analysis.

When to Use

  • New project: run once to establish baseline quality
  • Before committing a skill to a team repo
  • After bulk-importing skills from another project
  • When adding
    effort
    fields for the first time (v2.1.80+)

What Gets Audited

All

SKILL.md
files and flat
.md
files found in:

  • .claude/skills/**
  • ~/.claude/skills/**
    (if requested)
  • Any path passed as argument:
    /eval-skills ./my-skills-dir

Scoring Criteria (14 pts per skill)

#CriterionMaxWhat is checked
1name1Present, lowercase, hyphens only, matches directory name
2description2Present + has "Use when" / "when to" / trigger phrasing
3allowed-tools2Present + not overly broad (Bash without scoping when read-only)
4effort3Present (1pt) + appropriate for content (2pt based on inference)
5content structure4Has Purpose/When section (1), has examples/usage (1), has clear workflow (1), no placeholder text (1)
6bonus+2argument-hint present (1), version/author metadata (1)

Note:

tags
is NOT an officially supported frontmatter field in Claude Code. It is ignored by the runtime. Do not include it or score it as a quality criterion.

Thresholds:

  • ✅ Good: ≥11/14 (≥80%)
  • ⚠️ Needs work: 8–10/14 (60–79%)
  • ❌ Fix: <8/14 (<60%)

Effort Level Inference Engine

For each skill, analyze description + content and classify using these signals:

low
— Mechanical execution, no design decisions

Signals:

  • Verbs: commit, push, sync, scaffold, generate (template-based), format, rename, bump, wrap, convert
  • No reasoning required: sequential steps, template instantiation, data fetching
  • allowed-tools: Bash only, or Read-only
  • No sub-agents spawned
  • Short workflow (<5 steps)

Examples:

/commit
,
/release-notes
,
/scaffold
,
/sync
,
/format

medium
— Analysis with bounded scope, categorization

Signals:

  • Verbs: review, triage, analyze, categorize, suggest, evaluate (single file or bounded scope)
  • Requires pattern recognition but not architectural reasoning
  • allowed-tools: Read + Grep + Bash combination
  • May spawn 1-2 sub-agents but with predefined scope
  • Produces structured output (tables, categorized lists)

Examples:

/code-review
(single PR),
/issue-triage
,
/dependency-audit
,
/test-coverage

high
— Design decisions, adversarial reasoning, cross-system analysis

Signals:

  • Verbs: architect, redesign, threat-model, audit (security), orchestrate (multi-agent), score, assess trade-offs
  • Requires reasoning about edge cases, attack vectors, or system-wide implications
  • allowed-tools: broad access (Read + Write + Bash + external tools)
  • Spawns multiple sub-agents or uses parallel execution
  • Produces analysis with explicit uncertainty or trade-off sections
  • Keywords in content: "security", "architecture", "adversarial", "pipeline", "threat", "design decision"

Examples:

/security-audit
,
/architecture-review
,
/cyber-defense
,
/eval-agents

Mismatch flag

If a skill has

effort:
already set but the inferred level differs, flag it:

⚠️ Effort mismatch: declared

low
, inferred
high
— skill spawns 4 sub-agents and performs security analysis


Execution Instructions

Step 1 — Discovery

# Find all SKILL.md files
find .claude/skills -name "SKILL.md" 2>/dev/null

# Find flat skill files
find .claude/skills -maxdepth 1 -name "*.md" ! -name "README*" 2>/dev/null

# If argument provided, use that path instead

Step 2 — Parse each skill

For each skill file found:

  1. Read the full file
  2. Extract YAML frontmatter (between first
    ---
    and second
    ---
    )
  3. Parse: name, description, allowed-tools, effort, argument-hint, version
  4. Note presence/absence of each field
  5. Read the body content for structure analysis

Step 3 — Score and infer

Apply the scoring criteria above to each skill:

  • Check frontmatter fields
  • Evaluate description quality (does it answer "when to use"? is it under 1024 chars?)
  • Evaluate allowed-tools scope (is Bash used when only Read would suffice? are tools scoped with wildcards when possible?)
  • Infer effort level from content analysis
  • Compare inferred vs declared effort (if set)
  • Evaluate content structure (scan for "When to Use", "Purpose", "Example", "Workflow" sections)

Step 4 — Output

Produce a structured report:

# Skills Audit — [project name or path]
Date: [today] | Scanned: N skills

## Summary
| Status | Count |
|--------|-------|
| ✅ Good (≥80%) | N |
| ⚠️ Needs work (60–79%) | N |
| ❌ Fix (<60%) | N |

**Effort coverage**: N/N skills have effort field set

---

## Per-Skill Results

### [skill-name] — [score]/15 [✅/⚠️/❌]

| Criterion | Score | Notes |
|-----------|-------|-------|
| name | ✅ 1/1 | — |
| description | ⚠️ 1/2 | Missing "Use when" phrasing |
| allowed-tools | ✅ 2/2 | Well-scoped |
| effort | ❌ 0/3 | Missing — Recommended: high |
| content structure | ⚠️ 2/4 | No examples section |

**Effort inference**: `high` — skill performs security analysis with adversarial reasoning
  Signals: "threat", "attack surface", "vulnerability scoring" in content; spawns 4 agents

**Priority fixes** (ordered by impact):
1. Add `effort: high` to frontmatter
2. Add "Use when" to description
3. Add a concrete usage example section

---

After all skills: print a Fix Summary — all missing effort fields with recommended values, ready to copy-paste.


Fix Summary Format

At the end, print a ready-to-use patch block for all missing/mismatched effort fields:

## Recommended effort fields (copy-paste ready)

skill-name-1: effort: low     # mechanical scaffold
skill-name-2: effort: high    # security analysis, spawns agents
skill-name-3: effort: medium  # code review, bounded scope

And a 1-line count:

N skills need effort field · N mismatches · N missing allowed-tools