Vibecosystem skill-evolution

Self-evolving skill system. Skills are scored after execution (0-100) on 5 dimensions. Score 90+ over 5 runs = crystallized (locked). Score below 30 = auto-repair attempted. Skills improve themselves through usage feedback.

install
source · Clone the upstream repo
git clone https://github.com/vibeeval/vibecosystem
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/vibeeval/vibecosystem "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/skill-evolution" ~/.claude/skills/vibeeval-vibecosystem-skill-evolution && rm -rf "$T"
manifest: skills/skill-evolution/SKILL.md
source content

Skill Evolution

Darwinian selection for skills. Skills that produce good outcomes are crystallized and protected. Skills that produce poor outcomes are repaired or archived. Every execution generates a score that drives the next generation of the skill.

The 5 Scoring Dimensions

Each skill execution is scored 0-100 on five dimensions:

DimensionWeightWhat It Measures
Accuracy25%Did the skill produce the correct result for the task?
Relevance20%Was the skill content applicable to the actual use case?
Token Efficiency20%Did the skill guide the agent without bloat or repetition?
User Satisfaction20%Did the outcome meet or exceed user expectations?
Reusability15%Could another agent use this skill in a similar situation?

Composite score = weighted average of all five dimensions (0-100).

Scoring Rubric

90-100: Excellent -- candidate for crystallization
70-89:  Good -- active skill, no action needed
50-69:  Adequate -- flag for review after 3 more runs
30-49:  Poor -- schedule auto-repair attempt
0-29:   Critical -- immediate auto-repair or archive

Skill Lifecycle

DRAFT          ACTIVE         CRYSTALLIZED      ARCHIVED
  |               |                |                |
New skill   In regular use   Proven stable    Deprecated/replaced
  |               |                |                |
  +-- first run ->+-- score >90   ++-- score <30    |
                  |   for 5+ runs  |   (3 attempts)  |
                  +-- score <30 -->+ auto-repair      |
                  |   auto-repair  |   fails 3x -->--+
                  +-- score >90 -->+

Draft

New skills enter as Draft. They receive no special protection and are evaluated critically on first use. A Draft skill that scores below 30 on its very first run is discarded rather than repaired.

Active

Skills in regular use. Scores are tracked in

~/.claude/skill-scores.jsonl
. No action unless scores trend below 30 or above 90 over a rolling window of 5 runs.

Crystallized

A skill that maintains an average composite score above 90 over 5 or more consecutive runs is crystallized:

  • Git tag applied:
    skill/<name>/crystallized-v<N>
  • Read-only flag added to frontmatter:
    locked: true
  • Skill is excluded from auto-repair
  • Changes require explicit human unlock + PR

Archived

A skill that fails auto-repair 3 times is archived:

  • Moved to
    skills/_archived/<name>/
  • Git tag applied:
    skill/<name>/archived
  • Replacement skill drafted by
    catalyst
    agent if the capability is still needed

Score Storage Format

Append one record per execution to

~/.claude/skill-scores.jsonl
:

{"skill":"experiment-loop","ts":"2026-04-07T10:00:00Z","session":"abc123","scores":{"accuracy":88,"relevance":92,"token_efficiency":75,"user_satisfaction":90,"reusability":85},"composite":86.5,"feedback":"Loop ran 4 iterations successfully, target nearly met"}
{"skill":"experiment-loop","ts":"2026-04-07T14:30:00Z","session":"def456","scores":{"accuracy":95,"relevance":90,"token_efficiency":82,"user_satisfaction":95,"reusability":88},"composite":90.4,"feedback":"Bundle size reduced 28%, target exceeded"}

Score CLI (quick check)

# Average scores for a skill (last 10 runs)
cat ~/.claude/skill-scores.jsonl | python3 -c "
import sys, json, statistics
skill = '$1'
runs = [json.loads(l) for l in sys.stdin if json.loads(l).get('skill') == skill][-10:]
if runs:
    avg = statistics.mean(r['composite'] for r in runs)
    print(f'{skill}: {avg:.1f} avg over {len(runs)} runs')
"

Crystallization Protocol

When a skill reaches 90+ composite score over 5+ consecutive runs:

  1. Verify scores in
    ~/.claude/skill-scores.jsonl
    -- confirm no outliers inflating the average
  2. Add
    locked: true
    to the skill's frontmatter
  3. Apply git tag:
    git tag skill/<name>/crystallized-v1 -m "Crystallized: avg score 92.3 over 7 runs"
    git push origin skill/<name>/crystallized-v1
    
  4. Log the crystallization in
    thoughts/SKILL-EVOLUTION.md
  5. Notify via canavar cross-training so all agents know this skill is stable

Auto-Repair Protocol

When a skill's composite score drops below 30:

Diagnosis

  1. Identify the lowest-scoring dimension (the primary failure mode)
  2. Read the last 3 session feedback notes from
    ~/.claude/skill-scores.jsonl
  3. Summarize what went wrong (specific, not vague)

Repair

The

catalyst
agent rewrites the failing section(s) of the skill:

  • Only the sections relevant to the low-scoring dimension
  • Preserve all high-scoring sections unchanged
  • Add a concrete example for the repaired section

Validation

After repair, the skill is re-scored on a synthetic test case by the

verifier
agent:

  • Synthetic score must be 50+ to proceed to Active state
  • If synthetic score < 50, attempt 2 of 3 repairs begins

Escalation

After 3 failed auto-repairs:

  • Archive the skill
  • Alert via
    thoughts/SKILL-EVOLUTION.md
  • Spawn
    catalyst
    to draft a replacement from scratch

Evolution Log Format

Append events to

thoughts/SKILL-EVOLUTION.md
:

## 2026-04-07

### skill: experiment-loop
- Status change: Active -> Crystallized
- Trigger: avg composite 91.2 over 6 consecutive runs
- Git tag: skill/experiment-loop/crystallized-v1
- Notable strength: Token Efficiency dimension consistently 85+

### skill: legacy-deploy-helper
- Status change: Active -> Auto-Repair (attempt 1/3)
- Trigger: composite 24 on last run
- Lowest dimension: Relevance (12) -- skill referenced outdated Heroku patterns
- Repair: catalyst rewrote "Deployment Targets" section with Vercel/Railway focus
- Post-repair synthetic score: 71 -- promoted back to Active

Integration with Canavar Cross-Training

Skill evolution data feeds into canavar's cross-training pipeline:

  • A crystallized skill is injected into canavar's
    skill-matrix.json
    with
    trust: locked
  • An archived skill is marked
    trust: deprecated
    -- agents stop referencing it
  • Auto-repair failures are logged to
    error-ledger.jsonl
    with
    source: skill-evolution
  • The canavar leaderboard tracks which agents most frequently produce high-scoring skill executions
# View crystallized skills
node ~/.claude/hooks/dist/canavar-cli.mjs leaderboard --filter crystallized

# View skills needing repair
cat ~/.claude/skill-scores.jsonl | python3 -c "
import sys, json, collections
runs = [json.loads(l) for l in sys.stdin]
low = {r['skill'] for r in runs if r['composite'] < 30}
print('Skills needing repair:', low)
"

Activation

This skill activates automatically when:

  • A skill completes an execution (PostToolUse hook)
  • A skill is referenced in a session that ends with user dissatisfaction
  • The
    verifier
    agent reports a skill-guided task as failed

Agents involved:

catalyst
(repair),
verifier
(validation),
self-learner
(feedback extraction),
canavar
(cross-training propagation).