Claude-elixir-phoenix skill-monitor

Analyze skill effectiveness across sessions. Computes per-skill metrics (action rate, friction, outcomes), identifies degrading skills, and generates improvement recommendations. Requires session-scan data in metrics.jsonl.

install
source · Clone the upstream repo
git clone https://github.com/oliver-kriska/claude-elixir-phoenix
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/oliver-kriska/claude-elixir-phoenix "$T" && mkdir -p ~/.claude/skills && cp -r "$T/.claude/skills/skill-monitor" ~/.claude/skills/oliver-kriska-claude-elixir-phoenix-skill-monitor && rm -rf "$T"
manifest: .claude/skills/skill-monitor/SKILL.md
source content

Skill Monitor

Closed-loop skill effectiveness monitoring. Reads session metrics, computes per-skill signals, identifies what's working and what needs improvement.

Inspired by the deploy-monitor-evaluate-improve feedback loop: skills get better over time instead of staying static.

Requirements

Requires

.claude/session-metrics/metrics.jsonl
from
/session-scan
. If no data: suggest running
/session-scan
first.

Usage

/skill-monitor                     # Dashboard: all skills
/skill-monitor --skill review      # Deep-dive on one skill
/skill-monitor --improve           # Generate improvement recommendations
/skill-monitor --window 30d        # Change comparison window (default: 7d)

What Main Context Does

Step 1: Parse Arguments

Extract from

$ARGUMENTS
:

  • --skill NAME
    : Focus on one skill (e.g.,
    review
    ,
    plan
    ,
    investigate
    )
  • --improve
    : Spawn analysis agent for improvement recommendations
  • --window PERIOD
    : Comparison window (
    7d
    ,
    30d
    ,
    all
    ; default:
    7d
    )

Step 2: Load Metrics

Read

.claude/session-metrics/metrics.jsonl
. For each entry, extract the
skill_effectiveness
field (added by compute-metrics.py v2).

Filter by window period. Count sessions with and without skill usage.

If no

skill_effectiveness
data exists in metrics: "Metrics were computed before skill tracking was added. Run
/session-scan --rescan
to recompute."

Step 3: Compute Per-Skill Aggregates

For each skill found across all sessions, aggregate:

| Metric              | Computation                                    |
|---------------------|------------------------------------------------|
| Total invocations   | Sum of invocation_count across sessions        |
| Sessions used in    | Count of sessions containing this skill        |
| Action rate         | Weighted avg of per-session action_rate         |
| Avg post-errors     | Weighted avg of avg_post_errors                |
| Avg post-corrections| Weighted avg of avg_post_corrections           |
| Outcome distribution| Count of effective/friction/no_action/mixed    |
| Effectiveness score | action_rate - (0.3 * avg_post_corrections)     |
| Adjusted score      | For analysis/check skills, use lower thresholds |

Skill type weighting: Analysis and check skills (verify, triage, perf, boundaries, pr-review, audit) have low action rates BY DESIGN — their success is "found issues" or "confirmed things pass". Apply adjusted thresholds:

Skill TypeFlag ThresholdExpected Action Rate
Execution (work, quick, full)< 0.5> 0.7
Analysis (perf, boundaries, audit, pr-review)< 0.30.3-0.5
Check (verify, triage)< 0.10.0-0.3
Knowledge (compound, learn, brief)< 0.5> 0.5

Also compute baseline friction (avg friction of sessions WITHOUT any skill usage) vs skill friction (avg friction of sessions WITH skill usage). Delta = skill_friction - baseline_friction. Negative delta = skills reduce friction (good).

Step 4: Display Dashboard

Dashboard mode (no

--skill
):

## Skill Effectiveness Dashboard (last {window})

Baseline friction (no skills): 0.32 | With skills: 0.18 | Delta: -0.14

| Skill           | Uses | Sessions | Action% | Errors | Corrections | Outcome    | Score |
|-----------------|------|----------|---------|--------|-------------|------------|-------|
| /phx:review     | 12   | 8        | 92%     | 0.5    | 0.1         | effective  | 0.89  |
| /phx:plan       | 9    | 7        | 100%    | 0.2    | 0.0         | effective  | 1.00  |
| /phx:investigate| 5    | 5        | 80%     | 1.2    | 0.4         | mixed      | 0.68  |

Skills needing attention: /phx:investigate (high post-errors)

Flag skills using type-adjusted thresholds (see weighting table above). Also flag if avg_post_corrections > 1 or outcome is predominantly "friction". When displaying flagged skills, note if the flag is "expected" for the skill type (e.g., verify at 0.24 is normal for a check skill).

Skill deep-dive (

--skill NAME
):

Show per-session breakdown for that skill, including session IDs, dates, and individual outcome signals. If session reports exist in

.claude/session-analysis/
, reference them.

Step 5: Improvement Mode (--improve)

Spawn

skill-effectiveness-analyzer
agent:

Agent(subagent_type="skill-effectiveness-analyzer", model="sonnet", prompt="""
Analyze skill effectiveness data and recommend improvements.

Metrics data: {aggregated_metrics_json}

Sessions with friction outcomes: {session_ids}

For each underperforming skill:
1. Identify failure patterns from outcome signals
2. Propose specific skill/agent changes
3. Suggest new Iron Laws if patterns are systematic

Write recommendations to: .claude/skill-metrics/recommendations-{date}.md
""")

Step 6: Write Output

Write aggregated metrics to

.claude/skill-metrics/dashboard-{date}.json
:

{
  "computed_at": "2026-03-03T14:00:00Z",
  "window": "7d",
  "baseline_friction": 0.32,
  "skill_friction": 0.18,
  "friction_delta": -0.14,
  "skills": { ... },
  "flagged_skills": ["investigate"]
}

Append-only: never modify previous dashboard files.

Iron Laws

  1. NEVER modify metrics.jsonl — read-only from this skill
  2. Baseline comparison is mandatory — raw numbers without baseline are meaningless
  3. Flag, don't judge — surface data, let the human decide what to fix
  4. Evidence tags on recommendations — every suggestion needs session citations

Integration

/session-scan → metrics.jsonl (with skill_effectiveness)
       ↓
/skill-monitor → dashboard + flagged skills
       ↓
/skill-monitor --improve → recommendations
       ↓
Developer updates skills/agents → deploy → repeat

References

  • references/effectiveness-metrics.md
    — Full metrics schema and evaluation criteria
  • references/improvement-template.md
    — Template for improvement recommendations