Mycelium eval-runner

Run eval scenarios to benchmark Mycelium effectiveness. Execute tasks using reflexion loop, validate against success criteria, record metrics.

install
source · Clone the upstream repo
git clone https://github.com/haabe/mycelium
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/haabe/mycelium "$T" && mkdir -p ~/.claude/skills && cp -r "$T/.claude/skills/eval-runner" ~/.claude/skills/haabe-mycelium-eval-runner && rm -rf "$T"
manifest: .claude/skills/eval-runner/SKILL.md
source content

Eval Runner

Benchmark the agent's performance against defined scenarios. Adapted from n-trax eval system.

Commands

run <category/name>

  1. Read YAML from
    .claude/evals/scenarios/<category>/<name>.yml
  2. Parse fields (name, category, task_prompt, success_criteria, budget)
  3. Execute setup steps if defined
  4. Record start time
  5. Execute task via reflexion workflow (read corrections first)
  6. Record end time and iteration count
  7. Validate ALL success criteria
  8. Write result JSON to
    .claude/evals/results/<timestamp>-<name>.json
  9. Report summary

run-all [category]

  1. Glob
    .claude/evals/scenarios/**/*.yml
  2. Skip scenarios with
    status: retired
  3. For each: run in isolation (git stash), record result, restore
  4. Update
    .claude/evals/pass-history.json
    with each result
  5. Aggregate and report

run-split <optimization|holdout>

  1. Glob
    .claude/evals/scenarios/**/*.yml
  2. Read each YAML, filter by
    split
    field matching the requested set
  3. Skip scenarios with
    status: retired
  4. For each matching scenario: run in isolation, record result, restore
  5. Update
    .claude/evals/pass-history.json
    with each result
  6. Aggregate and report (label output clearly as "Optimization Set" or "Holdout Set")

report

  1. Read all results from
    .claude/evals/results/
  2. Generate summary table:
| Category    | Pass Rate | Avg Iterations | Avg Time | Notes |
|-------------|-----------|----------------|----------|-------|
| discovery   | ...       | ...            | ...      |       |
| delivery    | ...       | ...            | ...      |       |
| integration | ...       | ...            | ...      |       |
| **Overall** | ...       | ...            | ...      |       |
  1. List failure patterns and recommendations

prune

  1. Read
    .claude/evals/pass-history.json
  2. Flag evals where
    last_5
    is all-pass (
    saturated
    ) or all-fail (
    broken
    )
  3. Flag evals with no runs in 30+ days (
    stale
    )
  4. For saturated evals, suggest: retire or increase difficulty
  5. For broken evals, suggest: fix criteria or retire
  6. Present recommendations — do NOT auto-retire
  7. On user confirmation: set
    status: retired
    in scenario YAML, update pass-history.json, log in decision-log.md

mine

Analyze audit logs to propose new eval scenarios from observed failure patterns.

  1. Read
    .claude/state/change-log.jsonl
    (last 100 entries)
  2. Read
    .claude/state/diamond-state-audit.jsonl
    (all entries)
  3. Group change-log entries by
    session_id
    , identify: a. Correction clusters: 3+ edits to same file in one session (agent struggled) b. Skill friction: edits to
    .claude/skills/*/SKILL.md
    during a session (instructions unclear) c. Missing test coverage: 5+ files changed with no test file edits
  4. Count diamond bypass entries from audit log
  5. Cross-reference with existing scenarios in pass-history.json to avoid duplicates
  6. For each pattern, output a proposed eval scenario as YAML template
  7. Tag proposals with
    source: trace-mining
    and originating
    session_id
  8. Do NOT auto-create files — present for human review

See

.claude/evals/schema.md
§Trace Mining Heuristics for pattern-to-eval mappings.

Result Format

See

.claude/evals/schema.md
for YAML scenario and JSON result formats.

Pass History

After writing each result JSON (step 8 of

run
), also update
.claude/evals/pass-history.json
:

  • Increment
    runs
    and
    passes
    (if passed) for the eval
  • Append
    true
    /
    false
    to
    last_5
    (trim to keep only last 5)
  • Update
    last_run
    timestamp

Creating Scenarios

Place YAML files in

.claude/evals/scenarios/<category>/
. Define task_prompt, success_criteria, and budget. Set
split
,
status
, and
source
fields per schema.