Mycelium eval-runner

Run eval scenarios to benchmark Mycelium effectiveness. Execute tasks using reflexion loop, validate against success criteria, record metrics.

install

source · Clone the upstream repo

git clone https://github.com/haabe/mycelium

Claude Code · Install into ~/.claude/skills/

T=$(mktemp -d) && git clone --depth=1 https://github.com/haabe/mycelium "$T" && mkdir -p ~/.claude/skills && cp -r "$T/.claude/skills/eval-runner" ~/.claude/skills/haabe-mycelium-eval-runner && rm -rf "$T"

manifest: .claude/skills/eval-runner/SKILL.md

source content

Eval Runner

Benchmark the agent's performance against defined scenarios. Adapted from n-trax eval system.

Commands

run <category/name>

Read YAML from

.claude/evals/scenarios/<category>/<name>.yml

Parse fields (name, category, task_prompt, success_criteria, budget)
Execute setup steps if defined
Record start time
Execute task via reflexion workflow (read corrections first)
Record end time and iteration count
Validate ALL success criteria

Write result JSON to

.claude/evals/results/<timestamp>-<name>.json

Report summary

run-all [category]

Glob
```
.claude/evals/scenarios/**/*.yml
```
Skip scenarios with
```
status: retired
```
For each: run in isolation (git stash), record result, restore
Update
```
.claude/evals/pass-history.json
```
with each result
Aggregate and report

run-split <optimization|holdout>

Glob
```
.claude/evals/scenarios/**/*.yml
```
Read each YAML, filter by
```
split
```
field matching the requested set
Skip scenarios with
```
status: retired
```
For each matching scenario: run in isolation, record result, restore
Update
```
.claude/evals/pass-history.json
```
with each result
Aggregate and report (label output clearly as "Optimization Set" or "Holdout Set")

report

Read all results from
```
.claude/evals/results/
```
Generate summary table:

| Category    | Pass Rate | Avg Iterations | Avg Time | Notes |
|-------------|-----------|----------------|----------|-------|
| discovery   | ...       | ...            | ...      |       |
| delivery    | ...       | ...            | ...      |       |
| integration | ...       | ...            | ...      |       |
| **Overall** | ...       | ...            | ...      |       |

List failure patterns and recommendations

prune

Read
```
.claude/evals/pass-history.json
```
Flag evals where
```
last_5
```
is all-pass (
```
saturated
```
) or all-fail (
```
broken
```
)
Flag evals with no runs in 30+ days (
```
stale
```
)
For saturated evals, suggest: retire or increase difficulty
For broken evals, suggest: fix criteria or retire
Present recommendations — do NOT auto-retire
On user confirmation: set
```
status: retired
```
in scenario YAML, update pass-history.json, log in decision-log.md

mine

Analyze audit logs to propose new eval scenarios from observed failure patterns.

Read
```
.claude/state/change-log.jsonl
```
(last 100 entries)
Read
```
.claude/state/diamond-state-audit.jsonl
```
(all entries)
Group change-log entries by
```
session_id
```
, identify: a. Correction clusters: 3+ edits to same file in one session (agent struggled) b. Skill friction: edits to
```
.claude/skills/*/SKILL.md
```
during a session (instructions unclear) c. Missing test coverage: 5+ files changed with no test file edits
Count diamond bypass entries from audit log
Cross-reference with existing scenarios in pass-history.json to avoid duplicates
For each pattern, output a proposed eval scenario as YAML template
Tag proposals with
```
source: trace-mining
```
and originating
```
session_id
```
Do NOT auto-create files — present for human review

See

.claude/evals/schema.md

§Trace Mining Heuristics for pattern-to-eval mappings.

Result Format

See

.claude/evals/schema.md

for YAML scenario and JSON result formats.

Pass History

After writing each result JSON (step 8 of

run

), also update

.claude/evals/pass-history.json

Increment
```
runs
```
and
```
passes
```
(if passed) for the eval
Append
```
true
```
/
```
false
```
to
```
last_5
```
(trim to keep only last 5)
Update
```
last_run
```
timestamp

Creating Scenarios

Place YAML files in

.claude/evals/scenarios/<category>/

. Define task_prompt, success_criteria, and budget. Set

split

status

, and

source

fields per schema.

Mycelium eval-runner

Eval Runner

Commands

`run <category/name>`

`run-all [category]`

`run-split <optimization|holdout>`

`report`

`prune`

`mine`

Result Format

Pass History

Creating Scenarios