Mycelium eval-runner
Run eval scenarios to benchmark Mycelium effectiveness. Execute tasks using reflexion loop, validate against success criteria, record metrics.
install
source · Clone the upstream repo
git clone https://github.com/haabe/mycelium
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/haabe/mycelium "$T" && mkdir -p ~/.claude/skills && cp -r "$T/.claude/skills/eval-runner" ~/.claude/skills/haabe-mycelium-eval-runner && rm -rf "$T"
manifest:
.claude/skills/eval-runner/SKILL.mdsource content
Eval Runner
Benchmark the agent's performance against defined scenarios. Adapted from n-trax eval system.
Commands
run <category/name>
run <category/name>- Read YAML from
.claude/evals/scenarios/<category>/<name>.yml - Parse fields (name, category, task_prompt, success_criteria, budget)
- Execute setup steps if defined
- Record start time
- Execute task via reflexion workflow (read corrections first)
- Record end time and iteration count
- Validate ALL success criteria
- Write result JSON to
.claude/evals/results/<timestamp>-<name>.json - Report summary
run-all [category]
run-all [category]- Glob
.claude/evals/scenarios/**/*.yml - Skip scenarios with
status: retired - For each: run in isolation (git stash), record result, restore
- Update
with each result.claude/evals/pass-history.json - Aggregate and report
run-split <optimization|holdout>
run-split <optimization|holdout>- Glob
.claude/evals/scenarios/**/*.yml - Read each YAML, filter by
field matching the requested setsplit - Skip scenarios with
status: retired - For each matching scenario: run in isolation, record result, restore
- Update
with each result.claude/evals/pass-history.json - Aggregate and report (label output clearly as "Optimization Set" or "Holdout Set")
report
report- Read all results from
.claude/evals/results/ - Generate summary table:
| Category | Pass Rate | Avg Iterations | Avg Time | Notes | |-------------|-----------|----------------|----------|-------| | discovery | ... | ... | ... | | | delivery | ... | ... | ... | | | integration | ... | ... | ... | | | **Overall** | ... | ... | ... | |
- List failure patterns and recommendations
prune
prune- Read
.claude/evals/pass-history.json - Flag evals where
is all-pass (last_5
) or all-fail (saturated
)broken - Flag evals with no runs in 30+ days (
)stale - For saturated evals, suggest: retire or increase difficulty
- For broken evals, suggest: fix criteria or retire
- Present recommendations — do NOT auto-retire
- On user confirmation: set
in scenario YAML, update pass-history.json, log in decision-log.mdstatus: retired
mine
mineAnalyze audit logs to propose new eval scenarios from observed failure patterns.
- Read
(last 100 entries).claude/state/change-log.jsonl - Read
(all entries).claude/state/diamond-state-audit.jsonl - Group change-log entries by
, identify: a. Correction clusters: 3+ edits to same file in one session (agent struggled) b. Skill friction: edits tosession_id
during a session (instructions unclear) c. Missing test coverage: 5+ files changed with no test file edits.claude/skills/*/SKILL.md - Count diamond bypass entries from audit log
- Cross-reference with existing scenarios in pass-history.json to avoid duplicates
- For each pattern, output a proposed eval scenario as YAML template
- Tag proposals with
and originatingsource: trace-miningsession_id - Do NOT auto-create files — present for human review
See
.claude/evals/schema.md §Trace Mining Heuristics for pattern-to-eval mappings.
Result Format
See
.claude/evals/schema.md for YAML scenario and JSON result formats.
Pass History
After writing each result JSON (step 8 of
run), also update .claude/evals/pass-history.json:
- Increment
andruns
(if passed) for the evalpasses - Append
/true
tofalse
(trim to keep only last 5)last_5 - Update
timestamplast_run
Creating Scenarios
Place YAML files in
.claude/evals/scenarios/<category>/. Define task_prompt, success_criteria, and budget. Set split, status, and source fields per schema.