Ralph-orchestrator evaluate-presets
Use when testing Ralph's hat collection presets, validating preset configurations, or auditing the preset library for bugs and UX issues.
git clone https://github.com/mikeyobrien/ralph-orchestrator
T=$(mktemp -d) && git clone --depth=1 https://github.com/mikeyobrien/ralph-orchestrator "$T" && mkdir -p ~/.claude/skills && cp -r "$T/.claude/skills/evaluate-presets" ~/.claude/skills/mikeyobrien-ralph-orchestrator-evaluate-presets && rm -rf "$T"
.claude/skills/evaluate-presets/SKILL.mdEvaluate Presets
Overview
Systematically test all hat collection presets using shell scripts. Direct CLI invocation—no meta-orchestration complexity.
When to Use
- Testing preset configurations after changes
- Auditing the preset library for quality
- Validating new presets work correctly
- After modifying hat routing logic
Quick Start
Evaluate a single preset:
./tools/evaluate-preset.sh tdd-red-green claude
Evaluate all presets:
./tools/evaluate-all-presets.sh claude
Arguments:
- First arg: preset name (without
extension).yml - Second arg: backend (
orclaude
, defaults tokiro
)claude
Bash Tool Configuration
IMPORTANT: When invoking these scripts via the Bash tool, use these settings:
- Single preset evaluation: Use
(10 minutes max) andtimeout: 600000run_in_background: true - All presets evaluation: Use
(10 minutes max) andtimeout: 600000run_in_background: true
Since preset evaluations can run for hours (especially the full suite), always run in background mode and use the
TaskOutput tool to check progress periodically.
Example invocation pattern:
Bash tool with: command: "./tools/evaluate-preset.sh tdd-red-green claude" timeout: 600000 run_in_background: true
After launching, use
TaskOutput with block: false to check status without waiting for completion.
What the Scripts Do
evaluate-preset.sh
evaluate-preset.sh- Loads test task from
(iftools/preset-test-tasks.yml
available)yq - Creates merged config with evaluation settings
- Runs Ralph with
for metrics capture--record-session - Captures output logs, exit codes, and timing
- Extracts metrics: iterations, hats activated, events published
Output structure:
.eval/ ├── logs/<preset>/<timestamp>/ │ ├── output.log # Full stdout/stderr │ ├── session.jsonl # Recorded session │ ├── metrics.json # Extracted metrics │ ├── environment.json # Runtime environment │ └── merged-config.yml # Config used └── logs/<preset>/latest -> <timestamp>
evaluate-all-presets.sh
evaluate-all-presets.shRuns all 12 presets sequentially and generates a summary:
.eval/results/<suite-id>/ ├── SUMMARY.md # Markdown report ├── <preset>.json # Per-preset metrics └── latest -> <suite-id>
Presets Under Evaluation
| Preset | Test Task |
|---|---|
| Add function |
| Review user input handler for security |
| Understand |
| Specify and implement |
| Implement a data structure |
| Debug failing mock test assertion |
| Understand history of |
| Profile hat matching |
| Design a trait |
| Document |
| Respond to "tests failing in CI" |
| Plan v1 to v2 config migration |
Interpreting Results
Exit codes from
:evaluate-preset.sh
— Success (LOOP_COMPLETE reached)0
— Timeout (preset hung or took too long)124- Other — Failure (check
)output.log
Metrics in
:metrics.json
— How many event loop cyclesiterations
— Which hats were triggeredhats_activated
— Total events emittedevents_published
— Whether completion promise was reachedcompleted
Hat Routing Performance
Critical: Validate that hats get fresh context per Tenet #1 ("Fresh Context Is Reliability").
What Good Looks Like
Each hat should execute in its own iteration:
Iter 1: Ralph → publishes starting event → STOPS Iter 2: Hat A → does work → publishes next event → STOPS Iter 3: Hat B → does work → publishes next event → STOPS Iter 4: Hat C → does work → LOOP_COMPLETE
Red Flags (Same-Iteration Hat Switching)
BAD: Multiple hat personas in one iteration:
Iter 2: Ralph does Blue Team + Red Team + Fixer work ^^^ All in one bloated context!
How to Check
1. Count iterations vs events in
:session.jsonl
# Count iterations grep -c "_meta.loop_start\|ITERATION" .eval/logs/<preset>/latest/output.log # Count events published grep -c "bus.publish" .eval/logs/<preset>/latest/session.jsonl
Expected: iterations ≈ events published (one event per iteration) Bad sign: 2-3 iterations but 5+ events (all work in single iteration)
2. Check for same-iteration hat switching in
:output.log
grep -E "ITERATION|Now I need to perform|Let me put on|I'll switch to" \ .eval/logs/<preset>/latest/output.log
Red flag: Hat-switching phrases WITHOUT an ITERATION separator between them.
3. Check event timestamps in
:session.jsonl
cat .eval/logs/<preset>/latest/session.jsonl | jq -r '.ts'
Red flag: Multiple events with identical timestamps (published in same iteration).
Routing Performance Triage
| Pattern | Diagnosis | Action |
|---|---|---|
| iterations ≈ events | ✅ Good | Hat routing working |
| iterations << events | ⚠️ Same-iteration switching | Check prompt has STOP instruction |
| iterations >> events | ⚠️ Recovery loops | Agent not publishing required events |
| 0 events | ❌ Broken | Events not being read from JSONL |
Root Cause Checklist
If hat routing is broken:
-
Check workflow prompt in
:hatless_ralph.rs- Does it say "CRITICAL: STOP after publishing"?
- Is the DELEGATE section clear about yielding control?
-
Check hat instructions propagation:
- Does
includeHatInfo
field?instructions - Are instructions rendered in the
section?## HATS
- Does
-
Check events context:
- Is
using the context parameter?build_prompt(context) - Does prompt include
section?## PENDING EVENTS
- Is
Autonomous Fix Workflow
After evaluation, delegate fixes to subagents:
Step 1: Triage Results
Read
.eval/results/latest/SUMMARY.md and identify:
→ Create code tasks for fixes❌ FAIL
→ Investigate infinite loops⏱️ TIMEOUT
→ Check for edge cases⚠️ PARTIAL
Step 2: Dispatch Task Creation
For each issue, spawn a Task agent:
"Use /code-task-generator to create a task for fixing: [issue from evaluation] Output to: tasks/preset-fixes/"
Step 3: Dispatch Implementation
For each created task:
"Use /code-assist to implement: tasks/preset-fixes/[task-file].code-task.md Mode: auto"
Step 4: Re-evaluate
./tools/evaluate-preset.sh <fixed-preset> claude
Prerequisites
- yq (optional): For loading test tasks from YAML. Install:
brew install yq - Cargo: Must be able to build Ralph
Related Files
— Single preset evaluationtools/evaluate-preset.sh
— Full suite evaluationtools/evaluate-all-presets.sh
— Test task definitionstools/preset-test-tasks.yml
— Manual findings doctools/preset-evaluation-findings.md
— The preset collection being evaluatedpresets/