Citadel experiment
git clone https://github.com/SethGammon/Citadel
T=$(mktemp -d) && git clone --depth=1 https://github.com/SethGammon/Citadel "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/experiment" ~/.claude/skills/sethgammon-citadel-experiment && rm -rf "$T"
skills/experiment/SKILL.md/experiment — Metric-Driven Optimization Loop
Identity
/experiment is an automated optimization loop with a scalar fitness function. It takes a hypothesis, runs isolated experiments in git worktrees, measures results with a metric command, and keeps improvements or discards failures. Think of it as automated A/B testing for code changes.
Inputs
The user provides three things:
- scope: Files to modify (glob pattern, e.g., "src/api/**/*.ts")
- metric: Shell command that outputs a single number (e.g.,
)npm run build 2>&1 | tail -1 | grep -oP '\d+' - budget: Iteration cap (default: 5) or time cap (e.g., "10 minutes")
If any input is missing, ask for it. The metric MUST output a single number to stdout.
Protocol
Step 1: BASELINE
- Stash any uncommitted changes (restore on exit)
- Run the metric command. Record the baseline value.
- Determine direction: does lower = better (bundle size, error count) or higher = better (FPS, test count)? Ask the user if ambiguous.
- Log:
Baseline: {value} ({metric command})
Step 2: ITERATE
For each iteration (up to budget):
- Create isolation: Spawn a sub-agent in a worktree (
)isolation: "worktree" - Propose change: The agent modifies files within scope to improve the metric. Provide context: baseline value, metric direction, scope, what previous iterations tried.
- Measure: Run the metric command in the worktree (via
)node scripts/run-with-timeout.js 300 - Gate: Run typecheck (also via timeout wrapper). If it fails, discard immediately.
- Evaluate:
- Improved? → KEEP. Merge the worktree branch. New baseline = new value.
- Same or worse? → DISCARD. Delete the worktree.
- Log iteration:
Iteration {N}: {value} ({delta from baseline}) → {KEEP|DISCARD} Change: {one-line description of what was tried}
Step 3: CONVERGENCE CHECK
After each iteration, check:
- Local optimum: Last 3 iterations all discarded → stop ("no more improvements found")
- Diminishing returns: Last kept improvement was < 0.5% → stop ("diminishing returns")
- Budget exhausted: Iteration count or time exceeded → stop
Step 4: REPORT
Write results to
.planning/research/experiment-{slug}.md:
# Experiment: {Description} > Metric: `{command}` > Direction: {lower|higher} is better > Scope: {glob pattern} > Budget: {N iterations} > Date: {ISO date} ## Results | Iteration | Value | Delta | Verdict | Change | |-----------|-------|-------|---------|--------| | baseline | {N} | — | — | — | | 1 | {N} | {+/-} | KEEP | {desc} | | 2 | {N} | {+/-} | DISCARD | {desc} | ## Outcome - **Start**: {baseline} - **End**: {final value} - **Improvement**: {percentage} - **Iterations**: {kept}/{total} - **Stop reason**: {convergence|diminishing|budget} ## Kept Changes {List of changes that were kept, with commit hashes}
Also log to
.planning/telemetry/agent-runs.jsonl:
{"event":"experiment-complete","slug":"{slug}","baseline":0,"final":0,"improvement":"0%","kept":0,"total":0,"timestamp":"ISO"}
Common Metrics
| Goal | Metric Command |
|---|---|
| Reduce bundle size | |
| Reduce type errors | |
| Increase test pass rate | |
| Reduce file count | |
| Reduce line count | |
When to Use
- When you want to optimize a measurable metric (bundle size, error count, test coverage, FPS)
- When you have a clear hypothesis but aren't sure which of several approaches wins
- When manual A/B testing would be too slow or error-prone
- NOT when the goal is subjective ("make it feel better") — the metric must be a number
Safety Rules
- NEVER modify files outside scope
- ALWAYS use worktree isolation for changes
- ALWAYS run typecheck before keeping a change
- Restore stashed changes on exit (even on error)
- If the metric command fails, treat as DISCARD (not crash)
Quality Gates
- Baseline was measured before any iterations ran
- Every kept iteration improved the metric AND passed typecheck
- Every discarded iteration has a logged reason
- The stop reason is one of: convergence, diminishing returns, or budget exhausted
- The experiment report exists at
with all iteration rows filled.planning/research/experiment-{slug}.md
Fringe Cases
Metric command outputs nothing or non-numeric text: Treat as a metric failure. Ask the user to provide a command that outputs a single number to stdout before starting iterations.
No worktree support (e.g., shallow clone): Fall back to branch isolation. Create a branch, run changes there, measure, then delete or merge the branch. Never modify the working tree directly.
If .planning/research/ does not exist: Create it before writing the experiment report. If
.planning/ itself doesn't exist, create the full path or output the report inline.
Budget exhausted with zero kept iterations: Report outcome as "no improvement found". This is a valid result — do not continue past the budget.
Exit Protocol
---HANDOFF--- - Experiment: {description} - Result: {baseline} → {final} ({improvement}%) - Kept: {N}/{total} iterations - Stop reason: {reason} - Report: .planning/research/experiment-{slug}.md ---