Simmer simmer
git clone https://github.com/2389-research/simmer
T=$(mktemp -d) && git clone --depth=1 https://github.com/2389-research/simmer "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills" ~/.claude/skills/2389-research-simmer-simmer && rm -rf "$T"
skills/SKILL.mdSimmer
Iterative refinement loop — take an artifact (single file or workspace) and hone it repeatedly against user-defined criteria until it's as good as it can get.
Related skills (test-kitchen family):
— don't know what you want → parallel designs → react → picktest-kitchen:omakase-off
— know what you want, it's code → parallel implementations → fixed criteria → steal the besttest-kitchen:cookoff
— know what you want, it's anything → user-defined criteria → iterate until goodsimmer
Flow
"Simmer this" / "Refine this" / "Optimize this pipeline" ↓ ┌─────────────────────────────────────┐ │ SETUP (identify + criteria) │ │ Load simmer-setup subskill │ │ │ │ Output: artifact, rubric, N iters, │ │ evaluator (optional), │ │ background (optional) │ └─────────────────────────────────────┘ ↓ ┌─────────────────────────────────────┐ │ LOOP (default 3 iterations) │ │ │ │ Each iteration: │ │ 1. Dispatch generator subagent │ │ 2. Run evaluator (if present) │ │ 3. Dispatch judge subagent │ │ 4. Load reflect subskill │ │ │ │ Generator gets: candidate + ASI │ │ + background │ │ Judge gets: candidate + rubric │ │ + evaluator output (if any) │ │ Reflect gets: full score history │ └─────────────────────────────────────┘ ↓ ┌─────────────────────────────────────┐ │ OUTPUT │ │ Best candidate → result file │ │ Score trajectory displayed │ └─────────────────────────────────────┘
When to Use
Trigger when user wants iterative refinement of any kind:
- "Simmer this", "refine this", "hone this", "iterate on this"
- "Make this better", "improve this over a few rounds"
- "Polish this", "tighten this up"
- "Optimize this pipeline", "find the best model for this task"
- "Tune this configuration", "improve these prompts against this test suite"
- Any request to iteratively improve an artifact or workspace
Judge mode is auto-selected by setup based on problem complexity:
| Condition | JUDGE_MODE |
|---|---|
| text/creative, ≤2 criteria, short artifact (email, tweet, tagline) | |
| text/creative, 3 criteria or long/complex artifact | |
| code/testable (any) | |
| pipeline/engineering (any) | |
| User says "with a single judge" | (override) |
| User says "with a judge board" or "with a panel" | (override) |
Plateau upgrade: If the loop started with a single judge and detects a plateau (3 iterations without improvement), offer: "Scores have plateaued. Switch to judge board for deeper diagnosis?" If the user accepts, switch to
JUDGE_MODE: board for remaining iterations.
Not simmer: If the artifact is code and the user wants parallel implementations, use cookoff instead.
Orchestration
Announce: "I'm using the simmer skill to set up iterative refinement."
Track progress (TodoWrite if available, otherwise inline):
- Setup — identify artifact, elicit criteria, determine evaluation method
- Refinement loop (N iterations)
- Output best version with score trajectory
Phase 1: Setup
Invoke
.simmer:simmer-setup
Do not attempt to identify the artifact or ask about criteria yourself — that is the setup subskill's job.
Shortcut: If the user (or calling system) has already provided artifact, criteria (each with at least one sentence describing what a high score looks like), iteration count, mode, and optionally evaluator/background, skip the setup subskill entirely. Construct the setup brief directly and proceed to Phase 2.
Setup returns a brief:
ARTIFACT: [content, file path, or directory path] ARTIFACT_TYPE: [single-file | workspace] CRITERIA: - [criterion 1]: [what better looks like] - [criterion 2]: [what better looks like] - [criterion 3]: [what better looks like] PRIMARY: [criterion name — omit if equally weighted] EVALUATOR: [command to run — omit for judge-only mode] BACKGROUND: [constraints, available resources, domain knowledge — omit if not needed] OUTPUT_CONTRACT: [valid output format description — omit for text/creative] VALIDATION_COMMAND: [quick check command — omit if no cheap validation exists] SEARCH_SPACE: [what's in scope to explore — omit if unconstrained] JUDGE_MODE: [single | board — auto-selected by setup based on complexity. User can override] JUDGE_PANEL: [optional custom judge definitions — omit to use defaults for problem class] ITERATIONS: [N] MODE: [seedless | from-file | from-paste | from-workspace] OUTPUT_DIR: [path, default: docs/simmer]
Phase 2: Refinement Loop
For single-file mode:
mkdir -p {OUTPUT_DIR}
For workspace mode:
# Create initial commit to snapshot the seed state cd {ARTIFACT} git add -A && git commit -m "simmer: iteration 0 — seed state"
Iteration counting:
"N iterations" means N generate-judge-reflect cycles AFTER the initial seed judgment. The seed judgment is iteration 0 (not counted toward N). So
ITERATIONS: 3 means:
- Iteration 0: Judge the seed (no generator)
- Iteration 1: Generate → Judge → Reflect
- Iteration 2: Generate → Judge → Reflect
- Iteration 3: Generate → Judge → Reflect
- Total: 3 generation passes + 1 seed judgment = 4 judge rounds
For seedless mode: iteration 1 generates the initial candidate AND judges it.
ITERATIONS: 3 means 3 generation passes total.
Iteration 0 (seed):
Single-file mode:
- Write the seed artifact to
{OUTPUT_DIR}/iteration-0-candidate.md - If seedless: dispatch generator subagent to produce initial candidate from description + criteria, then judge it
- If from-file or from-paste: the seed IS the starting artifact — judge it directly (no generator)
Workspace mode:
- The seed is the current state of the workspace directory
- If from-workspace: judge the current state directly (no generator)
- If seedless: dispatch generator to scaffold the initial workspace, then judge it
Each iteration:
Step 1: Generator (subagent)
Invoke
simmer:simmer-generator as a subagent.
Single-file subagent prompt:
You are the generator in a simmer refinement loop. Invoke the skill: simmer:simmer-generator ITERATION: [N] ARTIFACT_TYPE: single-file CRITERIA: [rubric from setup] CURRENT CANDIDATE: [full text of current best candidate] JUDGE FEEDBACK (ASI from previous round): [ASI text, or "First iteration — generate initial candidate" if seedless iteration 1] Write your improved candidate to: {OUTPUT_DIR}/iteration-[N]-candidate.md (or appropriate extension matching artifact type) Report: what specifically changed and why (2-3 sentences).
Workspace subagent prompt:
You are the generator in a simmer refinement loop. Invoke the skill: simmer:simmer-generator ITERATION: [N] ARTIFACT_TYPE: workspace WORKSPACE: [directory path] CRITERIA: [rubric from setup] BACKGROUND: [constraints, available resources, domain knowledge from setup] OUTPUT_CONTRACT: [valid output format — omit if not specified in setup] VALIDATION_COMMAND: [quick check command — omit if not specified in setup] SEARCH_SPACE: [what's in scope to explore — omit if not specified in setup] JUDGE FEEDBACK (ASI from previous round): [ASI text — may describe coordinated changes across multiple files] EXPLORATION STATUS: [from reflect: what's been tried vs untried — omit on iteration 1 or if no search space] Make your changes directly in the workspace directory. You may edit multiple files in a single iteration when the ASI calls for coordinated changes. If making infrastructure changes, run VALIDATION_COMMAND (if available) before reporting success. Report: what specifically changed and why (2-3 sentences).
Step 2: Run Evaluator (if present)
If the setup brief includes an
EVALUATOR command:
cd {ARTIFACT} # for workspace mode {EVALUATOR}
Capture stdout and stderr. This output will be passed to the judge.
Timeouts: Set generous timeouts for evaluator commands. If the evaluator involves LLM inference, network calls, or large data processing, allow 10-60 minutes per run. The orchestrator should not timeout before the evaluator completes.
If no evaluator, skip this step.
Step 3: Judge (subagent or judge board)
If
: Invoke JUDGE_MODE: board
simmer:simmer-judge-board instead of the single judge. Pass it all the same context below, plus JUDGE_PANEL if specified in the setup brief. The board dispatches multiple judges, runs deliberation, and returns output in the exact same format as a single judge. The rest of the loop (reflect, generator) is unchanged.
Include file paths so judges can investigate. In addition to pasted content, pass:
- Path to the candidate file (or workspace directory)
- Path to the evaluator script (if evaluator mode)
- Path to ground truth / test data (if known from setup inspection)
- Paths to prior iteration candidate files
- Paths to config files (from setup inspection)
Judges need to read these files themselves — not just the pre-digested summaries in the prompt. A judge who reads the evaluator script discovers exact-match scoring on iteration 0 instead of learning it through 3 iterations of trial and error.
Otherwise: Invoke
simmer:simmer-judge as a subagent.
Without evaluator:
You are the judge in a simmer refinement loop. Invoke the skill: simmer:simmer-judge ITERATION: [N] ARTIFACT_TYPE: [single-file | workspace] CRITERIA: [rubric from setup] CANDIDATE: [full text of candidate, or key files from workspace] SEED CALIBRATION: [full text of original seed artifact, or key seed files] SEED SCORES: [iteration 0 scores — omit this block on iteration 0] Score this candidate against the criteria using the seed as a calibration reference. Do NOT look at or consider any intermediate iteration scores.
With evaluator:
You are the judge in a simmer refinement loop. Invoke the skill: simmer:simmer-judge ITERATION: [N] ARTIFACT_TYPE: [single-file | workspace] CRITERIA: [rubric from setup] CANDIDATE: [full text of candidate, or key files from workspace] EVALUATOR OUTPUT: [stdout and stderr from the evaluator command] SEED CALIBRATION: [full text of original seed artifact, or key seed files] SEED SCORES: [iteration 0 scores — omit this block on iteration 0] OUTPUT_CONTRACT: [valid output format — omit if not specified in setup] SEARCH_SPACE: [what's in scope to explore — omit if not specified in setup] PREVIOUS ASI: [the ASI from the previous judge round — omit on iteration 0] ITERATION HISTORY: [condensed trajectory: iteration number, scores, config, key change for each prior iteration — omit on iteration 0] EXPLORATION STATUS: [from reflect: what's been tried vs untried in the search space — omit on iteration 0 or if no search space specified] Interpret the evaluator output alongside the criteria. Check evaluator output against the output contract if specified. Score this candidate using the seed as a calibration reference. Use the iteration history, previous ASI, and exploration status to inform your ASI — analyze what's been tried, what worked, what didn't, and propose an evidence-based direction. You may research approaches if the current path is stuck.
Step 4: Reflect (inline, load subskill)
Invoke
simmer:simmer-reflect.
Provide: full score history across all iterations so far, current iteration number, max iterations, judge output from this round.
After reflect completes, display the updated trajectory table to the user. Show the full table so far — the user should see scores accumulate row by row as the loop runs. This is especially important during long evaluator runs where the user otherwise sees nothing for 10-15 minutes per iteration.
Iteration 2 complete. | Iter | Value Prop | Tone | CTA | Composite | Key Change | |------|-----------|------|-----|-----------|------------| | 0 | 4 | 5 | 3 | 4.0 | seed | | 1 | 7 | 5 | 4 | 5.3 | specific problem statement | | 2 | 7 | 6 | 6 | 6.3 | low-friction CTA | Best so far: iteration 2 (6.3/10). 1 iteration remaining.
Handling regression: If reflect reports that this iteration scored lower than best-so-far:
- Single-file: the NEXT generator receives the best candidate file (not the latest regressed one)
- Workspace: selectively restore workspace files from the best iteration's commit:
. Do NOT revert trajectory.md or other tracking files ingit checkout <best-commit> -- <workspace-files>
.{OUTPUT_DIR} - The generator prompt should note: "Starting from the best version (iteration N), not the latest (which regressed)."
Plateau detection: If the best-so-far score (primary criterion if set, otherwise composite) has not improved for 3 consecutive iterations — including regressions that were rolled back:
-
If currently using single judge (
): Offer upgrade: "Best score has not improved for 3 iterations (best: N.N/10 at iteration M). Switch to judge board for deeper diagnosis, or stop?" If the user accepts the upgrade, switch toJUDGE_MODE: single
and add 2 iterations to the remaining count (the board typically needs 2-3 iterations to surface and act on new insights). The board's multi-perspective deliberation often surfaces blind spots the single judge missed.JUDGE_MODE: board -
If already using board: Offer early termination: "Best score has not improved for 3 iterations with the judge board (best: N.N/10 at iteration M). Continue or stop?"
This catches both flat plateaus and oscillation around a ceiling. Especially important when evaluator runs are expensive (minutes to hours per iteration).
Phase 3: Output
After all iterations complete:
Single-file mode:
- Write best-scoring candidate to
{OUTPUT_DIR}/result.md - Display full trajectory table
- Summarize what changed from start to finish (2-3 sentences)
- Offer: "N iterations complete. Run 3 more?"
Workspace mode:
- Ensure workspace is on the best iteration's state
- Display full trajectory table
- Summarize what changed from start to finish (2-3 sentences)
- Offer: "N iterations complete. Run 3 more?"
If user continues: carry forward best candidate as new seed, continue iteration numbering (e.g., iterations 4, 5, 6), run 3 more.
Directory Structure
Single-file mode:
{OUTPUT_DIR}/ iteration-0-candidate.md # Seed (or seedless first generation) iteration-1-candidate.md # Each improved candidate iteration-2-candidate.md iteration-3-candidate.md trajectory.md # Running score table result.md # Final best output
Workspace mode:
{WORKSPACE}/ # The target directory [project files] # Modified in place by generator {OUTPUT_DIR}/ # Tracking files (can be inside or outside workspace) trajectory.md # Running score table
Iterations are tracked via git commits in workspace mode rather than separate candidate files.
{OUTPUT_DIR} defaults to docs/simmer. Override via setup brief's OUTPUT_DIR field.
Single-Agent Mode
If you cannot dispatch separate subagents (e.g., nested Claude sessions are blocked, or you're running in a constrained environment), execute all roles sequentially.
Context discipline is aspirational in single-agent mode. You will see prior scores and evaluator output. Mitigate bias by:
- (a) Writing your judge scores BEFORE reading your previous trajectory
- (b) Scoring against the criterion descriptions and seed reference, not against your memory of prior scores
- (c) In the generator step, work from the ASI text only — do not reference raw evaluator metrics or output. If the ASI is well-written (specific, citing concrete failures), it already contains the signal you need.
Per-iteration checklist (single-agent):
- GENERATOR: Review the simmer-generator constraints (especially what context you receive and do NOT receive). Read ASI + current best candidate + background. Write improved version.
- RUN EVALUATOR: If evaluator command exists, run it and capture output.
- JUDGE: Review the simmer-judge constraints (especially scoring rules, seed calibration, and ASI format). Score against criteria + seed reference + evaluator output (if any). Write scores in required format.
- REFLECT: Update
. Note best-so-far. If regression, flag it and roll back to best candidate. Skip the formal "output to orchestrator" block — just update the file and continue.{OUTPUT_DIR}/trajectory.md
Context Discipline
This is critical for consistent results:
| Subskill | Receives | Does NOT receive |
|---|---|---|
| Generator | Current candidate, criteria, ASI from last judge, background, exploration status | Score history, previous candidates, evaluator output |
| Judge (text/creative) | Current candidate, criteria, iteration number, seed + seed scores | Intermediate scores, intermediate candidates, previous ASI, trajectory |
| Judge (code/pipeline) | Current candidate, criteria, iteration number, seed + seed scores, evaluator output, previous ASI, iteration history, search space, exploration status | Full candidate history |
| Judge Board | Same as single judge per problem class, plus: other panelists' scores during deliberation | Other panelists' ASI candidates (withheld until synthesis) |
| Reflect | Full score history, all iteration summaries, search space | Candidate content (just scores + summaries) |
The generator improves based on specific feedback (ASI) and available resources (background), not scores. The judge scores against criteria definitions, evaluator output, and the seed as a fixed calibration reference — no intermediate scores. The judge board preserves these same rules per panelist — deliberation adds within-iteration cross-judge visibility only, no new cross-iteration information. The reflect subskill is the only one that sees the full trajectory.
Skill Dependencies
| Dependency | Usage |
|---|---|
| — fallback: dispatch sequentially |
Common Mistakes
Giving the generator score history
- Problem: Generator optimizes for scores instead of addressing the specific ASI
- Fix: Generator only sees current candidate + ASI + criteria + background
Giving the judge previous scores
- Problem: Anchoring — judge calibrates relative to prior scores instead of fresh
- Fix: Judge only sees current candidate + criteria + evaluator output
Trying to fix everything at once (single-file mode)
- Problem: Generator makes scattered edits, regression on some criteria
- Fix: ASI is a single focused fix — focused improvement compounds
Treating ASI as always single-edit (workspace mode)
- Problem: Generator makes one tiny change when the ASI calls for a coordinated move
- Fix: In workspace mode, ASI describes a single direction which may involve coordinated changes across files
Sharing candidate history with the judge
- Problem: Judge compares to previous versions instead of scoring against criteria
- Fix: Judge sees only the current candidate
Not tracking best candidate separately
- Problem: Last iteration may not be the best
- Fix: Reflect tracks best-scoring candidate across all iterations
Not rolling back on regression (workspace mode)
- Problem: Generator builds on a regressed state instead of the best state
- Fix: Selectively restore workspace files:
. Do NOT revert trajectory.md.git checkout <best-commit> -- <files>
Dual-writing embedded artifacts
- Problem: Artifact is embedded in a larger system (prompt inside a script, config inside YAML) requiring updates in two places
- Fix: Design evaluators to read from the canonical candidate file rather than requiring dual writes. Pass the artifact path as input to the evaluator script.
Reverting trajectory on git rollback (workspace mode)
- Problem:
reverts ALL files including trajectory.md trackinggit checkout <commit> - Fix: Always use selective checkout:
. Keep trajectory.md and other tracking files outside the rollback scope.git checkout <commit> -- file1 file2
Example Flow: Single-File (v1 behavior)
User: "Simmer this" [pastes a pitch email] Claude: I'm using the simmer skill to set up iterative refinement. [Invokes simmer-setup] Setup identifies: pitch email, suggests criteria User accepts: value prop clarity, tone match, call to action strength Iterations: 3 [Iteration 0: Judge scores seed — 4.0/10] [Iteration 1: Generator fixes value prop → 5.3/10] [Iteration 2: Generator fixes CTA → 6.3/10] [Iteration 3: Generator fixes tone → 7.3/10] Trajectory: | Iter | Value Prop | Tone | CTA | Composite | Key Change | |------|-----------|------|-----|-----------|------------| | 0 | 4 | 5 | 3 | 4.0 | seed | | 1 | 7 | 5 | 4 | 5.3 | specific problem statement | | 2 | 7 | 6 | 6 | 6.3 | lower-friction CTA | | 3 | 7 | 7 | 8 | 7.3 | specific asset in CTA | Best candidate: iteration 3 (7.3/10) 3 iterations complete. Run 3 more?
Example Flow: Workspace with Evaluator
User: "Simmer this pipeline — find the best model and prompt setup" Claude: I'm using the simmer skill to set up iterative refinement. [Invokes simmer-setup] Setup identifies: workspace at ./pipeline/ Evaluator: python evaluate.py --input output.json Background: "Available models: claude-sonnet, gpt-4o-mini, llama-8b, llama-70b. Topologies: single-call, multi-step chain, parallel fan-out. Budget: <$0.01/call, <2s latency." Criteria: accuracy, cost efficiency, latency Iterations: 5 [Iteration 0: Run evaluator on seed, judge scores — 3.7/10] accuracy: 6/10, cost: 2/10, latency: 3/10 ASI: "Using claude-sonnet for a simple extraction task. The model is overkill — accuracy is fine but cost is 5x over budget. Switch to gpt-4o-mini which handles extraction well at 1/10th the cost." [Iteration 1: Generator swaps model + adjusts prompt → 5.3/10] accuracy: 5/10, cost: 8/10, latency: 7/10 ASI: "Cost and latency are great now but accuracy dropped on multi-step reasoning tasks (cases 7, 12). Split into two calls — extraction on mini, reasoning on sonnet — to get accuracy back without blowing the budget." [Iteration 2: Generator restructures to 2-step chain → 7.0/10] accuracy: 7/10, cost: 7/10, latency: 7/10 ASI: "Architecture is solid. The extraction prompt is too generic — add 3 few-shot examples from the test cases to anchor the format." [Iteration 3: Generator adds few-shot examples → 7.7/10] ... Best candidate: iteration 3 (7.7/10) 5 iterations complete. Run 3 more?