Simmer simmer

install
source · Clone the upstream repo
git clone https://github.com/2389-research/simmer
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/2389-research/simmer "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills" ~/.claude/skills/2389-research-simmer-simmer && rm -rf "$T"
manifest: skills/SKILL.md
source content

Simmer

Iterative refinement loop — take an artifact (single file or workspace) and hone it repeatedly against user-defined criteria until it's as good as it can get.

Related skills (test-kitchen family):

  • test-kitchen:omakase-off
    — don't know what you want → parallel designs → react → pick
  • test-kitchen:cookoff
    — know what you want, it's code → parallel implementations → fixed criteria → steal the best
  • simmer
    — know what you want, it's anything → user-defined criteria → iterate until good

Flow

"Simmer this" / "Refine this" / "Optimize this pipeline"
    ↓
┌─────────────────────────────────────┐
│  SETUP (identify + criteria)        │
│  Load simmer-setup subskill         │
│                                     │
│  Output: artifact, rubric, N iters, │
│  evaluator (optional),              │
│  background (optional)              │
└─────────────────────────────────────┘
    ↓
┌─────────────────────────────────────┐
│  LOOP (default 3 iterations)        │
│                                     │
│  Each iteration:                    │
│  1. Dispatch generator subagent     │
│  2. Run evaluator (if present)      │
│  3. Dispatch judge subagent         │
│  4. Load reflect subskill           │
│                                     │
│  Generator gets: candidate + ASI    │
│           + background              │
│  Judge gets: candidate + rubric     │
│       + evaluator output (if any)   │
│  Reflect gets: full score history   │
└─────────────────────────────────────┘
    ↓
┌─────────────────────────────────────┐
│  OUTPUT                             │
│  Best candidate → result file       │
│  Score trajectory displayed         │
└─────────────────────────────────────┘

When to Use

Trigger when user wants iterative refinement of any kind:

  • "Simmer this", "refine this", "hone this", "iterate on this"
  • "Make this better", "improve this over a few rounds"
  • "Polish this", "tighten this up"
  • "Optimize this pipeline", "find the best model for this task"
  • "Tune this configuration", "improve these prompts against this test suite"
  • Any request to iteratively improve an artifact or workspace

Judge mode is auto-selected by setup based on problem complexity:

ConditionJUDGE_MODE
text/creative, ≤2 criteria, short artifact (email, tweet, tagline)
single
text/creative, 3 criteria or long/complex artifact
board
code/testable (any)
board
pipeline/engineering (any)
board
User says "with a single judge"
single
(override)
User says "with a judge board" or "with a panel"
board
(override)

Plateau upgrade: If the loop started with a single judge and detects a plateau (3 iterations without improvement), offer: "Scores have plateaued. Switch to judge board for deeper diagnosis?" If the user accepts, switch to

JUDGE_MODE: board
for remaining iterations.

Not simmer: If the artifact is code and the user wants parallel implementations, use cookoff instead.

Orchestration

Announce: "I'm using the simmer skill to set up iterative refinement."

Track progress (TodoWrite if available, otherwise inline):

  1. Setup — identify artifact, elicit criteria, determine evaluation method
  2. Refinement loop (N iterations)
  3. Output best version with score trajectory

Phase 1: Setup

Invoke

simmer:simmer-setup
.

Do not attempt to identify the artifact or ask about criteria yourself — that is the setup subskill's job.

Shortcut: If the user (or calling system) has already provided artifact, criteria (each with at least one sentence describing what a high score looks like), iteration count, mode, and optionally evaluator/background, skip the setup subskill entirely. Construct the setup brief directly and proceed to Phase 2.

Setup returns a brief:

ARTIFACT: [content, file path, or directory path]
ARTIFACT_TYPE: [single-file | workspace]
CRITERIA:
  - [criterion 1]: [what better looks like]
  - [criterion 2]: [what better looks like]
  - [criterion 3]: [what better looks like]
PRIMARY: [criterion name — omit if equally weighted]
EVALUATOR: [command to run — omit for judge-only mode]
BACKGROUND: [constraints, available resources, domain knowledge — omit if not needed]
OUTPUT_CONTRACT: [valid output format description — omit for text/creative]
VALIDATION_COMMAND: [quick check command — omit if no cheap validation exists]
SEARCH_SPACE: [what's in scope to explore — omit if unconstrained]
JUDGE_MODE: [single | board — auto-selected by setup based on complexity. User can override]
JUDGE_PANEL: [optional custom judge definitions — omit to use defaults for problem class]
ITERATIONS: [N]
MODE: [seedless | from-file | from-paste | from-workspace]
OUTPUT_DIR: [path, default: docs/simmer]

Phase 2: Refinement Loop

For single-file mode:

mkdir -p {OUTPUT_DIR}

For workspace mode:

# Create initial commit to snapshot the seed state
cd {ARTIFACT}
git add -A && git commit -m "simmer: iteration 0 — seed state"

Iteration counting:

"N iterations" means N generate-judge-reflect cycles AFTER the initial seed judgment. The seed judgment is iteration 0 (not counted toward N). So

ITERATIONS: 3
means:

  • Iteration 0: Judge the seed (no generator)
  • Iteration 1: Generate → Judge → Reflect
  • Iteration 2: Generate → Judge → Reflect
  • Iteration 3: Generate → Judge → Reflect
  • Total: 3 generation passes + 1 seed judgment = 4 judge rounds

For seedless mode: iteration 1 generates the initial candidate AND judges it.

ITERATIONS: 3
means 3 generation passes total.

Iteration 0 (seed):

Single-file mode:

  • Write the seed artifact to
    {OUTPUT_DIR}/iteration-0-candidate.md
  • If seedless: dispatch generator subagent to produce initial candidate from description + criteria, then judge it
  • If from-file or from-paste: the seed IS the starting artifact — judge it directly (no generator)

Workspace mode:

  • The seed is the current state of the workspace directory
  • If from-workspace: judge the current state directly (no generator)
  • If seedless: dispatch generator to scaffold the initial workspace, then judge it

Each iteration:

Step 1: Generator (subagent)

Invoke

simmer:simmer-generator
as a subagent.

Single-file subagent prompt:

You are the generator in a simmer refinement loop.

Invoke the skill: simmer:simmer-generator

ITERATION: [N]
ARTIFACT_TYPE: single-file
CRITERIA:
[rubric from setup]

CURRENT CANDIDATE:
[full text of current best candidate]

JUDGE FEEDBACK (ASI from previous round):
[ASI text, or "First iteration — generate initial candidate" if seedless iteration 1]

Write your improved candidate to: {OUTPUT_DIR}/iteration-[N]-candidate.md
(or appropriate extension matching artifact type)

Report: what specifically changed and why (2-3 sentences).

Workspace subagent prompt:

You are the generator in a simmer refinement loop.

Invoke the skill: simmer:simmer-generator

ITERATION: [N]
ARTIFACT_TYPE: workspace
WORKSPACE: [directory path]
CRITERIA:
[rubric from setup]

BACKGROUND:
[constraints, available resources, domain knowledge from setup]

OUTPUT_CONTRACT:
[valid output format — omit if not specified in setup]

VALIDATION_COMMAND:
[quick check command — omit if not specified in setup]

SEARCH_SPACE:
[what's in scope to explore — omit if not specified in setup]

JUDGE FEEDBACK (ASI from previous round):
[ASI text — may describe coordinated changes across multiple files]

EXPLORATION STATUS:
[from reflect: what's been tried vs untried — omit on iteration 1 or if no search space]

Make your changes directly in the workspace directory.
You may edit multiple files in a single iteration when the ASI calls for coordinated changes.
If making infrastructure changes, run VALIDATION_COMMAND (if available) before reporting success.

Report: what specifically changed and why (2-3 sentences).

Step 2: Run Evaluator (if present)

If the setup brief includes an

EVALUATOR
command:

cd {ARTIFACT}  # for workspace mode
{EVALUATOR}

Capture stdout and stderr. This output will be passed to the judge.

Timeouts: Set generous timeouts for evaluator commands. If the evaluator involves LLM inference, network calls, or large data processing, allow 10-60 minutes per run. The orchestrator should not timeout before the evaluator completes.

If no evaluator, skip this step.

Step 3: Judge (subagent or judge board)

If

JUDGE_MODE: board
: Invoke
simmer:simmer-judge-board
instead of the single judge. Pass it all the same context below, plus
JUDGE_PANEL
if specified in the setup brief. The board dispatches multiple judges, runs deliberation, and returns output in the exact same format as a single judge. The rest of the loop (reflect, generator) is unchanged.

Include file paths so judges can investigate. In addition to pasted content, pass:

  • Path to the candidate file (or workspace directory)
  • Path to the evaluator script (if evaluator mode)
  • Path to ground truth / test data (if known from setup inspection)
  • Paths to prior iteration candidate files
  • Paths to config files (from setup inspection)

Judges need to read these files themselves — not just the pre-digested summaries in the prompt. A judge who reads the evaluator script discovers exact-match scoring on iteration 0 instead of learning it through 3 iterations of trial and error.

Otherwise: Invoke

simmer:simmer-judge
as a subagent.

Without evaluator:

You are the judge in a simmer refinement loop.

Invoke the skill: simmer:simmer-judge

ITERATION: [N]
ARTIFACT_TYPE: [single-file | workspace]
CRITERIA:
[rubric from setup]

CANDIDATE:
[full text of candidate, or key files from workspace]

SEED CALIBRATION:
[full text of original seed artifact, or key seed files]
SEED SCORES:
[iteration 0 scores — omit this block on iteration 0]

Score this candidate against the criteria using the seed as a calibration reference.
Do NOT look at or consider any intermediate iteration scores.

With evaluator:

You are the judge in a simmer refinement loop.

Invoke the skill: simmer:simmer-judge

ITERATION: [N]
ARTIFACT_TYPE: [single-file | workspace]
CRITERIA:
[rubric from setup]

CANDIDATE:
[full text of candidate, or key files from workspace]

EVALUATOR OUTPUT:
[stdout and stderr from the evaluator command]

SEED CALIBRATION:
[full text of original seed artifact, or key seed files]
SEED SCORES:
[iteration 0 scores — omit this block on iteration 0]

OUTPUT_CONTRACT:
[valid output format — omit if not specified in setup]

SEARCH_SPACE:
[what's in scope to explore — omit if not specified in setup]

PREVIOUS ASI:
[the ASI from the previous judge round — omit on iteration 0]

ITERATION HISTORY:
[condensed trajectory: iteration number, scores, config, key change for each
 prior iteration — omit on iteration 0]

EXPLORATION STATUS:
[from reflect: what's been tried vs untried in the search space — omit on
 iteration 0 or if no search space specified]

Interpret the evaluator output alongside the criteria.
Check evaluator output against the output contract if specified.
Score this candidate using the seed as a calibration reference.
Use the iteration history, previous ASI, and exploration status to inform
your ASI — analyze what's been tried, what worked, what didn't, and propose
an evidence-based direction. You may research approaches if the current
path is stuck.

Step 4: Reflect (inline, load subskill)

Invoke

simmer:simmer-reflect
.

Provide: full score history across all iterations so far, current iteration number, max iterations, judge output from this round.

After reflect completes, display the updated trajectory table to the user. Show the full table so far — the user should see scores accumulate row by row as the loop runs. This is especially important during long evaluator runs where the user otherwise sees nothing for 10-15 minutes per iteration.

Iteration 2 complete.

| Iter | Value Prop | Tone | CTA | Composite | Key Change |
|------|-----------|------|-----|-----------|------------|
| 0    | 4         | 5    | 3   | 4.0       | seed       |
| 1    | 7         | 5    | 4   | 5.3       | specific problem statement |
| 2    | 7         | 6    | 6   | 6.3       | low-friction CTA |

Best so far: iteration 2 (6.3/10). 1 iteration remaining.

Handling regression: If reflect reports that this iteration scored lower than best-so-far:

  • Single-file: the NEXT generator receives the best candidate file (not the latest regressed one)
  • Workspace: selectively restore workspace files from the best iteration's commit:
    git checkout <best-commit> -- <workspace-files>
    . Do NOT revert trajectory.md or other tracking files in
    {OUTPUT_DIR}
    .
  • The generator prompt should note: "Starting from the best version (iteration N), not the latest (which regressed)."

Plateau detection: If the best-so-far score (primary criterion if set, otherwise composite) has not improved for 3 consecutive iterations — including regressions that were rolled back:

  • If currently using single judge (

    JUDGE_MODE: single
    ): Offer upgrade: "Best score has not improved for 3 iterations (best: N.N/10 at iteration M). Switch to judge board for deeper diagnosis, or stop?" If the user accepts the upgrade, switch to
    JUDGE_MODE: board
    and add 2 iterations to the remaining count (the board typically needs 2-3 iterations to surface and act on new insights). The board's multi-perspective deliberation often surfaces blind spots the single judge missed.

  • If already using board: Offer early termination: "Best score has not improved for 3 iterations with the judge board (best: N.N/10 at iteration M). Continue or stop?"

This catches both flat plateaus and oscillation around a ceiling. Especially important when evaluator runs are expensive (minutes to hours per iteration).

Phase 3: Output

After all iterations complete:

Single-file mode:

  1. Write best-scoring candidate to
    {OUTPUT_DIR}/result.md
  2. Display full trajectory table
  3. Summarize what changed from start to finish (2-3 sentences)
  4. Offer: "N iterations complete. Run 3 more?"

Workspace mode:

  1. Ensure workspace is on the best iteration's state
  2. Display full trajectory table
  3. Summarize what changed from start to finish (2-3 sentences)
  4. Offer: "N iterations complete. Run 3 more?"

If user continues: carry forward best candidate as new seed, continue iteration numbering (e.g., iterations 4, 5, 6), run 3 more.

Directory Structure

Single-file mode:

{OUTPUT_DIR}/
  iteration-0-candidate.md     # Seed (or seedless first generation)
  iteration-1-candidate.md     # Each improved candidate
  iteration-2-candidate.md
  iteration-3-candidate.md
  trajectory.md                # Running score table
  result.md                    # Final best output

Workspace mode:

{WORKSPACE}/                    # The target directory
  [project files]               # Modified in place by generator

{OUTPUT_DIR}/                   # Tracking files (can be inside or outside workspace)
  trajectory.md                 # Running score table

Iterations are tracked via git commits in workspace mode rather than separate candidate files.

{OUTPUT_DIR}
defaults to
docs/simmer
. Override via setup brief's
OUTPUT_DIR
field.

Single-Agent Mode

If you cannot dispatch separate subagents (e.g., nested Claude sessions are blocked, or you're running in a constrained environment), execute all roles sequentially.

Context discipline is aspirational in single-agent mode. You will see prior scores and evaluator output. Mitigate bias by:

  • (a) Writing your judge scores BEFORE reading your previous trajectory
  • (b) Scoring against the criterion descriptions and seed reference, not against your memory of prior scores
  • (c) In the generator step, work from the ASI text only — do not reference raw evaluator metrics or output. If the ASI is well-written (specific, citing concrete failures), it already contains the signal you need.

Per-iteration checklist (single-agent):

  1. GENERATOR: Review the simmer-generator constraints (especially what context you receive and do NOT receive). Read ASI + current best candidate + background. Write improved version.
  2. RUN EVALUATOR: If evaluator command exists, run it and capture output.
  3. JUDGE: Review the simmer-judge constraints (especially scoring rules, seed calibration, and ASI format). Score against criteria + seed reference + evaluator output (if any). Write scores in required format.
  4. REFLECT: Update
    {OUTPUT_DIR}/trajectory.md
    . Note best-so-far. If regression, flag it and roll back to best candidate. Skip the formal "output to orchestrator" block — just update the file and continue.

Context Discipline

This is critical for consistent results:

SubskillReceivesDoes NOT receive
GeneratorCurrent candidate, criteria, ASI from last judge, background, exploration statusScore history, previous candidates, evaluator output
Judge (text/creative)Current candidate, criteria, iteration number, seed + seed scoresIntermediate scores, intermediate candidates, previous ASI, trajectory
Judge (code/pipeline)Current candidate, criteria, iteration number, seed + seed scores, evaluator output, previous ASI, iteration history, search space, exploration statusFull candidate history
Judge BoardSame as single judge per problem class, plus: other panelists' scores during deliberationOther panelists' ASI candidates (withheld until synthesis)
ReflectFull score history, all iteration summaries, search spaceCandidate content (just scores + summaries)

The generator improves based on specific feedback (ASI) and available resources (background), not scores. The judge scores against criteria definitions, evaluator output, and the seed as a fixed calibration reference — no intermediate scores. The judge board preserves these same rules per panelist — deliberation adds within-iteration cross-judge visibility only, no new cross-iteration information. The reflect subskill is the only one that sees the full trajectory.

Skill Dependencies

DependencyUsage
parallel-agents
superpowers:dispatching-parallel-agents
— fallback: dispatch sequentially

Common Mistakes

Giving the generator score history

  • Problem: Generator optimizes for scores instead of addressing the specific ASI
  • Fix: Generator only sees current candidate + ASI + criteria + background

Giving the judge previous scores

  • Problem: Anchoring — judge calibrates relative to prior scores instead of fresh
  • Fix: Judge only sees current candidate + criteria + evaluator output

Trying to fix everything at once (single-file mode)

  • Problem: Generator makes scattered edits, regression on some criteria
  • Fix: ASI is a single focused fix — focused improvement compounds

Treating ASI as always single-edit (workspace mode)

  • Problem: Generator makes one tiny change when the ASI calls for a coordinated move
  • Fix: In workspace mode, ASI describes a single direction which may involve coordinated changes across files

Sharing candidate history with the judge

  • Problem: Judge compares to previous versions instead of scoring against criteria
  • Fix: Judge sees only the current candidate

Not tracking best candidate separately

  • Problem: Last iteration may not be the best
  • Fix: Reflect tracks best-scoring candidate across all iterations

Not rolling back on regression (workspace mode)

  • Problem: Generator builds on a regressed state instead of the best state
  • Fix: Selectively restore workspace files:
    git checkout <best-commit> -- <files>
    . Do NOT revert trajectory.md.

Dual-writing embedded artifacts

  • Problem: Artifact is embedded in a larger system (prompt inside a script, config inside YAML) requiring updates in two places
  • Fix: Design evaluators to read from the canonical candidate file rather than requiring dual writes. Pass the artifact path as input to the evaluator script.

Reverting trajectory on git rollback (workspace mode)

  • Problem:
    git checkout <commit>
    reverts ALL files including trajectory.md tracking
  • Fix: Always use selective checkout:
    git checkout <commit> -- file1 file2
    . Keep trajectory.md and other tracking files outside the rollback scope.

Example Flow: Single-File (v1 behavior)

User: "Simmer this" [pastes a pitch email]

Claude: I'm using the simmer skill to set up iterative refinement.

[Invokes simmer-setup]

Setup identifies: pitch email, suggests criteria
User accepts: value prop clarity, tone match, call to action strength
Iterations: 3

[Iteration 0: Judge scores seed — 4.0/10]
[Iteration 1: Generator fixes value prop → 5.3/10]
[Iteration 2: Generator fixes CTA → 6.3/10]
[Iteration 3: Generator fixes tone → 7.3/10]

Trajectory:
| Iter | Value Prop | Tone | CTA | Composite | Key Change |
|------|-----------|------|-----|-----------|------------|
| 0    | 4         | 5    | 3   | 4.0       | seed       |
| 1    | 7         | 5    | 4   | 5.3       | specific problem statement |
| 2    | 7         | 6    | 6   | 6.3       | lower-friction CTA |
| 3    | 7         | 7    | 8   | 7.3       | specific asset in CTA |

Best candidate: iteration 3 (7.3/10)
3 iterations complete. Run 3 more?

Example Flow: Workspace with Evaluator

User: "Simmer this pipeline — find the best model and prompt setup"

Claude: I'm using the simmer skill to set up iterative refinement.

[Invokes simmer-setup]

Setup identifies: workspace at ./pipeline/
Evaluator: python evaluate.py --input output.json
Background: "Available models: claude-sonnet, gpt-4o-mini, llama-8b, llama-70b.
            Topologies: single-call, multi-step chain, parallel fan-out.
            Budget: <$0.01/call, <2s latency."
Criteria: accuracy, cost efficiency, latency
Iterations: 5

[Iteration 0: Run evaluator on seed, judge scores — 3.7/10]
  accuracy: 6/10, cost: 2/10, latency: 3/10
  ASI: "Using claude-sonnet for a simple extraction task. The model is
       overkill — accuracy is fine but cost is 5x over budget. Switch to
       gpt-4o-mini which handles extraction well at 1/10th the cost."

[Iteration 1: Generator swaps model + adjusts prompt → 5.3/10]
  accuracy: 5/10, cost: 8/10, latency: 7/10
  ASI: "Cost and latency are great now but accuracy dropped on multi-step
       reasoning tasks (cases 7, 12). Split into two calls — extraction
       on mini, reasoning on sonnet — to get accuracy back without
       blowing the budget."

[Iteration 2: Generator restructures to 2-step chain → 7.0/10]
  accuracy: 7/10, cost: 7/10, latency: 7/10
  ASI: "Architecture is solid. The extraction prompt is too generic —
       add 3 few-shot examples from the test cases to anchor the format."

[Iteration 3: Generator adds few-shot examples → 7.7/10]
  ...

Best candidate: iteration 3 (7.7/10)
5 iterations complete. Run 3 more?