Compound-engineering-plugin ce-optimize

Run metric-driven iterative optimization loops. Define a measurable goal, build measurement scaffolding, then run parallel experiments that try many approaches, measure each against hard gates and/or LLM-as-judge quality scores, keep improvements, and converge toward the best solution. Use when optimizing clustering quality, search relevance, build performance, prompt quality, or any measurable outcome that benefits from systematic experimentation. Inspired by Karpathy's autoresearch, generalized for multi-file code changes and non-ML domains.

install

source · Clone the upstream repo

git clone https://github.com/EveryInc/compound-engineering-plugin

Claude Code · Install into ~/.claude/skills/

T=$(mktemp -d) && git clone --depth=1 https://github.com/EveryInc/compound-engineering-plugin "$T" && mkdir -p ~/.claude/skills && cp -r "$T/plugins/compound-engineering/skills/ce-optimize" ~/.claude/skills/everyinc-compound-engineering-plugin-ce-optimize && rm -rf "$T"

manifest: plugins/compound-engineering/skills/ce-optimize/SKILL.md

source content

Iterative Optimization Loop

Run metric-driven iterative optimization. Define a goal, build measurement scaffolding, then run parallel experiments that converge toward the best solution.

Interaction Method

Use the platform's blocking question tool when available (

AskUserQuestion

in Claude Code,

request_user_input

in Codex,

ask_user

in Gemini). Otherwise, present numbered options in chat and wait for the user's reply before proceeding.

Input

<optimization_input> #$ARGUMENTS </optimization_input>

If the input above is empty, ask: "What would you like to optimize? Describe the goal, or provide a path to an optimization spec YAML file."

Optimization Spec Schema

Reference the spec schema for validation:

references/optimize-spec-schema.yaml

Experiment Log Schema

Reference the experiment log schema for state management:

references/experiment-log-schema.yaml

Quick Start

For a first run, optimize for signal and safety, not maximum throughput:

Start from
```
references/example-hard-spec.yaml
```
when the metric is objective and cheap to measure
Use
```
references/example-judge-spec.yaml
```
only when actual quality requires semantic judgment

Prefer

execution.mode: serial

and

execution.max_concurrent: 1

Cap the first run with

stopping.max_iterations: 4

and

stopping.max_hours: 1

Avoid new dependencies until the baseline and measurement harness are trusted

For judge mode, start with

sample_size: 10

batch_size: 5

, and

max_total_cost_usd: 5

For a friendly overview of what this skill is for, when to use hard metrics vs LLM-as-judge, and example kickoff prompts, see:

references/usage-guide.md

Persistence Discipline

CRITICAL: The experiment log on disk is the single source of truth. The conversation context is NOT durable storage. Results that exist only in the conversation WILL be lost.

The files under

.context/compound-engineering/ce-optimize/<spec-name>/

are local scratch state. They are ignored by git, so they survive local resumes on the same machine but are not preserved by commits, branches, or pushes unless the user exports them separately.

This skill runs for hours. Context windows compact, sessions crash, and agents restart. Every piece of state that matters MUST live on disk, not in the agent's memory.

If you produce a results table in the conversation without writing those results to disk first, you have a bug. The conversation is for the user's benefit. The experiment log file is for durability.

Core Rules

Write each experiment result to disk IMMEDIATELY after measurement — not after the batch, not after evaluation, IMMEDIATELY. Append the experiment entry to the experiment log file the moment its metrics are known, before evaluating the next experiment. This is the #1 crash-safety rule.
VERIFY every critical write — after writing the experiment log, read the file back and confirm the entry is present. This catches silent write failures. Do not proceed to the next experiment until verification passes.
Re-read from disk at every phase boundary and before every decision — never trust in-memory state across phase transitions, batch boundaries, or after any operation that might have taken significant time. Re-read the experiment log and strategy digest from disk.
The experiment log is append-only during Phase 3 — never rewrite the full file. Append new experiment entries. Update the
```
best
```
section in place only when a new best is found. This prevents data loss if a write is interrupted.
Per-experiment result markers for crash recovery — each experiment writes a
```
result.yaml
```
marker in its worktree immediately after measurement. On resume, scan for these markers to recover experiments that were measured but not yet logged.
Strategy digest is written after every batch, before generating new hypotheses — the agent reads the digest (not its memory) when deciding what to try next.
Never present results to the user without writing them to disk first — the pattern is: measure -> write to disk -> verify -> THEN show the user. Not the reverse.

Mandatory Disk Checkpoints

These are non-negotiable write-then-verify steps. At each checkpoint, the agent MUST write the specified file and then read it back to confirm the write succeeded.

Checkpoint	File Written	Phase
CP-0: Spec saved	`spec.yaml`	Phase 0, after user approval
CP-1: Baseline recorded	`experiment-log.yaml` (initial with baseline)	Phase 1, after baseline measurement
CP-2: Hypothesis backlog saved	`experiment-log.yaml` (hypothesis_backlog section)	Phase 2, after hypothesis generation
CP-3: Each experiment result	`experiment-log.yaml` (append experiment entry)	Phase 3.3, immediately after each measurement
CP-4: Batch summary	`experiment-log.yaml` (outcomes + best) + `strategy-digest.md`	Phase 3.5, after batch evaluation
CP-5: Final summary	`experiment-log.yaml` (final state)	Phase 4, at wrap-up

Format of a verification step:

Write the file using the native file-write tool
Read the file back using the native file-read tool
Confirm the expected content is present
If verification fails, retry the write. If it fails twice, alert the user.

File Locations (all under

.context/compound-engineering/ce-optimize/<spec-name>/

)

File	Purpose	Written When
`spec.yaml`	Optimization spec (immutable during run)	Phase 0 (CP-0)
`experiment-log.yaml`	Full history of all experiments	Initialized at CP-1, appended at CP-3, updated at CP-4
`strategy-digest.md`	Compressed learnings for hypothesis generation	Written at CP-4 after each batch
`<worktree>/result.yaml`	Per-experiment crash-recovery marker	Immediately after measurement, before CP-3

On Resume

When Phase 0.4 detects an existing run:

Read the experiment log from disk — this is the ground truth
Scan worktree directories for
```
result.yaml
```
markers not yet in the log
Recover any measured-but-unlogged experiments
Continue from where the log left off

Phase 0: Setup

0.1 Determine Input Type

Check whether the input is:

A spec file path (ends in
```
.yaml
```
or
```
.yml
```
): read and validate it
A description of the optimization goal: help the user create a spec interactively

0.2 Load or Create Spec

If spec file provided:

Read the YAML spec file. The orchestrating agent parses YAML natively -- no shell script parsing.
Validate against
```
references/optimize-spec-schema.yaml
```
:
- All required fields present
- ```
name
```
  is lowercase kebab-case and safe to use in git refs / worktree paths
- ```
metric.primary.type
```
  is
```
hard
```
  or
```
judge
```
- If type is
```
judge
```
  ,
```
metric.judge
```
  section exists with
```
rubric
```
  and
```
scoring
```
- At least one degenerate gate defined
- ```
measurement.command
```
  is non-empty
- ```
scope.mutable
```
  and
```
scope.immutable
```
  each have at least one entry
- Gate check operators are valid (
```
>=
```
  ,
```
<=
```
  ,
```
>
```
  ,
```
<
```
  ,
```
==
```
  ,
```
!=
```
  )
- ```
execution.max_concurrent
```
  is at least 1
- ```
execution.max_concurrent
```
  does not exceed 6 when backend is
```
worktree
```
If validation fails, report errors and ask the user to fix them

If description provided:

Analyze the project to understand what can be measured
Detect whether the optimization target is qualitative or quantitative — this determines
```
type: hard
```
vs
```
type: judge
```
and is the single most important spec decision:

Use
```
type: hard
```
when:
- The metric is a scalar number with a clear "better" direction
- The metric is objectively measurable (build time, test pass rate, latency, memory usage)
- No human judgment is needed to evaluate "is this result actually good?"
- Examples: reduce build time, increase test coverage, reduce API latency, decrease bundle size
Use
```
type: judge
```
when:
- The quality of the output requires semantic understanding to evaluate
- A human reviewer would need to look at the results to say "this is better"
- Proxy metrics exist but can mislead (e.g., "more clusters" does not mean "better clusters")
- The optimization could produce degenerate solutions that look good on paper
- Examples: clustering quality, search relevance, summarization quality, code readability, UX copy, recommendation relevance
IMPORTANT: If the target is qualitative, strongly recommend
```
type: judge
```
. Explain that hard metrics alone will optimize proxy numbers without checking actual quality. Show the user the three-tier approach:
- Degenerate gates (hard, cheap, fast): catch obviously broken solutions — e.g., "all items in 1 cluster" or "0% coverage". Run first. If gates fail, skip the expensive judge step.
- LLM-as-judge (the actual optimization target): sample outputs, score them against a rubric, aggregate. This is what the loop optimizes.
- Diagnostics (logged, not gated): distribution stats, counts, timing — useful for understanding WHY a judge score changed.
If the user insists on
```
type: hard
```
for a qualitative target, proceed but warn that the results may optimize a misleading proxy.

Design the sampling strategy (for

type: judge

Guide the user through defining stratified sampling. The key question is: "What parts of the output space do you need to check quality on?"

Walk through these questions:

What does one "item" look like? (a cluster, a search result page, a summary, etc.)
What are the natural size/quality strata? (e.g., large clusters vs small clusters vs singletons)
Where are quality failures most likely? (e.g., very large clusters may be degenerate merges; singletons may be missed groupings)
What total sample size balances cost vs signal? (default: 30 items, adjust based on output volume)

Example stratified sampling for clustering:

stratification:
  - bucket: "top_by_size"     # largest clusters — check for degenerate mega-clusters
    count: 10
  - bucket: "mid_range"       # middle of non-solo cluster size range — representative quality
    count: 10
  - bucket: "small_clusters"  # clusters with 2-3 items — check if connections are real
    count: 10
singleton_sample: 15          # singletons — check for false negatives (items that should cluster)

The sampling strategy is domain-specific. For search relevance, strata might be "top-3 results", "results 4-10", "tail results". For summarization, strata might be "short documents", "long documents", "multi-topic documents".

Singleton evaluation is critical when the goal involves coverage — sampling singletons with the singleton rubric checks whether the system is missing obvious groupings.

Design the rubric (for

type: judge

Help the user define the scoring rubric. A good rubric:

Has a 1-5 scale (or similar) with concrete descriptions for each level
Includes supplementary fields that help diagnose issues (e.g.,
```
distinct_topics
```
,
```
outlier_count
```
)
Is specific enough that two judges would give similar scores
Does NOT assume bigger/more is better — "3 items per cluster average" is not inherently good or bad

Example for clustering:

rubric: |
  Rate this cluster 1-5:
  - 5: All items clearly about the same issue/feature
  - 4: Strong theme, minor outliers
  - 3: Related but covers 2-3 sub-topics that could reasonably be split
  - 2: Weak connection — items share superficial similarity only
  - 1: Unrelated items grouped together
  Also report: distinct_topics (integer), outlier_count (integer)

Guide the user through the remaining spec fields:
- What degenerate cases should be rejected? (gates — e.g., "solo_pct <= 0.95" catches all-singletons, "max_cluster_size <= 500" catches mega-clusters)
- What command runs the measurement?
- What files can be modified? What is immutable?
- Any constraints or dependencies?
- If this is the first run: recommend
```
execution.mode: serial
```
  ,
```
execution.max_concurrent: 1
```
  ,
```
stopping.max_iterations: 4
```
  , and
```
stopping.max_hours: 1
```
- If
```
type: judge
```
  : recommend
```
sample_size: 10
```
  ,
```
batch_size: 5
```
  , and
```
max_total_cost_usd: 5
```
  until the rubric and harness are trusted

Write the spec to

.context/compound-engineering/ce-optimize/<spec-name>/spec.yaml

Present the spec to the user for approval before proceeding

0.3 Search Prior Learnings

Dispatch

research:ce-learnings-researcher

to search for prior optimization work on similar topics. If relevant learnings exist, incorporate them into the approach.

0.4 Run Identity Detection

Check if

optimize/<spec-name>

branch already exists:

git rev-parse --verify "optimize/<spec-name>" 2>/dev/null

If branch exists, check for an existing experiment log at

.context/compound-engineering/ce-optimize/<spec-name>/experiment-log.yaml

Present the user with a choice via the platform question tool:

Resume: read ALL state from the experiment log on disk (do not rely on any in-memory context from a prior session). Recover any measured-but-unlogged experiments by scanning worktree directories for
```
result.yaml
```
markers. Continue from the last iteration number in the log.
Fresh start: archive the old branch to
```
optimize-archive/<spec-name>/archived-<timestamp>
```
, clear the experiment log, start from scratch

0.5 Create Optimization Branch and Scratch Space

git checkout -b "optimize/<spec-name>"  # or switch to existing if resuming

Create scratch directory:

mkdir -p .context/compound-engineering/ce-optimize/<spec-name>/

Phase 1: Measurement Scaffolding

This phase is a HARD GATE. The user must approve baseline and parallel readiness before Phase 2.

1.1 Clean-Tree Gate

Verify no uncommitted changes to files within

scope.mutable

scope.immutable

git status --porcelain

Filter the output against the scope paths. If any in-scope files have uncommitted changes:

Report which files are dirty
Ask the user to commit or stash before proceeding
Do NOT continue until the working tree is clean for in-scope files

1.2 Build or Validate Measurement Harness

If user provides a measurement harness (the

measurement.command

already exists):

Run it once via the measurement script:

bash scripts/measure.sh "<measurement.command>" <timeout_seconds> "<measurement.working_directory or .>"

Validate the JSON output:
- Contains keys for all degenerate gate metric names
- Contains keys for all diagnostic metric names
- Values are numeric or boolean as expected
If validation fails, report what is missing and ask the user to fix the harness

If agent must build the harness:

Analyze the codebase to understand the current approach and what should be measured
Build an evaluation script (e.g.,
```
evaluate.py
```
,
```
evaluate.sh
```
, or equivalent)
Add the evaluation script path to
```
scope.immutable
```
-- the experiment agent must not modify it
Run it once and validate the output
Present the harness and its output to the user for review

1.3 Establish Baseline

Run the measurement harness on the current code.

If stability mode is

repeat

Run the harness
```
repeat_count
```
times
Aggregate results using the configured aggregation method (median, mean, min, max)
Calculate variance across runs
If variance exceeds
```
noise_threshold
```
, warn the user and suggest increasing
```
repeat_count
```

Record the baseline in the experiment log:

baseline:
  timestamp: "<current ISO 8601 timestamp>"
  gates:
    <gate_name>: <value>
    ...
  diagnostics:
    <diagnostic_name>: <value>
    ...

If primary type is

judge

, also run the judge evaluation on baseline output to establish the starting judge score.

1.4 Parallelism Readiness Probe

Run the parallelism probe script:

bash scripts/parallel-probe.sh "<project_directory>" "<measurement.command>" "<measurement.working_directory>" <shared_files...>

Read the JSON output. Present any blockers to the user with suggested mitigations. Treat the probe as intentionally narrow: it should inspect the measurement command, the measurement working directory, and explicitly declared shared files, not the entire repository.

1.5 Worktree Budget Check

Count existing worktrees:

bash scripts/experiment-worktree.sh count

If count +

execution.max_concurrent

would exceed 12:

Warn the user
Suggest cleaning up existing worktrees or reducing
```
max_concurrent
```
Do NOT block -- the user may proceed at their own risk

1.6 Write Baseline to Disk (CP-1)

MANDATORY CHECKPOINT. Before presenting results to the user, write the initial experiment log with baseline metrics to disk:

Create the experiment log file at

.context/compound-engineering/ce-optimize/<spec-name>/experiment-log.yaml

Include all required top-level sections from

references/experiment-log-schema.yaml

spec

run_id

started_at

baseline

experiments

, and

best

Seed
```
experiments
```
as an empty array and seed
```
best
```
from the baseline snapshot (use
```
iteration: 0
```
, baseline metrics, and baseline judge scores if present) so later phases have a valid current-best state to compare against
Optionally seed
```
hypothesis_backlog: []
```
here as well so the log shape is stable before Phase 2 populates it
Verify: read the file back and confirm the required sections are present and the baseline values match
Only THEN present results to the user

1.7 User Approval Gate

Present to the user via the platform question tool:

Baseline metrics: all gate values, diagnostic values, and judge scores (if applicable)
Experiment log location: show the file path so the user knows where results are saved
Parallel readiness: probe results, any blockers, mitigations applied
Clean-tree status: confirmed clean
Worktree budget: current count and projected usage
Judge budget: estimated per-experiment judge cost and configured
```
max_total_cost_usd
```
cap (or an explicit note that spend is uncapped)

Options:

Proceed -- approve baseline and parallel config, move to Phase 2
Adjust spec -- modify spec settings before proceeding
Fix issues -- user needs to resolve blockers first

Do NOT proceed to Phase 2 until the user explicitly approves.

If primary type is

judge

and

max_total_cost_usd

is null, call that out as uncapped spend and require explicit approval before proceeding.

State re-read: After gate approval, re-read the spec and baseline from disk. Do not carry stale in-memory values forward.

Phase 2: Hypothesis Generation

2.1 Analyze Current Approach

Read the code within

scope.mutable

to understand:

The current implementation approach
Obvious improvement opportunities
Constraints and dependencies between components

Optionally dispatch

research:ce-repo-research-analyst

for deeper codebase analysis if the scope is large or unfamiliar.

2.2 Generate Hypothesis List

Generate an initial set of hypotheses. Each hypothesis should have:

Description: what to try
Category: one of the standard categories (signal-extraction, graph-signals, embedding, algorithm, preprocessing, parameter-tuning, architecture, data-handling) or a domain-specific category
Priority: high, medium, or low based on expected impact and feasibility
Required dependencies: any new packages or tools needed

Include user-provided hypotheses if any were given as input.

Aim for 10-30 hypotheses in the initial backlog. More can be generated during the loop based on learnings.

2.3 Dependency Pre-Approval

Collect all unique new dependencies across all hypotheses.

If any hypotheses require new dependencies:

Present the full dependency list to the user via the platform question tool
Ask for bulk approval
Mark each hypothesis's
```
dep_status
```
as
```
approved
```
or
```
needs_approval
```

Hypotheses with unapproved dependencies remain in the backlog but are skipped during batch selection. They are re-presented at wrap-up for potential approval.

2.4 Record Hypothesis Backlog (CP-2)

MANDATORY CHECKPOINT. Write the initial backlog to the experiment log file and verify:

hypothesis_backlog:
  - description: "Remove template boilerplate before embedding"
    category: "signal-extraction"
    priority: high
    dep_status: approved
    required_deps: []
  - description: "Try HDBSCAN clustering algorithm"
    category: "algorithm"
    priority: medium
    dep_status: needs_approval
    required_deps: ["scikit-learn"]

Phase 3: Optimization Loop

This phase repeats in batches until a stopping criterion is met.

3.1 Batch Selection

Select hypotheses for this batch:

Build a runnable backlog by excluding hypotheses with
```
dep_status: needs_approval
```
If
```
execution.mode
```
is
```
serial
```
, force
```
batch_size = 1
```

Otherwise,

batch_size = min(runnable_backlog_size, execution.max_concurrent)

Prefer diversity: select from different categories when possible
Within a category, select by priority (high first)

If the backlog is empty and no new hypotheses can be generated, proceed to Phase 4 (wrap-up). If the backlog is non-empty but no runnable hypotheses remain because everything needs approval or is otherwise blocked, proceed to Phase 4 so the user can approve dependencies instead of spinning forever.

3.2 Dispatch Experiments

For each hypothesis in the batch, dispatch according to

execution.mode

. In

serial

mode, run exactly one experiment to completion before selecting the next hypothesis. In

parallel

mode, dispatch the full batch concurrently.

Worktree backend:

Create experiment worktree:

WORKTREE_PATH=$(bash scripts/experiment-worktree.sh create "<spec_name>" <exp_index> "optimize/<spec_name>" <shared_files...>)  # creates optimize-exp/<spec_name>/exp-<NNN>

Apply port parameterization if configured (set env vars for the measurement script)
Fill the experiment prompt template (
```
references/experiment-prompt-template.md
```
) with:
- Iteration number, spec name
- Hypothesis description and category
- Current best and baseline metrics
- Mutable and immutable scope
- Constraints and approved dependencies
- Rolling window of last 10 experiments (concise summaries)
Dispatch a subagent with the filled prompt, working in the experiment worktree

Codex backend:

Check environment guard -- do NOT delegate if already inside a Codex sandbox:

# If these exist, we're already in Codex -- fall back to subagent
test -n "${CODEX_SANDBOX:-}" || test -n "${CODEX_SESSION_ID:-}" || test ! -w .git

Fill the experiment prompt template
Write the filled prompt to a temp file

Dispatch via Codex:

cat /tmp/optimize-exp-XXXXX.txt | codex exec --skip-git-repo-check - 2>&1

Security posture: use the user's selection (ask once per session if not set in spec)

3.3 Collect and Persist Results

Process experiments as they complete — do NOT wait for the entire batch to finish before writing results.

For each completed experiment, immediately:

Run measurement in the experiment's worktree:

bash scripts/measure.sh "<measurement.command>" <timeout_seconds> "<worktree_path>/<measurement.working_directory or .>" <env_vars...>

If stability mode is
```
repeat
```
, run the measurement harness
```
repeat_count
```
times in that working directory and aggregate the results exactly as in Phase 1 before evaluating gates or ranking the experiment.
Use the aggregated metrics as the experiment's score; if variance exceeds
```
noise_threshold
```
, record that in learnings so the operator knows the result is noisy.

Write crash-recovery marker — immediately after measurement, write
```
result.yaml
```
in the experiment worktree containing the raw metrics. This ensures the measurement is recoverable even if the agent crashes before updating the main log.
Read raw JSON output from the measurement script
Evaluate degenerate gates:
- For each gate in
```
metric.degenerate_gates
```
  , parse the operator and threshold
- Compare the metric value against the threshold
- If ANY gate fails: mark outcome as
```
degenerate
```
  , skip judge evaluation, save money
If gates pass AND primary type is
```
judge
```
:
- Read the experiment's output (cluster assignments, search results, etc.)
- Apply stratified sampling per
```
metric.judge.stratification
```
  config (using
```
sample_seed
```
  )
- Group samples into batches of
```
metric.judge.batch_size
```
- Fill the judge prompt template (
```
references/judge-prompt-template.md
```
  ) for each batch
- Dispatch
```
ceil(sample_size / batch_size)
```
  parallel judge sub-agents
- Each sub-agent returns structured JSON scores
- Aggregate scores: compute the configured primary judge field from
```
metric.judge.scoring.primary
```
  (which should match
```
metric.primary.name
```
  ) plus any
```
scoring.secondary
```
  values
- If
```
singleton_sample > 0
```
  : also dispatch singleton evaluation sub-agents
If gates pass AND primary type is
```
hard
```
:
- Use the metric value directly from the measurement output
IMMEDIATELY append to experiment log on disk (CP-3) — do not defer this to batch evaluation. Write the experiment entry (iteration, hypothesis, outcome, metrics, learnings) to
```
.context/compound-engineering/ce-optimize/<spec-name>/experiment-log.yaml
```
right now. Use the transitional outcome
```
measured
```
once the experiment has valid metrics but has not yet been compared to the current best. Update the outcome to
```
kept
```
,
```
reverted
```
, or another terminal state in the evaluation step, but the raw metrics are on disk and safe from context compaction.
VERIFY the write (CP-3 verification) — read the experiment log back from disk and confirm the entry just written is present. If verification fails, retry the write. Do NOT proceed to the next experiment until this entry is confirmed on disk.

Why immediately + verify? The agent's context window is NOT a durable store. Context compaction, session crashes, and restarts are expected during long runs. If results only exist in the agent's memory, they are lost. Karpathy's autoresearch writes to

results.tsv

after every single experiment — this skill must do the same with the experiment log. The verification step catches silent write failures that would otherwise lose data.

3.4 Evaluate Batch

After all experiments in the batch have been measured:

Rank experiments by primary metric improvement:
- For hard metrics: compare to the current best using
```
metric.primary.direction
```
  (
```
maximize
```
  means higher is better,
```
minimize
```
  means lower is better), and require the absolute improvement to exceed
```
measurement.stability.noise_threshold
```
  before treating it as a real win
- For judge metrics: compare the configured primary judge score (
```
metric.judge.scoring.primary
```
  /
```
metric.primary.name
```
  ) to the current best, and require it to exceed
```
minimum_improvement
```
Identify the best experiment that passes all gates and improves the primary metric
If best improves on current best: KEEP
- Commit the experiment branch first so the winning diff exists as a real commit before any merge or cherry-pick
- Include only mutable-scope changes in that commit; if no eligible diff remains, treat the experiment as non-improving and revert it
- Merge the committed experiment branch into the optimization branch
- Use the message
```
optimize(<spec-name>): <hypothesis description>
```
  for the experiment commit
- After the merge succeeds, clean up the winner's experiment worktree and branch; the integrated commit on the optimization branch is the durable artifact
- This is now the new baseline for subsequent batches
Check file-disjoint runners-up (up to
```
max_runner_up_merges_per_batch
```
):
- For each runner-up that also improved, check file-level disjointness with the kept experiment
- File-level disjointness: two experiments are disjoint if they modified completely different files. Same file = overlapping, even if different lines.
- If disjoint: cherry-pick the runner-up onto the new baseline, re-run full measurement
- If combined measurement is strictly better: keep the cherry-pick (outcome:
```
runner_up_kept
```
  ), then clean up that runner-up's experiment worktree and branch
- Otherwise: revert the cherry-pick, log as "promising alone but neutral/harmful in combination" (outcome:
```
runner_up_reverted
```
  ), then clean up the runner-up's experiment worktree and branch
- Stop after first failed combination
Handle deferred deps: experiments that need unapproved dependencies get outcome
```
deferred_needs_approval
```
Revert all others: cleanup worktrees, log as
```
reverted
```

3.5 Update State (CP-4)

MANDATORY CHECKPOINT. By this point, individual experiment results are already on disk (written in step 3.3). This step updates aggregate state and verifies.

Re-read the experiment log from disk — do not trust in-memory state. The log is the source of truth.
Finalize outcomes — update experiment entries from step 3.4 evaluation (mark
```
kept
```
,
```
reverted
```
,
```
runner_up_kept
```
, etc.). Write these outcome updates to disk immediately.
Update the
```
best
```
section in the experiment log if a new best was found. Write to disk.
Write strategy digest to
```
.context/compound-engineering/ce-optimize/<spec-name>/strategy-digest.md
```
:
- Categories tried so far (with success/failure counts)
- Key learnings from this batch and overall
- Exploration frontier: what categories and approaches remain untried
- Current best metrics and improvement from baseline
Generate new hypotheses based on learnings:
- Re-read the strategy digest from disk (not from memory)
- Read the rolling window (last 10 experiments from the log on disk)
- Do NOT read the full experiment log -- use the digest for broad context
- Add new hypotheses to the backlog and write the updated backlog to disk
Write updated hypothesis backlog to disk — the backlog section of the experiment log must reflect newly added hypotheses and removed (tested) ones.

CP-4 Verification: Read the experiment log back from disk. Confirm: (a) all experiment outcomes from this batch are finalized, (b) the

best

section reflects the current best, (c) the hypothesis backlog is updated. Read

strategy-digest.md

back and confirm it exists. Only THEN proceed to the next batch or stopping criteria check.

Checkpoint: at this point, all state for this batch is on disk. If the agent crashes and restarts, it can resume from the experiment log without loss.

3.6 Check Stopping Criteria

Stop the loop if ANY of these are true:

Target reached:
```
stopping.target_reached
```
is true,
```
metric.primary.target
```
is set, and the primary metric reaches that target according to
```
metric.primary.direction
```
(
```
>=
```
for
```
maximize
```
,
```
<=
```
for
```
minimize
```
)
Max iterations: total experiments run >=
```
stopping.max_iterations
```
Max hours: wall-clock time since Phase 3 start >=
```
stopping.max_hours
```
Judge budget exhausted: cumulative judge spend >=
```
metric.judge.max_total_cost_usd
```
(if set)
Plateau: no improvement for
```
stopping.plateau_iterations
```
consecutive experiments
Manual stop: user interrupts (save state and proceed to Phase 4)
Empty backlog: no hypotheses remain and no new ones can be generated

If no stopping criterion is met, proceed to the next batch (step 3.1).

3.7 Cross-Cutting Concerns

Codex failure cascade: Track consecutive Codex delegation failures. After 3 consecutive failures, auto-disable Codex for remaining experiments and fall back to subagent dispatch. Log the switch.

Error handling: If an experiment's measurement command crashes, times out, or produces malformed output:

Log as outcome
```
error
```
or
```
timeout
```
with the error message
Revert the experiment (cleanup worktree)
The loop continues with remaining experiments in the batch

Progress reporting: After each batch, report:

Batch N of estimated M (based on backlog size)
Experiments run this batch and total
Current best metric and improvement from baseline
Cumulative judge cost (if applicable)

Crash recovery: See Persistence Discipline section. Per-experiment

result.yaml

markers are written in step 3.3. Individual experiment results are appended to the log immediately in step 3.3. Batch-level state (outcomes, best, digest) is written in step 3.5. On resume (Phase 0.4), the log on disk is the ground truth — scan for any

result.yaml

markers not yet reflected in the log.

Phase 4: Wrap-Up

4.1 Present Deferred Hypotheses

If any hypotheses were deferred due to unapproved dependencies:

List them with their dependency requirements
Ask the user whether to approve, skip, or save for a future run
If approved: add to backlog and offer to re-enter Phase 3 for one more round

4.2 Summarize Results

Present a comprehensive summary:

Optimization: <spec-name>
Duration: <wall-clock time>
Total experiments: <count>
  Kept: <count> (including <runner_up_kept_count> runner-up merges)
  Reverted: <count>
  Degenerate: <count>
  Errors: <count>
  Deferred: <count>

Baseline -> Final:
  <primary_metric>: <baseline_value> -> <final_value> (<delta>)
  <gate_metrics>: ...
  <diagnostics>: ...

Judge cost: $<total_judge_cost_usd> (if applicable)

Key improvements:
  1. <kept experiment 1 hypothesis> (+<delta>)
  2. <kept experiment 2 hypothesis> (+<delta>)
  ...

4.3 Preserve and Offer Next Steps

The optimization branch (

optimize/<spec-name>

) is preserved with all commits from kept experiments. The experiment log and strategy digest remain in local

.context/...

scratch space for resume and audit on this machine only; they do not travel with the branch because

.context/

is gitignored.

Present post-completion options via the platform question tool:

Run
/ce-code-review
on the cumulative diff (baseline to final). Load the
```
ce-code-review
```
skill with
```
mode:autofix
```
on the optimization branch.
Run
/ce-compound
to document the winning strategy as an institutional learning.
Create PR from the optimization branch to the default branch.
Continue with more experiments: re-enter Phase 3 with the current state. State re-read first.
Done -- leave the optimization branch for manual review.

4.4 Cleanup

Clean up scratch space:

# Keep the experiment log for local resume/audit on this machine
# Remove temporary batch artifacts
rm -f .context/compound-engineering/ce-optimize/<spec-name>/strategy-digest.md

Do NOT delete the experiment log if the user may resume locally or wants a local audit trail. If they need a durable shared artifact, summarize or export the results into a tracked path before cleanup. Do NOT delete experiment worktrees that are still being referenced.