Compound-engineering-plugin ce-optimize

Run metric-driven iterative optimization loops. Define a measurable goal, build measurement scaffolding, then run parallel experiments that try many approaches, measure each against hard gates and/or LLM-as-judge quality scores, keep improvements, and converge toward the best solution. Use when optimizing clustering quality, search relevance, build performance, prompt quality, or any measurable outcome that benefits from systematic experimentation. Inspired by Karpathy's autoresearch, generalized for multi-file code changes and non-ML domains.

install
source · Clone the upstream repo
git clone https://github.com/EveryInc/compound-engineering-plugin
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/EveryInc/compound-engineering-plugin "$T" && mkdir -p ~/.claude/skills && cp -r "$T/plugins/compound-engineering/skills/ce-optimize" ~/.claude/skills/everyinc-compound-engineering-plugin-ce-optimize && rm -rf "$T"
manifest: plugins/compound-engineering/skills/ce-optimize/SKILL.md
source content

Iterative Optimization Loop

Run metric-driven iterative optimization. Define a goal, build measurement scaffolding, then run parallel experiments that converge toward the best solution.

Interaction Method

Use the platform's blocking question tool when available (

AskUserQuestion
in Claude Code,
request_user_input
in Codex,
ask_user
in Gemini). Otherwise, present numbered options in chat and wait for the user's reply before proceeding.

Input

<optimization_input> #$ARGUMENTS </optimization_input>

If the input above is empty, ask: "What would you like to optimize? Describe the goal, or provide a path to an optimization spec YAML file."

Optimization Spec Schema

Reference the spec schema for validation:

references/optimize-spec-schema.yaml

Experiment Log Schema

Reference the experiment log schema for state management:

references/experiment-log-schema.yaml

Quick Start

For a first run, optimize for signal and safety, not maximum throughput:

  • Start from
    references/example-hard-spec.yaml
    when the metric is objective and cheap to measure
  • Use
    references/example-judge-spec.yaml
    only when actual quality requires semantic judgment
  • Prefer
    execution.mode: serial
    and
    execution.max_concurrent: 1
  • Cap the first run with
    stopping.max_iterations: 4
    and
    stopping.max_hours: 1
  • Avoid new dependencies until the baseline and measurement harness are trusted
  • For judge mode, start with
    sample_size: 10
    ,
    batch_size: 5
    , and
    max_total_cost_usd: 5

For a friendly overview of what this skill is for, when to use hard metrics vs LLM-as-judge, and example kickoff prompts, see:

references/usage-guide.md


Persistence Discipline

CRITICAL: The experiment log on disk is the single source of truth. The conversation context is NOT durable storage. Results that exist only in the conversation WILL be lost.

The files under

.context/compound-engineering/ce-optimize/<spec-name>/
are local scratch state. They are ignored by git, so they survive local resumes on the same machine but are not preserved by commits, branches, or pushes unless the user exports them separately.

This skill runs for hours. Context windows compact, sessions crash, and agents restart. Every piece of state that matters MUST live on disk, not in the agent's memory.

If you produce a results table in the conversation without writing those results to disk first, you have a bug. The conversation is for the user's benefit. The experiment log file is for durability.

Core Rules

  1. Write each experiment result to disk IMMEDIATELY after measurement — not after the batch, not after evaluation, IMMEDIATELY. Append the experiment entry to the experiment log file the moment its metrics are known, before evaluating the next experiment. This is the #1 crash-safety rule.

  2. VERIFY every critical write — after writing the experiment log, read the file back and confirm the entry is present. This catches silent write failures. Do not proceed to the next experiment until verification passes.

  3. Re-read from disk at every phase boundary and before every decision — never trust in-memory state across phase transitions, batch boundaries, or after any operation that might have taken significant time. Re-read the experiment log and strategy digest from disk.

  4. The experiment log is append-only during Phase 3 — never rewrite the full file. Append new experiment entries. Update the

    best
    section in place only when a new best is found. This prevents data loss if a write is interrupted.

  5. Per-experiment result markers for crash recovery — each experiment writes a

    result.yaml
    marker in its worktree immediately after measurement. On resume, scan for these markers to recover experiments that were measured but not yet logged.

  6. Strategy digest is written after every batch, before generating new hypotheses — the agent reads the digest (not its memory) when deciding what to try next.

  7. Never present results to the user without writing them to disk first — the pattern is: measure -> write to disk -> verify -> THEN show the user. Not the reverse.

Mandatory Disk Checkpoints

These are non-negotiable write-then-verify steps. At each checkpoint, the agent MUST write the specified file and then read it back to confirm the write succeeded.

CheckpointFile WrittenPhase
CP-0: Spec saved
spec.yaml
Phase 0, after user approval
CP-1: Baseline recorded
experiment-log.yaml
(initial with baseline)
Phase 1, after baseline measurement
CP-2: Hypothesis backlog saved
experiment-log.yaml
(hypothesis_backlog section)
Phase 2, after hypothesis generation
CP-3: Each experiment result
experiment-log.yaml
(append experiment entry)
Phase 3.3, immediately after each measurement
CP-4: Batch summary
experiment-log.yaml
(outcomes + best) +
strategy-digest.md
Phase 3.5, after batch evaluation
CP-5: Final summary
experiment-log.yaml
(final state)
Phase 4, at wrap-up

Format of a verification step:

  1. Write the file using the native file-write tool
  2. Read the file back using the native file-read tool
  3. Confirm the expected content is present
  4. If verification fails, retry the write. If it fails twice, alert the user.

File Locations (all under
.context/compound-engineering/ce-optimize/<spec-name>/
)

FilePurposeWritten When
spec.yaml
Optimization spec (immutable during run)Phase 0 (CP-0)
experiment-log.yaml
Full history of all experimentsInitialized at CP-1, appended at CP-3, updated at CP-4
strategy-digest.md
Compressed learnings for hypothesis generationWritten at CP-4 after each batch
<worktree>/result.yaml
Per-experiment crash-recovery markerImmediately after measurement, before CP-3

On Resume

When Phase 0.4 detects an existing run:

  1. Read the experiment log from disk — this is the ground truth
  2. Scan worktree directories for
    result.yaml
    markers not yet in the log
  3. Recover any measured-but-unlogged experiments
  4. Continue from where the log left off

Phase 0: Setup

0.1 Determine Input Type

Check whether the input is:

  • A spec file path (ends in
    .yaml
    or
    .yml
    ): read and validate it
  • A description of the optimization goal: help the user create a spec interactively

0.2 Load or Create Spec

If spec file provided:

  1. Read the YAML spec file. The orchestrating agent parses YAML natively -- no shell script parsing.
  2. Validate against
    references/optimize-spec-schema.yaml
    :
    • All required fields present
    • name
      is lowercase kebab-case and safe to use in git refs / worktree paths
    • metric.primary.type
      is
      hard
      or
      judge
    • If type is
      judge
      ,
      metric.judge
      section exists with
      rubric
      and
      scoring
    • At least one degenerate gate defined
    • measurement.command
      is non-empty
    • scope.mutable
      and
      scope.immutable
      each have at least one entry
    • Gate check operators are valid (
      >=
      ,
      <=
      ,
      >
      ,
      <
      ,
      ==
      ,
      !=
      )
    • execution.max_concurrent
      is at least 1
    • execution.max_concurrent
      does not exceed 6 when backend is
      worktree
  3. If validation fails, report errors and ask the user to fix them

If description provided:

  1. Analyze the project to understand what can be measured

  2. Detect whether the optimization target is qualitative or quantitative — this determines

    type: hard
    vs
    type: judge
    and is the single most important spec decision:

    Use

    type: hard
    when:

    • The metric is a scalar number with a clear "better" direction
    • The metric is objectively measurable (build time, test pass rate, latency, memory usage)
    • No human judgment is needed to evaluate "is this result actually good?"
    • Examples: reduce build time, increase test coverage, reduce API latency, decrease bundle size

    Use

    type: judge
    when:

    • The quality of the output requires semantic understanding to evaluate
    • A human reviewer would need to look at the results to say "this is better"
    • Proxy metrics exist but can mislead (e.g., "more clusters" does not mean "better clusters")
    • The optimization could produce degenerate solutions that look good on paper
    • Examples: clustering quality, search relevance, summarization quality, code readability, UX copy, recommendation relevance

    IMPORTANT: If the target is qualitative, strongly recommend

    type: judge
    . Explain that hard metrics alone will optimize proxy numbers without checking actual quality. Show the user the three-tier approach:

    • Degenerate gates (hard, cheap, fast): catch obviously broken solutions — e.g., "all items in 1 cluster" or "0% coverage". Run first. If gates fail, skip the expensive judge step.
    • LLM-as-judge (the actual optimization target): sample outputs, score them against a rubric, aggregate. This is what the loop optimizes.
    • Diagnostics (logged, not gated): distribution stats, counts, timing — useful for understanding WHY a judge score changed.

    If the user insists on

    type: hard
    for a qualitative target, proceed but warn that the results may optimize a misleading proxy.

  3. Design the sampling strategy (for

    type: judge
    ):

    Guide the user through defining stratified sampling. The key question is: "What parts of the output space do you need to check quality on?"

    Walk through these questions:

    • What does one "item" look like? (a cluster, a search result page, a summary, etc.)
    • What are the natural size/quality strata? (e.g., large clusters vs small clusters vs singletons)
    • Where are quality failures most likely? (e.g., very large clusters may be degenerate merges; singletons may be missed groupings)
    • What total sample size balances cost vs signal? (default: 30 items, adjust based on output volume)

    Example stratified sampling for clustering:

    stratification:
      - bucket: "top_by_size"     # largest clusters — check for degenerate mega-clusters
        count: 10
      - bucket: "mid_range"       # middle of non-solo cluster size range — representative quality
        count: 10
      - bucket: "small_clusters"  # clusters with 2-3 items — check if connections are real
        count: 10
    singleton_sample: 15          # singletons — check for false negatives (items that should cluster)
    

    The sampling strategy is domain-specific. For search relevance, strata might be "top-3 results", "results 4-10", "tail results". For summarization, strata might be "short documents", "long documents", "multi-topic documents".

    Singleton evaluation is critical when the goal involves coverage — sampling singletons with the singleton rubric checks whether the system is missing obvious groupings.

  4. Design the rubric (for

    type: judge
    ):

    Help the user define the scoring rubric. A good rubric:

    • Has a 1-5 scale (or similar) with concrete descriptions for each level
    • Includes supplementary fields that help diagnose issues (e.g.,
      distinct_topics
      ,
      outlier_count
      )
    • Is specific enough that two judges would give similar scores
    • Does NOT assume bigger/more is better — "3 items per cluster average" is not inherently good or bad

    Example for clustering:

    rubric: |
      Rate this cluster 1-5:
      - 5: All items clearly about the same issue/feature
      - 4: Strong theme, minor outliers
      - 3: Related but covers 2-3 sub-topics that could reasonably be split
      - 2: Weak connection — items share superficial similarity only
      - 1: Unrelated items grouped together
      Also report: distinct_topics (integer), outlier_count (integer)
    
  5. Guide the user through the remaining spec fields:

    • What degenerate cases should be rejected? (gates — e.g., "solo_pct <= 0.95" catches all-singletons, "max_cluster_size <= 500" catches mega-clusters)
    • What command runs the measurement?
    • What files can be modified? What is immutable?
    • Any constraints or dependencies?
    • If this is the first run: recommend
      execution.mode: serial
      ,
      execution.max_concurrent: 1
      ,
      stopping.max_iterations: 4
      , and
      stopping.max_hours: 1
    • If
      type: judge
      : recommend
      sample_size: 10
      ,
      batch_size: 5
      , and
      max_total_cost_usd: 5
      until the rubric and harness are trusted
  6. Write the spec to

    .context/compound-engineering/ce-optimize/<spec-name>/spec.yaml

  7. Present the spec to the user for approval before proceeding

0.3 Search Prior Learnings

Dispatch

research:ce-learnings-researcher
to search for prior optimization work on similar topics. If relevant learnings exist, incorporate them into the approach.

0.4 Run Identity Detection

Check if

optimize/<spec-name>
branch already exists:

git rev-parse --verify "optimize/<spec-name>" 2>/dev/null

If branch exists, check for an existing experiment log at

.context/compound-engineering/ce-optimize/<spec-name>/experiment-log.yaml
.

Present the user with a choice via the platform question tool:

  • Resume: read ALL state from the experiment log on disk (do not rely on any in-memory context from a prior session). Recover any measured-but-unlogged experiments by scanning worktree directories for
    result.yaml
    markers. Continue from the last iteration number in the log.
  • Fresh start: archive the old branch to
    optimize-archive/<spec-name>/archived-<timestamp>
    , clear the experiment log, start from scratch

0.5 Create Optimization Branch and Scratch Space

git checkout -b "optimize/<spec-name>"  # or switch to existing if resuming

Create scratch directory:

mkdir -p .context/compound-engineering/ce-optimize/<spec-name>/

Phase 1: Measurement Scaffolding

This phase is a HARD GATE. The user must approve baseline and parallel readiness before Phase 2.

1.1 Clean-Tree Gate

Verify no uncommitted changes to files within

scope.mutable
or
scope.immutable
:

git status --porcelain

Filter the output against the scope paths. If any in-scope files have uncommitted changes:

  • Report which files are dirty
  • Ask the user to commit or stash before proceeding
  • Do NOT continue until the working tree is clean for in-scope files

1.2 Build or Validate Measurement Harness

If user provides a measurement harness (the

measurement.command
already exists):

  1. Run it once via the measurement script:
    bash scripts/measure.sh "<measurement.command>" <timeout_seconds> "<measurement.working_directory or .>"
    
  2. Validate the JSON output:
    • Contains keys for all degenerate gate metric names
    • Contains keys for all diagnostic metric names
    • Values are numeric or boolean as expected
  3. If validation fails, report what is missing and ask the user to fix the harness

If agent must build the harness:

  1. Analyze the codebase to understand the current approach and what should be measured
  2. Build an evaluation script (e.g.,
    evaluate.py
    ,
    evaluate.sh
    , or equivalent)
  3. Add the evaluation script path to
    scope.immutable
    -- the experiment agent must not modify it
  4. Run it once and validate the output
  5. Present the harness and its output to the user for review

1.3 Establish Baseline

Run the measurement harness on the current code.

If stability mode is

repeat
:

  1. Run the harness
    repeat_count
    times
  2. Aggregate results using the configured aggregation method (median, mean, min, max)
  3. Calculate variance across runs
  4. If variance exceeds
    noise_threshold
    , warn the user and suggest increasing
    repeat_count

Record the baseline in the experiment log:

baseline:
  timestamp: "<current ISO 8601 timestamp>"
  gates:
    <gate_name>: <value>
    ...
  diagnostics:
    <diagnostic_name>: <value>
    ...

If primary type is

judge
, also run the judge evaluation on baseline output to establish the starting judge score.

1.4 Parallelism Readiness Probe

Run the parallelism probe script:

bash scripts/parallel-probe.sh "<project_directory>" "<measurement.command>" "<measurement.working_directory>" <shared_files...>

Read the JSON output. Present any blockers to the user with suggested mitigations. Treat the probe as intentionally narrow: it should inspect the measurement command, the measurement working directory, and explicitly declared shared files, not the entire repository.

1.5 Worktree Budget Check

Count existing worktrees:

bash scripts/experiment-worktree.sh count

If count +

execution.max_concurrent
would exceed 12:

  • Warn the user
  • Suggest cleaning up existing worktrees or reducing
    max_concurrent
  • Do NOT block -- the user may proceed at their own risk

1.6 Write Baseline to Disk (CP-1)

MANDATORY CHECKPOINT. Before presenting results to the user, write the initial experiment log with baseline metrics to disk:

  1. Create the experiment log file at
    .context/compound-engineering/ce-optimize/<spec-name>/experiment-log.yaml
  2. Include all required top-level sections from
    references/experiment-log-schema.yaml
    :
    spec
    ,
    run_id
    ,
    started_at
    ,
    baseline
    ,
    experiments
    , and
    best
  3. Seed
    experiments
    as an empty array and seed
    best
    from the baseline snapshot (use
    iteration: 0
    , baseline metrics, and baseline judge scores if present) so later phases have a valid current-best state to compare against
  4. Optionally seed
    hypothesis_backlog: []
    here as well so the log shape is stable before Phase 2 populates it
  5. Verify: read the file back and confirm the required sections are present and the baseline values match
  6. Only THEN present results to the user

1.7 User Approval Gate

Present to the user via the platform question tool:

  • Baseline metrics: all gate values, diagnostic values, and judge scores (if applicable)
  • Experiment log location: show the file path so the user knows where results are saved
  • Parallel readiness: probe results, any blockers, mitigations applied
  • Clean-tree status: confirmed clean
  • Worktree budget: current count and projected usage
  • Judge budget: estimated per-experiment judge cost and configured
    max_total_cost_usd
    cap (or an explicit note that spend is uncapped)

Options:

  1. Proceed -- approve baseline and parallel config, move to Phase 2
  2. Adjust spec -- modify spec settings before proceeding
  3. Fix issues -- user needs to resolve blockers first

Do NOT proceed to Phase 2 until the user explicitly approves.

If primary type is

judge
and
max_total_cost_usd
is null, call that out as uncapped spend and require explicit approval before proceeding.

State re-read: After gate approval, re-read the spec and baseline from disk. Do not carry stale in-memory values forward.


Phase 2: Hypothesis Generation

2.1 Analyze Current Approach

Read the code within

scope.mutable
to understand:

  • The current implementation approach
  • Obvious improvement opportunities
  • Constraints and dependencies between components

Optionally dispatch

research:ce-repo-research-analyst
for deeper codebase analysis if the scope is large or unfamiliar.

2.2 Generate Hypothesis List

Generate an initial set of hypotheses. Each hypothesis should have:

  • Description: what to try
  • Category: one of the standard categories (signal-extraction, graph-signals, embedding, algorithm, preprocessing, parameter-tuning, architecture, data-handling) or a domain-specific category
  • Priority: high, medium, or low based on expected impact and feasibility
  • Required dependencies: any new packages or tools needed

Include user-provided hypotheses if any were given as input.

Aim for 10-30 hypotheses in the initial backlog. More can be generated during the loop based on learnings.

2.3 Dependency Pre-Approval

Collect all unique new dependencies across all hypotheses.

If any hypotheses require new dependencies:

  1. Present the full dependency list to the user via the platform question tool
  2. Ask for bulk approval
  3. Mark each hypothesis's
    dep_status
    as
    approved
    or
    needs_approval

Hypotheses with unapproved dependencies remain in the backlog but are skipped during batch selection. They are re-presented at wrap-up for potential approval.

2.4 Record Hypothesis Backlog (CP-2)

MANDATORY CHECKPOINT. Write the initial backlog to the experiment log file and verify:

hypothesis_backlog:
  - description: "Remove template boilerplate before embedding"
    category: "signal-extraction"
    priority: high
    dep_status: approved
    required_deps: []
  - description: "Try HDBSCAN clustering algorithm"
    category: "algorithm"
    priority: medium
    dep_status: needs_approval
    required_deps: ["scikit-learn"]

Phase 3: Optimization Loop

This phase repeats in batches until a stopping criterion is met.

3.1 Batch Selection

Select hypotheses for this batch:

  • Build a runnable backlog by excluding hypotheses with
    dep_status: needs_approval
  • If
    execution.mode
    is
    serial
    , force
    batch_size = 1
  • Otherwise,
    batch_size = min(runnable_backlog_size, execution.max_concurrent)
  • Prefer diversity: select from different categories when possible
  • Within a category, select by priority (high first)

If the backlog is empty and no new hypotheses can be generated, proceed to Phase 4 (wrap-up). If the backlog is non-empty but no runnable hypotheses remain because everything needs approval or is otherwise blocked, proceed to Phase 4 so the user can approve dependencies instead of spinning forever.

3.2 Dispatch Experiments

For each hypothesis in the batch, dispatch according to

execution.mode
. In
serial
mode, run exactly one experiment to completion before selecting the next hypothesis. In
parallel
mode, dispatch the full batch concurrently.

Worktree backend:

  1. Create experiment worktree:
    WORKTREE_PATH=$(bash scripts/experiment-worktree.sh create "<spec_name>" <exp_index> "optimize/<spec_name>" <shared_files...>)  # creates optimize-exp/<spec_name>/exp-<NNN>
    
  2. Apply port parameterization if configured (set env vars for the measurement script)
  3. Fill the experiment prompt template (
    references/experiment-prompt-template.md
    ) with:
    • Iteration number, spec name
    • Hypothesis description and category
    • Current best and baseline metrics
    • Mutable and immutable scope
    • Constraints and approved dependencies
    • Rolling window of last 10 experiments (concise summaries)
  4. Dispatch a subagent with the filled prompt, working in the experiment worktree

Codex backend:

  1. Check environment guard -- do NOT delegate if already inside a Codex sandbox:
    # If these exist, we're already in Codex -- fall back to subagent
    test -n "${CODEX_SANDBOX:-}" || test -n "${CODEX_SESSION_ID:-}" || test ! -w .git
    
  2. Fill the experiment prompt template
  3. Write the filled prompt to a temp file
  4. Dispatch via Codex:
    cat /tmp/optimize-exp-XXXXX.txt | codex exec --skip-git-repo-check - 2>&1
    
  5. Security posture: use the user's selection (ask once per session if not set in spec)

3.3 Collect and Persist Results

Process experiments as they complete — do NOT wait for the entire batch to finish before writing results.

For each completed experiment, immediately:

  1. Run measurement in the experiment's worktree:

    bash scripts/measure.sh "<measurement.command>" <timeout_seconds> "<worktree_path>/<measurement.working_directory or .>" <env_vars...>
    
    • If stability mode is
      repeat
      , run the measurement harness
      repeat_count
      times in that working directory and aggregate the results exactly as in Phase 1 before evaluating gates or ranking the experiment.
    • Use the aggregated metrics as the experiment's score; if variance exceeds
      noise_threshold
      , record that in learnings so the operator knows the result is noisy.
  2. Write crash-recovery marker — immediately after measurement, write

    result.yaml
    in the experiment worktree containing the raw metrics. This ensures the measurement is recoverable even if the agent crashes before updating the main log.

  3. Read raw JSON output from the measurement script

  4. Evaluate degenerate gates:

    • For each gate in
      metric.degenerate_gates
      , parse the operator and threshold
    • Compare the metric value against the threshold
    • If ANY gate fails: mark outcome as
      degenerate
      , skip judge evaluation, save money
  5. If gates pass AND primary type is

    judge
    :

    • Read the experiment's output (cluster assignments, search results, etc.)
    • Apply stratified sampling per
      metric.judge.stratification
      config (using
      sample_seed
      )
    • Group samples into batches of
      metric.judge.batch_size
    • Fill the judge prompt template (
      references/judge-prompt-template.md
      ) for each batch
    • Dispatch
      ceil(sample_size / batch_size)
      parallel judge sub-agents
    • Each sub-agent returns structured JSON scores
    • Aggregate scores: compute the configured primary judge field from
      metric.judge.scoring.primary
      (which should match
      metric.primary.name
      ) plus any
      scoring.secondary
      values
    • If
      singleton_sample > 0
      : also dispatch singleton evaluation sub-agents
  6. If gates pass AND primary type is

    hard
    :

    • Use the metric value directly from the measurement output
  7. IMMEDIATELY append to experiment log on disk (CP-3) — do not defer this to batch evaluation. Write the experiment entry (iteration, hypothesis, outcome, metrics, learnings) to

    .context/compound-engineering/ce-optimize/<spec-name>/experiment-log.yaml
    right now. Use the transitional outcome
    measured
    once the experiment has valid metrics but has not yet been compared to the current best. Update the outcome to
    kept
    ,
    reverted
    , or another terminal state in the evaluation step, but the raw metrics are on disk and safe from context compaction.

  8. VERIFY the write (CP-3 verification) — read the experiment log back from disk and confirm the entry just written is present. If verification fails, retry the write. Do NOT proceed to the next experiment until this entry is confirmed on disk.

Why immediately + verify? The agent's context window is NOT a durable store. Context compaction, session crashes, and restarts are expected during long runs. If results only exist in the agent's memory, they are lost. Karpathy's autoresearch writes to

results.tsv
after every single experiment — this skill must do the same with the experiment log. The verification step catches silent write failures that would otherwise lose data.

3.4 Evaluate Batch

After all experiments in the batch have been measured:

  1. Rank experiments by primary metric improvement:

    • For hard metrics: compare to the current best using
      metric.primary.direction
      (
      maximize
      means higher is better,
      minimize
      means lower is better), and require the absolute improvement to exceed
      measurement.stability.noise_threshold
      before treating it as a real win
    • For judge metrics: compare the configured primary judge score (
      metric.judge.scoring.primary
      /
      metric.primary.name
      ) to the current best, and require it to exceed
      minimum_improvement
  2. Identify the best experiment that passes all gates and improves the primary metric

  3. If best improves on current best: KEEP

    • Commit the experiment branch first so the winning diff exists as a real commit before any merge or cherry-pick
    • Include only mutable-scope changes in that commit; if no eligible diff remains, treat the experiment as non-improving and revert it
    • Merge the committed experiment branch into the optimization branch
    • Use the message
      optimize(<spec-name>): <hypothesis description>
      for the experiment commit
    • After the merge succeeds, clean up the winner's experiment worktree and branch; the integrated commit on the optimization branch is the durable artifact
    • This is now the new baseline for subsequent batches
  4. Check file-disjoint runners-up (up to

    max_runner_up_merges_per_batch
    ):

    • For each runner-up that also improved, check file-level disjointness with the kept experiment
    • File-level disjointness: two experiments are disjoint if they modified completely different files. Same file = overlapping, even if different lines.
    • If disjoint: cherry-pick the runner-up onto the new baseline, re-run full measurement
    • If combined measurement is strictly better: keep the cherry-pick (outcome:
      runner_up_kept
      ), then clean up that runner-up's experiment worktree and branch
    • Otherwise: revert the cherry-pick, log as "promising alone but neutral/harmful in combination" (outcome:
      runner_up_reverted
      ), then clean up the runner-up's experiment worktree and branch
    • Stop after first failed combination
  5. Handle deferred deps: experiments that need unapproved dependencies get outcome

    deferred_needs_approval

  6. Revert all others: cleanup worktrees, log as

    reverted

3.5 Update State (CP-4)

MANDATORY CHECKPOINT. By this point, individual experiment results are already on disk (written in step 3.3). This step updates aggregate state and verifies.

  1. Re-read the experiment log from disk — do not trust in-memory state. The log is the source of truth.

  2. Finalize outcomes — update experiment entries from step 3.4 evaluation (mark

    kept
    ,
    reverted
    ,
    runner_up_kept
    , etc.). Write these outcome updates to disk immediately.

  3. Update the

    best
    section in the experiment log if a new best was found. Write to disk.

  4. Write strategy digest to

    .context/compound-engineering/ce-optimize/<spec-name>/strategy-digest.md
    :

    • Categories tried so far (with success/failure counts)
    • Key learnings from this batch and overall
    • Exploration frontier: what categories and approaches remain untried
    • Current best metrics and improvement from baseline
  5. Generate new hypotheses based on learnings:

    • Re-read the strategy digest from disk (not from memory)
    • Read the rolling window (last 10 experiments from the log on disk)
    • Do NOT read the full experiment log -- use the digest for broad context
    • Add new hypotheses to the backlog and write the updated backlog to disk
  6. Write updated hypothesis backlog to disk — the backlog section of the experiment log must reflect newly added hypotheses and removed (tested) ones.

CP-4 Verification: Read the experiment log back from disk. Confirm: (a) all experiment outcomes from this batch are finalized, (b) the

best
section reflects the current best, (c) the hypothesis backlog is updated. Read
strategy-digest.md
back and confirm it exists. Only THEN proceed to the next batch or stopping criteria check.

Checkpoint: at this point, all state for this batch is on disk. If the agent crashes and restarts, it can resume from the experiment log without loss.

3.6 Check Stopping Criteria

Stop the loop if ANY of these are true:

  • Target reached:
    stopping.target_reached
    is true,
    metric.primary.target
    is set, and the primary metric reaches that target according to
    metric.primary.direction
    (
    >=
    for
    maximize
    ,
    <=
    for
    minimize
    )
  • Max iterations: total experiments run >=
    stopping.max_iterations
  • Max hours: wall-clock time since Phase 3 start >=
    stopping.max_hours
  • Judge budget exhausted: cumulative judge spend >=
    metric.judge.max_total_cost_usd
    (if set)
  • Plateau: no improvement for
    stopping.plateau_iterations
    consecutive experiments
  • Manual stop: user interrupts (save state and proceed to Phase 4)
  • Empty backlog: no hypotheses remain and no new ones can be generated

If no stopping criterion is met, proceed to the next batch (step 3.1).

3.7 Cross-Cutting Concerns

Codex failure cascade: Track consecutive Codex delegation failures. After 3 consecutive failures, auto-disable Codex for remaining experiments and fall back to subagent dispatch. Log the switch.

Error handling: If an experiment's measurement command crashes, times out, or produces malformed output:

  • Log as outcome
    error
    or
    timeout
    with the error message
  • Revert the experiment (cleanup worktree)
  • The loop continues with remaining experiments in the batch

Progress reporting: After each batch, report:

  • Batch N of estimated M (based on backlog size)
  • Experiments run this batch and total
  • Current best metric and improvement from baseline
  • Cumulative judge cost (if applicable)

Crash recovery: See Persistence Discipline section. Per-experiment

result.yaml
markers are written in step 3.3. Individual experiment results are appended to the log immediately in step 3.3. Batch-level state (outcomes, best, digest) is written in step 3.5. On resume (Phase 0.4), the log on disk is the ground truth — scan for any
result.yaml
markers not yet reflected in the log.


Phase 4: Wrap-Up

4.1 Present Deferred Hypotheses

If any hypotheses were deferred due to unapproved dependencies:

  1. List them with their dependency requirements
  2. Ask the user whether to approve, skip, or save for a future run
  3. If approved: add to backlog and offer to re-enter Phase 3 for one more round

4.2 Summarize Results

Present a comprehensive summary:

Optimization: <spec-name>
Duration: <wall-clock time>
Total experiments: <count>
  Kept: <count> (including <runner_up_kept_count> runner-up merges)
  Reverted: <count>
  Degenerate: <count>
  Errors: <count>
  Deferred: <count>

Baseline -> Final:
  <primary_metric>: <baseline_value> -> <final_value> (<delta>)
  <gate_metrics>: ...
  <diagnostics>: ...

Judge cost: $<total_judge_cost_usd> (if applicable)

Key improvements:
  1. <kept experiment 1 hypothesis> (+<delta>)
  2. <kept experiment 2 hypothesis> (+<delta>)
  ...

4.3 Preserve and Offer Next Steps

The optimization branch (

optimize/<spec-name>
) is preserved with all commits from kept experiments. The experiment log and strategy digest remain in local
.context/...
scratch space for resume and audit on this machine only; they do not travel with the branch because
.context/
is gitignored.

Present post-completion options via the platform question tool:

  1. Run
    /ce-code-review
    on the cumulative diff (baseline to final). Load the
    ce-code-review
    skill with
    mode:autofix
    on the optimization branch.
  2. Run
    /ce-compound
    to document the winning strategy as an institutional learning.
  3. Create PR from the optimization branch to the default branch.
  4. Continue with more experiments: re-enter Phase 3 with the current state. State re-read first.
  5. Done -- leave the optimization branch for manual review.

4.4 Cleanup

Clean up scratch space:

# Keep the experiment log for local resume/audit on this machine
# Remove temporary batch artifacts
rm -f .context/compound-engineering/ce-optimize/<spec-name>/strategy-digest.md

Do NOT delete the experiment log if the user may resume locally or wants a local audit trail. If they need a durable shared artifact, summarize or export the results into a tracked path before cleanup. Do NOT delete experiment worktrees that are still being referenced.