Auto-claude-code-research-in-sleep ablation-planner

Use when main results pass result-to-claim (`claim_supported = yes` or `partial`) and ablation studies are needed for paper submission. A secondary Codex agent designs ablations from a reviewer's perspective; the local executor reviews feasibility and implements.

install

source · Clone the upstream repo

git clone https://github.com/wanshuiyin/Auto-claude-code-research-in-sleep

Claude Code · Install into ~/.claude/skills/

T=$(mktemp -d) && git clone --depth=1 https://github.com/wanshuiyin/Auto-claude-code-research-in-sleep "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/skills-codex/ablation-planner" ~/.claude/skills/wanshuiyin-auto-claude-code-research-in-sleep-ablation-planner-05aab9 && rm -rf "$T"

manifest: skills/skills-codex/ablation-planner/SKILL.md

source content

Ablation Planner

Systematically design ablation studies that answer the questions reviewers will ask. The reviewer agent leads the design; the local executor reviews feasibility and implements.

Context: $ARGUMENTS

Constants

REVIEWER_MODEL =
gpt-5.4
- Used via a secondary Codex agent for reviewer-style ablation design.

When to Use

Main results pass

/result-to-claim

with

claim_supported = yes

partial

The user explicitly requests ablation planning
```
/auto-review-loop
```
identifies missing ablations

Workflow

Step 1: Prepare Context

Read available project files to build the full picture:

Method description and components (from
```
docs/research_contract.md
```
, project notes, or method docs)
Current experiment results (from
```
EXPERIMENT_LOG.md
```
,
```
EXPERIMENT_TRACKER.md
```
, or W&B)
Confirmed and intended claims (from
```
/result-to-claim
```
output or project notes)
Available compute resources (from server notes, run configs, or user-provided budget)

Step 2: Secondary Codex Designs Ablations

spawn_agent:
  model: REVIEWER_MODEL
  reasoning_effort: xhigh
  message: |
    You are a rigorous ML reviewer planning ablation studies.
    Given this method and results, design ablations that:

    1. Isolate the contribution of each novel component
    2. Answer questions reviewers will definitely ask
    3. Test sensitivity to key hyperparameters
    4. Compare against natural alternative design choices

    Method: [description from project files]
    Components: [list of removable or replaceable components]
    Current results: [key metrics from experiments]
    Claims: [what we claim and current evidence]

    For each ablation, specify:
    - name: what to change (for example, "remove module X", "replace Y with Z")
    - what_it_tests: the specific question this answers
    - expected_if_component_matters: what we predict if the component is important
    - priority: 1 (must-run) to 5 (nice-to-have)

    Also provide:
    - coverage_assessment: what reviewer questions these ablations answer
    - unnecessary_ablations: experiments that seem useful but will not add insight
    - suggested_order: run order optimized for maximum early information
    - estimated_compute: total GPU-hours estimate

If delegation is unavailable, generate the same plan locally and mark it

[pending external review]

Step 3: Parse Ablation Plan

Normalize the response into a structured format:

## Ablation Plan

### Component Ablations (highest priority)
| # | Name | What It Tests | Expected If Matters | Priority |
|---|------|---------------|---------------------|----------|
| 1 | remove module X | contribution of X | performance drops on metric Y | 1 |
| 2 | replace X with simpler Z | value of learned vs fixed | drops, especially on dataset A | 2 |

### Hyperparameter Sensitivity
| # | Parameter | Values to Test | What It Tests | Priority |
|---|-----------|----------------|---------------|----------|
| 3 | lambda | [0.01, 0.1, 1.0] | sensitivity to regularization | 3 |

### Design Choice Comparisons
| # | Name | What It Tests | Priority |
|---|------|---------------|----------|
| 4 | joint vs separate matching | whether joint adds value | 4 |

### Coverage Assessment
[What reviewer questions these ablations answer]

### Unnecessary Ablations
[Experiments that seem useful but will not add insight - skip these]

### Run Order
[Optimized for maximum early information]

### Estimated Compute
[Total GPU-hours]

Step 4: Review Feasibility

Before running anything, check:

Compute budget - Can you afford all ablations with available GPUs?
Code changes - Which ablations need code modifications vs config-only changes?
Dependencies - Which ablations can run in parallel?
Cuts - If budget is tight, propose removing lower-priority ablations and ask the reviewer agent to re-prioritize when possible

Step 5: Implement and Run

Create configs or scripts for each ablation (config-only changes first)
Smoke test each ablation before the full run
Run in the suggested order, using descriptive names (for example,
```
ablation-no-module-X
```
)
Track results in
```
EXPERIMENT_LOG.md
```
After all ablations complete, update
```
findings.md
```
with insights

Rules

The reviewer agent leads the design. Do not pre-filter or bias the ablation list before external review sees it. The reviewer thinks like a reviewer; the local executor thinks like an engineer.
Every ablation must have a clear
```
what_it_tests
```
and
```
expected_if_component_matters
```
. No "just try it" experiments.
Config-only ablations take priority over those needing code changes (faster, less error-prone).
If total compute exceeds budget, propose cuts and ask for re-prioritization - do not silently drop ablations.
Component ablations (remove or replace) take priority over hyperparameter sweeps.
Do not generate ablations for components identical to the baseline (no-op ablations).
Record all ablation results in
```
EXPERIMENT_LOG.md
```
, including negative results (for example, component removal had no effect).