Auto-claude-code-research-in-sleep ablation-planner
Use when main results pass result-to-claim (`claim_supported = yes` or `partial`) and ablation studies are needed for paper submission. A secondary Codex agent designs ablations from a reviewer's perspective; the local executor reviews feasibility and implements.
install
source · Clone the upstream repo
git clone https://github.com/wanshuiyin/Auto-claude-code-research-in-sleep
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/wanshuiyin/Auto-claude-code-research-in-sleep "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/skills-codex/ablation-planner" ~/.claude/skills/wanshuiyin-auto-claude-code-research-in-sleep-ablation-planner-05aab9 && rm -rf "$T"
manifest:
skills/skills-codex/ablation-planner/SKILL.mdsource content
Ablation Planner
Systematically design ablation studies that answer the questions reviewers will ask. The reviewer agent leads the design; the local executor reviews feasibility and implements.
Context: $ARGUMENTS
Constants
- REVIEWER_MODEL =
- Used via a secondary Codex agent for reviewer-style ablation design.gpt-5.4
When to Use
- Main results pass
with/result-to-claim
orclaim_supported = yespartial - The user explicitly requests ablation planning
identifies missing ablations/auto-review-loop
Workflow
Step 1: Prepare Context
Read available project files to build the full picture:
- Method description and components (from
, project notes, or method docs)docs/research_contract.md - Current experiment results (from
,EXPERIMENT_LOG.md
, or W&B)EXPERIMENT_TRACKER.md - Confirmed and intended claims (from
output or project notes)/result-to-claim - Available compute resources (from server notes, run configs, or user-provided budget)
Step 2: Secondary Codex Designs Ablations
spawn_agent: model: REVIEWER_MODEL reasoning_effort: xhigh message: | You are a rigorous ML reviewer planning ablation studies. Given this method and results, design ablations that: 1. Isolate the contribution of each novel component 2. Answer questions reviewers will definitely ask 3. Test sensitivity to key hyperparameters 4. Compare against natural alternative design choices Method: [description from project files] Components: [list of removable or replaceable components] Current results: [key metrics from experiments] Claims: [what we claim and current evidence] For each ablation, specify: - name: what to change (for example, "remove module X", "replace Y with Z") - what_it_tests: the specific question this answers - expected_if_component_matters: what we predict if the component is important - priority: 1 (must-run) to 5 (nice-to-have) Also provide: - coverage_assessment: what reviewer questions these ablations answer - unnecessary_ablations: experiments that seem useful but will not add insight - suggested_order: run order optimized for maximum early information - estimated_compute: total GPU-hours estimate
If delegation is unavailable, generate the same plan locally and mark it
[pending external review].
Step 3: Parse Ablation Plan
Normalize the response into a structured format:
## Ablation Plan ### Component Ablations (highest priority) | # | Name | What It Tests | Expected If Matters | Priority | |---|------|---------------|---------------------|----------| | 1 | remove module X | contribution of X | performance drops on metric Y | 1 | | 2 | replace X with simpler Z | value of learned vs fixed | drops, especially on dataset A | 2 | ### Hyperparameter Sensitivity | # | Parameter | Values to Test | What It Tests | Priority | |---|-----------|----------------|---------------|----------| | 3 | lambda | [0.01, 0.1, 1.0] | sensitivity to regularization | 3 | ### Design Choice Comparisons | # | Name | What It Tests | Priority | |---|------|---------------|----------| | 4 | joint vs separate matching | whether joint adds value | 4 | ### Coverage Assessment [What reviewer questions these ablations answer] ### Unnecessary Ablations [Experiments that seem useful but will not add insight - skip these] ### Run Order [Optimized for maximum early information] ### Estimated Compute [Total GPU-hours]
Step 4: Review Feasibility
Before running anything, check:
- Compute budget - Can you afford all ablations with available GPUs?
- Code changes - Which ablations need code modifications vs config-only changes?
- Dependencies - Which ablations can run in parallel?
- Cuts - If budget is tight, propose removing lower-priority ablations and ask the reviewer agent to re-prioritize when possible
Step 5: Implement and Run
- Create configs or scripts for each ablation (config-only changes first)
- Smoke test each ablation before the full run
- Run in the suggested order, using descriptive names (for example,
)ablation-no-module-X - Track results in
EXPERIMENT_LOG.md - After all ablations complete, update
with insightsfindings.md
Rules
- The reviewer agent leads the design. Do not pre-filter or bias the ablation list before external review sees it. The reviewer thinks like a reviewer; the local executor thinks like an engineer.
- Every ablation must have a clear
andwhat_it_tests
. No "just try it" experiments.expected_if_component_matters - Config-only ablations take priority over those needing code changes (faster, less error-prone).
- If total compute exceeds budget, propose cuts and ask for re-prioritization - do not silently drop ablations.
- Component ablations (remove or replace) take priority over hyperparameter sweeps.
- Do not generate ablations for components identical to the baseline (no-op ablations).
- Record all ablation results in
, including negative results (for example, component removal had no effect).EXPERIMENT_LOG.md