Auto-claude-code-research-in-sleep experiment-plan
Turn a refined research proposal or method idea into a detailed, claim-driven experiment roadmap. Use after `research-refine`, or when the user asks for a detailed experiment plan, ablation matrix, evaluation protocol, run order, compute budget, or paper-ready validation that supports the core problem, novelty, simplicity, and any LLM / VLM / Diffusion / RL-based contribution.
git clone https://github.com/wanshuiyin/Auto-claude-code-research-in-sleep
T=$(mktemp -d) && git clone --depth=1 https://github.com/wanshuiyin/Auto-claude-code-research-in-sleep "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/skills-codex/experiment-plan" ~/.claude/skills/wanshuiyin-auto-claude-code-research-in-sleep-experiment-plan-61812d && rm -rf "$T"
skills/skills-codex/experiment-plan/SKILL.mdExperiment Plan: Claim-Driven, Paper-Oriented Validation
Refine and concretize: $ARGUMENTS
Overview
Use this skill after the method is stable enough that the next question becomes: what exact experiments should we run, in what order, to defend the paper? If the user wants the full chain in one request, prefer
/research-refine-pipeline.
The goal is not to generate a giant benchmark wishlist. The goal is to turn a proposal into a claim -> evidence -> run order roadmap that supports four things:
- the method actually solves the anchored problem
- the dominant contribution is real and focused
- the method is elegant enough that extra complexity is unnecessary
- any frontier-model-era component is genuinely useful, not decorative
Constants
- OUTPUT_DIR =
— Default destination for experiment planning artifacts.refine-logs/ - MAX_PRIMARY_CLAIMS = 2 — Prefer one dominant claim plus one supporting claim.
- MAX_CORE_BLOCKS = 5 — Keep the must-run experimental story compact.
- MAX_BASELINE_FAMILIES = 3 — Prefer a few strong baselines over many weak ones.
- DEFAULT_SEEDS = 3 — Use 3 seeds when stochastic variance matters and budget allows.
Workflow
Phase 0: Load the Proposal Context
Read the most relevant existing files first if they exist:
refine-logs/FINAL_PROPOSAL.mdrefine-logs/REVIEW_SUMMARY.mdrefine-logs/REFINEMENT_REPORT.md
Extract:
- Problem Anchor
- Dominant contribution
- Optional supporting contribution
- Critical reviewer concerns
- Data / compute / timeline constraints
- Which frontier primitive is central, if any
If these files do not exist, derive the same information from the user's prompt.
Phase 1: Freeze the Paper Claims
Before proposing experiments, write down the claims that must be defended.
Use this structure:
- Primary claim: the main mechanism-level contribution
- Supporting claim: optional, only if it directly strengthens the main paper story
- Anti-claim to rule out: e.g. "the gain only comes from more parameters," "the gain only comes from a larger search space," or "the modern component is just decoration"
- Minimum convincing evidence: what would make each claim believable to a strong reviewer?
Do not exceed
MAX_PRIMARY_CLAIMS unless the paper truly has multiple inseparable claims.
Phase 2: Build the Experimental Storyline
Design the paper around a compact set of experiment blocks. Default to the following blocks and delete any that are not needed:
- Main anchor result — does the method solve the actual bottleneck?
- Novelty isolation — does the dominant contribution itself matter?
- Simplicity / elegance check — can a bigger or more fragmented version be avoided?
- Frontier necessity check — if an LLM / VLM / Diffusion / RL-era component is central, is it actually the right tool?
- Failure analysis or qualitative diagnosis — what does the method still miss?
For each block, decide whether it belongs in:
- Main paper — essential to defend the core claims
- Appendix — useful but non-blocking
- Cut — interesting, but not worth the paper budget
Prefer one strong baseline family over many weak baselines. If a stronger modern baseline exists, use it instead of padding the list.
Phase 3: Specify Each Experiment Block
For every kept block, fully specify:
- Claim tested
- Why this block exists
- Dataset / split / task
- Compared systems: strongest baselines, ablations, and variants only
- Metrics: decisive metrics first, secondary metrics second
- Setup details: backbone, frozen vs trainable parts, key hyperparameters, training budget, seeds
- Success criterion: what outcome would count as convincing evidence?
- Failure interpretation: if the result is negative, what does it mean?
- Table / figure target: where this result should appear in the paper
Special rules:
- A simplicity check should usually compare the final method against either an overbuilt variant or a tempting extra component that the paper intentionally rejects.
- A frontier necessity check should usually compare the chosen modern primitive against the strongest plausible simpler or older alternative.
- If the proposal is intentionally non-frontier, say so explicitly and skip the frontier block instead of forcing one.
Phase 4: Turn the Plan Into an Execution Order
Build a realistic run order so the user knows what to do first.
Use this milestone structure:
- Sanity stage — data pipeline, metric correctness, one quick overfit or toy split
- Baseline stage — reproduce the strongest baseline(s)
- Main method stage — run the final method on the primary setting
- Decision stage — run the decisive ablations for novelty, simplicity, and frontier necessity
- Polish stage — robustness, qualitative figures, appendix extras
For each milestone, estimate:
- compute cost
- expected turnaround time
- stop / go decision gate
- risk and mitigation
Separate must-run from nice-to-have experiments.
Phase 5: Write the Outputs
Step 5.1: Write refine-logs/EXPERIMENT_PLAN.md
refine-logs/EXPERIMENT_PLAN.mdUse this structure:
# Experiment Plan **Problem**: [problem] **Method Thesis**: [one-sentence thesis] **Date**: [today] ## Claim Map | Claim | Why It Matters | Minimum Convincing Evidence | Linked Blocks | |-------|-----------------|-----------------------------|---------------| | C1 | ... | ... | B1, B2 | ## Paper Storyline - Main paper must prove: - Appendix can support: - Experiments intentionally cut: ## Experiment Blocks ### Block 1: [Name] - Claim tested: - Why this block exists: - Dataset / split / task: - Compared systems: - Metrics: - Setup details: - Success criterion: - Failure interpretation: - Table / figure target: - Priority: MUST-RUN / NICE-TO-HAVE ### Block 2: [Name] ... ## Run Order and Milestones | Milestone | Goal | Runs | Decision Gate | Cost | Risk | |-----------|------|------|---------------|------|------| | M0 | ... | ... | ... | ... | ... | ## Compute and Data Budget - Total estimated GPU-hours: - Data preparation needs: - Human evaluation needs: - Biggest bottleneck: ## Risks and Mitigations - [Risk]: - [Mitigation]: ## Final Checklist - [ ] Main paper tables are covered - [ ] Novelty is isolated - [ ] Simplicity is defended - [ ] Frontier contribution is justified or explicitly not claimed - [ ] Nice-to-have runs are separated from must-run runs
Step 5.2: Write refine-logs/EXPERIMENT_TRACKER.md
refine-logs/EXPERIMENT_TRACKER.mdUse this structure:
# Experiment Tracker | Run ID | Milestone | Purpose | System / Variant | Split | Metrics | Priority | Status | Notes | |--------|-----------|---------|------------------|-------|---------|----------|--------|-------| | R001 | M0 | sanity | ... | ... | ... | MUST | TODO | ... |
Keep the tracker compact and execution-oriented.
Step 5.3: Present a Brief Summary to the User
Experiment plan ready. Must-run blocks: - [Block 1] - [Block 2] Highest-risk assumption: - [risk] First three runs to launch: 1. [run] 2. [run] 3. [run] Plan file: refine-logs/EXPERIMENT_PLAN.md Tracker file: refine-logs/EXPERIMENT_TRACKER.md
Output Protocols
Follow these shared protocols for all output files:
- Output Versioning Protocol — write timestamped file first, then copy to fixed name
- Output Manifest Protocol — log every output to MANIFEST.md
- Output Language Protocol — respect the project's language setting
Key Rules
-
Large file handling: If the Write tool fails due to file size, immediately retry using Bash (
) to write in chunks. Do NOT ask the user for permission — just do it silently.cat << 'EOF' > file -
Every experiment must defend a claim. If it does not change a reviewer belief, cut it.
-
Prefer a compact paper story. Design the main table first, then add only the ablations that defend it.
-
Defend simplicity explicitly. If complexity is a concern, include a deletion study or a stronger-but-bloated variant comparison.
-
Defend frontier choices explicitly. If a modern primitive is central, prove why it is better than the strongest simpler alternative.
-
Prefer strong baselines over long baseline lists. A short, credible comparison set is better than a padded one.
-
Separate must-run from nice-to-have. Do not let appendix ideas delay the core paper evidence.
-
Reuse proposal constraints. Do not invent unrealistic budgets or data assumptions.
-
Do not fabricate results. Plan evidence; do not claim evidence.
Composing with Other Skills
/research-refine-pipeline -> one-shot method + experiment planning /research-refine -> method and claim refinement /experiment-plan -> detailed experiment roadmap /run-experiment -> execute the runs /auto-review-loop -> react to results and iterate on the paper