Auto-claude-code-research-in-sleep auto-paper-improvement-loop
Autonomously improve a generated paper via GPT-5.4 xhigh review \u2192 implement fixes \u2192 recompile, for 2 rounds. Use when user says \\\"\u6539\u8bba\u6587\\\", \\\"improve paper\\\", \\\"\u8bba\u6587\u6da6\u8272\u5faa\u73af\\\", \\\"auto improve\\\", or wants to iteratively polish a generated paper.
git clone https://github.com/wanshuiyin/Auto-claude-code-research-in-sleep
T=$(mktemp -d) && git clone --depth=1 https://github.com/wanshuiyin/Auto-claude-code-research-in-sleep "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/skills-codex/auto-paper-improvement-loop" ~/.claude/skills/wanshuiyin-auto-claude-code-research-in-sleep-auto-paper-improvement-loop-d5c330 && rm -rf "$T"
skills/skills-codex/auto-paper-improvement-loop/SKILL.mdAuto Paper Improvement Loop: Review → Fix → Recompile
Autonomously improve the paper at: $ARGUMENTS
Context
This skill is designed to run after Workflow 3 (
/paper-plan → /paper-figure → /paper-write → /paper-compile). It takes a compiled paper and iteratively improves it through external LLM review.
Unlike
/auto-review-loop (which iterates on research — running experiments, collecting data, rewriting narrative), this skill iterates on paper writing quality — fixing theoretical inconsistencies, softening overclaims, adding missing content, and improving presentation.
Constants
- MAX_ROUNDS = 2 — Two rounds of review→fix→recompile. Empirically, Round 1 catches structural issues (4→6/10), Round 2 catches remaining presentation issues (6→7/10). Diminishing returns beyond 2 rounds for writing-only improvements.
- REVIEWER_MODEL =
— Model used via a secondary Codex agent for paper review.gpt-5.4 - REVIEW_LOG =
— Cumulative log of all rounds, stored in paper directory.PAPER_IMPROVEMENT_LOG.md - HUMAN_CHECKPOINT = false — When
, pause after each round's review and present score + weaknesses to the user. The user can approve fixes, provide custom modification instructions, skip specific fixes, or stop early. Whentrue
(default), runs fully autonomously.false
💡 Override:
/auto-paper-improvement-loop "paper/" — human checkpoint: true
Inputs
- Compiled paper —
+ LaTeX source filespaper/main.pdf - All section
files — concatenated for review prompt.tex
State Persistence (Compact Recovery)
If the context window fills up mid-loop, Codex auto-compacts. To recover, this skill writes
PAPER_IMPROVEMENT_STATE.json after each round:
{ "current_round": 1, "agent_id": "019ce736-...", "last_score": 6, "status": "in_progress", "timestamp": "2026-03-13T21:00:00" }
On startup: if
PAPER_IMPROVEMENT_STATE.json exists with "status": "in_progress" AND timestamp is within 24 hours, read it + PAPER_IMPROVEMENT_LOG.md to recover context, then resume from the next round. Otherwise (file absent, "status": "completed", or older than 24 hours), start fresh.
After each round: overwrite the state file. On completion: set
"status": "completed".
Workflow
Step 0: Preserve Original
cp paper/main.pdf paper/main_round0_original.pdf
Step 1: Collect Paper Text
Concatenate all section files into a single text block for the review prompt:
# Collect all sections in order for f in paper/sections/*.tex; do echo "% === $(basename $f) ===" cat "$f" done > /tmp/paper_full_text.txt
Step 2: Round 1 Review
Send the full paper text to GPT-5.4 xhigh:
spawn_agent: model: gpt-5.4 reasoning_effort: xhigh message: | You are reviewing a [VENUE] paper. Please provide a detailed, structured review. ## Full Paper Text: [paste concatenated sections] ## Review Instructions Please act as a senior ML reviewer ([VENUE] level). Provide: 1. **Overall Score** (1-10, where 6 = weak accept, 7 = accept) 2. **Summary** (2-3 sentences) 3. **Strengths** (bullet list, ranked) 4. **Weaknesses** (bullet list, ranked: CRITICAL > MAJOR > MINOR) 5. **For each CRITICAL/MAJOR weakness**: A specific, actionable fix 6. **Missing References** (if any) 7. **Verdict**: Ready for submission? Yes / Almost / No Focus on: theoretical rigor, claims vs evidence alignment, writing clarity, self-containedness, notation consistency.
Save the agent id for Round 2.
Step 2b: Human Checkpoint (if enabled)
Skip if
.HUMAN_CHECKPOINT = false
Present the review results and wait for user input:
📋 Round 1 review complete. Score: X/10 — [verdict] Key weaknesses (by severity): 1. [CRITICAL] ... 2. [MAJOR] ... 3. [MINOR] ... Reply "go" to implement all fixes, give custom instructions, "skip 2" to skip specific fixes, or "stop" to end.
Parse user response same as
/auto-review-loop: approve / custom instructions / skip / stop.
Step 3: Implement Round 1 Fixes
Parse the review and implement fixes by severity:
Priority order:
- CRITICAL fixes (assumption mismatches, internal contradictions)
- MAJOR fixes (overclaims, missing content, notation issues)
- MINOR fixes (if time permits)
Common fix patterns:
| Issue | Fix Pattern |
|---|---|
| Assumption-model mismatch | Rewrite assumption to match the model, add formal proposition bridging the gap |
| Overclaims | Soften language: "validate" → "demonstrate practical relevance", "comparable" → "qualitatively competitive" |
| Missing metrics | Add quantitative table with honest parameter counts and caveats |
| Theorem not self-contained | Add "Interpretation" paragraph listing all dependencies |
| Notation confusion | Rename conflicting symbols globally, add Notation paragraph |
| Missing references | Add to , cite in appropriate locations |
| Theory-practice gap | Explicitly frame theory as idealized; add synthetic validation subsection |
Step 4: Recompile Round 1
cd paper && latexmk -C && latexmk -pdf -interaction=nonstopmode -halt-on-error main.tex cp main.pdf main_round1.pdf
Verify: 0 undefined references, 0 undefined citations.
Step 5: Round 2 Review
Use
send_input with the saved agent id:
send_input: id: [saved from Round 1] model: gpt-5.4 reasoning_effort: xhigh message: | [Round 2 update] Since your last review, we have implemented: 1. [Fix 1]: [description] 2. [Fix 2]: [description] ... Please re-score and re-assess. Same format: Score, Summary, Strengths, Weaknesses, Actionable fixes, Verdict.
Step 5b: Human Checkpoint (if enabled)
Skip if
. Same as Step 2b — present Round 2 review, wait for user input.HUMAN_CHECKPOINT = false
Step 6: Implement Round 2 Fixes
Same process as Step 3. Typical Round 2 fixes:
- Add controlled synthetic experiments validating theory
- Further soften any remaining overclaims
- Formalize informal arguments (e.g., truncation → formal proposition)
- Strengthen limitations section
Step 7: Recompile Round 2
cd paper && latexmk -C && latexmk -pdf -interaction=nonstopmode -halt-on-error main.tex cp main.pdf main_round2.pdf
Step 8: Format Check
After the final recompilation, run a format compliance check:
# 1. Page count vs venue limit PAGES=$(pdfinfo paper/main.pdf | grep Pages | awk '{print $2}') echo "Pages: $PAGES (limit: 9 main body for ICLR/NeurIPS)" # 2. Overfull hbox warnings (content exceeding margins) OVERFULL=$(grep -c "Overfull" paper/main.log 2>/dev/null || echo 0) echo "Overfull hbox warnings: $OVERFULL" grep "Overfull" paper/main.log 2>/dev/null | head -10 # 3. Underfull hbox warnings (loose spacing) UNDERFULL=$(grep -c "Underfull" paper/main.log 2>/dev/null || echo 0) echo "Underfull hbox warnings: $UNDERFULL" # 4. Bad boxes summary grep -c "badness" paper/main.log 2>/dev/null || echo "0 badness warnings"
Auto-fix patterns:
| Issue | Fix |
|---|---|
| Overfull hbox in equation | Wrap in or split with / |
| Overfull hbox in table | Reduce font (/) or use |
| Overfull hbox in text | Rephrase sentence or add / hints |
| Over page limit | Move content to appendix, compress tables, reduce figure sizes |
| Underfull hbox (loose) | Rephrase for better line filling or add |
If any overfull hbox > 10pt is found, fix it and recompile before documenting.
Step 9: Document Results
Create
PAPER_IMPROVEMENT_LOG.md in the paper directory:
# Paper Improvement Log ## Score Progression | Round | Score | Verdict | Key Changes | |-------|-------|---------|-------------| | Round 0 (original) | X/10 | No/Almost/Yes | Baseline | | Round 1 | Y/10 | No/Almost/Yes | [summary of fixes] | | Round 2 | Z/10 | No/Almost/Yes | [summary of fixes] | ## Round 1 Review & Fixes <details> <summary>GPT-5.4 xhigh Review (Round 1)</summary> [Full raw review text, verbatim] </details> ### Fixes Implemented 1. [Fix description] 2. [Fix description] ... ## Round 2 Review & Fixes <details> <summary>GPT-5.4 xhigh Review (Round 2)</summary> [Full raw review text, verbatim] </details> ### Fixes Implemented 1. [Fix description] 2. [Fix description] ... ## PDFs - `main_round0_original.pdf` — Original generated paper - `main_round1.pdf` — After Round 1 fixes - `main_round2.pdf` — Final version after Round 2 fixes
Step 9: Summary
Report to user:
- Score progression table
- Number of CRITICAL/MAJOR/MINOR issues fixed per round
- Final page count
- Remaining issues (if any)
Feishu Notification (if configured)
After each round's review AND at final completion, check
~/.codex/feishu.json:
- After each round: Send
— "Round N: X/10 — [key changes]"review_scored - After final round: Send
— score progression table + final page countpipeline_done - If config absent or mode
: skip entirely (no-op)"off"
Output
paper/ ├── main_round0_original.pdf # Original ├── main_round1.pdf # After Round 1 ├── main_round2.pdf # After Round 2 (final) ├── main.pdf # = main_round2.pdf └── PAPER_IMPROVEMENT_LOG.md # Full review log with scores
Key Rules
-
Large file handling: If the Write tool fails due to file size, immediately retry using Bash (
) to write in chunks. Do NOT ask the user for permission — just do it silently.cat << 'EOF' > file -
Preserve all PDF versions — user needs to compare progression
-
Save FULL raw review text — do not summarize or truncate GPT-5.4 responses
-
Use
for Round 2 to maintain conversation contextsend_input -
Always recompile after fixes — verify 0 errors before proceeding
-
Do not fabricate experimental results — synthetic validation must describe methodology, not invent numbers
-
Respect the paper's claims — soften overclaims rather than adding unsupported new claims
-
Global consistency — when renaming notation or softening claims, check ALL files (abstract, intro, method, experiments, theory sections, conclusion, tables, figure captions)
Typical Score Progression
Based on end-to-end testing on a 9-page ICLR 2026 theory paper:
| Round | Score | Key Improvements |
|---|---|---|
| Round 0 | 4/10 (content) | Baseline: assumption-model mismatch, overclaims, notation issues |
| Round 1 | 6/10 (content) | Fixed assumptions, softened claims, added interpretation, renamed notation |
| Round 2 | 7/10 (content) | Added synthetic validation, formal truncation proposition, stronger limitations |
| Round 3 | 5→8.5/10 (format) | Removed hero fig, appendix, compressed conclusion, fixed overfull hbox |
+4.5 points across 3 rounds (2 content + 1 format) is typical for a well-structured but rough first draft. Final: 8 pages main body, 0 overfull hbox, ICLR-compliant.