Awesome-Agent-Skills-for-Empirical-Research auto-review-loop
Autonomous multi-round research review loop. Repeatedly reviews using a secondary Codex agent, implements fixes, and re-reviews until positive assessment or max rounds reached. Use when user says \"auto review loop\", \"review until it passes\", or wants autonomous iterative improvement.
git clone https://github.com/brycewang-stanford/Awesome-Agent-Skills-for-Empirical-Research
T=$(mktemp -d) && git clone --depth=1 https://github.com/brycewang-stanford/Awesome-Agent-Skills-for-Empirical-Research "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/42-wanshuiyin-ARIS/skills/skills-codex/auto-review-loop" ~/.claude/skills/brycewang-stanford-awesome-agent-skills-for-empirical-research-auto-review-loop-f032ea && rm -rf "$T"
skills/42-wanshuiyin-ARIS/skills/skills-codex/auto-review-loop/SKILL.mdAuto Review Loop: Autonomous Research Improvement
Autonomously iterate: review → implement fixes → re-review, until the external reviewer gives a positive assessment or MAX_ROUNDS is reached.
Context: $ARGUMENTS
Constants
- MAX_ROUNDS = 4
- POSITIVE_THRESHOLD: score >= 6/10, or verdict contains "accept", "sufficient", "ready for submission"
- REVIEW_DOC:
in project root (cumulative log)AUTO_REVIEW.md - REVIEWER_MODEL =
— Model used via a secondary Codex agent. Must be an OpenAI model (e.g.,gpt-5.4
,gpt-5.4
,o3
)gpt-4o - HUMAN_CHECKPOINT = false — When
, pause after each round's review (Phase B) and present the score + weaknesses to the user. Wait for user input before proceeding to Phase C. The user can: approve the suggested fixes, provide custom modification instructions, skip specific fixes, or stop the loop early. Whentrue
(default), the loop runs fully autonomously.false - COMPACT = false — When
, (1) readtrue
andEXPERIMENT_LOG.md
instead of parsing full logs on session recovery, (2) append key findings tofindings.md
after each round.findings.md
💡 Override:
/auto-review-loop "topic" — compact: true, human checkpoint: true
State Persistence (Compact Recovery)
Long-running loops may hit the context window limit, triggering automatic compaction. To survive this, persist state to
REVIEW_STATE.json after each round:
{ "round": 2, "agent_id": "019cd392-...", "status": "in_progress", "last_score": 5.0, "last_verdict": "not ready", "pending_experiments": ["screen_name_1"], "timestamp": "2026-03-13T21:00:00" }
Write this file at the end of every Phase E (after documenting the round). Overwrite each time — only the latest state matters.
On completion (positive assessment or max rounds), set
"status": "completed" so future invocations don't accidentally resume a finished loop.
Workflow
Initialization
- Check for
in project root:REVIEW_STATE.json- If it does not exist: fresh start (normal case, identical to behavior before this feature existed)
- If it exists AND
isstatus
: fresh start (previous loop finished normally)"completed" - If it exists AND
isstatus
AND"in_progress"
is older than 24 hours: fresh start (stale state from a killed/abandoned run — delete the file and start over)timestamp - If it exists AND
isstatus
AND"in_progress"
is within 24 hours: resumetimestamp- Read the state file to recover
,round
,agent_id
,last_scorepending_experiments - Read
to restore full context of prior roundsAUTO_REVIEW.md - If
is non-empty, check if they have completed (e.g., check screen sessions)pending_experiments - Resume from the next round (round = saved round + 1)
- Log: "Recovered from context compaction. Resuming at Round N."
- Read the state file to recover
- Read project narrative documents, memory files, and any prior review documents. When
and compact files exist, preferCOMPACT = true
+findings.md
over full raw logs.EXPERIMENT_LOG.md - Read recent experiment results (check output directories, logs)
- Identify current weaknesses and open TODOs from prior reviews
- Initialize round counter = 1 (unless recovered from state file)
- Create/update
with header and timestampAUTO_REVIEW.md
Loop (repeat up to MAX_ROUNDS)
Phase A: Review
Send comprehensive context to the external reviewer:
spawn_agent: reasoning_effort: xhigh message: | [Round N/MAX_ROUNDS of autonomous review loop] [Full research context: claims, methods, results, known weaknesses] [Changes since last round, if any] Please act as a senior ML reviewer (NeurIPS/ICML level). 1. Score this work 1-10 for a top venue 2. List remaining critical weaknesses (ranked by severity) 3. For each weakness, specify the MINIMUM fix (experiment, analysis, or reframing) 4. State clearly: is this READY for submission? Yes/No/Almost Be brutally honest. If the work is ready, say so clearly.
If this is round 2+, use
send_input with the saved agent id to maintain continuity.
Phase B: Parse Assessment
CRITICAL: Save the FULL raw response from the external reviewer verbatim (store in a variable for Phase E). Do NOT discard or summarize — the raw text is the primary record.
Then extract structured fields:
- Score (numeric 1-10)
- Verdict ("ready" / "almost" / "not ready")
- Action items (ranked list of fixes)
STOP CONDITION: If score >= 6 AND verdict contains "ready" or "almost" → stop loop, document final state.
Human Checkpoint (if enabled)
Skip this step entirely if
.HUMAN_CHECKPOINT = false
When
HUMAN_CHECKPOINT = true, present the review results and wait for user input:
📋 Round N/MAX_ROUNDS review complete. Score: X/10 — [verdict] Top weaknesses: 1. [weakness 1] 2. [weakness 2] 3. [weakness 3] Suggested fixes: 1. [fix 1] 2. [fix 2] 3. [fix 3] Options: - Reply "go" or "continue" → implement all suggested fixes - Reply with custom instructions → implement your modifications instead - Reply "skip 2" → skip fix #2, implement the rest - Reply "stop" → end the loop, document current state
Wait for the user's response. Parse their input:
- Approval ("go", "continue", "ok", "proceed"): proceed to Phase C with all suggested fixes
- Custom instructions (any other text): treat as additional/replacement guidance for Phase C. Merge with reviewer suggestions where appropriate
- Skip specific fixes ("skip 1,3"): remove those fixes from the action list
- Stop ("stop", "enough", "done"): terminate the loop, jump to Termination
Feishu Notification (if configured)
After parsing the score, check if
~/.codex/feishu.json exists and mode is not "off":
- Send a
notification: "Round N: X/10 — [verdict]" with top 3 weaknessesreview_scored - If interactive mode and verdict is "almost": send as checkpoint, wait for user reply on whether to continue or stop
- If config absent or mode off: skip entirely (no-op)
Phase C: Implement Fixes (if not stopping)
For each action item (highest priority first):
- Code changes: Write/modify experiment scripts, model code, analysis scripts
- Run experiments: Deploy to GPU server via SSH + screen/tmux
- Analysis: Run evaluation, collect results, update figures/tables
- Documentation: Update project notes and review document
Prioritization rules:
- Skip fixes requiring excessive compute (flag for manual follow-up)
- Skip fixes requiring external data/models not available
- Prefer reframing/analysis over new experiments when both address the concern
- Always implement metric additions (cheap, high impact)
Phase D: Wait for Results
If experiments were launched:
- Monitor remote sessions for completion
- Collect results from output files and logs
- Training quality check — if W&B is configured, invoke
to verify training was healthy (no NaN, no divergence, no plateau). If W&B is not available, skip silently./training-check
Phase E: Document Round
Append to
AUTO_REVIEW.md:
## Round N (timestamp) ### Assessment (Summary) - Score: X/10 - Verdict: [ready/almost/not ready] - Key criticisms: [bullet list] ### Reviewer Raw Response <details> <summary>Click to expand full reviewer response</summary> [Paste the COMPLETE raw response from the external reviewer here — verbatim, unedited. This is the authoritative record. Do NOT truncate or paraphrase.] </details> ### Actions Taken - [what was implemented/changed] ### Results - [experiment outcomes, if any] ### Status - [continuing to round N+1 / stopping]
Write
with current round, agent id, score, verdict, and any pending experiments.REVIEW_STATE.json
Append to
(when findings.md
COMPACT = true): one-line entry per key finding this round.
- [Round N] [positive/negative/unexpected]: [one-sentence finding] (metric: X.XX → Y.YY)
Increment round counter → back to Phase A.
Termination
When loop ends (positive assessment or max rounds):
- Update
withREVIEW_STATE.json"status": "completed" - Write final summary to
AUTO_REVIEW.md - Update project notes with conclusions
- Write method/pipeline description to
under aAUTO_REVIEW.md
section — a concise 1-2 paragraph summary of the final method, architecture, and data flow. This serves as direct input for## Method Description
./paper-illustration - Generate claims from results — invoke
to convert experiment results from/result-to-claim
into structured paper claims. Output:AUTO_REVIEW.md
. IfCLAIMS_FROM_RESULTS.md
is unavailable, skip silently./result-to-claim - If stopped at max rounds without positive assessment:
- List remaining blockers
- Estimate effort needed for each
- Suggest whether to continue manually or pivot
- Feishu notification (if configured): Send
with final score progression tablepipeline_done
Key Rules
-
Large file handling: If the Write tool fails due to file size, immediately retry using Bash (
) to write in chunks. Do NOT ask the user for permission — just do it silently.cat << 'EOF' > file -
ALWAYS use
for maximum reasoning depthreasoning_effort: xhigh -
Save agent id from first call, use
for subsequent roundssend_input -
Be honest — include negative results and failed experiments
-
Do NOT hide weaknesses to game a positive score
-
Implement fixes BEFORE re-reviewing (don't just promise to fix)
-
If an experiment takes > 30 minutes, launch it and continue with other fixes while waiting
-
Document EVERYTHING — the review log should be self-contained
-
Update project notes after each round, not just at the end
Prompt Template for Round 2+
send_input: id: [saved from round 1] reasoning_effort: xhigh message: | [Round N update] Since your last review, we have: 1. [Action 1]: [result] 2. [Action 2]: [result] 3. [Action 3]: [result] Updated results table: [paste metrics] Please re-score and re-assess. Are the remaining concerns addressed? Same format: Score, Verdict, Remaining Weaknesses, Minimum Fixes.