Auto-deep-researcher-24x7 auto-experiment
Launch an autonomous THINK→EXECUTE→REFLECT experiment loop on a GPU project
install
source · Clone the upstream repo
git clone https://github.com/Xiangyue-Zhang/auto-deep-researcher-24x7
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/Xiangyue-Zhang/auto-deep-researcher-24x7 "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/auto-experiment" ~/.claude/skills/xiangyue-zhang-auto-deep-researcher-24x7-auto-experiment && rm -rf "$T"
manifest:
skills/auto-experiment/SKILL.mdsource content
/auto-experiment
Launch an autonomous experiment agent that runs your deep learning experiments 24/7.
What This Does
This skill starts a THINK → EXECUTE → REFLECT loop that:
- Reads your
to understand the research goalPROJECT_BRIEF.md - Analyzes previous results in
MEMORY_LOG.md - Plans the next experiment (hypothesis + success criteria)
- Implements code changes and runs a mandatory dry-run
- Launches GPU training via
(tracks PID)nohup - Monitors at zero LLM cost (only
+kill -0 PID
+tail log
)nvidia-smi - Wakes up when training finishes to analyze results
- Updates memory and decides: iterate, pivot, or report
- Repeats
Usage
/auto-experiment /auto-experiment --project /path/to/my_project --gpu 0 /auto-experiment --project . --max-cycles 5
Prerequisites
The project directory must contain:
PROJECT_BRIEF.md
(required)
PROJECT_BRIEF.mdA frozen reference describing your research goal. Example:
# Goal Train a ViT-B/16 on ImageNet to reach 78%+ top-1 accuracy. # Codebase - Training: train.py - Config: configs/vit_base.yaml - Data: /data/imagenet/ # Constraints - GPU 0-3 available (use DDP) - Max 90 epochs per run - Report val accuracy after each run # Current Best - ResNet-50 baseline: 76.1%
config.yaml
(optional)
config.yamlOverride default agent settings:
agent: model: "claude-sonnet-4-6" max_cycles: -1 # -1 = unlimited max_steps_per_cycle: 3 # max sub-agent dispatches per cycle cooldown_interval: 300 # 5 min smart polling memory: brief_max_chars: 3000 log_max_chars: 2000 monitor: poll_interval: 900 # check every 15 min during training zero_llm: true experiment: mandatory_dry_run: true
Optional remote execution over SSH:
execution: mode: "ssh" ssh_host: "user@server" remote_workspace: "/home/user/my_project/workspace" remote_python: "python3"
In SSH mode, the controller state stays local (
PROJECT_BRIEF.md,
workspace/MEMORY_LOG.md, workspace/HUMAN_DIRECTIVE.md, state.json),
while code edits, shell commands, training, log tailing, PID checks, and GPU
queries run on the configured remote host.
Workflow Details
Phase 1: THINK
- Read
(frozen, max 3000 chars)PROJECT_BRIEF.md - Read
(rolling, auto-compacted)MEMORY_LOG.md - Check for
(highest priority, auto-archived after reading)HUMAN_DIRECTIVE.md - Analyze: What's the current best? What hasn't been tried? What's most promising?
- Output: experiment plan with hypothesis and success criteria
Phase 2: EXECUTE
- Dispatch to Code Agent (5 tools:
,run_shell
,launch_experiment
,write_file
,read_file
)list_files - Code Agent implements changes
- Mandatory dry-run (2-step verify, abort if fails)
- Launch training via
, capture PIDnohup - Enter zero-cost monitoring loop:
- backend PID check — is process alive?
- backend
— GPU utilizationnvidia-smi - backend
— latest training outputtail -50 logfile - Zero LLM API calls during this phase
Phase 3: REFLECT
- Parse training logs for metrics (loss, accuracy, FGD, FID, etc.)
- Compare against previous best
- Log milestone if improved (auto-compacted at 1200 chars)
- Log decision (rolling last 15 entries)
- Decide: try another config / pivot direction / generate report
Human Override (anytime)
# Drop a directive file — agent reads it next cycle with highest priority echo "Try learning rate 1e-5 with cosine schedule" > workspace/HUMAN_DIRECTIVE.md
Memory System
Two-Tier, constant size (~5K chars / ~1500 tokens), no matter how long the agent runs:
| Tier | File | Content | Cap |
|---|---|---|---|
| 1 | | Frozen project reference | 3,000 chars |
| 2 | | Key Results + Recent Decisions | 2,000 chars |
Auto-compaction rules:
- Key Results: oldest dropped when section > 1,200 chars
- Recent Decisions: only last 15 entries kept
- Total log hard-capped at 2,000 chars
Cost
| Phase | Duration | LLM Cost |
|---|---|---|
| THINK | 5-10 min | ~$0.05 |
| EXECUTE (training) | hours/days | $0.00 |
| REFLECT | 5-10 min | ~$0.03 |
| 24h cycle total | ~$0.08 |
Example Output
After a few cycles, your
workspace/MEMORY_LOG.md will look like:
# Memory Log ## Key Results [04-07 14:30] Exp001: ResNet-50 baseline, lr=0.1, acc=76.1% [04-07 22:15] Exp002: ViT-B/16, lr=1e-3, acc=74.8% (underperforming, lr too high) [04-08 06:00] Exp003: ViT-B/16, lr=3e-4 + cosine, acc=77.9% (new best!) [04-08 14:45] Exp004: ViT-B/16, lr=3e-4 + cosine + mixup, acc=78.3% (target reached!) ## Recent Decisions [04-07 14:30] Start with ResNet-50 baseline to establish reference [04-07 22:15] ViT lr=1e-3 too high, try 3e-4 next [04-08 06:00] Cosine schedule helped significantly, try adding regularization [04-08 14:45] Target reached! Generate final report.