Auto-claude-code-research-in-sleep experiment-queue
SSH job queue for multi-seed/multi-config ML experiments with OOM-aware retry, stale-screen cleanup, and wave-transition race prevention. Use when user says "batch experiments", "队列实验", "run grid", "multi-seed sweep", "auto-chain experiments", or when /run-experiment is insufficient for 10+ jobs that need orchestration.
git clone https://github.com/wanshuiyin/Auto-claude-code-research-in-sleep
T=$(mktemp -d) && git clone --depth=1 https://github.com/wanshuiyin/Auto-claude-code-research-in-sleep "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/experiment-queue" ~/.claude/skills/wanshuiyin-auto-claude-code-research-in-sleep-experiment-queue && rm -rf "$T"
skills/experiment-queue/SKILL.mdExperiment Queue
Orchestrate large batches of ML experiments on SSH remote GPU servers with proper state tracking, OOM retry, stale cleanup, and wave transitions.
When to Use This Skill
Use when
/run-experiment is insufficient:
- ≥10 jobs that need batching across GPUs
- Multi-seed sweeps (e.g., 21 seeds × 12 cells)
- Wave transitions (run wave 1, wait, run wave 2, wait, run wave 3...)
- Teacher+student chains (train teacher then distill; auto-trigger student after teacher done)
- OOM-prone configs where you need to retry with different GPU or wait
- Mixed seed grids where failed cells need re-running
Do NOT use for:
- Single ad-hoc experiment (use
)/run-experiment - Modal/Vast.ai deployments (those have their own orchestration)
- Experiments that need manual inspection between runs
Why This Exists
Based on session audit (2026-04-16), the major wall-clock sinks in multi-seed grid experiments are:
- Stale screens — python finishes, wandb uploads, screen hangs, next wave blocked
- OOM on shared GPU — previous job's memory not yet released
- Wave race — new wave launches before previous wave fully settles
- Missing checkpoints — student launches before teacher saved
- Parser duplication — rewriting multi-seed analysis python every batch
All of these are pure engineering friction that can be orchestrated.
Core Concepts
Job Manifest
A manifest lists jobs with explicit state:
project: dllm_distill cwd: /home/rfyang/rfyang_code/dllm_experiments_torch conda: dllm # Optional: override conda hook path if conda is not at a standard location. # Can be a bare path (wrapped automatically) or a full `eval "$(... shell.bash hook)"` string. # Falls back to auto-detect of ~/anaconda3, ~/miniconda3, /opt/anaconda3, etc., # or the ARIS_CONDA_HOOK environment variable. # conda_hook: /custom/path/to/conda ssh: SJTUServer5 default_cmd: > python run_pc_distill_exp.py --backbone softmax --lam 0.5 --K 500 --L 96 --W 16 --n_steps 30000 --batch_size 128 --lr 1e-4 preconditions: - type: checkpoint_exists path: checkpoints/transformer/pcc_softmax_L96_K500_N{N}_wikitext103.pt gpus: [0, 1, 2, 3, 4, 5, 6, 7] max_parallel: 8 gpu_free_threshold_mib: 500 # optional, default 500; raise for shared servers, lower for tight packing oom_retry: delay: 120 max_attempts: 3 jobs: - id: s200_N64_n50K args: {seed: 200, n_hidden: 64, n_train_subset: 50000, subset_seed: 2024} - id: s200_N128_n50K args: {seed: 200, n_hidden: 128, n_train_subset: 50000, subset_seed: 2024} # ... 14 more
Job State Machine
pending → running → completed ↘ failed_oom → pending (after delay) [retry up to N] ↘ failed_other → stuck (needs manual inspection) stale_screen_detected → cleaned → pending
Wave Orchestration
A "wave" is a batch of jobs that fit available GPUs. Next wave only starts when:
- All current-wave python processes have exited
- No stale screens remain for current-wave tags
- GPU memory has dropped below threshold (≤500 MiB)
- Precondition checks pass for next-wave jobs
Workflow
Step 1: Parse Manifest / Build from Grid
Input can be:
- YAML manifest (explicit job list, recommended for complex cases)
- Grid spec (Cartesian product of param values, e.g.,
)N=[64,128,256] × n=[50K,150K,500K,652K] - Natural language description (Claude parses into manifest)
Save the built manifest to
<project>/experiment_queue/<timestamp>/manifest.json for reproducibility.
Step 2: Pre-flight
- Check SSH connection works
- Check conda env exists on remote
- Check
exists on remotecwd - Check all preconditions (checkpoints, input files)
- Check GPU availability (at least
free GPUs)max_parallel
If any precondition fails, show user which jobs are blocked and why.
Step 3: Launch Scheduler
Run
tools/queue_manager.py (bundled with this skill) as a detached nohup process on the SSH host:
ssh <server> 'nohup python3 ~/.aris_queue/queue_manager.py \ --manifest /tmp/manifest.json \ --state /tmp/queue_state.json \ --log /tmp/queue.log \ > /tmp/queue_mgr.log 2>&1 &'
The scheduler:
- Reads manifest
- Loops: for each pending job, assign to free GPU, launch via
screen - Polls job status (every 60s)
- Detects stale screens (python exited but screen detached → kill)
- Detects OOM (CUDA OOM in log → mark failed_oom → retry after delay)
- Detects completion (expected output JSON/file exists) → mark completed
- Launches next wave when current wave settles
- Writes state to
continuouslyqueue_state.json
Step 4: Monitoring
User can check state anytime:
ssh <server> cat /tmp/queue_state.json | jq '.jobs | group_by(.status) | map({(.[0].status): length}) | add'
Or invoke
/monitor-experiment which reads the state file.
Step 5: Post-completion
When all jobs in
manifest.json are completed or stuck:
- Scheduler exits cleanly
- Write final summary to
<project>/experiment_queue/<timestamp>/summary.md - Invoke
if/analyze-resultsanalyze_on_complete: true
Grid Spec Syntax
Instead of writing 24 job entries manually:
grid: N: [64, 128, 256] n: [50000, 150000, 500000, 652000] seed: [42, 200, 201] template: id: "s${seed}_N${N}_n${n}" args: {seed: ${seed}, n_hidden: ${N}, n_train_subset: ${n}}
Expands to 36 jobs automatically.
Wave Chaining
For sequential phases (teacher → student):
phases: - name: train_teachers grid: N: [384, 512] template: cmd: python run_pc_exp.py --direction c --backbone softmax --n_hidden ${N} ... output_check: checkpoints/transformer/pcc_softmax_L96_K500_N${N}_wikitext103.pt - name: distill_students depends_on: train_teachers grid: N: [384, 512] seed: [42, 200, 201] template: cmd: python run_pc_distill_exp.py --n_hidden ${N} --seed ${seed} ... output_check: figures/pcdistill_sw_N${N}_*_seed${seed}.json
Scheduler enforces
depends_on: distill_students jobs stay pending until all
train_teachers jobs are completed.
OOM Handling
Detect OOM from stdout:
torch\.OutOfMemoryError: CUDA out of memory
On detection:
- Mark job
failed_oom - Kill screen
- Wait
secondsoom_retry.delay - Check if current GPU is free; if not, try another free GPU
- Requeue as
pending - Max
before markingoom_retry.max_attemptsstuck
Stale Screen Detection
Every 60s, for each running screen:
- Check screen exists (
)screen -ls - Check python PID still running (
)ps -p - If screen exists but python exited:
- If expected output file exists → mark
, kill stale screencompleted - If no output file → mark
, kill screenfailed_other
- If expected output file exists → mark
Resume-on-restart
If scheduler crashes / is killed:
- Read
queue_state.json - For each
job: check screen; if still alive, keep; if not, re-evaluate staterunning - For each
: continue normallypending - Idempotent: safe to restart scheduler without losing state
Output: Summary Report
# Experiment Queue Summary **Project**: dllm_distill **Started**: 2026-04-16 11:36:29 **Completed**: 2026-04-16 18:02:14 **Total wall-clock**: 6h 25m **Jobs**: 40 completed, 2 OOM-retried then completed, 0 stuck ## Phases | Phase | Jobs | Success | OOM retries | Duration | | --- | --- | --- | --- | --- | | train_teachers | 2 | 2 | 0 | 58m | | distill_students | 24 | 24 | 2 | 4h 02m | | multi_seed_validation | 16 | 16 | 0 | 1h 25m | ## Results Files - 42 JSON files in `figures/pcdistill_sw_*.json` ## Next Steps - Run `/analyze-results` on output JSONs - Figures auto-regen via `artifact-sync` (if configured)
Comparison with /run-experiment
/run-experiment| Feature | | |
|---|---|---|
| Single-shot experiment | ✅ | ✅ (overkill) |
| Multi-GPU parallel | Basic | Proper scheduling |
| Wave transitions | Manual | Automatic |
| OOM retry | Manual | Automatic |
| Stale screen cleanup | Manual | Automatic |
| Teacher→student chain | Manual | Built-in |
| State persistence | No | Yes (JSON) |
| Resume on crash | No | Yes |
| Grid expansion | Manual | Declarative |
Rule: Use
/run-experiment for ≤5 jobs. Use experiment-queue for ≥10 jobs or anything with phases.
Key Rules
- Never overlap screens on the same GPU — always wait for
before launching new jobmemory.used < 500 MiB - Always write state to disk — every state change flushed to
queue_state.json - Idempotent scheduler — safe to restart; picks up from state file
- Expected-output-based completion — don't trust screen state alone; verify output file exists
- Bounded retry — max N OOM retries, then mark
and alertstuck - Dependencies enforced at launch — never launch student before teacher checkpoint exists
Known Failure Modes
- SSH connection drop during scheduling: scheduler keeps running on remote (nohup), just reconnect and check
- GPU reservation by another user: scheduler waits, does not pre-empt
- Disk full on remote: scheduler detects write failure, marks all pending
, alertsstuck
Example Session
User: "跑 T5+T6 全部实验:T5 = N∈{80,192} × n 4 values × seed {200,201}, T6 = N∈{384,512} × n 4 values × seed {42,200,201}; T6 需要先 train teacher"
Claude invokes
/experiment-queue:
- Parses description into 2-phase manifest
- Phase 1: T5 (16 jobs, no teacher dependency) + T6 teacher training (2 jobs)
- Phase 2: T6 distillation (24 jobs, depends on teachers)
- Deploys scheduler via nohup
- Reports: "Scheduler PID 93534, total 42 jobs, estimated 6-7h wall-clock"
Then user can check anytime or wait for summary report.
See Also
— single experiment deployment/run-experiment
— check progress (now reads from queue_state.json)/monitor-experiment
— post-hoc analysis/analyze-results
(bundled) — the scheduler implementationtools/queue_manager.py
(bundled) — build manifest from grid spectools/build_manifest.py
Rationale / Source
Identified via 2026-04-16 post-mortem analysis (Codex GPT-5.4 xhigh) of a 1.5-day multi-seed paper experiment session:
- Wall-clock sink: stale screens, OOM, wave transitions, manual parser
- Token sink: re-writing orchestration code each session
- Cognitive sink: tracking which cells succeeded, which failed, which to retry
This skill targets the wall-clock sink specifically; see
artifact-sync and
paper-fix-auto-apply for the other two.