Auto-claude-code-research-in-sleep monitor-experiment
Monitor running experiments, check progress, collect results. Use when user says "check results", "is it done", "monitor", or wants experiment output.
git clone https://github.com/wanshuiyin/Auto-claude-code-research-in-sleep
T=$(mktemp -d) && git clone --depth=1 https://github.com/wanshuiyin/Auto-claude-code-research-in-sleep "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/monitor-experiment" ~/.claude/skills/wanshuiyin-auto-claude-code-research-in-sleep-monitor-experiment && rm -rf "$T"
skills/monitor-experiment/SKILL.mdMonitor Experiment Results
Monitor: $ARGUMENTS
Workflow
Step 1: Check What's Running
SSH server:
ssh <server> "screen -ls"
Vast.ai instance (read
ssh_host, ssh_port from vast-instances.json):
ssh -p <PORT> root@<HOST> "screen -ls"
Also check vast.ai instance status:
vastai show instances
Modal (when
gpu: modal in CLAUDE.md):
modal app list # List running/recent apps modal app logs <app> # Stream logs from a running app
Modal apps auto-terminate when done — if it's not in the list, it already finished. Check results via
modal volume ls <volume> or local output.
Step 2: Collect Output from Each Screen
For each screen session, capture the last N lines:
ssh <server> "screen -S <name> -X hardcopy /tmp/screen_<name>.txt && tail -50 /tmp/screen_<name>.txt"
If hardcopy fails, check for log files or tee output.
Step 3: Check for JSON Result Files
ssh <server> "ls -lt <results_dir>/*.json 2>/dev/null | head -20"
If JSON results exist, fetch and parse them:
ssh <server> "cat <results_dir>/<latest>.json"
Step 3.5: Pull W&B Metrics (when wandb: true
in CLAUDE.md)
wandb: trueSkip this step entirely if
is not set or is wandb
in CLAUDE.md.false
Pull training curves and metrics from Weights & Biases via Python API:
# List recent runs in the project ssh <server> "python3 -c \" import wandb api = wandb.Api() runs = api.runs('<entity>/<project>', per_page=10) for r in runs: print(f'{r.id} {r.state} {r.name} {r.summary.get(\"eval/loss\", \"N/A\")}') \"" # Pull specific metrics from a run (last 50 steps) ssh <server> "python3 -c \" import wandb, json api = wandb.Api() run = api.run('<entity>/<project>/<run_id>') history = list(run.scan_history(keys=['train/loss', 'eval/loss', 'eval/ppl', 'train/lr'], page_size=50)) print(json.dumps(history[-10:], indent=2)) \"" # Pull run summary (final metrics) ssh <server> "python3 -c \" import wandb, json api = wandb.Api() run = api.run('<entity>/<project>/<run_id>') print(json.dumps(dict(run.summary), indent=2, default=str)) \""
What to extract:
- Training loss curve — is it converging? diverging? plateauing?
- Eval metrics — loss, PPL, accuracy at latest checkpoint
- Learning rate — is the schedule behaving as expected?
- GPU memory — any OOM risk?
- Run status — running / finished / crashed?
W&B dashboard link (include in summary for user):
https://wandb.ai/<entity>/<project>/runs/<run_id>
This gives the auto-review-loop richer signal than just screen output — training dynamics, loss curves, and metric trends over time.
Step 4: Summarize Results
Present results in a comparison table:
| Experiment | Metric | Delta vs Baseline | Status | |-----------|--------|-------------------|--------| | Baseline | X.XX | — | done | | Method A | X.XX | +Y.Y | done |
Step 5: Interpret
- Compare against known baselines
- Flag unexpected results (negative delta, NaN, divergence)
- Suggest next steps based on findings
Step 6: Feishu Notification (if configured)
After results are collected, check
~/.claude/feishu.json:
- Send
notification: results summary table, delta vs baselineexperiment_done - If config absent or mode
: skip entirely (no-op)"off"
Key Rules
- Always show raw numbers before interpretation
- Compare against the correct baseline (same config)
- Note if experiments are still running (check progress bars, iteration counts)
- If results look wrong, check training logs for errors before concluding
- Vast.ai cost awareness: When monitoring vast.ai instances, report the running cost (hours * $/hr from
). If all experiments on an instance are done, remind the user to runvast-instances.json
to stop billing/vast-gpu destroy <instance_id> - Modal cost awareness: Modal auto-scales to zero — no idle billing. When reporting results from Modal runs, note the actual execution time and estimated cost (time * $/hr from the GPU tier used). No cleanup action needed