Auto-deep-researcher-24x7 gpu-monitor

Check GPU status, running experiments, and available resources

install
source · Clone the upstream repo
git clone https://github.com/Xiangyue-Zhang/auto-deep-researcher-24x7
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/Xiangyue-Zhang/auto-deep-researcher-24x7 "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/gpu-monitor" ~/.claude/skills/xiangyue-zhang-auto-deep-researcher-24x7-gpu-monitor && rm -rf "$T"
manifest: skills/gpu-monitor/SKILL.md
source content

/gpu-monitor

Quick GPU status check for experiment management.

Usage

/gpu-monitor
/gpu-monitor --server user@remote-host

Behavior

  1. Run
    nvidia-smi
    to get current GPU status
  2. Display a clean summary table:
    • GPU ID, Name, Memory (used/total), Utilization %, Temperature
    • Running processes on each GPU
  3. Identify which GPUs are free (< 1GB memory used)
  4. Identify which GPUs are running experiments (check for python/torchrun processes)
  5. If
    --server
    is provided, SSH to remote server first

Output Format

GPU Status
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
 GPU  Name          Memory         Util  Temp
  0   L20X 144GB    45123/147456   98%   72°C  ← training (PID 12345)
  1   L20X 144GB      234/147456    0%   35°C  ← FREE
  2   L20X 144GB    43210/147456   95%   70°C  ← training (PID 12346)
  3   L20X 144GB     1024/147456   12%   40°C  ← keeper

Free GPUs: [1]
Training: GPU 0 (PID 12345), GPU 2 (PID 12346)