Awesome-Agent-Skills-for-Empirical-Research monitor-experiment

Monitor running experiments, check progress, collect results. Use when user says "check results", "is it done", "monitor", or wants experiment output.

install
source · Clone the upstream repo
git clone https://github.com/brycewang-stanford/Awesome-Agent-Skills-for-Empirical-Research
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/brycewang-stanford/Awesome-Agent-Skills-for-Empirical-Research "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/42-wanshuiyin-ARIS/skills/monitor-experiment" ~/.claude/skills/brycewang-stanford-awesome-agent-skills-for-empirical-research-monitor-experimen && rm -rf "$T"
manifest: skills/42-wanshuiyin-ARIS/skills/monitor-experiment/SKILL.md
source content

Monitor Experiment Results

Monitor: $ARGUMENTS

Workflow

Step 1: Check What's Running

SSH server:

ssh <server> "screen -ls"

Vast.ai instance (read

ssh_host
,
ssh_port
from
vast-instances.json
):

ssh -p <PORT> root@<HOST> "screen -ls"

Also check vast.ai instance status:

vastai show instances

Step 2: Collect Output from Each Screen

For each screen session, capture the last N lines:

ssh <server> "screen -S <name> -X hardcopy /tmp/screen_<name>.txt && tail -50 /tmp/screen_<name>.txt"

If hardcopy fails, check for log files or tee output.

Step 3: Check for JSON Result Files

ssh <server> "ls -lt <results_dir>/*.json 2>/dev/null | head -20"

If JSON results exist, fetch and parse them:

ssh <server> "cat <results_dir>/<latest>.json"

Step 3.5: Pull W&B Metrics (when
wandb: true
in CLAUDE.md)

Skip this step entirely if

wandb
is not set or is
false
in CLAUDE.md.

Pull training curves and metrics from Weights & Biases via Python API:

# List recent runs in the project
ssh <server> "python3 -c \"
import wandb
api = wandb.Api()
runs = api.runs('<entity>/<project>', per_page=10)
for r in runs:
    print(f'{r.id}  {r.state}  {r.name}  {r.summary.get(\"eval/loss\", \"N/A\")}')
\""

# Pull specific metrics from a run (last 50 steps)
ssh <server> "python3 -c \"
import wandb, json
api = wandb.Api()
run = api.run('<entity>/<project>/<run_id>')
history = list(run.scan_history(keys=['train/loss', 'eval/loss', 'eval/ppl', 'train/lr'], page_size=50))
print(json.dumps(history[-10:], indent=2))
\""

# Pull run summary (final metrics)
ssh <server> "python3 -c \"
import wandb, json
api = wandb.Api()
run = api.run('<entity>/<project>/<run_id>')
print(json.dumps(dict(run.summary), indent=2, default=str))
\""

What to extract:

  • Training loss curve — is it converging? diverging? plateauing?
  • Eval metrics — loss, PPL, accuracy at latest checkpoint
  • Learning rate — is the schedule behaving as expected?
  • GPU memory — any OOM risk?
  • Run status — running / finished / crashed?

W&B dashboard link (include in summary for user):

https://wandb.ai/<entity>/<project>/runs/<run_id>

This gives the auto-review-loop richer signal than just screen output — training dynamics, loss curves, and metric trends over time.

Step 4: Summarize Results

Present results in a comparison table:

| Experiment | Metric | Delta vs Baseline | Status |
|-----------|--------|-------------------|--------|
| Baseline  | X.XX   | —                 | done   |
| Method A  | X.XX   | +Y.Y              | done   |

Step 5: Interpret

  • Compare against known baselines
  • Flag unexpected results (negative delta, NaN, divergence)
  • Suggest next steps based on findings

Step 6: Feishu Notification (if configured)

After results are collected, check

~/.claude/feishu.json
:

  • Send
    experiment_done
    notification: results summary table, delta vs baseline
  • If config absent or mode
    "off"
    : skip entirely (no-op)

Key Rules

  • Always show raw numbers before interpretation
  • Compare against the correct baseline (same config)
  • Note if experiments are still running (check progress bars, iteration counts)
  • If results look wrong, check training logs for errors before concluding
  • Vast.ai cost awareness: When monitoring vast.ai instances, report the running cost (hours * $/hr from
    vast-instances.json
    ). If all experiments on an instance are done, remind the user to run
    /vast-gpu destroy <instance_id>
    to stop billing