Claude-night-market cpu-gpu-performance

Establish CPU/GPU baselines before resource-intensive operations. Use for regression detection.

install

source · Clone the upstream repo

git clone https://github.com/athola/claude-night-market

Claude Code · Install into ~/.claude/skills/

T=$(mktemp -d) && git clone --depth=1 https://github.com/athola/claude-night-market "$T" && mkdir -p ~/.claude/skills && cp -r "$T/plugins/conserve/skills/cpu-gpu-performance" ~/.claude/skills/athola-claude-night-market-cpu-gpu-performance && rm -rf "$T"

manifest: plugins/conserve/skills/cpu-gpu-performance/SKILL.md

CPU/GPU Performance Discipline

When To Use

At the beginning of every session (auto-load alongside
```
token-conservation
```
).
Whenever you plan to build, train, or test anything that could pin CPU cores or GPUs for more than a minute.
Before retrying a failing command that previously consumed significant resources.

When NOT To Use

Simple operations with no resource impact
Quick single-file operations

Required TodoWrite Items

```
cpu-gpu-performance:baseline
```
```
cpu-gpu-performance:scope
```
```
cpu-gpu-performance:instrument
```
```
cpu-gpu-performance:throttle
```
```
cpu-gpu-performance:log
```

Step 1: Establish Current Baseline

Capture current utilization:

```
uptime
```
```
ps -eo pcpu,cmd | head
```

nvidia-smi --query-gpu=utilization.gpu,memory.used --format=csv

Note which hosts/GPUs are already busy.

Record any CI/cluster budgets (time quotas, GPU hours) before launching work.
Set a per-task CPU minute / GPU minute budget that respects those limits.

Step 2: Narrow the Scope

Avoid running "whole world" jobs after a small fix. Prefer diff-based or tag-based selective testing:
- ```
pytest -k
```
- Bazel target patterns
- ```
cargo test <module>
```
Batch low-level fixes so you can validate multiple changes with a single targeted command.
For GPU jobs, favor unit-scale smoke inputs or lower epoch counts before scheduling the full training/eval sweep.

Step 3: Instrument Before You Optimize

Pick the right profiler/monitor:
- CPU work:
  - ```
  perf
```
- ```
intel vtune
```
  - ```
  cargo flamegraph
```
- language-specific profilers
- GPU work:
  - ```
  nvidia-smi dmon
```
- ```
nsys
```
  - ```
  nvprof
```
- DLProf
- framework timeline tracers
Capture kernel/ops timelines, memory footprints, and data pipeline latency so you have evidence when throttling or parallelizing.
Record hot paths + I/O bottlenecks in notes so future reruns can jump straight to the culprit.

Step 4: Throttle and Sequence Work

Use
```
nice
```
,
```
ionice
```
, or Kubernetes/Slurm quotas to prevent starvation of shared nodes.
Chain heavy tasks with guardrails:
- Rerun only the failed test/module
- Then (optionally) escalate to the next-wider shard
- Reserve the full suite for the final gate
Stagger GPU kernels (smaller batch sizes or gradient accumulation) when memory pressure risks eviction; prefer checkpoint/restore over restarts.

Step 5: Log Decisions and Next Steps

Conclude by documenting the commands that were run and their resource cost (duration, CPU%, GPU%), confirming whether they remained within the per-task budget. If a full suite or long training run was necessary, justify why selective or staged approaches were not feasible. Capture any follow-up tasks, such as adding a new test marker or profiling documentation, to simplify future sessions.

Output Expectations

Brief summary covering:
- baseline metrics
- scope chosen
- instrumentation captured
- throttling tactics
- follow-up items
Concrete example(s) of what ran (e.g.):
- "reran
```
pytest tests/test_orders.py -k test_refund
```
  instead of
```
pytest -m slow
```
  "
- "profiled
```
nvidia-smi dmon
```
  output to prove GPU idle time before scaling"