Skillshub coreweave-observability
install
source · Clone the upstream repo
git clone https://github.com/ComeOnOliver/skillshub
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/ComeOnOliver/skillshub "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/jeremylongshore/claude-code-plugins-plus-skills/coreweave-observability" ~/.claude/skills/comeonoliver-skillshub-coreweave-observability && rm -rf "$T"
manifest:
skills/jeremylongshore/claude-code-plugins-plus-skills/coreweave-observability/SKILL.mdsource content
CoreWeave Observability
GPU Metrics (DCGM Exporter)
CKS clusters come with DCGM exporter pre-installed. Key metrics:
| Metric | Description |
|---|---|
| GPU core utilization % |
| GPU memory used (MB) |
| GPU memory free (MB) |
| Power consumption (W) |
| GPU temperature (C) |
Prometheus Alert Rules
groups: - name: coreweave-gpu rules: - alert: GPUUtilizationLow expr: avg(DCGM_FI_DEV_GPU_UTIL) < 20 for: 30m labels: { severity: warning } annotations: summary: "GPU utilization below 20% for 30min -- consider scaling down" - alert: GPUMemoryHigh expr: DCGM_FI_DEV_FB_USED / (DCGM_FI_DEV_FB_USED + DCGM_FI_DEV_FB_FREE) > 0.95 for: 5m labels: { severity: critical } annotations: summary: "GPU memory >95% -- risk of OOM" - alert: InferencePodDown expr: kube_deployment_status_replicas_available{deployment=~".*inference.*"} == 0 for: 2m labels: { severity: critical }
Resources
Next Steps
For incident response, see
coreweave-incident-runbook.