Skillshub coreweave-observability

install
source · Clone the upstream repo
git clone https://github.com/ComeOnOliver/skillshub
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/ComeOnOliver/skillshub "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/jeremylongshore/claude-code-plugins-plus-skills/coreweave-observability" ~/.claude/skills/comeonoliver-skillshub-coreweave-observability && rm -rf "$T"
manifest: skills/jeremylongshore/claude-code-plugins-plus-skills/coreweave-observability/SKILL.md
source content

CoreWeave Observability

GPU Metrics (DCGM Exporter)

CKS clusters come with DCGM exporter pre-installed. Key metrics:

MetricDescription
DCGM_FI_DEV_GPU_UTIL
GPU core utilization %
DCGM_FI_DEV_FB_USED
GPU memory used (MB)
DCGM_FI_DEV_FB_FREE
GPU memory free (MB)
DCGM_FI_DEV_POWER_USAGE
Power consumption (W)
DCGM_FI_DEV_GPU_TEMP
GPU temperature (C)

Prometheus Alert Rules

groups:
  - name: coreweave-gpu
    rules:
      - alert: GPUUtilizationLow
        expr: avg(DCGM_FI_DEV_GPU_UTIL) < 20
        for: 30m
        labels: { severity: warning }
        annotations:
          summary: "GPU utilization below 20% for 30min -- consider scaling down"

      - alert: GPUMemoryHigh
        expr: DCGM_FI_DEV_FB_USED / (DCGM_FI_DEV_FB_USED + DCGM_FI_DEV_FB_FREE) > 0.95
        for: 5m
        labels: { severity: critical }
        annotations:
          summary: "GPU memory >95% -- risk of OOM"

      - alert: InferencePodDown
        expr: kube_deployment_status_replicas_available{deployment=~".*inference.*"} == 0
        for: 2m
        labels: { severity: critical }

Resources

Next Steps

For incident response, see

coreweave-incident-runbook
.