Claude-code-plugins-plus coreweave-observability
install
source · Clone the upstream repo
git clone https://github.com/jeremylongshore/claude-code-plugins-plus-skills
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/jeremylongshore/claude-code-plugins-plus-skills "$T" && mkdir -p ~/.claude/skills && cp -r "$T/plugins/saas-packs/coreweave-pack/skills/coreweave-observability" ~/.claude/skills/jeremylongshore-claude-code-plugins-plus-coreweave-observability && rm -rf "$T"
manifest:
plugins/saas-packs/coreweave-pack/skills/coreweave-observability/SKILL.mdsource content
CoreWeave Observability
Overview
CoreWeave runs GPU-intensive workloads on Kubernetes where hardware failures, memory exhaustion, and underutilization directly impact cost and reliability. Observability must cover DCGM GPU metrics, Kubernetes pod health, inference latency, and job completion rates. Proactive monitoring prevents wasted spend on idle GPUs and catches OOM conditions before they cascade.
Key Metrics
| Metric | Type | Target | Alert Threshold |
|---|---|---|---|
| GPU utilization | Gauge | > 60% | < 20% for 30m |
| GPU memory usage | Gauge | < 85% | > 95% for 5m |
| Inference latency p99 | Histogram | < 200ms | > 500ms |
| Job completion rate | Counter | > 99% | < 95% per hour |
| Pod restart count | Counter | 0 | > 3 in 15m |
| Node GPU temperature | Gauge | < 80C | > 85C for 10m |
Instrumentation
async function trackInference(model: string, fn: () => Promise<any>) { const start = Date.now(); try { const result = await fn(); metrics.record('coreweave.inference.latency', Date.now() - start, { model, status: 'ok' }); metrics.increment('coreweave.inference.completed', { model }); return result; } catch (err) { metrics.increment('coreweave.inference.errors', { model, error: err.code }); throw err; } }
Health Check Dashboard
async function coreweaveHealth(): Promise<Record<string, string>> { const gpu = await queryPrometheus('avg(DCGM_FI_DEV_GPU_UTIL)'); const mem = await queryPrometheus('avg(DCGM_FI_DEV_FB_USED/(DCGM_FI_DEV_FB_USED+DCGM_FI_DEV_FB_FREE))'); const pods = await queryPrometheus('kube_deployment_status_replicas_available{namespace="inference"}'); return { gpu_utilization: gpu > 20 ? 'healthy' : 'underutilized', gpu_memory: mem < 0.9 ? 'healthy' : 'critical', inference_pods: pods > 0 ? 'healthy' : 'down', }; }
Alerting Rules
const alerts = [ { metric: 'DCGM_FI_DEV_GPU_UTIL', condition: 'avg < 20', window: '30m', severity: 'warning' }, { metric: 'gpu_memory_pct', condition: '> 0.95', window: '5m', severity: 'critical' }, { metric: 'inference_latency_p99', condition: '> 500ms', window: '10m', severity: 'warning' }, { metric: 'pod_restart_count', condition: '> 3', window: '15m', severity: 'critical' }, ];
Structured Logging
function logGpuEvent(event: string, node: string, data: Record<string, any>) { console.log(JSON.stringify({ service: 'coreweave', event, node, gpu_model: data.gpu_model, utilization: data.util, memory_pct: data.memPct, temperature: data.temp, timestamp: new Date().toISOString(), })); }
Error Handling
| Signal | Meaning | Action |
|---|---|---|
| GPU util < 20% sustained | Idle GPUs burning cost | Scale down or reassign workload |
| GPU memory > 95% | OOM imminent | Reduce batch size or add nodes |
| Pod CrashLoopBackOff | Driver or config failure | Check DCGM logs, restart node |
| Inference latency spike | Contention or throttling | Review GPU temp and queue depth |
| Node NotReady | Hardware or network issue | Cordon node, migrate pods |
Resources
Next Steps
For incident response, see
coreweave-incident-runbook.