Claude-code-plugins-plus-skills coreweave-incident-runbook
install
source · Clone the upstream repo
git clone https://github.com/jeremylongshore/claude-code-plugins-plus-skills
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/jeremylongshore/claude-code-plugins-plus-skills "$T" && mkdir -p ~/.claude/skills && cp -r "$T/plugins/saas-packs/coreweave-pack/skills/coreweave-incident-runbook" ~/.claude/skills/jeremylongshore-claude-code-plugins-plus-skills-coreweave-incident-runbook && rm -rf "$T"
manifest:
plugins/saas-packs/coreweave-pack/skills/coreweave-incident-runbook/SKILL.mdsource content
CoreWeave Incident Runbook
Triage Steps
# 1. Check pod status kubectl get pods -l app=inference -o wide # 2. Check recent events kubectl get events --sort-by=.lastTimestamp | tail -20 # 3. Check node status kubectl get nodes -l gpu.nvidia.com/class -o wide # 4. Check GPU health kubectl exec -it $(kubectl get pod -l app=inference -o name | head -1) -- nvidia-smi
Common Incidents
Inference Service Down
- Check pod status and events
- If OOMKilled: reduce batch size or upgrade GPU
- If ImagePullBackOff: check registry credentials
- If Pending: check GPU quota and availability
GPU Node Failure
- Pods will be rescheduled automatically
- If no capacity: scale down non-critical workloads
- Contact CoreWeave support for extended outages
Model Loading Failure
- Check HuggingFace token secret exists
- Verify model name spelling
- Check PVC has sufficient storage
- Review container logs for download errors
Rollback
kubectl rollout undo deployment/inference
Resources
Next Steps
For data handling, see
coreweave-data-handling.