Skillshub coreweave-common-errors
install
source · Clone the upstream repo
git clone https://github.com/ComeOnOliver/skillshub
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/ComeOnOliver/skillshub "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/jeremylongshore/claude-code-plugins-plus-skills/coreweave-common-errors" ~/.claude/skills/comeonoliver-skillshub-coreweave-common-errors && rm -rf "$T"
manifest:
skills/jeremylongshore/claude-code-plugins-plus-skills/coreweave-common-errors/SKILL.mdsource content
CoreWeave Common Errors
Error Reference
1. Pod Stuck Pending -- No GPU Available
kubectl describe pod <pod-name> | grep -A5 Events # "0/N nodes are available: insufficient nvidia.com/gpu"
Fix: Check GPU availability:
kubectl get nodes -l gpu.nvidia.com/class=A100_PCIE_80GB. Try a different GPU type or region.
2. CUDA Out of Memory
torch.cuda.OutOfMemoryError: CUDA out of memory
Fix: Reduce batch size, enable gradient checkpointing, or use a larger GPU (A100-80GB instead of 40GB).
3. Image Pull BackOff
Fix: Create an imagePullSecret:
kubectl create secret docker-registry regcred \ --docker-server=ghcr.io \ --docker-username=$GH_USER \ --docker-password=$GH_TOKEN
4. NCCL Timeout (Multi-GPU)
NCCL error: unhandled system error
Fix: Ensure all GPUs are on the same node (NVLink). For multi-node, use InfiniBand-connected nodes.
5. PVC Not Mounting
Fix: Check storage class availability:
kubectl get sc. Use CoreWeave storage classes like shared-hdd-ord1 or shared-ssd-ord1.
6. Node Affinity Mismatch
Fix: List valid GPU class labels:
kubectl get nodes -o json | jq -r '.items[].metadata.labels["gpu.nvidia.com/class"]' | sort -u
7. Service Not Reachable
Fix: Check Service and Endpoints:
kubectl get svc,endpoints <service-name>
Resources
Next Steps
For diagnostics, see
coreweave-debug-bundle.