Skillshub coreweave-core-workflow-b
install
source · Clone the upstream repo
git clone https://github.com/ComeOnOliver/skillshub
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/ComeOnOliver/skillshub "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/jeremylongshore/claude-code-plugins-plus-skills/coreweave-core-workflow-b" ~/.claude/skills/comeonoliver-skillshub-coreweave-core-workflow-b && rm -rf "$T"
manifest:
skills/jeremylongshore/claude-code-plugins-plus-skills/coreweave-core-workflow-b/SKILL.mdsource content
CoreWeave Core Workflow: GPU Training
Overview
Run distributed GPU training on CoreWeave: single-node multi-GPU and multi-node training with PyTorch DDP, Slurm-on-Kubernetes, and shared storage.
Prerequisites
- CKS cluster with multi-GPU node pools (8xA100 or 8xH100)
- Shared storage (CoreWeave PVC or NFS)
- Training container with PyTorch and NCCL
Instructions
Step 1: Single-Node Multi-GPU Training
# training-job.yaml apiVersion: batch/v1 kind: Job metadata: name: llm-finetune spec: template: spec: restartPolicy: Never containers: - name: trainer image: ghcr.io/myorg/trainer:latest command: ["torchrun"] args: - "--nproc_per_node=8" - "train.py" - "--model_name=meta-llama/Llama-3.1-8B" - "--batch_size=4" - "--epochs=3" resources: limits: nvidia.com/gpu: "8" memory: 512Gi cpu: "64" volumeMounts: - name: data mountPath: /data - name: checkpoints mountPath: /checkpoints volumes: - name: data persistentVolumeClaim: claimName: training-data - name: checkpoints persistentVolumeClaim: claimName: model-checkpoints affinity: nodeAffinity: requiredDuringSchedulingIgnoredDuringExecution: nodeSelectorTerms: - matchExpressions: - key: gpu.nvidia.com/class operator: In values: ["A100_NVLINK_A100_SXM4_80GB"]
Step 2: Persistent Storage for Training Data
# storage.yaml apiVersion: v1 kind: PersistentVolumeClaim metadata: name: training-data spec: accessModes: ["ReadWriteMany"] resources: requests: storage: 500Gi storageClassName: shared-hdd-ord1 --- apiVersion: v1 kind: PersistentVolumeClaim metadata: name: model-checkpoints spec: accessModes: ["ReadWriteMany"] resources: requests: storage: 200Gi storageClassName: shared-ssd-ord1
Step 3: Monitor Training Progress
# Watch training logs kubectl logs -f job/llm-finetune # Check GPU utilization kubectl exec -it $(kubectl get pod -l job-name=llm-finetune -o name) -- nvidia-smi # Check training metrics kubectl exec -it $(kubectl get pod -l job-name=llm-finetune -o name) -- \ cat /checkpoints/training_log.json | tail -5
Error Handling
| Error | Cause | Solution |
|---|---|---|
| NCCL timeout | Network issue between GPUs | Use NVLink nodes (SXM4/SXM5) |
| OOMKilled | Batch size too large | Reduce batch size or use gradient accumulation |
| Checkpoint save failed | PVC full | Increase storage or prune old checkpoints |
| Job evicted | Preemption | Use on-demand nodes for training |
Resources
Next Steps
For troubleshooting, see
coreweave-common-errors.