Skillshub coreweave-core-workflow-a
install
source · Clone the upstream repo
git clone https://github.com/ComeOnOliver/skillshub
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/ComeOnOliver/skillshub "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/jeremylongshore/claude-code-plugins-plus-skills/coreweave-core-workflow-a" ~/.claude/skills/comeonoliver-skillshub-coreweave-core-workflow-a && rm -rf "$T"
manifest:
skills/jeremylongshore/claude-code-plugins-plus-skills/coreweave-core-workflow-a/SKILL.mdsource content
CoreWeave Core Workflow: KServe Inference
Overview
Deploy production inference services on CoreWeave using KServe InferenceService with GPU scheduling, autoscaling, and scale-to-zero. CKS natively integrates with KServe for serverless GPU inference.
Prerequisites
- Completed
setupcoreweave-install-auth - KServe available on your CKS cluster
- Model stored in S3, GCS, or HuggingFace
Instructions
Step 1: Deploy an InferenceService
# inference-service.yaml apiVersion: serving.kserve.io/v1beta1 kind: InferenceService metadata: name: llama-inference annotations: autoscaling.knative.dev/class: "kpa.autoscaling.knative.dev" autoscaling.knative.dev/metric: "concurrency" autoscaling.knative.dev/target: "1" autoscaling.knative.dev/minScale: "1" autoscaling.knative.dev/maxScale: "5" spec: predictor: minReplicas: 1 maxReplicas: 5 containers: - name: kserve-container image: vllm/vllm-openai:latest args: - "--model" - "meta-llama/Llama-3.1-8B-Instruct" - "--port" - "8080" ports: - containerPort: 8080 protocol: TCP resources: limits: nvidia.com/gpu: "1" memory: 48Gi cpu: "8" requests: nvidia.com/gpu: "1" memory: 32Gi cpu: "4" env: - name: HUGGING_FACE_HUB_TOKEN valueFrom: secretKeyRef: name: hf-token key: token affinity: nodeAffinity: requiredDuringSchedulingIgnoredDuringExecution: nodeSelectorTerms: - matchExpressions: - key: gpu.nvidia.com/class operator: In values: ["A100_PCIE_80GB"]
kubectl apply -f inference-service.yaml kubectl get inferenceservice llama-inference -w
Step 2: Scale-to-Zero Configuration
# For dev/staging -- scale down to zero when idle metadata: annotations: autoscaling.knative.dev/minScale: "0" # Scale to zero autoscaling.knative.dev/maxScale: "3" autoscaling.knative.dev/scaleDownDelay: "5m"
Step 3: Test the Endpoint
# Get inference URL INFERENCE_URL=$(kubectl get inferenceservice llama-inference \ -o jsonpath='{.status.url}') curl -X POST "${INFERENCE_URL}/v1/chat/completions" \ -H "Content-Type: application/json" \ -d '{"model": "meta-llama/Llama-3.1-8B-Instruct", "messages": [{"role": "user", "content": "Hello!"}]}'
Error Handling
| Error | Cause | Solution |
|---|---|---|
| InferenceService not ready | GPU not available | Check node capacity and affinity |
| Scale-to-zero cold start | First request after idle | Set for production |
| Model loading timeout | Large model download | Pre-cache model in PVC |
| OOMKilled | Model too large | Use multi-GPU or quantized model |
Resources
Next Steps
For GPU training workloads, see
coreweave-core-workflow-b.