Skillshub castai-core-workflow-a
install
source · Clone the upstream repo
git clone https://github.com/ComeOnOliver/skillshub
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/ComeOnOliver/skillshub "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/jeremylongshore/claude-code-plugins-plus-skills/castai-core-workflow-a" ~/.claude/skills/comeonoliver-skillshub-castai-core-workflow-a && rm -rf "$T"
manifest:
skills/jeremylongshore/claude-code-plugins-plus-skills/castai-core-workflow-a/SKILL.mdsource content
CAST AI Core Workflow: Autoscaler & Policies
Overview
Primary workflow for CAST AI: configure autoscaler policies to optimize cluster costs. Covers enabling spot instances, configuring the node downscaler and evictor, setting cluster CPU/memory limits, and creating node templates for workload-specific requirements.
Prerequisites
- Completed
with Phase 2 (cluster controller + evictor)castai-install-auth
andCASTAI_API_KEY
setCASTAI_CLUSTER_ID- Cluster in "ready" status
Instructions
Step 1: Read Current Policies
curl -s -H "X-API-Key: ${CASTAI_API_KEY}" \ "https://api.cast.ai/v1/kubernetes/clusters/${CASTAI_CLUSTER_ID}/policies" \ | jq .
Step 2: Enable Cost-Optimized Autoscaling
curl -X PUT -H "X-API-Key: ${CASTAI_API_KEY}" \ -H "Content-Type: application/json" \ "https://api.cast.ai/v1/kubernetes/clusters/${CASTAI_CLUSTER_ID}/policies" \ -d '{ "enabled": true, "unschedulablePods": { "enabled": true, "headroom": { "cpuPercentage": 10, "memoryPercentage": 10, "enabled": true } }, "nodeDownscaler": { "enabled": true, "emptyNodes": { "enabled": true, "delaySeconds": 180 } }, "spotInstances": { "enabled": true, "clouds": ["aws"], "spotDiversityEnabled": true, "spotDiversityPriceIncreaseLimitPercent": 20 }, "clusterLimits": { "enabled": true, "cpu": { "minCores": 4, "maxCores": 100 } } }'
Step 3: Configure Node Templates via Terraform
resource "castai_node_template" "spot_workers" { cluster_id = castai_eks_cluster.this.id name = "spot-workers" is_default = false is_enabled = true constraints { min_cpu = 2 max_cpu = 16 min_memory = 4096 max_memory = 65536 spot = true use_spot_fallbacks = true fallback_restore_rate_seconds = 600 instance_families { include = ["m5", "m6i", "c5", "c6i", "r5", "r6i"] } architectures = ["amd64"] } custom_labels = { "workload-type" = "batch" } } resource "castai_node_template" "gpu_ondemand" { cluster_id = castai_eks_cluster.this.id name = "gpu-ondemand" is_default = false is_enabled = true constraints { spot = false gpu_manufacturers = ["NVIDIA"] instance_families { include = ["p3", "p4d", "g4dn", "g5"] } } custom_labels = { "workload-type" = "gpu" } }
Step 4: Verify Autoscaler is Working
# Check if the autoscaler is processing nodes curl -s -H "X-API-Key: ${CASTAI_API_KEY}" \ "https://api.cast.ai/v1/kubernetes/external-clusters/${CASTAI_CLUSTER_ID}/nodes" \ | jq '[.items[] | {name, instanceType, lifecycle, castaiManaged: .castaiManaged}] | group_by(.lifecycle) | map({lifecycle: .[0].lifecycle, count: length})' # Expected: mix of spot and on-demand nodes
Error Handling
| Error | Cause | Solution |
|---|---|---|
| Policy update returns 400 | Invalid policy JSON | Validate with before sending |
| Nodes not scaling | Policy not enabled | Verify in policy |
| Spot instances not used | Provider not configured | Add cloud provider to |
| Evictor too aggressive | Low delay threshold | Increase |
| Cluster limit hit | too low | Increase |
Resources
Next Steps
For workload-level autoscaling, see
castai-core-workflow-b.