Claude-code-plugins-plus-skills castai-prod-checklist
install
source · Clone the upstream repo
git clone https://github.com/jeremylongshore/claude-code-plugins-plus-skills
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/jeremylongshore/claude-code-plugins-plus-skills "$T" && mkdir -p ~/.claude/skills && cp -r "$T/plugins/saas-packs/castai-pack/skills/castai-prod-checklist" ~/.claude/skills/jeremylongshore-claude-code-plugins-plus-skills-castai-prod-checklist && rm -rf "$T"
manifest:
plugins/saas-packs/castai-pack/skills/castai-prod-checklist/SKILL.mdsource content
CAST AI Production Checklist
Overview
Complete checklist for enabling CAST AI cost optimization on a production Kubernetes cluster. Covers Phase 1 (monitoring) through Phase 2 (full automation) with validation steps at each stage.
Prerequisites
- CAST AI tested on a staging cluster first
- Production API key (Full Access)
- Change management approval for node lifecycle changes
Phase 1: Monitoring Only
- Agent installed with read-only key
- Agent pod healthy:
kubectl get pods -n castai-agent - Console shows cluster as "Connected"
- Savings report generating (wait 24h for full data)
- Review savings estimate before enabling automation
Phase 2: Autoscaling Enabled
- Full Access API key provisioned and stored in secrets manager
- Cluster controller installed
- Evictor installed with conservative settings (non-aggressive)
- Spot handler installed for graceful interruption handling
- Autoscaler policies configured with appropriate limits:
-
set to safe ceilingclusterLimits.cpu.maxCores -
configured (10-15%)unschedulablePods.headroom -
>= 300 for productionnodeDownscaler.emptyNodes.delaySeconds -
= truespotInstances.spotDiversityEnabled
-
- Node templates created for workload-specific needs (GPU, high-memory)
- PodDisruptionBudgets set on all critical workloads
Workload Autoscaler
- Workload autoscaler installed
- Critical deployments annotated with min/max resource bounds
- Anti-shrink cooldown set (300s minimum)
- Memory headroom >= 20% for production workloads
Security
- API key in secrets manager (not Helm values files)
- Kvisor security agent installed
- Network policies applied to castai-agent namespace
- RBAC reviewed and minimized
- Key rotation scheduled (90-day interval)
Monitoring and Alerting
- Alert on agent pod restarts:
kube_pod_container_status_restarts_total{namespace="castai-agent"} - Alert on API errors in agent logs
- CAST AI console email notifications enabled
- Savings report reviewed weekly
- Dashboard tracking spot vs on-demand node ratio
Rollback Procedure
# Disable autoscaling immediately (keeps agent monitoring) curl -X PUT -H "X-API-Key: ${CASTAI_API_KEY}" \ -H "Content-Type: application/json" \ "https://api.cast.ai/v1/kubernetes/clusters/${CASTAI_CLUSTER_ID}/policies" \ -d '{"enabled": false}' # Or remove all CAST AI components helm uninstall castai-evictor -n castai-agent helm uninstall cluster-controller -n castai-agent # Keep the agent for monitoring if desired
Validation Commands
# Final pre-go-live verification echo "=== CAST AI Production Validation ===" # Agent healthy kubectl get pods -n castai-agent -o wide # All components running helm list -n castai-agent # Policies correct curl -s -H "X-API-Key: ${CASTAI_API_KEY}" \ "https://api.cast.ai/v1/kubernetes/clusters/${CASTAI_CLUSTER_ID}/policies" \ | jq '{enabled, unschedulablePods: .unschedulablePods.enabled, downscaler: .nodeDownscaler.enabled, spot: .spotInstances.enabled}' # Savings estimate curl -s -H "X-API-Key: ${CASTAI_API_KEY}" \ "https://api.cast.ai/v1/kubernetes/clusters/${CASTAI_CLUSTER_ID}/savings" \ | jq '{monthly: .monthlySavings, percent: .savingsPercentage}'
Resources
Next Steps
For version upgrades, see
castai-upgrade-migration.