Skillshub castai-cost-tuning
install
source · Clone the upstream repo
git clone https://github.com/ComeOnOliver/skillshub
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/ComeOnOliver/skillshub "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/jeremylongshore/claude-code-plugins-plus-skills/castai-cost-tuning" ~/.claude/skills/comeonoliver-skillshub-castai-cost-tuning && rm -rf "$T"
manifest:
skills/jeremylongshore/claude-code-plugins-plus-skills/castai-cost-tuning/SKILL.mdsource content
CAST AI Cost Tuning
Overview
Maximize Kubernetes cost savings through CAST AI: spot instance strategies, workload right-sizing, cluster hibernation, and savings tracking. Typical savings: 50-70% on cloud compute costs.
Prerequisites
- CAST AI Phase 2 enabled with full automation
- Savings report available (requires 24h+ of data)
- Understanding of workload criticality tiers
Instructions
Step 1: Analyze Current Savings
# Get savings breakdown curl -s -H "X-API-Key: ${CASTAI_API_KEY}" \ "https://api.cast.ai/v1/kubernetes/clusters/${CASTAI_CLUSTER_ID}/savings" \ | jq '{ currentMonthlyCost: .currentMonthlyCost, optimizedMonthlyCost: .optimizedMonthlyCost, monthlySavings: .monthlySavings, savingsPercentage: .savingsPercentage, spotSavings: .spotSavings, rightSizingSavings: .rightSizingSavings }'
Step 2: Maximize Spot Usage
# Enable aggressive spot with diversity and fallbacks curl -X PUT -H "X-API-Key: ${CASTAI_API_KEY}" \ -H "Content-Type: application/json" \ "https://api.cast.ai/v1/kubernetes/clusters/${CASTAI_CLUSTER_ID}/policies" \ -d '{ "enabled": true, "spotInstances": { "enabled": true, "clouds": ["aws"], "spotDiversityEnabled": true, "spotDiversityPriceIncreaseLimitPercent": 20, "spotBackups": { "enabled": true, "spotBackupRestoreRateSeconds": 600 } } }'
Spot allocation strategy by workload tier:
| Workload Type | Spot % | Rationale |
|---|---|---|
| Batch jobs, CI runners | 100% spot | Interruptible, restartable |
| Stateless APIs (behind LB) | 80% spot | Can handle brief interruptions |
| Stateful services, databases | 0% spot | Use on-demand or reserved |
| ML training | 80-100% spot | Checkpointing handles interrupts |
Step 3: Workload Right-Sizing
# Get resource waste analysis curl -s -H "X-API-Key: ${CASTAI_API_KEY}" \ "https://api.cast.ai/v1/workload-autoscaling/clusters/${CASTAI_CLUSTER_ID}/workloads" \ | jq '[.items[] | select(.estimatedSavingsPercent > 20) | { name: .workloadName, namespace: .namespace, wastedCpu: (.currentCpuRequest - .recommendedCpuRequest), wastedMemory: (.currentMemoryRequest - .recommendedMemoryRequest), savingsPercent: .estimatedSavingsPercent }] | sort_by(-.savingsPercent) | .[0:10]'
Step 4: Cluster Hibernation (Dev/Staging)
# Hibernate non-production clusters during off-hours # Scales nodes to zero, resume on demand # Enable hibernation curl -X POST -H "X-API-Key: ${CASTAI_API_KEY}" \ -H "Content-Type: application/json" \ "https://api.cast.ai/v1/kubernetes/clusters/${CASTAI_CLUSTER_ID}/hibernate" \ -d '{ "schedule": { "enabled": true, "hibernateAt": "20:00", "wakeUpAt": "08:00", "timezone": "America/New_York", "weekdaysOnly": true } }'
Step 5: Cost Tracking Dashboard
interface CostReport { cluster: string; period: string; currentCost: number; optimizedCost: number; savings: number; spotPercent: number; } async function generateMonthlyCostReport( clusterIds: string[] ): Promise<CostReport[]> { const reports: CostReport[] = []; for (const clusterId of clusterIds) { const [cluster, savings, nodes] = await Promise.all([ castaiGet(`/v1/kubernetes/external-clusters/${clusterId}`), castaiGet(`/v1/kubernetes/clusters/${clusterId}/savings`), castaiGet(`/v1/kubernetes/external-clusters/${clusterId}/nodes`), ]); const spotNodes = nodes.items.filter( (n: { lifecycle: string }) => n.lifecycle === "spot" ).length; reports.push({ cluster: cluster.name, period: new Date().toISOString().slice(0, 7), currentCost: savings.currentMonthlyCost, optimizedCost: savings.optimizedMonthlyCost, savings: savings.monthlySavings, spotPercent: nodes.items.length > 0 ? (spotNodes / nodes.items.length) * 100 : 0, }); } return reports; }
Cost Optimization Checklist
- Spot instances enabled with diversity
- Workload autoscaler right-sizing resources
- Dev/staging clusters hibernated off-hours
- Empty node downscaler enabled
- Instance families include latest generation (cheaper)
- Reserved/savings plan for baseline on-demand nodes
- Weekly savings report review
Error Handling
| Issue | Cause | Solution |
|---|---|---|
| Savings lower than expected | Too many on-demand constraints | Relax node template constraints |
| Spot interruptions too frequent | Single instance type | Enable spot diversity |
| Hibernation not triggering | Schedule timezone wrong | Use IANA timezone format |
| Right-sizing too aggressive | Low headroom | Increase memory headroom to 20% |
Resources
Next Steps
For architecture patterns, see
castai-reference-architecture.