Skillshub castai-reference-architecture
install
source · Clone the upstream repo
git clone https://github.com/ComeOnOliver/skillshub
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/ComeOnOliver/skillshub "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/jeremylongshore/claude-code-plugins-plus-skills/castai-reference-architecture" ~/.claude/skills/comeonoliver-skillshub-castai-reference-architecture && rm -rf "$T"
manifest:
skills/jeremylongshore/claude-code-plugins-plus-skills/castai-reference-architecture/SKILL.mdsource content
CAST AI Reference Architecture
Overview
Production-grade architecture for managing CAST AI across multiple Kubernetes clusters. Covers Terraform module layout, per-environment policies, API key management, and observability integration.
Prerequisites
- Multiple Kubernetes clusters (dev, staging, production)
- Terraform for infrastructure management
- Centralized secrets management
- Monitoring stack (Prometheus, Grafana, or Datadog)
Terraform Module Structure
infrastructure/ ├── modules/ │ └── castai-cluster/ │ ├── main.tf # CAST AI provider resources │ ├── variables.tf # Cluster-specific inputs │ ├── outputs.tf # Cluster ID, savings metrics │ ├── policies.tf # Autoscaler policy configuration │ ├── node-templates.tf # Node template definitions │ └── security.tf # Kvisor, RBAC ├── environments/ │ ├── dev/ │ │ ├── main.tf # Dev cluster onboarding │ │ ├── terraform.tfvars # Dev-specific values │ │ └── backend.tf # State storage │ ├── staging/ │ │ ├── main.tf │ │ ├── terraform.tfvars │ │ └── backend.tf │ └── prod/ │ ├── main.tf │ ├── terraform.tfvars │ └── backend.tf └── shared/ ├── api-keys.tf # Key management └── monitoring.tf # Alerting rules
Reusable Module
# modules/castai-cluster/main.tf variable "cluster_name" { type = string } variable "cluster_id" { type = string } variable "environment" { type = string } variable "api_token" { type = string; sensitive = true } variable "provider_type" { type = string } # eks, gke, aks variable "max_cpu_cores" { type = number; default = 100 } variable "spot_enabled" { type = bool; default = true } variable "hibernation_enabled" { type = bool; default = false } variable "evictor_aggressive" { type = bool; default = false } resource "castai_autoscaler" "this" { cluster_id = var.cluster_id autoscaler_policies_json = jsonencode({ enabled = true unschedulablePods = { enabled = true headroom = { enabled = true cpuPercentage = var.environment == "prod" ? 15 : 5 memoryPercentage = var.environment == "prod" ? 15 : 5 } } nodeDownscaler = { enabled = true emptyNodes = { enabled = true delaySeconds = var.environment == "prod" ? 300 : 60 } } spotInstances = { enabled = var.spot_enabled spotDiversityEnabled = true } clusterLimits = { enabled = true cpu = { minCores = 2, maxCores = var.max_cpu_cores } } }) } resource "castai_node_template" "default_spot" { cluster_id = var.cluster_id name = "${var.environment}-spot-workers" is_enabled = var.spot_enabled constraints { spot = true use_spot_fallbacks = true architectures = ["amd64"] } }
Per-Environment Configuration
# environments/dev/terraform.tfvars environment = "dev" max_cpu_cores = 16 spot_enabled = true hibernation_enabled = true # Hibernate off-hours evictor_aggressive = true # Fast consolidation OK # environments/prod/terraform.tfvars environment = "prod" max_cpu_cores = 200 spot_enabled = true hibernation_enabled = false # Never hibernate production evictor_aggressive = false # Conservative eviction
Architecture Diagram
┌─────────────────────┐ │ CAST AI Console │ │ console.cast.ai │ └──────────┬──────────┘ │ API ┌────────────────┼────────────────┐ │ │ │ ┌────────▼──────┐ ┌──────▼────────┐ ┌─────▼───────┐ │ Dev (EKS) │ │ Staging (GKE) │ │ Prod (EKS) │ │ Spot: 100% │ │ Spot: 80% │ │ Spot: 60% │ │ Hibernate: Y │ │ Hibernate: N │ │ Hibernate:N│ │ Max: 16 CPU │ │ Max: 50 CPU │ │ Max:200CPU │ └───────────────┘ └───────────────┘ └─────────────┘ │ │ │ ┌────────▼──────┐ ┌──────▼────────┐ ┌─────▼───────┐ │ Terraform │ │ Terraform │ │ Terraform │ │ dev/ │ │ staging/ │ │ prod/ │ └───────────────┘ └───────────────┘ └─────────────┘
Monitoring Integration
# Prometheus alert rules for CAST AI groups: - name: castai rules: - alert: CastAIAgentDown expr: kube_pod_status_ready{namespace="castai-agent", pod=~"castai-agent.*"} == 0 for: 5m labels: severity: critical annotations: summary: "CAST AI agent is down on {{ $labels.cluster }}" - alert: CastAIHighSpotInterruptions expr: increase(castai_spot_interruptions_total[1h]) > 5 labels: severity: warning annotations: summary: "High spot interruption rate on {{ $labels.cluster }}"
Error Handling
| Issue | Cause | Solution |
|---|---|---|
| State drift between envs | Manual console changes | Enforce Terraform-only policy |
| Module version mismatch | Independent env upgrades | Pin module versions |
| Cross-env key leak | Shared tfvars | Separate state and secrets per env |
| Monitoring gaps | Missing scrape config | Add castai-agent namespace to Prometheus |
Resources
Next Steps
This completes the CAST AI skill pack. Start with
castai-install-auth for new clusters.