Claude-code-plugins-plus-skills castai-common-errors

install
source · Clone the upstream repo
git clone https://github.com/jeremylongshore/claude-code-plugins-plus-skills
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/jeremylongshore/claude-code-plugins-plus-skills "$T" && mkdir -p ~/.claude/skills && cp -r "$T/plugins/saas-packs/castai-pack/skills/castai-common-errors" ~/.claude/skills/jeremylongshore-claude-code-plugins-plus-skills-castai-common-errors && rm -rf "$T"
manifest: plugins/saas-packs/castai-pack/skills/castai-common-errors/SKILL.md
source content

CAST AI Common Errors

Overview

Diagnostic guide for the 10 most common CAST AI issues, covering agent connectivity, API errors, autoscaler failures, and node provisioning problems.

Prerequisites

  • kubectl
    access to the cluster
  • CASTAI_API_KEY
    configured
  • Access to CAST AI console for log correlation

Error Reference

1. Agent Pod CrashLoopBackOff

kubectl get pods -n castai-agent
kubectl logs -n castai-agent deployment/castai-agent --tail=50

Causes and fixes:

  • Invalid API key: Regenerate at console.cast.ai > API
  • Wrong provider: Set
    --set provider=eks|gke|aks
    correctly in Helm
  • RBAC missing: Apply the required ClusterRole and ClusterRoleBinding
  • Network blocked: Ensure outbound HTTPS to
    api.cast.ai
    is allowed

2. Agent Shows "Disconnected" in Console

# Check agent heartbeat
kubectl logs -n castai-agent deployment/castai-agent | grep -i "heartbeat\|connect\|error"

# Verify network connectivity from inside the cluster
kubectl run castai-debug --image=curlimages/curl --rm -it --restart=Never -- \
  curl -s -o /dev/null -w "%{http_code}" https://api.cast.ai/v1/kubernetes/external-clusters

Fix: Restart the agent pod:

kubectl rollout restart deployment/castai-agent -n castai-agent

3. API Returns 401 Unauthorized

# Test API key
curl -s -o /dev/null -w "%{http_code}" \
  -H "X-API-Key: ${CASTAI_API_KEY}" \
  https://api.cast.ai/v1/kubernetes/external-clusters
# Should return 200, not 401

Fix: Generate a new API key at console.cast.ai > API > API Access Keys.

4. Nodes Not Scaling Up (Unschedulable Pods)

# Check for pending pods
kubectl get pods --all-namespaces --field-selector=status.phase=Pending

# Verify unschedulable pods policy is enabled
curl -s -H "X-API-Key: ${CASTAI_API_KEY}" \
  "https://api.cast.ai/v1/kubernetes/clusters/${CASTAI_CLUSTER_ID}/policies" \
  | jq '.unschedulablePods'

Causes:

  • unschedulablePods.enabled
    is
    false
    -- enable it
  • Cluster limits reached -- increase
    clusterLimits.cpu.maxCores
  • No matching node template -- check constraints match pod requirements

5. Nodes Not Scaling Down (Empty Nodes)

# Check node downscaler configuration
curl -s -H "X-API-Key: ${CASTAI_API_KEY}" \
  "https://api.cast.ai/v1/kubernetes/clusters/${CASTAI_CLUSTER_ID}/policies" \
  | jq '.nodeDownscaler'

Causes:

  • nodeDownscaler.enabled
    is
    false
  • Pods with
    PodDisruptionBudget
    blocking eviction
  • DaemonSet-only nodes with system pods preventing drain
  • Delay too high -- reduce
    emptyNodes.delaySeconds

6. Spot Instance Fallback Not Working

# Check spot configuration
curl -s -H "X-API-Key: ${CASTAI_API_KEY}" \
  "https://api.cast.ai/v1/kubernetes/clusters/${CASTAI_CLUSTER_ID}/policies" \
  | jq '.spotInstances'

Fix: Enable

spotDiversityEnabled: true
and set
spotDiversityPriceIncreaseLimitPercent
to 20-30 for better availability.

7. Evictor Too Aggressive

Symptoms: Pods being evicted too frequently, service disruption.

kubectl get events --field-selector reason=Evicted -A --sort-by=.lastTimestamp | tail -20

Fix: Increase evictor cycle interval or switch to non-aggressive mode:

helm upgrade castai-evictor castai-helm/castai-evictor \
  -n castai-agent \
  --set castai.apiKey="${CASTAI_API_KEY}" \
  --set castai.clusterID="${CASTAI_CLUSTER_ID}" \
  --set evictor.aggressiveMode=false \
  --set evictor.cycleInterval=600

8. Terraform State Drift

terraform plan -var-file=environments/prod.tfvars
# If drift detected:
terraform refresh -var-file=environments/prod.tfvars

Fix: Avoid mixing Terraform and console-based policy changes. Pick one source of truth.

9. Helm Chart Version Mismatch

# Check installed versions
helm list -n castai-agent
helm search repo castai-helm --versions | head -10

# Update to latest
helm repo update
helm upgrade castai-agent castai-helm/castai-agent -n castai-agent \
  --reuse-values

10. Workload Autoscaler Not Recommending

kubectl logs -n castai-agent deployment/castai-workload-autoscaler --tail=50

Causes:

  • Insufficient metrics data (wait 24h)
  • Missing annotation
    autoscaling.cast.ai/enabled: "true"
  • Workload autoscaler pod not running

Escalation Path

  1. Collect debug info: Helm releases, agent logs, cluster events
  2. Check https://status.cast.ai for platform issues
  3. Contact support with cluster ID and screenshots

Resources

Next Steps

For comprehensive diagnostics, see

castai-debug-bundle
.