Skillshub castai-common-errors

install

source · Clone the upstream repo

git clone https://github.com/ComeOnOliver/skillshub

Claude Code · Install into ~/.claude/skills/

T=$(mktemp -d) && git clone --depth=1 https://github.com/ComeOnOliver/skillshub "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/jeremylongshore/claude-code-plugins-plus-skills/castai-common-errors" ~/.claude/skills/comeonoliver-skillshub-castai-common-errors && rm -rf "$T"

manifest: skills/jeremylongshore/claude-code-plugins-plus-skills/castai-common-errors/SKILL.md

CAST AI Common Errors

Overview

Diagnostic guide for the 10 most common CAST AI issues, covering agent connectivity, API errors, autoscaler failures, and node provisioning problems.

Prerequisites

```
kubectl
```
access to the cluster
```
CASTAI_API_KEY
```
configured
Access to CAST AI console for log correlation

Error Reference

1. Agent Pod CrashLoopBackOff

kubectl get pods -n castai-agent
kubectl logs -n castai-agent deployment/castai-agent --tail=50

Causes and fixes:

Invalid API key: Regenerate at console.cast.ai > API
Wrong provider: Set
```
--set provider=eks|gke|aks
```
correctly in Helm
RBAC missing: Apply the required ClusterRole and ClusterRoleBinding
Network blocked: Ensure outbound HTTPS to
```
api.cast.ai
```
is allowed

2. Agent Shows "Disconnected" in Console

# Check agent heartbeat
kubectl logs -n castai-agent deployment/castai-agent | grep -i "heartbeat\|connect\|error"

# Verify network connectivity from inside the cluster
kubectl run castai-debug --image=curlimages/curl --rm -it --restart=Never -- \
  curl -s -o /dev/null -w "%{http_code}" https://api.cast.ai/v1/kubernetes/external-clusters

Fix: Restart the agent pod:

kubectl rollout restart deployment/castai-agent -n castai-agent

3. API Returns 401 Unauthorized

# Test API key
curl -s -o /dev/null -w "%{http_code}" \
  -H "X-API-Key: ${CASTAI_API_KEY}" \
  https://api.cast.ai/v1/kubernetes/external-clusters
# Should return 200, not 401

Fix: Generate a new API key at console.cast.ai > API > API Access Keys.

4. Nodes Not Scaling Up (Unschedulable Pods)

# Check for pending pods
kubectl get pods --all-namespaces --field-selector=status.phase=Pending

# Verify unschedulable pods policy is enabled
curl -s -H "X-API-Key: ${CASTAI_API_KEY}" \
  "https://api.cast.ai/v1/kubernetes/clusters/${CASTAI_CLUSTER_ID}/policies" \
  | jq '.unschedulablePods'

Causes:

```
unschedulablePods.enabled
```
is
```
false
```
-- enable it
Cluster limits reached -- increase
```
clusterLimits.cpu.maxCores
```
No matching node template -- check constraints match pod requirements

5. Nodes Not Scaling Down (Empty Nodes)

# Check node downscaler configuration
curl -s -H "X-API-Key: ${CASTAI_API_KEY}" \
  "https://api.cast.ai/v1/kubernetes/clusters/${CASTAI_CLUSTER_ID}/policies" \
  | jq '.nodeDownscaler'

Causes:

```
nodeDownscaler.enabled
```
is
```
false
```
Pods with
```
PodDisruptionBudget
```
blocking eviction
DaemonSet-only nodes with system pods preventing drain
Delay too high -- reduce
```
emptyNodes.delaySeconds
```

6. Spot Instance Fallback Not Working

# Check spot configuration
curl -s -H "X-API-Key: ${CASTAI_API_KEY}" \
  "https://api.cast.ai/v1/kubernetes/clusters/${CASTAI_CLUSTER_ID}/policies" \
  | jq '.spotInstances'

Fix: Enable

spotDiversityEnabled: true

and set

spotDiversityPriceIncreaseLimitPercent

to 20-30 for better availability.

7. Evictor Too Aggressive

Symptoms: Pods being evicted too frequently, service disruption.

kubectl get events --field-selector reason=Evicted -A --sort-by=.lastTimestamp | tail -20

Fix: Increase evictor cycle interval or switch to non-aggressive mode:

helm upgrade castai-evictor castai-helm/castai-evictor \
  -n castai-agent \
  --set castai.apiKey="${CASTAI_API_KEY}" \
  --set castai.clusterID="${CASTAI_CLUSTER_ID}" \
  --set evictor.aggressiveMode=false \
  --set evictor.cycleInterval=600

8. Terraform State Drift

terraform plan -var-file=environments/prod.tfvars
# If drift detected:
terraform refresh -var-file=environments/prod.tfvars

Fix: Avoid mixing Terraform and console-based policy changes. Pick one source of truth.

9. Helm Chart Version Mismatch

# Check installed versions
helm list -n castai-agent
helm search repo castai-helm --versions | head -10

# Update to latest
helm repo update
helm upgrade castai-agent castai-helm/castai-agent -n castai-agent \
  --reuse-values

10. Workload Autoscaler Not Recommending

kubectl logs -n castai-agent deployment/castai-workload-autoscaler --tail=50

Causes:

Insufficient metrics data (wait 24h)
Missing annotation
```
autoscaling.cast.ai/enabled: "true"
```
Workload autoscaler pod not running

Escalation Path

Collect debug info: Helm releases, agent logs, cluster events
Check https://status.cast.ai for platform issues
Contact support with cluster ID and screenshots

Resources

Next Steps

For comprehensive diagnostics, see

castai-debug-bundle