Claude-skill-registry kubernetes-troubleshooting
install
source · Clone the upstream repo
git clone https://github.com/majiayu000/claude-skill-registry
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/majiayu000/claude-skill-registry "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/data/kubernetes-troubleshooting" ~/.claude/skills/majiayu000-claude-skill-registry-kubernetes-troubleshooting && rm -rf "$T"
manifest:
skills/data/kubernetes-troubleshooting/SKILL.mdsource content
Kubernetes / OpenShift Troubleshooting Guide
Systematic approach to diagnosing and resolving cluster issues through event analysis, log interpretation, and Popeye-style health scoring.
Current Versions & Tools (January 2026)
| Platform | Version | Key Changes |
|---|---|---|
| Kubernetes | 1.31.x | Sidecar containers GA, Pod lifecycle improvements |
| OpenShift | 4.17.x | OVN-Kubernetes default, enhanced web terminal |
| EKS | 1.31 | Pod Identity, Auto Mode, Karpenter 1.x |
| AKS | 1.31 | Cilium CNI, Workload Identity GA |
| GKE | 1.31 | Autopilot improvements, Gateway API GA |
Troubleshooting Tools
| Tool | Install | Purpose |
|---|---|---|
| k9s | | Terminal UI |
| stern | | Multi-pod log tailing |
| kubectx/kubens | | Context switching |
| kubectl-node-shell | | Node access |
Command Usage Convention
IMPORTANT: This skill uses
kubectl as the primary command. When working with:
- OpenShift/ARO clusters: Replace
withkubectloc - Standard Kubernetes (AKS, EKS, GKE): Use
as shownkubectl
Cluster Health Scoring (Popeye-Style)
Health scores range from 0-100. Issues reduce the score based on severity:
- BOOM (Critical): -50 points - Security vulnerabilities, resource exhaustion, failed services
- WARN (Warning): -20 points - Configuration inefficiencies, best practice violations
- INFO (Informational): -5 points - Non-critical issues, optimization opportunities
Quick Cluster Health Assessment
#!/bin/bash # cluster-health-check.sh echo "=== CLUSTER HEALTH ASSESSMENT ===" # 1. Node Health (Critical) echo "### NODE HEALTH ###" kubectl get nodes -o wide | grep -E "NotReady|Unknown" && \ echo "BOOM: Unhealthy nodes detected!" || echo "✓ All nodes healthy" # 2. Pod Issues (Critical) echo -e "\n### POD HEALTH ###" POD_ISSUES=$(kubectl get pods -A --field-selector=status.phase!=Running,status.phase!=Succeeded --no-headers | wc -l) if [ $POD_ISSUES -gt 0 ]; then echo "WARN: $POD_ISSUES pods not running" kubectl get pods -A --field-selector=status.phase!=Running,status.phase!=Succeeded else echo "✓ All pods running" fi # 3. Security (Critical) echo -e "\n### SECURITY ASSESSMENT ###" PRIVILEGED=$(kubectl get pods -A -o json | jq -r '.items[] | select(.spec.containers[].securityContext.privileged == true) | "\(.metadata.namespace)/\(.metadata.name)"' | wc -l) [ $PRIVILEGED -gt 0 ] && echo "BOOM: $PRIVILEGED privileged containers!" || echo "✓ No privileged containers" # 4. Resource Configuration (Warning) echo -e "\n### RESOURCE CONFIGURATION ###" NO_LIMITS=$(kubectl get pods -A -o json | jq -r '.items[] | select(.spec.containers[].resources.limits == null) | "\(.metadata.namespace)/\(.metadata.name)"' | wc -l) [ $NO_LIMITS -gt 0 ] && echo "WARN: $NO_LIMITS containers without limits" || echo "✓ All have limits" # 5. Storage (Warning) echo -e "\n### STORAGE HEALTH ###" PENDING_PVC=$(kubectl get pvc -A --field-selector=status.phase!=Bound --no-headers | wc -l) [ $PENDING_PVC -gt 0 ] && echo "WARN: $PENDING_PVC PVCs not bound" || echo "✓ All PVCs bound" # OpenShift: Cluster Operators if command -v oc &> /dev/null; then echo -e "\n### OPENSHIFT OPERATORS ###" DEGRADED=$(oc get clusteroperators --no-headers | grep -c -E "False.*True|False.*False") [ $DEGRADED -gt 0 ] && echo "BOOM: $DEGRADED operators degraded!" || echo "✓ All operators healthy" fi
Quick Diagnostic Commands
# Pod status overview kubectl get pods -n ${NAMESPACE} -o wide # Recent events (sorted by time) kubectl get events -n ${NAMESPACE} --sort-by='.lastTimestamp' # Pod details and events kubectl describe pod ${POD_NAME} -n ${NAMESPACE} # Container logs (current) kubectl logs ${POD_NAME} -n ${NAMESPACE} -c ${CONTAINER} # Container logs (previous crashed instance) kubectl logs ${POD_NAME} -n ${NAMESPACE} -c ${CONTAINER} --previous # Multi-pod log streaming stern -n ${NAMESPACE} ${POD_PREFIX} stern -A -l app=${APP_NAME} --since 1h # Node status kubectl get nodes -o wide kubectl describe node ${NODE_NAME} # Resource usage kubectl top pods -n ${NAMESPACE} kubectl top nodes
Pod Status Interpretation
Pod Phase States
| Phase | Meaning | Action |
|---|---|---|
| Not scheduled or pulling images | Check events, node resources, PVC status |
| At least one container running | Check container statuses if issues |
| All containers completed successfully | Normal for Jobs |
| All containers terminated, at least one failed | Check logs, exit codes |
| Cannot determine state | Node communication issue |
Container Waiting States
| Reason | Cause | Resolution |
|---|---|---|
| Setting up container | Check events, volume mounts |
| Cannot pull image | Verify image name, registry access, credentials |
| Image pull failed | Check image exists, network, ImagePullSecrets |
| Config error | Check ConfigMaps, Secrets exist |
| Container repeatedly crashing | Check , fix application |
Container Exit Codes
| Exit Code | Signal | Cause | Resolution |
|---|---|---|---|
| 0 | - | Normal exit | Expected for Jobs |
| 1 | - | Application error | Check logs for stack trace |
| 126 | - | Command not executable | Fix permissions |
| 127 | - | Command not found | Fix command path |
| 137 | SIGKILL | OOM or forced termination | Increase memory limit |
| 143 | SIGTERM | Graceful shutdown | Normal during updates |
Event Analysis
Critical Events to Monitor
Scheduling Events
| Event | Meaning | Resolution |
|---|---|---|
| Cannot place pod | Check node resources, taints, affinity |
| No suitable node | Add nodes, adjust requirements |
FailedScheduling Messages:
"Insufficient cpu" → Reduce requests or add capacity "Insufficient memory" → Reduce requests or add capacity "node(s) had taint" → Add toleration or remove taint "node(s) didn't match selector" → Fix nodeSelector/affinity "persistentvolumeclaim not found" → Create PVC or fix name
Image Events
| Event | Meaning | Resolution |
|---|---|---|
| Repeated pull failures | Check image name, registry, auth |
| Image not local | Change imagePullPolicy or pre-pull |
ImagePullBackOff Diagnosis:
# Check image name kubectl get pod ${POD} -o jsonpath='{.spec.containers[*].image}' # Verify ImagePullSecrets kubectl get pod ${POD} -o jsonpath='{.spec.imagePullSecrets}' kubectl get secret ${SECRET} -n ${NAMESPACE}
Volume Events
| Event | Meaning | Resolution |
|---|---|---|
| Cannot mount volume | Check PVC, storage class |
| Cannot attach | Check cloud provider, volume exists |
PVC Pending Diagnosis:
kubectl describe pvc ${PVC_NAME} -n ${NAMESPACE} kubectl get storageclass kubectl get pv
Log Analysis Patterns
Common Error Patterns
# Search for errors kubectl logs ${POD} -n ${NS} | grep -iE "(error|exception|fatal|panic)" # Java OOM java.lang.OutOfMemoryError → Increase memory, tune JVM heap # Connection refused ECONNREFUSED, Connection refused → Dependency not available # DNS failure ENOTFOUND, getaddrinfo → DNS resolution failed, check service name # Permission denied Permission denied → Check securityContext, runAsUser, fsGroup
Memory Issues (OOMKilled)
Last State: Terminated Reason: OOMKilled Exit Code: 137 → Solutions: 1. Increase memory limit 2. Profile application memory usage 3. For JVM: Set -Xmx < container limit (leave ~25% headroom)
Node Troubleshooting
Node Conditions
| Condition | Status | Meaning |
|---|---|---|
| True | Node healthy |
| False | Kubelet not healthy |
| Unknown | No heartbeat |
| True | Low memory |
| True | Low disk space |
| True | Too many processes |
Node NotReady Diagnosis
kubectl describe node ${NODE_NAME} # On the node (SSH or debug) systemctl status kubelet journalctl -u kubelet -f # Check resources df -h free -m top
Networking Troubleshooting
DNS Issues
# Test DNS resolution kubectl run dns-test --image=busybox:1.28 --rm -it --restart=Never -- \ nslookup ${SERVICE_NAME}.${NAMESPACE}.svc.cluster.local # Check CoreDNS kubectl get pods -n kube-system -l k8s-app=kube-dns kubectl logs -n kube-system -l k8s-app=kube-dns
Service Connectivity
# Verify service and endpoints kubectl get svc ${SERVICE} -n ${NS} kubectl get endpoints ${SERVICE} -n ${NS} # Test from debug pod kubectl run curl-test --image=curlimages/curl --rm -it --restart=Never -- \ curl -v http://${SERVICE}.${NS}.svc.cluster.local:${PORT}
Ingress/Route Issues
# Check Ingress kubectl describe ingress ${INGRESS} -n ${NS} # Ingress controller logs kubectl logs -n ingress-nginx -l app.kubernetes.io/name=ingress-nginx # OpenShift Route oc describe route ${ROUTE} -n ${NS} oc get pods -n openshift-ingress
OpenShift-Specific Troubleshooting
Cluster Operators
# Check overall health oc get clusteroperators # Investigate degraded operator oc describe clusteroperator ${OPERATOR} oc logs -n openshift-${OPERATOR} -l name=${OPERATOR}-operator
Security Context Constraints (SCC)
# List SCCs oc get scc # Check which SCC a pod is using oc get pod ${POD} -n ${NS} -o yaml | grep scc # Common error fix # "unable to validate against any security context constraint" oc adm policy add-scc-to-user ${SCC} -z ${SERVICE_ACCOUNT} -n ${NS}
Build Failures
# Check build status oc get builds -n ${NS} oc describe build ${BUILD} -n ${NS} oc logs build/${BUILD} -n ${NS}
Cloud Provider Troubleshooting
EKS (AWS)
aws eks describe-cluster --name ${CLUSTER} --query 'cluster.status' aws eks describe-addon --cluster-name ${CLUSTER} --addon-name vpc-cni eksctl get nodegroup --cluster ${CLUSTER}
AKS (Azure)
az aks show --resource-group ${RG} --name ${CLUSTER} --query provisioningState az aks check-network outbound --resource-group ${RG} --name ${CLUSTER}
GKE (Google Cloud)
gcloud container clusters describe ${CLUSTER} --region ${REGION} --format='value(status)' gcloud container operations list --filter="targetLink:${CLUSTER}" --limit=10
Diagnostic Decision Tree
Pod Not Starting
Pod Phase = Pending? ├── Yes → Check Scheduling │ ├── "Insufficient cpu/memory" → Add nodes or reduce requests │ ├── "node(s) had taint" → Add toleration │ ├── "PVC not found" → Create PVC │ └── No events → Check API server │ └── No → Check Container Status ├── ImagePullBackOff → Fix image name/auth ├── CrashLoopBackOff → Check logs --previous ├── CreateContainerConfigError → Fix ConfigMap/Secret └── Running but not ready → Check readiness probe
Application Not Responding
Can reach Service? ├── No → Check Service │ ├── No endpoints → Fix selector labels │ ├── Wrong port → Fix targetPort │ └── NetworkPolicy blocking → Adjust policy │ └── Yes → Check Pod ├── Probe failing → Fix probe or application ├── High latency → Check resources, dependencies └── Errors in logs → Fix application
Performance Analysis
Resource Optimization
# Compare usage vs requests kubectl top pods -n ${NS} kubectl get pods -n ${NS} -o custom-columns=\ NAME:.metadata.name,\ CPU_REQ:.spec.containers[*].resources.requests.cpu,\ MEM_REQ:.spec.containers[*].resources.requests.memory # Find pods without limits kubectl get pods -A -o json | jq -r \ '.items[] | select(.spec.containers[].resources.limits == null) | "\(.metadata.namespace)/\(.metadata.name)"'
Right-Sizing Recommendations
| Symptom | Indication | Action |
|---|---|---|
| CPU throttling | CPU limit too low | Increase CPU limit |
| OOMKilled | Memory limit too low | Increase memory limit |
| Low utilization | Over-provisioned | Reduce requests |