Claude-skill-registry kubernetes-troubleshooting

install

source · Clone the upstream repo

git clone https://github.com/majiayu000/claude-skill-registry

Claude Code · Install into ~/.claude/skills/

T=$(mktemp -d) && git clone --depth=1 https://github.com/majiayu000/claude-skill-registry "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/data/kubernetes-troubleshooting" ~/.claude/skills/majiayu000-claude-skill-registry-kubernetes-troubleshooting && rm -rf "$T"

manifest: skills/data/kubernetes-troubleshooting/SKILL.md

Kubernetes / OpenShift Troubleshooting Guide

Systematic approach to diagnosing and resolving cluster issues through event analysis, log interpretation, and Popeye-style health scoring.

Current Versions & Tools (January 2026)

Platform	Version	Key Changes
Kubernetes	1.31.x	Sidecar containers GA, Pod lifecycle improvements
OpenShift	4.17.x	OVN-Kubernetes default, enhanced web terminal
EKS	1.31	Pod Identity, Auto Mode, Karpenter 1.x
AKS	1.31	Cilium CNI, Workload Identity GA
GKE	1.31	Autopilot improvements, Gateway API GA

Troubleshooting Tools

Tool	Install	Purpose
k9s	`brew install k9s`	Terminal UI
stern	`brew install stern`	Multi-pod log tailing
kubectx/kubens	`brew install kubectx`	Context switching
kubectl-node-shell	`kubectl krew install node-shell`	Node access

Command Usage Convention

IMPORTANT: This skill uses

kubectl

as the primary command. When working with:

OpenShift/ARO clusters: Replace
```
kubectl
```
with
```
oc
```
Standard Kubernetes (AKS, EKS, GKE): Use
```
kubectl
```
as shown

Cluster Health Scoring (Popeye-Style)

Health scores range from 0-100. Issues reduce the score based on severity:

BOOM (Critical): -50 points - Security vulnerabilities, resource exhaustion, failed services
WARN (Warning): -20 points - Configuration inefficiencies, best practice violations
INFO (Informational): -5 points - Non-critical issues, optimization opportunities

Quick Cluster Health Assessment

#!/bin/bash
# cluster-health-check.sh
echo "=== CLUSTER HEALTH ASSESSMENT ==="

# 1. Node Health (Critical)
echo "### NODE HEALTH ###"
kubectl get nodes -o wide | grep -E "NotReady|Unknown" && \
  echo "BOOM: Unhealthy nodes detected!" || echo "✓ All nodes healthy"

# 2. Pod Issues (Critical)
echo -e "\n### POD HEALTH ###"
POD_ISSUES=$(kubectl get pods -A --field-selector=status.phase!=Running,status.phase!=Succeeded --no-headers | wc -l)
if [ $POD_ISSUES -gt 0 ]; then
    echo "WARN: $POD_ISSUES pods not running"
    kubectl get pods -A --field-selector=status.phase!=Running,status.phase!=Succeeded
else
    echo "✓ All pods running"
fi

# 3. Security (Critical)
echo -e "\n### SECURITY ASSESSMENT ###"
PRIVILEGED=$(kubectl get pods -A -o json | jq -r '.items[] | select(.spec.containers[].securityContext.privileged == true) | "\(.metadata.namespace)/\(.metadata.name)"' | wc -l)
[ $PRIVILEGED -gt 0 ] && echo "BOOM: $PRIVILEGED privileged containers!" || echo "✓ No privileged containers"

# 4. Resource Configuration (Warning)
echo -e "\n### RESOURCE CONFIGURATION ###"
NO_LIMITS=$(kubectl get pods -A -o json | jq -r '.items[] | select(.spec.containers[].resources.limits == null) | "\(.metadata.namespace)/\(.metadata.name)"' | wc -l)
[ $NO_LIMITS -gt 0 ] && echo "WARN: $NO_LIMITS containers without limits" || echo "✓ All have limits"

# 5. Storage (Warning)
echo -e "\n### STORAGE HEALTH ###"
PENDING_PVC=$(kubectl get pvc -A --field-selector=status.phase!=Bound --no-headers | wc -l)
[ $PENDING_PVC -gt 0 ] && echo "WARN: $PENDING_PVC PVCs not bound" || echo "✓ All PVCs bound"

# OpenShift: Cluster Operators
if command -v oc &> /dev/null; then
    echo -e "\n### OPENSHIFT OPERATORS ###"
    DEGRADED=$(oc get clusteroperators --no-headers | grep -c -E "False.*True|False.*False")
    [ $DEGRADED -gt 0 ] && echo "BOOM: $DEGRADED operators degraded!" || echo "✓ All operators healthy"
fi

Quick Diagnostic Commands

# Pod status overview
kubectl get pods -n ${NAMESPACE} -o wide

# Recent events (sorted by time)
kubectl get events -n ${NAMESPACE} --sort-by='.lastTimestamp'

# Pod details and events
kubectl describe pod ${POD_NAME} -n ${NAMESPACE}

# Container logs (current)
kubectl logs ${POD_NAME} -n ${NAMESPACE} -c ${CONTAINER}

# Container logs (previous crashed instance)
kubectl logs ${POD_NAME} -n ${NAMESPACE} -c ${CONTAINER} --previous

# Multi-pod log streaming
stern -n ${NAMESPACE} ${POD_PREFIX}
stern -A -l app=${APP_NAME} --since 1h

# Node status
kubectl get nodes -o wide
kubectl describe node ${NODE_NAME}

# Resource usage
kubectl top pods -n ${NAMESPACE}
kubectl top nodes

Pod Status Interpretation

Pod Phase States

Phase	Meaning	Action
`Pending`	Not scheduled or pulling images	Check events, node resources, PVC status
`Running`	At least one container running	Check container statuses if issues
`Succeeded`	All containers completed successfully	Normal for Jobs
`Failed`	All containers terminated, at least one failed	Check logs, exit codes
`Unknown`	Cannot determine state	Node communication issue

Container Waiting States

Reason	Cause	Resolution
`ContainerCreating`	Setting up container	Check events, volume mounts
`ImagePullBackOff`	Cannot pull image	Verify image name, registry access, credentials
`ErrImagePull`	Image pull failed	Check image exists, network, ImagePullSecrets
`CreateContainerConfigError`	Config error	Check ConfigMaps, Secrets exist
`CrashLoopBackOff`	Container repeatedly crashing	Check `logs --previous` , fix application

Container Exit Codes

Exit Code	Signal	Cause	Resolution
0	-	Normal exit	Expected for Jobs
1	-	Application error	Check logs for stack trace
126	-	Command not executable	Fix permissions
127	-	Command not found	Fix command path
137	SIGKILL	OOM or forced termination	Increase memory limit
143	SIGTERM	Graceful shutdown	Normal during updates

Event Analysis

Critical Events to Monitor

Scheduling Events

Event	Meaning	Resolution
`FailedScheduling`	Cannot place pod	Check node resources, taints, affinity
`Unschedulable`	No suitable node	Add nodes, adjust requirements

FailedScheduling Messages:

"Insufficient cpu"           → Reduce requests or add capacity
"Insufficient memory"        → Reduce requests or add capacity
"node(s) had taint"          → Add toleration or remove taint
"node(s) didn't match selector" → Fix nodeSelector/affinity
"persistentvolumeclaim not found" → Create PVC or fix name

Image Events

Event	Meaning	Resolution
`BackOff`	Repeated pull failures	Check image name, registry, auth
`ErrImageNeverPull`	Image not local	Change imagePullPolicy or pre-pull

ImagePullBackOff Diagnosis:

# Check image name
kubectl get pod ${POD} -o jsonpath='{.spec.containers[*].image}'

# Verify ImagePullSecrets
kubectl get pod ${POD} -o jsonpath='{.spec.imagePullSecrets}'
kubectl get secret ${SECRET} -n ${NAMESPACE}

Volume Events

Event	Meaning	Resolution
`FailedMount`	Cannot mount volume	Check PVC, storage class
`FailedAttachVolume`	Cannot attach	Check cloud provider, volume exists

PVC Pending Diagnosis:

kubectl describe pvc ${PVC_NAME} -n ${NAMESPACE}
kubectl get storageclass
kubectl get pv

Log Analysis Patterns

Common Error Patterns

# Search for errors
kubectl logs ${POD} -n ${NS} | grep -iE "(error|exception|fatal|panic)"

# Java OOM
java.lang.OutOfMemoryError → Increase memory, tune JVM heap

# Connection refused
ECONNREFUSED, Connection refused → Dependency not available

# DNS failure
ENOTFOUND, getaddrinfo → DNS resolution failed, check service name

# Permission denied
Permission denied → Check securityContext, runAsUser, fsGroup

Memory Issues (OOMKilled)

Last State: Terminated
Reason: OOMKilled
Exit Code: 137

→ Solutions:
1. Increase memory limit
2. Profile application memory usage
3. For JVM: Set -Xmx < container limit (leave ~25% headroom)

Node Troubleshooting

Node Conditions

Condition	Status	Meaning
`Ready`	True	Node healthy
`Ready`	False	Kubelet not healthy
`Ready`	Unknown	No heartbeat
`MemoryPressure`	True	Low memory
`DiskPressure`	True	Low disk space
`PIDPressure`	True	Too many processes

Node NotReady Diagnosis

kubectl describe node ${NODE_NAME}

# On the node (SSH or debug)
systemctl status kubelet
journalctl -u kubelet -f

# Check resources
df -h
free -m
top

Networking Troubleshooting

DNS Issues

# Test DNS resolution
kubectl run dns-test --image=busybox:1.28 --rm -it --restart=Never -- \
  nslookup ${SERVICE_NAME}.${NAMESPACE}.svc.cluster.local

# Check CoreDNS
kubectl get pods -n kube-system -l k8s-app=kube-dns
kubectl logs -n kube-system -l k8s-app=kube-dns

Service Connectivity

# Verify service and endpoints
kubectl get svc ${SERVICE} -n ${NS}
kubectl get endpoints ${SERVICE} -n ${NS}

# Test from debug pod
kubectl run curl-test --image=curlimages/curl --rm -it --restart=Never -- \
  curl -v http://${SERVICE}.${NS}.svc.cluster.local:${PORT}

Ingress/Route Issues

# Check Ingress
kubectl describe ingress ${INGRESS} -n ${NS}

# Ingress controller logs
kubectl logs -n ingress-nginx -l app.kubernetes.io/name=ingress-nginx

# OpenShift Route
oc describe route ${ROUTE} -n ${NS}
oc get pods -n openshift-ingress

OpenShift-Specific Troubleshooting

Cluster Operators

# Check overall health
oc get clusteroperators

# Investigate degraded operator
oc describe clusteroperator ${OPERATOR}
oc logs -n openshift-${OPERATOR} -l name=${OPERATOR}-operator

Security Context Constraints (SCC)

# List SCCs
oc get scc

# Check which SCC a pod is using
oc get pod ${POD} -n ${NS} -o yaml | grep scc

# Common error fix
# "unable to validate against any security context constraint"
oc adm policy add-scc-to-user ${SCC} -z ${SERVICE_ACCOUNT} -n ${NS}

Build Failures

# Check build status
oc get builds -n ${NS}
oc describe build ${BUILD} -n ${NS}
oc logs build/${BUILD} -n ${NS}

Cloud Provider Troubleshooting

EKS (AWS)

aws eks describe-cluster --name ${CLUSTER} --query 'cluster.status'
aws eks describe-addon --cluster-name ${CLUSTER} --addon-name vpc-cni
eksctl get nodegroup --cluster ${CLUSTER}

AKS (Azure)

az aks show --resource-group ${RG} --name ${CLUSTER} --query provisioningState
az aks check-network outbound --resource-group ${RG} --name ${CLUSTER}

GKE (Google Cloud)

gcloud container clusters describe ${CLUSTER} --region ${REGION} --format='value(status)'
gcloud container operations list --filter="targetLink:${CLUSTER}" --limit=10

Diagnostic Decision Tree

Pod Not Starting

Pod Phase = Pending?
├── Yes → Check Scheduling
│   ├── "Insufficient cpu/memory" → Add nodes or reduce requests
│   ├── "node(s) had taint" → Add toleration
│   ├── "PVC not found" → Create PVC
│   └── No events → Check API server
│
└── No → Check Container Status
    ├── ImagePullBackOff → Fix image name/auth
    ├── CrashLoopBackOff → Check logs --previous
    ├── CreateContainerConfigError → Fix ConfigMap/Secret
    └── Running but not ready → Check readiness probe

Application Not Responding

Can reach Service?
├── No → Check Service
│   ├── No endpoints → Fix selector labels
│   ├── Wrong port → Fix targetPort
│   └── NetworkPolicy blocking → Adjust policy
│
└── Yes → Check Pod
    ├── Probe failing → Fix probe or application
    ├── High latency → Check resources, dependencies
    └── Errors in logs → Fix application

Performance Analysis

Resource Optimization

# Compare usage vs requests
kubectl top pods -n ${NS}

kubectl get pods -n ${NS} -o custom-columns=\
NAME:.metadata.name,\
CPU_REQ:.spec.containers[*].resources.requests.cpu,\
MEM_REQ:.spec.containers[*].resources.requests.memory

# Find pods without limits
kubectl get pods -A -o json | jq -r \
  '.items[] | select(.spec.containers[].resources.limits == null) |
   "\(.metadata.namespace)/\(.metadata.name)"'

Right-Sizing Recommendations

Symptom	Indication	Action
CPU throttling	CPU limit too low	Increase CPU limit
OOMKilled	Memory limit too low	Increase memory limit
Low utilization	Over-provisioned	Reduce requests