Claude-skill-registry kubernetes-troubleshooting

install
source · Clone the upstream repo
git clone https://github.com/majiayu000/claude-skill-registry
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/majiayu000/claude-skill-registry "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/data/kubernetes-troubleshooting" ~/.claude/skills/majiayu000-claude-skill-registry-kubernetes-troubleshooting && rm -rf "$T"
manifest: skills/data/kubernetes-troubleshooting/SKILL.md
source content

Kubernetes / OpenShift Troubleshooting Guide

Systematic approach to diagnosing and resolving cluster issues through event analysis, log interpretation, and Popeye-style health scoring.

Current Versions & Tools (January 2026)

PlatformVersionKey Changes
Kubernetes1.31.xSidecar containers GA, Pod lifecycle improvements
OpenShift4.17.xOVN-Kubernetes default, enhanced web terminal
EKS1.31Pod Identity, Auto Mode, Karpenter 1.x
AKS1.31Cilium CNI, Workload Identity GA
GKE1.31Autopilot improvements, Gateway API GA

Troubleshooting Tools

ToolInstallPurpose
k9s
brew install k9s
Terminal UI
stern
brew install stern
Multi-pod log tailing
kubectx/kubens
brew install kubectx
Context switching
kubectl-node-shell
kubectl krew install node-shell
Node access

Command Usage Convention

IMPORTANT: This skill uses

kubectl
as the primary command. When working with:

  • OpenShift/ARO clusters: Replace
    kubectl
    with
    oc
  • Standard Kubernetes (AKS, EKS, GKE): Use
    kubectl
    as shown

Cluster Health Scoring (Popeye-Style)

Health scores range from 0-100. Issues reduce the score based on severity:

  • BOOM (Critical): -50 points - Security vulnerabilities, resource exhaustion, failed services
  • WARN (Warning): -20 points - Configuration inefficiencies, best practice violations
  • INFO (Informational): -5 points - Non-critical issues, optimization opportunities

Quick Cluster Health Assessment

#!/bin/bash
# cluster-health-check.sh
echo "=== CLUSTER HEALTH ASSESSMENT ==="

# 1. Node Health (Critical)
echo "### NODE HEALTH ###"
kubectl get nodes -o wide | grep -E "NotReady|Unknown" && \
  echo "BOOM: Unhealthy nodes detected!" || echo "✓ All nodes healthy"

# 2. Pod Issues (Critical)
echo -e "\n### POD HEALTH ###"
POD_ISSUES=$(kubectl get pods -A --field-selector=status.phase!=Running,status.phase!=Succeeded --no-headers | wc -l)
if [ $POD_ISSUES -gt 0 ]; then
    echo "WARN: $POD_ISSUES pods not running"
    kubectl get pods -A --field-selector=status.phase!=Running,status.phase!=Succeeded
else
    echo "✓ All pods running"
fi

# 3. Security (Critical)
echo -e "\n### SECURITY ASSESSMENT ###"
PRIVILEGED=$(kubectl get pods -A -o json | jq -r '.items[] | select(.spec.containers[].securityContext.privileged == true) | "\(.metadata.namespace)/\(.metadata.name)"' | wc -l)
[ $PRIVILEGED -gt 0 ] && echo "BOOM: $PRIVILEGED privileged containers!" || echo "✓ No privileged containers"

# 4. Resource Configuration (Warning)
echo -e "\n### RESOURCE CONFIGURATION ###"
NO_LIMITS=$(kubectl get pods -A -o json | jq -r '.items[] | select(.spec.containers[].resources.limits == null) | "\(.metadata.namespace)/\(.metadata.name)"' | wc -l)
[ $NO_LIMITS -gt 0 ] && echo "WARN: $NO_LIMITS containers without limits" || echo "✓ All have limits"

# 5. Storage (Warning)
echo -e "\n### STORAGE HEALTH ###"
PENDING_PVC=$(kubectl get pvc -A --field-selector=status.phase!=Bound --no-headers | wc -l)
[ $PENDING_PVC -gt 0 ] && echo "WARN: $PENDING_PVC PVCs not bound" || echo "✓ All PVCs bound"

# OpenShift: Cluster Operators
if command -v oc &> /dev/null; then
    echo -e "\n### OPENSHIFT OPERATORS ###"
    DEGRADED=$(oc get clusteroperators --no-headers | grep -c -E "False.*True|False.*False")
    [ $DEGRADED -gt 0 ] && echo "BOOM: $DEGRADED operators degraded!" || echo "✓ All operators healthy"
fi

Quick Diagnostic Commands

# Pod status overview
kubectl get pods -n ${NAMESPACE} -o wide

# Recent events (sorted by time)
kubectl get events -n ${NAMESPACE} --sort-by='.lastTimestamp'

# Pod details and events
kubectl describe pod ${POD_NAME} -n ${NAMESPACE}

# Container logs (current)
kubectl logs ${POD_NAME} -n ${NAMESPACE} -c ${CONTAINER}

# Container logs (previous crashed instance)
kubectl logs ${POD_NAME} -n ${NAMESPACE} -c ${CONTAINER} --previous

# Multi-pod log streaming
stern -n ${NAMESPACE} ${POD_PREFIX}
stern -A -l app=${APP_NAME} --since 1h

# Node status
kubectl get nodes -o wide
kubectl describe node ${NODE_NAME}

# Resource usage
kubectl top pods -n ${NAMESPACE}
kubectl top nodes

Pod Status Interpretation

Pod Phase States

PhaseMeaningAction
Pending
Not scheduled or pulling imagesCheck events, node resources, PVC status
Running
At least one container runningCheck container statuses if issues
Succeeded
All containers completed successfullyNormal for Jobs
Failed
All containers terminated, at least one failedCheck logs, exit codes
Unknown
Cannot determine stateNode communication issue

Container Waiting States

ReasonCauseResolution
ContainerCreating
Setting up containerCheck events, volume mounts
ImagePullBackOff
Cannot pull imageVerify image name, registry access, credentials
ErrImagePull
Image pull failedCheck image exists, network, ImagePullSecrets
CreateContainerConfigError
Config errorCheck ConfigMaps, Secrets exist
CrashLoopBackOff
Container repeatedly crashingCheck
logs --previous
, fix application

Container Exit Codes

Exit CodeSignalCauseResolution
0-Normal exitExpected for Jobs
1-Application errorCheck logs for stack trace
126-Command not executableFix permissions
127-Command not foundFix command path
137SIGKILLOOM or forced terminationIncrease memory limit
143SIGTERMGraceful shutdownNormal during updates

Event Analysis

Critical Events to Monitor

Scheduling Events

EventMeaningResolution
FailedScheduling
Cannot place podCheck node resources, taints, affinity
Unschedulable
No suitable nodeAdd nodes, adjust requirements

FailedScheduling Messages:

"Insufficient cpu"           → Reduce requests or add capacity
"Insufficient memory"        → Reduce requests or add capacity
"node(s) had taint"          → Add toleration or remove taint
"node(s) didn't match selector" → Fix nodeSelector/affinity
"persistentvolumeclaim not found" → Create PVC or fix name

Image Events

EventMeaningResolution
BackOff
Repeated pull failuresCheck image name, registry, auth
ErrImageNeverPull
Image not localChange imagePullPolicy or pre-pull

ImagePullBackOff Diagnosis:

# Check image name
kubectl get pod ${POD} -o jsonpath='{.spec.containers[*].image}'

# Verify ImagePullSecrets
kubectl get pod ${POD} -o jsonpath='{.spec.imagePullSecrets}'
kubectl get secret ${SECRET} -n ${NAMESPACE}

Volume Events

EventMeaningResolution
FailedMount
Cannot mount volumeCheck PVC, storage class
FailedAttachVolume
Cannot attachCheck cloud provider, volume exists

PVC Pending Diagnosis:

kubectl describe pvc ${PVC_NAME} -n ${NAMESPACE}
kubectl get storageclass
kubectl get pv

Log Analysis Patterns

Common Error Patterns

# Search for errors
kubectl logs ${POD} -n ${NS} | grep -iE "(error|exception|fatal|panic)"

# Java OOM
java.lang.OutOfMemoryError → Increase memory, tune JVM heap

# Connection refused
ECONNREFUSED, Connection refused → Dependency not available

# DNS failure
ENOTFOUND, getaddrinfo → DNS resolution failed, check service name

# Permission denied
Permission denied → Check securityContext, runAsUser, fsGroup

Memory Issues (OOMKilled)

Last State: Terminated
Reason: OOMKilled
Exit Code: 137

→ Solutions:
1. Increase memory limit
2. Profile application memory usage
3. For JVM: Set -Xmx < container limit (leave ~25% headroom)

Node Troubleshooting

Node Conditions

ConditionStatusMeaning
Ready
TrueNode healthy
Ready
FalseKubelet not healthy
Ready
UnknownNo heartbeat
MemoryPressure
TrueLow memory
DiskPressure
TrueLow disk space
PIDPressure
TrueToo many processes

Node NotReady Diagnosis

kubectl describe node ${NODE_NAME}

# On the node (SSH or debug)
systemctl status kubelet
journalctl -u kubelet -f

# Check resources
df -h
free -m
top

Networking Troubleshooting

DNS Issues

# Test DNS resolution
kubectl run dns-test --image=busybox:1.28 --rm -it --restart=Never -- \
  nslookup ${SERVICE_NAME}.${NAMESPACE}.svc.cluster.local

# Check CoreDNS
kubectl get pods -n kube-system -l k8s-app=kube-dns
kubectl logs -n kube-system -l k8s-app=kube-dns

Service Connectivity

# Verify service and endpoints
kubectl get svc ${SERVICE} -n ${NS}
kubectl get endpoints ${SERVICE} -n ${NS}

# Test from debug pod
kubectl run curl-test --image=curlimages/curl --rm -it --restart=Never -- \
  curl -v http://${SERVICE}.${NS}.svc.cluster.local:${PORT}

Ingress/Route Issues

# Check Ingress
kubectl describe ingress ${INGRESS} -n ${NS}

# Ingress controller logs
kubectl logs -n ingress-nginx -l app.kubernetes.io/name=ingress-nginx

# OpenShift Route
oc describe route ${ROUTE} -n ${NS}
oc get pods -n openshift-ingress

OpenShift-Specific Troubleshooting

Cluster Operators

# Check overall health
oc get clusteroperators

# Investigate degraded operator
oc describe clusteroperator ${OPERATOR}
oc logs -n openshift-${OPERATOR} -l name=${OPERATOR}-operator

Security Context Constraints (SCC)

# List SCCs
oc get scc

# Check which SCC a pod is using
oc get pod ${POD} -n ${NS} -o yaml | grep scc

# Common error fix
# "unable to validate against any security context constraint"
oc adm policy add-scc-to-user ${SCC} -z ${SERVICE_ACCOUNT} -n ${NS}

Build Failures

# Check build status
oc get builds -n ${NS}
oc describe build ${BUILD} -n ${NS}
oc logs build/${BUILD} -n ${NS}

Cloud Provider Troubleshooting

EKS (AWS)

aws eks describe-cluster --name ${CLUSTER} --query 'cluster.status'
aws eks describe-addon --cluster-name ${CLUSTER} --addon-name vpc-cni
eksctl get nodegroup --cluster ${CLUSTER}

AKS (Azure)

az aks show --resource-group ${RG} --name ${CLUSTER} --query provisioningState
az aks check-network outbound --resource-group ${RG} --name ${CLUSTER}

GKE (Google Cloud)

gcloud container clusters describe ${CLUSTER} --region ${REGION} --format='value(status)'
gcloud container operations list --filter="targetLink:${CLUSTER}" --limit=10

Diagnostic Decision Tree

Pod Not Starting

Pod Phase = Pending?
├── Yes → Check Scheduling
│   ├── "Insufficient cpu/memory" → Add nodes or reduce requests
│   ├── "node(s) had taint" → Add toleration
│   ├── "PVC not found" → Create PVC
│   └── No events → Check API server
│
└── No → Check Container Status
    ├── ImagePullBackOff → Fix image name/auth
    ├── CrashLoopBackOff → Check logs --previous
    ├── CreateContainerConfigError → Fix ConfigMap/Secret
    └── Running but not ready → Check readiness probe

Application Not Responding

Can reach Service?
├── No → Check Service
│   ├── No endpoints → Fix selector labels
│   ├── Wrong port → Fix targetPort
│   └── NetworkPolicy blocking → Adjust policy
│
└── Yes → Check Pod
    ├── Probe failing → Fix probe or application
    ├── High latency → Check resources, dependencies
    └── Errors in logs → Fix application

Performance Analysis

Resource Optimization

# Compare usage vs requests
kubectl top pods -n ${NS}

kubectl get pods -n ${NS} -o custom-columns=\
NAME:.metadata.name,\
CPU_REQ:.spec.containers[*].resources.requests.cpu,\
MEM_REQ:.spec.containers[*].resources.requests.memory

# Find pods without limits
kubectl get pods -A -o json | jq -r \
  '.items[] | select(.spec.containers[].resources.limits == null) |
   "\(.metadata.namespace)/\(.metadata.name)"'

Right-Sizing Recommendations

SymptomIndicationAction
CPU throttlingCPU limit too lowIncrease CPU limit
OOMKilledMemory limit too lowIncrease memory limit
Low utilizationOver-provisionedReduce requests