Awesome-omni-skill kubernetes-troubleshooting
Debug Kubernetes pods, services, networking, and scaling issues. Use this skill when troubleshooting K8s deployments, investigating pod failures, or diagnosing cluster problems.
install
source · Clone the upstream repo
git clone https://github.com/diegosouzapw/awesome-omni-skill
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/diegosouzapw/awesome-omni-skill "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/devops/kubernetes-troubleshooting" ~/.claude/skills/diegosouzapw-awesome-omni-skill-kubernetes-troubleshooting && rm -rf "$T"
manifest:
skills/devops/kubernetes-troubleshooting/SKILL.mdsource content
Kubernetes Troubleshooting
You are a Kubernetes expert. Use these systematic debugging patterns when investigating K8s issues.
Diagnostic Decision Tree
Pod not running? ├── Pending → Resource constraints or scheduling issues │ ├── kubectl describe pod <name> → check Events │ ├── Insufficient CPU/memory → scale cluster or reduce requests │ ├── Node selector/affinity not matching → check node labels │ └── PVC not bound → check storage class and PV availability ├── CrashLoopBackOff → Application crashing on startup │ ├── kubectl logs <pod> → check application logs │ ├── kubectl logs <pod> --previous → check last crash logs │ ├── OOMKilled → increase memory limits │ ├── Exit code 1 → application error (bad config, missing env) │ └── Exit code 137 → killed by OOM or liveness probe ├── ImagePullBackOff → Can't pull container image │ ├── Image name typo → verify image:tag exists │ ├── Private registry → check imagePullSecrets │ └── Rate limited → Docker Hub pull limit, use mirror ├── Running but not Ready → Readiness probe failing │ ├── Check readiness probe config │ ├── Application not listening on expected port │ └── Dependency not available (database, cache) └── Evicted → Node pressure ├── Disk pressure → clean up images, expand disk └── Memory pressure → reduce workload or add nodes
Essential Debug Commands
Pod Investigation
# Overview kubectl get pods -A # All pods, all namespaces kubectl get pods -o wide # With node and IP info kubectl get pods --sort-by='.status.startTime' # Sorted by age # Deep inspect kubectl describe pod <name> # Events, conditions, volumes kubectl logs <name> # Current logs kubectl logs <name> --previous # Previous crash logs kubectl logs <name> -c <container> # Specific container in multi-container pod kubectl logs <name> --tail=100 -f # Follow last 100 lines # Interactive debug kubectl exec -it <name> -- /bin/sh # Shell into pod kubectl exec -it <name> -- env # Check environment kubectl exec -it <name> -- cat /etc/resolv.conf # Check DNS config # Resource usage kubectl top pods # CPU/memory per pod kubectl top nodes # CPU/memory per node
Service & Networking
# Check service endpoints kubectl get endpoints <service> # Are pods registered? kubectl get svc <service> -o yaml # Service config # DNS resolution (from inside a pod) kubectl exec -it <pod> -- nslookup <service> kubectl exec -it <pod> -- wget -qO- http://<service>:<port>/health # Test connectivity kubectl run debug --image=nicolaka/netshoot -it --rm -- /bin/bash # Then: curl, dig, nslookup, tcpdump, ping # Ingress kubectl get ingress -A kubectl describe ingress <name>
Cluster Health
kubectl get nodes # Node status kubectl describe node <name> # Node conditions, allocatable resources kubectl get events --sort-by='.lastTimestamp' # Recent cluster events kubectl cluster-info # API server status
Common Issues and Fixes
CrashLoopBackOff
# 1. Check logs kubectl logs <pod> --previous # 2. Common causes: # - Missing environment variable → check deployment env/configmap/secret # - Database not reachable → check network policy, service DNS # - Port conflict → check containerPort in deployment # - Permissions → check SecurityContext, ServiceAccount # 3. Debug with overridden command kubectl run debug --image=<same-image> --command -- sleep 3600 kubectl exec -it debug -- /bin/sh # Manually run the entrypoint to see errors
OOMKilled (Exit Code 137)
# Check current limits kubectl describe pod <name> | grep -A 5 "Limits" # Fix: increase memory limit # In deployment spec: resources: requests: memory: "256Mi" limits: memory: "512Mi" # Increase this # Monitor actual usage first kubectl top pod <name>
Service Not Reachable
# Checklist: # 1. Pod is Running and Ready? kubectl get pods -l app=<name> # 2. Service has endpoints? kubectl get endpoints <service> # If empty → labels don't match between Service and Pod # 3. Port correct? kubectl get svc <service> -o jsonpath='{.spec.ports[*]}' # targetPort must match containerPort # 4. NetworkPolicy blocking? kubectl get networkpolicy -A
Persistent Volume Issues
# PVC stuck in Pending kubectl describe pvc <name> # Common: no matching PV, storage class missing, capacity insufficient # Check storage classes kubectl get storageclass # Check PVs kubectl get pv
Resource Right-Sizing
Requests vs Limits
resources: requests: # Guaranteed minimum — scheduler uses this cpu: "100m" # 0.1 CPU core memory: "128Mi" limits: # Maximum allowed — killed if exceeded (memory), throttled (CPU) cpu: "500m" memory: "256Mi"
Rules of thumb:
= average usage + 20% bufferrequests
= peak usage + 30% bufferlimits- Never set
withoutlimitsrequests - CPU limits cause throttling — some teams only set requests for CPU
- Memory limits are hard — OOMKilled if exceeded
HPA (Horizontal Pod Autoscaler)
apiVersion: autoscaling/v2 kind: HorizontalPodAutoscaler metadata: name: my-app spec: scaleTargetRef: apiVersion: apps/v1 kind: Deployment name: my-app minReplicas: 2 maxReplicas: 10 metrics: - type: Resource resource: name: cpu target: type: Utilization averageUtilization: 70
Quick Reference
| Symptom | First Command | Likely Cause |
|---|---|---|
| Pod pending | | Resource constraints |
| Pod crashing | | App error or OOM |
| Service unreachable | | Label mismatch or no ready pods |
| Slow response | | CPU throttling or memory pressure |
| DNS not resolving | | CoreDNS issue or network policy |
| Storage error | | No matching PV or storage class |