Cc-devops-skills k8s-debug
Diagnose and fix Kubernetes pods, CrashLoopBackOff, Pending, DNS, networking, storage, and rollout failures with kubectl.
git clone https://github.com/akin-ozer/cc-devops-skills
T=$(mktemp -d) && git clone --depth=1 https://github.com/akin-ozer/cc-devops-skills "$T" && mkdir -p ~/.claude/skills && cp -r "$T/devops-skills-plugin/skills/k8s-debug" ~/.claude/skills/akin-ozer-cc-devops-skills-k8s-debug && rm -rf "$T"
devops-skills-plugin/skills/k8s-debug/SKILL.mdKubernetes Debugging Skill
Overview
Systematic toolkit for debugging Kubernetes clusters, workloads, networking, and storage with a deterministic, safety-first workflow.
Trigger Phrases
Use this skill when requests resemble:
- "My pod is in
; help me find the root cause."CrashLoopBackOff - "Service DNS works in one pod but not another."
- "Deployment rollout is stuck."
- "Pods are
and not scheduling."Pending - "Cluster health looks degraded after a change."
- "PVC is pending and pods cannot mount storage."
Prerequisites
Run from the skill directory (
devops-skills-plugin/skills/k8s-debug) so relative script paths work as written.
Required
installed and configured.kubectl- An active cluster context.
- Read access to namespaces, pods, events, services, and nodes.
Quick preflight:
kubectl config current-context kubectl auth can-i get pods -A kubectl auth can-i get events -A kubectl get ns
Optional but Recommended
for more precise filtering injq
../scripts/cluster_health.sh- Metrics API (
) formetrics-server
.kubectl top - In-container debug tools (
,nslookup
,getent
,curl
,wget
) for deep network tests.ip
Fallback behavior:
- If optional tools are missing, scripts continue and print warnings with reduced output.
- If
is unavailable, continue withkubectl top
and events.kubectl describe
When to Use This Skill
Use this skill for:
- Pod failures (CrashLoopBackOff, ImagePullBackOff, Pending, OOMKilled)
- Service connectivity or DNS resolution issues
- Network policy or ingress problems
- Volume and storage mount failures
- Deployment rollout issues
- Cluster health or performance degradation
- Resource exhaustion (CPU/memory)
- Configuration problems (ConfigMaps, Secrets, RBAC)
Safety Rules for Disruptive Commands
Default mode is read-only diagnosis first. Only execute disruptive commands after confirming blast radius and rollback.
Commands requiring explicit confirmation:
kubectl delete pod ... --force --grace-period=0kubectl drain ...kubectl rollout restart ...kubectl rollout undo ...kubectl debug ... --copy-to=...
Before disruptive actions:
# Snapshot current state for rollback and incident notes kubectl get deploy,rs,pod,svc -n <namespace> -o wide kubectl get pod <pod-name> -n <namespace> -o yaml > before-<pod-name>.yaml kubectl get events -n <namespace> --sort-by='.lastTimestamp' > before-events.txt
Reference Navigation Map
Load only the section needed for the observed symptom.
| Symptom / Need | Open | Start section |
|---|---|---|
| You need an end-to-end diagnosis path | | |
Pod state is , , or | | |
| Service reachability or DNS failure | | |
| Node pressure or performance regression | | |
| PVC / PV / storage class issues | | |
| Quick symptom-to-fix lookup | | matching issue heading |
| Post-mortem fix options for known issues | | sections |
Scripts Overview
| Script | Purpose | Required args | Optional args | Output | Fallback behavior |
|---|---|---|---|---|---|
| Cluster-wide health snapshot (nodes, workloads, events, common failure states) | None | , env var | Sectioned report to stdout | Continues on check failures, tracks them in summary and exit code |
| Pod-centric network and DNS diagnostics | ( defaults to ) | , , env var | Sectioned report to stdout | Uses secure API probe by default; insecure TLS requires explicit |
| Deep pod diagnostics (status, describe, YAML, events, per-container logs, node context) | | , | Sectioned report to stdout or file | Fails fast on missing access; skips optional metrics/log blocks with clear messages |
Script Exit Codes
./scripts/cluster_health.sh and ./scripts/network_debug.sh share the same contract:
: checks completed with no check failures (warnings allowed unless0
is set).--strict
: one or more checks failed, or warnings occurred in1
mode.--strict
: blocked preconditions (for example: missing2
, no active context, inaccessible namespace/pod).kubectl
Deterministic Debugging Workflow
Follow this systematic approach for any Kubernetes issue:
1. Preflight and Scope
kubectl config current-context kubectl get ns kubectl auth can-i get pods -n <namespace>
If preflight fails, stop and fix access/context first.
2. Identify the Problem Layer
Categorize the issue:
- Application Layer: Application crashes, errors, bugs
- Pod Layer: Pod not starting, restarting, or pending
- Service Layer: Network connectivity, DNS issues
- Node Layer: Node not ready, resource exhaustion
- Cluster Layer: Control plane issues, API problems
- Storage Layer: Volume mount failures, PVC issues
- Configuration Layer: ConfigMap, Secret, RBAC issues
3. Gather Diagnostics with the Right Script
Use the appropriate diagnostic script based on scope:
Pod-Level Diagnostics
Use
./scripts/pod_diagnostics.py for comprehensive pod analysis:
python3 ./scripts/pod_diagnostics.py <pod-name> -n <namespace>
This script gathers:
- Pod status and description
- Pod events
- Container logs (current and previous)
- Resource usage
- Node information
- YAML configuration
Output can be saved for analysis:
python3 ./scripts/pod_diagnostics.py <pod-name> -n <namespace> -o diagnostics.txt
Cluster-Level Health Check
Use
./scripts/cluster_health.sh for overall cluster diagnostics:
./scripts/cluster_health.sh > cluster-health-$(date +%Y%m%d-%H%M%S).txt
This script checks:
- Cluster info and version
- Node status and resources
- Pods across all namespaces
- Failed/pending pods
- Recent events
- Deployments, services, statefulsets, daemonsets
- PVCs and PVs
- Component health
- Common error states (CrashLoopBackOff, ImagePullBackOff)
Network Diagnostics
Use
./scripts/network_debug.sh for connectivity issues:
./scripts/network_debug.sh <namespace> <pod-name> # or force warning sensitivity / insecure TLS only when explicitly needed: ./scripts/network_debug.sh --strict <namespace> <pod-name> ./scripts/network_debug.sh --insecure <namespace> <pod-name>
This script analyzes:
- Pod network configuration
- DNS setup and resolution
- Service endpoints
- Network policies
- Connectivity tests
- CoreDNS logs
4. Follow Issue-Specific Reference Workflow
Based on the identified issue, consult
./references/troubleshooting_workflow.md:
- Pod Pending: Resource/scheduling workflow
- CrashLoopBackOff: Application crash workflow
- ImagePullBackOff: Image pull workflow
- Service issues: Network connectivity workflow
- DNS failures: DNS troubleshooting workflow
- Resource exhaustion: Performance investigation workflow
- Storage issues: PVC binding workflow
- Deployment stuck: Rollout workflow
5. Apply Targeted Fixes
Refer to
./references/common_issues.md for symptom-specific fixes.
6. Verify and Close
Run final verification:
kubectl get pods -n <namespace> -o wide kubectl get events -n <namespace> --sort-by='.lastTimestamp' | tail -20 kubectl rollout status deployment/<name> -n <namespace>
Issue is done when user-visible behavior is healthy and no new critical warning events appear.
Example Flows
Example 1: CrashLoopBackOff in payments
Namespace
paymentspython3 ./scripts/pod_diagnostics.py payments-api-7c97f95dfb-q9l7k -n payments -o payments-diagnostics.txt kubectl logs payments-api-7c97f95dfb-q9l7k -n payments --previous --tail=100 kubectl get deploy payments-api -n payments -o yaml | grep -A 8 livenessProbe
Then open
./references/common_issues.md and apply the CrashLoopBackOff solutions.
Example 2: Service DNS/Connectivity Failure
./scripts/network_debug.sh checkout checkout-api-75f49c9d8f-z6qtm kubectl get svc checkout-api -n checkout kubectl get endpoints checkout-api -n checkout kubectl get networkpolicies -n checkout
Then follow
Service Connectivity Workflow in ./references/troubleshooting_workflow.md.
Essential Manual Commands
Pod Debugging
# View pod status kubectl get pods -n <namespace> -o wide # Detailed pod information kubectl describe pod <pod-name> -n <namespace> # View logs kubectl logs <pod-name> -n <namespace> kubectl logs <pod-name> -n <namespace> --previous # Previous container kubectl logs <pod-name> -n <namespace> -c <container> # Specific container # Execute commands in pod kubectl exec <pod-name> -n <namespace> -it -- /bin/sh # Get pod YAML kubectl get pod <pod-name> -n <namespace> -o yaml
Service and Network Debugging
# Check services kubectl get svc -n <namespace> kubectl describe svc <service-name> -n <namespace> # Check endpoints kubectl get endpoints -n <namespace> # Test DNS kubectl exec <pod-name> -n <namespace> -- nslookup kubernetes.default # View events kubectl get events -n <namespace> --sort-by='.lastTimestamp'
Resource Monitoring
# Node resources kubectl top nodes kubectl describe nodes # Pod resources kubectl top pods -n <namespace> kubectl top pod <pod-name> -n <namespace> --containers
Emergency Operations
# Restart deployment kubectl rollout restart deployment/<name> -n <namespace> # Rollback deployment kubectl rollout undo deployment/<name> -n <namespace> # Force delete stuck pod kubectl delete pod <pod-name> -n <namespace> --force --grace-period=0 # Drain node (maintenance) kubectl drain <node-name> --ignore-daemonsets --delete-emptydir-data # Cordon node (prevent scheduling) kubectl cordon <node-name>
Completion Criteria
Troubleshooting session is complete when all are true:
- Cluster context and namespace are confirmed.
- Relevant diagnostic script output is captured.
- Root cause is identified and tied to evidence (events/logs/config/state).
- Any disruptive action was preceded by snapshot and rollback plan.
- Fix verification commands show healthy state.
- Reference path used (
or./references/troubleshooting_workflow.md
) is documented in notes../references/common_issues.md
Related Tools
Useful additional tools for Kubernetes debugging:
- kubectl-debug: Advanced debugging plugin
- stern: Multi-pod log tailing
- kubectx/kubens: Context and namespace switching
- k9s: Terminal UI for Kubernetes
- lens: Desktop IDE for Kubernetes
- Prometheus/Grafana: Monitoring and alerting
- Jaeger/Zipkin: Distributed tracing