Claude-skill-registry kubernetes-debugger
install
source · Clone the upstream repo
git clone https://github.com/majiayu000/claude-skill-registry
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/majiayu000/claude-skill-registry "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/data/kubernetes-debugger" ~/.claude/skills/majiayu000-claude-skill-registry-kubernetes-debugger && rm -rf "$T"
manifest:
skills/data/kubernetes-debugger/SKILL.mdsource content
Kubernetes Debugger
Systematic debugging workflows for Kubernetes issues using MCP kubernetes tools.
Prerequisites
Install Kubernetes MCP Server
claude mcp add kubernetes --scope user -- npx mcp-server-kubernetes
Requirements:
- Access to a Kubernetes cluster configured for kubectl (minikube, Rancher Desktop, GKE, EKS, AKS, etc.)
- kubeconfig at
(default) or~/.kube/config
env var setKUBECONFIG - Helm v3 in PATH (optional, for Helm operations)
Alternative installation methods:
# Global install npm install -g mcp-server-kubernetes # Or run directly with npx (no install) npx mcp-server-kubernetes
Verify installation:
claude mcp list # Should show 'kubernetes' server
Quick Reference: MCP Tools
| Tool | Use For |
|---|---|
| List resources, check status, find resource names |
| Detailed info, events, conditions |
| Container stdout/stderr, application errors |
| Run commands inside containers |
| Deployment rollout status/history |
| Cordon/drain/uncordon nodes |
Debugging Decision Tree
Issue reported │ ├─ Pod not running? ──────────► See: Pod Debugging Workflow │ ├─ Service unreachable? ──────► See: Service/Network Debugging │ ├─ Deployment stuck? ─────────► See: Deployment Debugging │ ├─ Node issues? ──────────────► See: Node Debugging │ └─ Performance/Resources? ────► See: Resource Debugging
Pod Debugging Workflow
Step 1: Get Pod Status
kubectl_get(resourceType="pods", namespace="<ns>")
Common statuses and their meaning:
- Pending: Scheduling issues (resources, node selector, affinity)
- CrashLoopBackOff: Container crashing repeatedly
- ImagePullBackOff/ErrImagePull: Cannot pull container image
- Running but not ready: Readiness probe failing
- Terminating: Stuck deletion (finalizers, PDB)
Step 2: Check Events and Conditions
kubectl_describe(resourceType="pod", name="<pod>", namespace="<ns>")
Look for in output:
- Events section: Scheduling failures, image pull errors, probe failures
- Conditions: PodScheduled, Initialized, ContainersReady, Ready
- Container State: Waiting (reason), Running, Terminated (exit code)
Step 3: Get Container Logs
kubectl_logs(resourceType="pod", name="<pod>", namespace="<ns>", container="<container>")
Options:
: Logs from crashed containerprevious=true
: Last N linestail=100
: Logs from last hoursince="1h"
Step 4: Exec Into Container (if running)
exec_in_pod(name="<pod>", namespace="<ns>", command=["sh", "-c", "<cmd>"])
Useful commands:
- Check DNS config["cat", "/etc/resolv.conf"]
- Verify environment variables["env"]
- Check mounted files["ls", "-la", "/app"]
- Test connectivity["nc", "-zv", "<host>", "<port>"]
Common Pod Issues
CrashLoopBackOff
- Get logs:
for crashed containerkubectl_logs(previous=true) - Check exit code in
outputkubectl_describe - Common causes:
- Exit code 1: Application error
- Exit code 137: OOMKilled (check memory limits)
- Exit code 143: SIGTERM (graceful shutdown issue)
ImagePullBackOff
- Check image name/tag in describe output
- Verify image exists in registry
- Check imagePullSecrets if private registry
- Look for "Failed to pull image" in events
Pending Pod
- Check events for scheduling failure reason
- Common causes:
: Node capacity exhaustedInsufficient cpu/memory
: Wrong labelsnode(s) didn't match node selector
: Storage issuePersistentVolumeClaim not bound
: Taints/tolerations mismatch0/N nodes available
Service/Network Debugging
Step 1: Verify Service Exists
kubectl_get(resourceType="services", namespace="<ns>") kubectl_describe(resourceType="service", name="<svc>", namespace="<ns>")
Step 2: Check Endpoints
kubectl_get(resourceType="endpoints", name="<svc>", namespace="<ns>")
No endpoints? Check:
- Pod labels match service selector
- Pods are Running and Ready
- Target port matches container port
Step 3: Test DNS Resolution
exec_in_pod(name="<debug-pod>", command=["nslookup", "<service>.<namespace>.svc.cluster.local"])
Step 4: Test Connectivity
exec_in_pod(name="<debug-pod>", command=["nc", "-zv", "<service>", "<port>"])
Deployment Debugging
Check Rollout Status
kubectl_rollout(subCommand="status", resourceType="deployment", name="<deploy>", namespace="<ns>")
View Rollout History
kubectl_rollout(subCommand="history", resourceType="deployment", name="<deploy>", namespace="<ns>")
Rollback if Needed
kubectl_rollout(subCommand="undo", resourceType="deployment", name="<deploy>", namespace="<ns>")
Common Issues
- Progressing stuck: New pods failing (check ReplicaSet pods)
- Available < desired: Pods not passing readiness probes
- Surge/unavailable conflicts: Check deployment strategy
Node Debugging
Check Node Status
kubectl_get(resourceType="nodes") kubectl_describe(resourceType="node", name="<node>")
Node Conditions to Check
| Condition | Problem If |
|---|---|
| Ready | False or Unknown |
| MemoryPressure | True |
| DiskPressure | True |
| PIDPressure | True |
| NetworkUnavailable | True |
Drain Node for Maintenance
node_management(operation="cordon", nodeName="<node>") # Prevent new pods node_management(operation="drain", nodeName="<node>", confirmDrain=true) # Evict pods # After maintenance: node_management(operation="uncordon", nodeName="<node>")
Resource Debugging
Check Resource Usage
kubectl_generic(command="top", resourceType="pods", namespace="<ns>") kubectl_generic(command="top", resourceType="nodes")
OOMKilled Detection
pod - look for "OOMKilled" in container statekubectl_describe- Check memory limits vs actual usage
- Solutions:
- Increase memory limits
- Fix memory leak in application
- Add memory requests for better scheduling
CPU Throttling
- Check if CPU limits are too restrictive
- Consider removing CPU limits (keep requests)
- Use
to see actual usagekubectl top pods
Reference Files
- references/pod-states.md: Complete pod state reference
- references/common-errors.md: Error messages and solutions
- references/network-debug.md: Network troubleshooting details