Claude-skill-registry k8s-sre
git clone https://github.com/majiayu000/claude-skill-registry
T=$(mktemp -d) && git clone --depth=1 https://github.com/majiayu000/claude-skill-registry "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/data/k8s-sre" ~/.claude/skills/majiayu000-claude-skill-registry-k8s-sre && rm -rf "$T"
skills/data/k8s-sre/SKILL.mdACCESSING CLUSTERS
ALWAYS USE
export KUBECONFIG=~/kube/<cluster>.yaml && kubectl ... WHEN EXECUTING KUBE COMMANDS TO CONNECT TO THE CLUSTER.
Debugging Kubernetes Incidents
Core Principles
- 5 Whys Analysis - NEVER stop at symptoms. Ask "why" until you reach the root cause.
- Read-Only Investigation - Observe and analyze, never modify resources
- Multi-Source Correlation - Combine logs, events, metrics for complete picture
- Research Unknown Services - Check documentation before deep investigation
The 5 Whys Analysis (CRITICAL)
You MUST apply 5 Whys before concluding any investigation. Stopping at symptoms leads to ineffective fixes.
How to Apply
- Start with the observed symptom
- Ask "Why did this happen?" for each answer
- Continue until you reach an actionable root cause (typically 5 levels)
Example
Symptom: Helm install failed with "context deadline exceeded" Why #1: Why did Helm timeout? → Pods never became Ready Why #2: Why weren't pods Ready? → Pods stuck in Pending state Why #3: Why were pods Pending? → PVCs couldn't bind (StorageClass "fast" not found) Why #4: Why was StorageClass missing? → longhorn-storage Kustomization failed to apply Why #5: Why did the Kustomization fail? → numberOfReplicas was integer instead of string ROOT CAUSE: YAML type coercion issue FIX: Use properly typed variable for StorageClass parameters
Red Flags You Haven't Reached Root Cause
- Your "fix" is increasing a timeout or retry count
- Your "fix" addresses the symptom, not what caused it
- You can still ask "but why did THAT happen?"
- Multiple issues share the same underlying cause
❌ WRONG: "Helm timed out → increase timeout to 15m" ✅ CORRECT: "Helm timed out → ... → Kustomization type error → fix YAML"
Cluster Context
CRITICAL: Always confirm cluster before running commands.
| Cluster | Purpose | Kubeconfig |
|---|---|---|
| Manual testing | |
| Automated testing | |
| Production | |
KUBECONFIG=~/.kube/<cluster>.yaml kubectl <command>
Investigation Phases
Phase 1: Triage
- Confirm cluster - Ask user: "Which cluster? (dev/integration/live)"
- Assess severity - P1 (down) / P2 (degraded) / P3 (minor) / P4 (cosmetic)
- Identify scope - Pod / Deployment / Namespace / Cluster-wide
Phase 2: Data Collection
# Pod status and events kubectl get pods -n <namespace> kubectl describe pod <pod> -n <namespace> # Logs (current and previous) kubectl logs <pod> -n <namespace> --tail=100 kubectl logs <pod> -n <namespace> --previous # Events timeline kubectl get events -n <namespace> --sort-by='.lastTimestamp' # Resource usage kubectl top pods -n <namespace>
Phase 3: Correlation
- Extract timestamps from logs, events, metrics
- Identify what happened FIRST (root cause)
- Trace the cascade of effects
Phase 4: Root Cause (5 Whys)
Apply 5 Whys analysis. Validate:
- Temporal: Did it happen BEFORE the symptom?
- Causal: Does it logically explain the symptom?
- Evidence: Is there supporting data?
- Complete: Have you asked "why" enough times?
Phase 5: Remediation
Use AskUserQuestion tool to present fix options when multiple valid approaches exist.
Provide recommendations only (read-only investigation):
- Immediate: Rollback, scale, restart
- Permanent: Code/config fixes
- Prevention: Alerts, quotas, tests
Quick Diagnosis
| Symptom | First Check | Common Cause |
|---|---|---|
| events | Wrong image/registry auth |
| Events, node capacity | Insufficient resources |
| | App error, missing config |
| Memory limits | Memory leak, limits too low |
| Probe config | Slow startup, wrong endpoint |
Common Failure Chains
Storage failures cascade:
StorageClass missing → PVC Pending → Pod Pending → Helm timeout
Network failures cascade:
DNS failure → Service unreachable → Health check fails → Pod restarted
Secret failures cascade:
ExternalSecret fails → Secret missing → Pod CrashLoopBackOff
Flux GitOps Commands
# Check status KUBECONFIG=~/.kube/<cluster>.yaml flux get all KUBECONFIG=~/.kube/<cluster>.yaml flux get kustomizations KUBECONFIG=~/.kube/<cluster>.yaml flux get helmreleases -A # Trigger reconciliation KUBECONFIG=~/.kube/<cluster>.yaml flux reconcile source git flux-system KUBECONFIG=~/.kube/<cluster>.yaml flux reconcile kustomization <name> KUBECONFIG=~/.kube/<cluster>.yaml flux reconcile helmrelease <name> -n <namespace>
Researching Unfamiliar Services
When investigating unknown services, spawn a haiku agent to research documentation:
Task tool: - subagent_type: "general-purpose" - model: "haiku" - prompt: "Research [service] troubleshooting docs. Focus on: 1. Common failure modes 2. Health indicators 3. Configuration gotchas Start with: [docs-url]"
Chart URL → Docs mapping:
| Chart Source | Documentation |
|---|---|
| cert-manager.io/docs |
| longhorn.io/docs |
| grafana.com/docs |
| prometheus.io/docs |
Common Confusions
❌ Jump to logs without checking events first ✅ Events provide context, then investigate logs
❌ Look only at current pod state ✅ Check
--previous logs if pod restarted
❌ Assume first error is root cause ✅ Apply 5 Whys to find true root cause
❌ Investigate without confirming cluster ✅ ALWAYS confirm cluster before any kubectl command
❌ Use
helm list to check Helm release status
✅ Use kubectl get helmrelease -A - Flux manages releases via CRDs, not Helm CLI
Keywords
kubernetes, debugging, crashloopbackoff, oomkilled, pending, root cause analysis, 5 whys, incident investigation, pod logs, events, kubectl, flux, gitops, troubleshooting