Claude-skill-registry check-cluster-health
Checks comprehensive health check for a Kubernetes Cluster.
git clone https://github.com/majiayu000/claude-skill-registry
T=$(mktemp -d) && git clone --depth=1 https://github.com/majiayu000/claude-skill-registry "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/data/check-cluster-health" ~/.claude/skills/majiayu000-claude-skill-registry-check-cluster-health && rm -rf "$T"
skills/data/check-cluster-health/SKILL.mdCheck Cluster Health
Perform comprehensive health check of Kubernetes cluster infrastructure.
When to Use
- Initial investigation of any production issue
- Before deep-diving into specific pods or services
- User reports "something is wrong" without specifics
- Periodic health checks
- Post-deployment validation
- After scaling events or cluster changes
Skill Objective
Quickly assess the overall state of the Kubernetes cluster to identify:
- Node health and resource pressure
- Pod health across all namespaces
- System component status
- Recent critical events
- Resource constraints or bottlenecks
Investigation Steps
Step 1: Check Node Health
Get overview of all nodes in the cluster:
kubectl get nodes -o wide
Look for:
- Nodes in NotReady state
- Node ages (very old or very new nodes)
- Kubernetes versions (version skew)
- Internal/External IPs
Expected Output:
NAME STATUS ROLES AGE VERSION node-1 Ready control-plane 45d v1.28.0 node-2 Ready <none> 45d v1.28.0 node-3 Ready <none> 45d v1.28.0 node-4 NotReady <none> 45d v1.28.0 ⚠️
Step 2: Check Node Resource Usage
Get current CPU and memory utilization:
kubectl top nodes
Look for:
- CPU usage > 80%
- Memory usage > 85%
- Significant imbalance between nodes
Expected Output:
NAME CPU(cores) CPU% MEMORY(bytes) MEMORY% node-1 450m 22% 4Gi 50% node-2 890m 44% 6Gi 75% node-3 1200m 60% 7Gi 87% ⚠️ node-4 100m 5% 2Gi 25%
Step 3: Check Node Conditions
Inspect for resource pressure conditions:
kubectl describe nodes | grep -A 5 "Conditions:"
Look for:
- MemoryPressure: True
- DiskPressure: True
- PIDPressure: True
- NetworkUnavailable: True
Critical Conditions:
Conditions: Type Status Reason ---- ------ ------ MemoryPressure True NodeHasInsufficientMemory ⚠️ DiskPressure False NodeHasSufficientDisk PIDPressure False NodeHasSufficientPID Ready True KubeletReady
Step 4: Find Problematic Pods
Get all pods that are not in Running or Succeeded state:
kubectl get pods --all-namespaces --field-selector=status.phase!=Running,status.phase!=Succeeded
Alternative - get all pods with issues:
kubectl get pods --all-namespaces | grep -vE 'Running|Completed|Succeeded'
Look for:
- CrashLoopBackOff
- ImagePullBackOff
- Pending
- Error
- Evicted
- OOMKilled
Expected Output:
NAMESPACE NAME READY STATUS RESTARTS AGE api api-service-abc 0/1 CrashLoopBackOff 5 10m ⚠️ api api-service-xyz 0/1 OOMKilled 3 15m ⚠️ default worker-123 0/1 Pending 0 5m ⚠️ monitoring prometheus-456 0/2 ImagePullBackOff 0 20m ⚠️
Step 5: Check System Components
Verify kube-system namespace health:
kubectl get pods -n kube-system
Critical components to check:
- kube-apiserver
- kube-controller-manager
- kube-scheduler
- etcd
- coredns (or kube-dns)
- kube-proxy
Expected Output:
NAME READY STATUS RESTARTS AGE coredns-565d847f94-abcde 1/1 Running 0 45d coredns-565d847f94-fghij 1/1 Running 0 45d etcd-node-1 1/1 Running 0 45d kube-apiserver-node-1 1/1 Running 0 45d kube-controller-manager-node-1 1/1 Running 0 45d kube-proxy-klmno 1/1 Running 0 45d kube-scheduler-node-1 1/1 Running 0 45d
Step 6: Review Recent Critical Events
Get events from the last hour, filtered for warnings and errors:
kubectl get events --all-namespaces --sort-by='.lastTimestamp' | tail -50 | grep -E 'Warning|Error'
Alternative - more structured:
kubectl get events --all-namespaces --sort-by='.lastTimestamp' --field-selector type!=Normal
Look for patterns:
- Repeated OOMKilled events
- FailedScheduling (resource constraints)
- FailedMount (volume issues)
- ImagePullBackOff (registry issues)
- Evictions (resource pressure)
- BackOff (crashing containers)
Expected Output:
10m Warning FailedScheduling pod/worker-123 0/4 nodes available: insufficient memory 8m Warning BackOff pod/api-service Back-off restarting failed container 5m Warning OOMKilled pod/api-service Container exceeded memory limit 3m Warning Evicted pod/cache-789 The node was low on resource: memory
Step 7: Check for Evicted Pods
Find pods that were evicted due to resource pressure:
kubectl get pods --all-namespaces --field-selector=status.phase=Failed | grep Evicted
Evictions indicate:
- Node resource pressure (memory/disk)
- Need for resource limits/requests tuning
- Possible need for cluster scaling
Step 8: Review Resource Allocation
Check cluster-wide resource allocation:
kubectl describe nodes | grep -A 7 "Allocated resources:"
Look for:
- CPU allocation > 80%
- Memory allocation > 80%
- Pods per node approaching limits
Expected Output:
Allocated resources: (Total limits may be over 100 percent, i.e., overcommitted.) Resource Requests Limits -------- -------- ------ cpu 3800m (95%) 7200m (180%) ⚠️ memory 24Gi (75%) 32Gi (100%) ⚠️ ephemeral-storage 0 (0%) 0 (0%) hugepages-2Mi 0 (0%) 0 (0%)
MCP Tools to Use
kubernetes.get_nodes() kubernetes.get_node_metrics() kubernetes.describe_node(node_name) kubernetes.get_pods(namespace="all", field_selector="status.phase!=Running") kubernetes.get_pods(namespace="kube-system") kubernetes.get_events(namespace="all", since="1h", field_selector="type!=Normal")
Output Format
Provide a structured summary in this format:
# CLUSTER HEALTH SUMMARY ======================== ## Cluster Overview - **Total Nodes:** 5 - **Healthy Nodes:** 4 - **Unhealthy Nodes:** 1 - **Kubernetes Version:** v1.28.0 ## Node Health ### Healthy Nodes ✓ - node-1: Ready (CPU: 22%, Memory: 50%) - node-2: Ready (CPU: 44%, Memory: 75%) - node-3: Ready (CPU: 60%, Memory: 87%) ⚠️ High memory ### Unhealthy Nodes ⚠️ - **node-4:** NotReady - Condition: KubeletNotReady - Reason: Node had insufficient memory - Duration: 15 minutes ## Pod Health Summary **Total Pods:** 127 - Running: 120 - Pending: 4 ⚠️ - CrashLoopBackOff: 2 ⚠️ - ImagePullBackOff: 1 ⚠️ ### Critical Pod Issues 1. **api-service-abc** (namespace: api) - Status: CrashLoopBackOff - Restarts: 5 times in 10 minutes - Action needed: Investigate with debug-pod-issues skill 2. **api-service-xyz** (namespace: api) - Status: OOMKilled - Restarts: 3 times in 15 minutes - Action needed: Memory limit investigation required 3. **worker-123** (namespace: default) - Status: Pending - Reason: Insufficient memory to schedule - Action needed: Resource analysis needed 4. **prometheus-456** (namespace: monitoring) - Status: ImagePullBackOff - Reason: Failed to pull image - Action needed: Check registry connectivity ## System Components ✓ All critical system components healthy: - coredns: 2/2 pods running - kube-apiserver: Running - kube-controller-manager: Running - kube-scheduler: Running - etcd: Running - kube-proxy: DaemonSet 5/5 ready ## Recent Critical Events (Last 60 minutes) **OOM Kills:** 3 occurrences - 14:23: api-service-xyz OOMKilled (namespace: api) - 14:25: api-service-xyz OOMKilled (namespace: api) - 14:27: api-service-xyz OOMKilled (namespace: api) **Scheduling Failures:** 4 occurrences - 14:20: worker-123 FailedScheduling: insufficient memory - 14:22: worker-456 FailedScheduling: insufficient memory - 14:25: worker-789 FailedScheduling: insufficient memory - 14:28: cache-abc FailedScheduling: insufficient cpu **Node Issues:** - 14:15: node-4 NodeNotReady: KubeletNotReady **Evictions:** 2 occurrences - 14:18: cache-xyz Evicted: node low on memory - 14:22: cache-abc Evicted: node low on memory ## Resource Pressure Analysis ### Node-4: MemoryPressure Detected ⚠️ - Current usage: 28Gi / 32Gi (87%) - Condition: MemoryPressure True - Impact: Pods may be evicted - Action: Investigate high memory consumers ### Cluster-Wide Resource Allocation - **CPU:** 75% allocated (approaching capacity) - **Memory:** 82% allocated ⚠️ (critical threshold) - **Risk:** New pods may not schedule ## Issues Detected ### 🚨 CRITICAL Issues (Require Immediate Action) 1. **Multiple OOM Kills in api namespace** - Impact: Service degradation/outages - Pods affected: api-service-xyz - Recommendation: Increase memory limits or investigate memory leak - Next step: Use `debug-pod-issues` skill 2. **Node-4 Unhealthy (NotReady)** - Impact: Reduced cluster capacity - Duration: 15 minutes - Recommendation: Investigate node logs, consider cordoning/draining - Next step: SSH to node or check kubelet logs 3. **Cluster Memory Capacity Critical (82% allocated)** - Impact: Risk of scheduling failures - Pods pending: 4 - Recommendation: Scale cluster or optimize workloads - Next step: Use `analyze-resource-usage` skill ### ⚠️ WARNING Issues (Should Be Addressed) 4. **Node-3 High Memory Usage (87%)** - Impact: Risk of pressure condition - Current state: Still Ready - Recommendation: Monitor closely, consider rebalancing pods 5. **ImagePullBackOff in monitoring namespace** - Impact: Prometheus not available - Likely cause: Registry connectivity or credentials - Recommendation: Check image repository access ## Recommended Actions (Priority Order) ### Immediate (Next 15 minutes) 1. **Investigate api-service OOM kills** → Use `debug-pod-issues` skill on api-service-xyz 2. **Check node-4 status** → SSH to node or review kubelet logs 3. **Review pending pods** → Use `analyze-resource-usage` to understand capacity ### Short Term (Next hour) 4. Increase memory limits for api-service pods 5. Consider scaling cluster (add nodes or upsize) 6. Fix ImagePullBackOff for prometheus 7. Investigate memory usage on node-3 ### Long Term (This week) 8. Implement pod resource requests/limits across all workloads 9. Set up cluster autoscaling 10. Review and optimize memory-intensive workloads 11. Implement monitoring alerts for: - Node NotReady conditions - OOM kill events - Resource allocation thresholds (>80%) - Pod evictions ## Next Steps Based on the findings, I recommend: 1. **Deep dive into OOM issues** → Skill: `debug-pod-issues` - Target: api-service-xyz in api namespace 2. **Analyze resource usage patterns** → Skill: `analyze-resource-usage` - Focus on memory consumption and allocation 3. **Check logs for crash patterns** → Skill: `inspect-logs` - Target: api-service pods for error patterns Would you like me to proceed with investigating the OOM kills in the api-service pods?
Red Flags to Watch For
- 🚨 Node NotReady status - Immediate impact on capacity
- 🚨 Multiple pods in CrashLoopBackOff - Application issues
- 🚨 Repeated OOMKilled events - Memory configuration problems
- 🚨 System component failures - Cluster instability
- 🚨 High resource allocation (>85%) - Scheduling issues imminent
- ⚠️ High restart counts (>5 in last hour) - Application instability
- ⚠️ Pending pods - Resource constraints
- ⚠️ ImagePullBackOff - Registry or networking issues
- ⚠️ Volume mount failures - Storage problems
- ⚠️ Evicted pods - Node resource pressure
Decision Tree - Next Skill to Use
Based on findings, recommend next skill: If OOMKilled or CrashLoopBackOff detected: → Use `debug-pod-issues` skill If high CPU/Memory usage detected: → Use `analyze-resource-usage` skill If connection errors in events: → Use `check-network-connectivity` skill If errors in events but pods running: → Use `inspect-logs` skill If multiple issues: → Prioritize by severity, start with pod crashes
Common Patterns & Root Causes
Pattern: Multiple OOM Kills
Indicates: Memory limits too low or memory leak Next Action: debug-pod-issues + inspect-logs
Pattern: Many Pending Pods
Indicates: Insufficient cluster capacity Next Action: analyze-resource-usage
Pattern: Node NotReady + Evictions
Indicates: Node resource exhaustion Next Action: Investigate node directly, consider draining
Pattern: System Component Failure
Indicates: Critical cluster issue Next Action: Immediate investigation, possibly escalate
Pattern: ImagePullBackOff
Indicates: Registry access issues Next Action: Check network connectivity, registry credentials
Skill Completion Criteria
This skill is complete when:
- ✓ Node health assessed
- ✓ Pod health across all namespaces evaluated
- ✓ System components verified
- ✓ Recent critical events reviewed
- ✓ Issues categorized by severity
- ✓ Recommended next steps provided
- ✓ Clear indication of which skill to use next
Notes for Agent
- Always start with this skill for vague issues
- Provide executive summary before detailed findings
- Categorize issues by severity (Critical/Warning/Info)
- Explicitly recommend next skill based on findings
- Include specific pod names, namespaces, timestamps
- Highlight patterns, not just individual issues
- Keep summary concise but comprehensive