Claude-skill-registry gcp-gke-troubleshooting
install
source · Clone the upstream repo
git clone https://github.com/majiayu000/claude-skill-registry
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/majiayu000/claude-skill-registry "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/data/gcp-gke-troubleshooting" ~/.claude/skills/majiayu000-claude-skill-registry-gcp-gke-troubleshooting && rm -rf "$T"
manifest:
skills/data/gcp-gke-troubleshooting/SKILL.mdsource content
GKE Troubleshooting
Purpose
Systematically diagnose and resolve common GKE issues. This skill provides structured debugging workflows, common causes, and proven solutions for the most frequent problems encountered in production deployments.
When to Use
Use this skill when you need to:
- Debug pods stuck in Pending, CrashLoopBackOff, or ImagePullBackOff status
- Troubleshoot networking issues (DNS failures, service connectivity)
- Fix Cloud SQL connection problems or IAM authentication errors
- Resolve Pub/Sub message processing issues
- Investigate resource exhaustion or scheduling failures
- Debug health probe failures
- Diagnose application crashes or startup issues
Trigger phrases: "pod not starting", "CrashLoopBackOff", "debug GKE issue", "Cloud SQL connection failed", "Pub/Sub not working", "pod pending"
Table of Contents
Quick Start
Quick diagnostic flow for any pod issue:
# 1. Check pod status kubectl get pods -n wtr-supplier-charges # 2. View detailed pod information kubectl describe pod <pod-name> -n wtr-supplier-charges # 3. Check logs kubectl logs <pod-name> -n wtr-supplier-charges # 4. Check previous logs if crashed kubectl logs <pod-name> -n wtr-supplier-charges --previous # 5. Check events for scheduling issues kubectl get events -n wtr-supplier-charges --sort-by='.lastTimestamp' # 6. Check resource availability kubectl top nodes kubectl top pods -n wtr-supplier-charges
Instructions
Step 1: Identify the Pod Status
Understand what the pod status means:
kubectl get pods -n wtr-supplier-charges -o wide
| Status | Meaning | Action |
|---|---|---|
| Running | Pod is executing | Check logs if issues |
| Pending | Waiting to be scheduled | Check events, node resources |
| CrashLoopBackOff | App crashes repeatedly | Check logs, configuration |
| ImagePullBackOff | Can't pull image | Verify image, permissions |
| Completed | Pod ran successfully and exited | Normal for batch jobs |
| Error | Pod exited with error | Check logs |
Step 2: Investigate Based on Status
Pod Status: ImagePullBackOff
Diagnose:
# Get detailed error kubectl describe pod <pod-name> -n wtr-supplier-charges # Look for "Failed to pull image" in Events section # Example: "Failed to pull image ... access denied" # Check if image exists in registry gcloud artifacts docker images list \ europe-west2-docker.pkg.dev/ecp-artifact-registry/wtr-supplier-charges-container-images
Solutions:
- Image doesn't exist:
# Verify image tag is correct kubectl get deployment supplier-charges-hub -n wtr-supplier-charges \ -o jsonpath='{.spec.template.spec.containers[0].image}'
- Missing Artifact Registry permissions:
# Grant Artifact Registry Reader role gcloud artifacts repositories add-iam-policy-binding \ wtr-supplier-charges-container-images \ --location=europe-west2 \ --member="serviceAccount:app-runtime@project.iam.gserviceaccount.com" \ --role="roles/artifactregistry.reader"
- Private image registry authentication:
# Create image pull secret kubectl create secret docker-registry regcred \ --docker-server=europe-west2-docker.pkg.dev \ --docker-username=_json_key \ --docker-password="$(cat key.json)" \ -n wtr-supplier-charges # Add to deployment spec: imagePullSecrets: - name: regcred
Pod Status: CrashLoopBackOff
Diagnose:
# Check current logs kubectl logs <pod-name> -n wtr-supplier-charges # Check logs from previous container (if crashed) kubectl logs <pod-name> -n wtr-supplier-charges --previous # Check liveness probe configuration kubectl describe pod <pod-name> -n wtr-supplier-charges | grep -A 10 "Liveness"
Common Causes:
- Application exits immediately:
# Check startup logs for Java/Spring Boot errors kubectl logs <pod-name> -n wtr-supplier-charges | head -50 # Look for: ClassNotFoundException, ConfigurationException, connection errors
- Liveness probe fails too early:
# Increase initialDelaySeconds from 20 to 60 kubectl patch deployment supplier-charges-hub -n wtr-supplier-charges \ -p '{"spec":{"template":{"spec":{"containers":[{"name":"supplier-charges-hub-container","livenessProbe":{"initialDelaySeconds":60}}]}}}}'
- Out of memory:
# Check memory usage kubectl top pods <pod-name> -n wtr-supplier-charges # Increase memory limits kubectl patch deployment supplier-charges-hub -n wtr-supplier-charges \ -p '{"spec":{"template":{"spec":{"containers":[{"name":"supplier-charges-hub-container","resources":{"limits":{"memory":"4Gi"}}}]}}}}'
- Missing environment variables:
# Check what env vars are set kubectl exec <pod-name> -n wtr-supplier-charges -- env | sort # Verify ConfigMap/Secret values kubectl get configmap supplier-charges-hub-config -n wtr-supplier-charges -o yaml kubectl get secret db-credentials -n wtr-supplier-charges -o yaml
Pod Status: Pending (Unschedulable)
Diagnose:
# Check events for scheduling messages kubectl describe pod <pod-name> -n wtr-supplier-charges # Look for: "Insufficient memory", "Insufficient cpu", "PersistentVolumeClaim" # Check node capacity kubectl top nodes kubectl describe nodes
Solutions:
- Insufficient cluster resources:
# Scale deployment down kubectl scale deployment supplier-charges-hub --replicas=1 -n wtr-supplier-charges # Or trigger autoscaling (if available) # GKE Autopilot automatically provisions capacity
- Node affinity/taints preventing scheduling:
# Check node taints kubectl get nodes -o custom-columns=NAME:.metadata.name,TAINTS:.spec.taints # View pod's node affinity/tolerations kubectl get pod <pod-name> -n wtr-supplier-charges -o yaml | grep -A 10 -B 2 "affinity\|toleration" # Add toleration to deployment if needed spec: tolerations: - key: "dedicated" operator: "Equal" value: "compute" effect: "NoSchedule"
- PersistentVolumeClaim not bound:
# Check PVC status kubectl get pvc -n wtr-supplier-charges # If Pending, check storage class kubectl get storageclass
Step 3: Network and Connectivity Issues
DNS Resolution Failures
Diagnose:
# Test DNS from pod kubectl exec <pod-name> -n wtr-supplier-charges -- nslookup postgres # Test connectivity to service kubectl exec <pod-name> -n wtr-supplier-charges -- curl -v http://postgres:5432
Solutions:
- CoreDNS pods not running:
# Check CoreDNS kubectl get pods -n kube-system -l k8s-app=kube-dns # Restart CoreDNS if needed kubectl rollout restart deployment coredns -n kube-system
- Service doesn't exist or wrong namespace:
# Verify service exists kubectl get svc postgres -n wtr-supplier-charges # Use fully qualified DNS name if in different namespace service-name.namespace.svc.cluster.local
Service Not Accessible
Diagnose:
# Check service endpoints kubectl get endpoints supplier-charges-hub -n wtr-supplier-charges # If empty, no pods match the selector kubectl get svc supplier-charges-hub -n wtr-supplier-charges -o yaml | grep selector kubectl get pods -n wtr-supplier-charges --show-labels
Solutions:
- Pod labels don't match service selector:
# Add/update labels on deployment kubectl patch deployment supplier-charges-hub -n wtr-supplier-charges \ -p '{"spec":{"template":{"metadata":{"labels":{"app":"supplier-charges-hub"}}}}}'
- Pods not in Ready state:
# Check readiness probe kubectl describe pod <pod-name> -n wtr-supplier-charges | grep -A 10 "Readiness" # Check health endpoint kubectl exec <pod-name> -n wtr-supplier-charges -- \ curl localhost:8080/actuator/health/readiness
Step 4: Database Connection Issues
Diagnose:
# Test connectivity to Cloud SQL Proxy kubectl exec <pod-name> -n wtr-supplier-charges -- nc -zv localhost 5432 # Check Cloud SQL Proxy logs kubectl logs <pod-name> -c cloud-sql-proxy -n wtr-supplier-charges # Check application startup logs for DB connection errors kubectl logs <pod-name> -c supplier-charges-hub-container -n wtr-supplier-charges | grep -i "database\|connection"
Solutions:
- IAM Authentication fails:
# Verify Workload Identity binding kubectl get sa app-runtime -n wtr-supplier-charges -o yaml | grep iam.gke.io # Grant cloudsql.client role gcloud projects add-iam-policy-binding project-id \ --member="serviceAccount:app-runtime@project.iam.gserviceaccount.com" \ --role="roles/cloudsql.client" # Check service account email format (must be {name}@{project}.iam)
- Wrong connection string:
# Verify DB_CONNECTION_NAME format: project:region:instance kubectl get configmap db-config -n wtr-supplier-charges -o yaml # Should be something like: ecp-wtr-supplier-charges-labs:europe-west2:supplier-charges-hub
- Cloud SQL Proxy not running:
# Check sidecar logs kubectl logs <pod-name> -c cloud-sql-proxy -n wtr-supplier-charges # Check sidecar resources kubectl describe pod <pod-name> -n wtr-supplier-charges | grep -A 15 "cloud-sql-proxy"
Step 5: Pub/Sub Issues
Diagnose:
# Check subscription backlog gcloud pubsub subscriptions describe supplier-charges-incoming-sub \ --project=ecp-wtr-supplier-charges-labs # Check application Pub/Sub logs kubectl logs <pod-name> -c supplier-charges-hub-container \ -n wtr-supplier-charges | grep -i "pubsub\|subscription" # Test pub/sub connectivity from pod kubectl exec <pod-name> -n wtr-supplier-charges -- \ gcloud pubsub topics list --project=ecp-wtr-supplier-charges-labs
Solutions:
- Missing Pub/Sub permissions:
# Grant Pub/Sub roles gcloud projects add-iam-policy-binding project-id \ --member="serviceAccount:app-runtime@project.iam.gserviceaccount.com" \ --role="roles/pubsub.subscriber" gcloud projects add-iam-policy-binding project-id \ --member="serviceAccount:app-runtime@project.iam.gserviceaccount.com" \ --role="roles/pubsub.publisher"
- High subscription backlog (messages not being consumed):
# Check if pod is running kubectl get pods -n wtr-supplier-charges # Check application logs for processing errors kubectl logs -f <pod-name> -c supplier-charges-hub-container \ -n wtr-supplier-charges | grep -i "error\|exception" # Increase message processing timeout # In application.yaml: # spring.cloud.gcp.pubsub.subscriber.max-ack-extension-period: 600
- Message processing failures:
# Check for poison messages (causing repeated failures) # Review DLQ (Dead Letter Queue) if configured # Implement retry logic with exponential backoff # See Spring Cloud GCP documentation for retry configuration
Examples
See examples/examples.md for comprehensive examples including:
- Complete troubleshooting workflow
- Database connectivity debugging
- Pub/Sub debugging
Requirements
access to the clusterkubectl
CLI configuredgcloud- Permissions to view pod logs and describe resources
- For database debugging: access to view Cloud SQL configuration
- For Pub/Sub debugging: access to view subscription details
See Also
- gcp-gke-deployment-strategies - Understand deployment health checks
- gcp-gke-monitoring-observability - Monitor applications
- gcp-gke-workload-identity - Debug IAM/Workload Identity issues