Claude-skill-registry debugging-flux-deployments

Diagnoses and resolves issues with Flux GitOps deployments, Kubernetes pods, services, and HelmReleases in the Superbloom cluster

install

source · Clone the upstream repo

git clone https://github.com/majiayu000/claude-skill-registry

Claude Code · Install into ~/.claude/skills/

T=$(mktemp -d) && git clone --depth=1 https://github.com/majiayu000/claude-skill-registry "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/data/debugging-flux-deployments" ~/.claude/skills/majiayu000-claude-skill-registry-debugging-flux-deployments && rm -rf "$T"

manifest: skills/data/debugging-flux-deployments/SKILL.md

Debugging Flux Deployments

This skill provides systematic approaches to diagnose and fix deployment issues in the Superbloom K3s cluster.

Capabilities

Diagnose pod startup failures
Debug empty service endpoints
Troubleshoot HelmRelease errors
Fix SOPS decryption issues
Resolve certificate/ingress problems
Identify RBAC permission issues

Diagnostic Flow

1. Check Flux sync status
2. Check HelmRelease status
3. Check Pod status
4. Check Service endpoints
5. Check container logs
6. Test internal connectivity

Essential Commands

Flux Status

# Check all Flux resources
flux get all -A

# Check specific kustomization
flux get kustomization flux-system -n flux-system

# Force reconciliation
flux reconcile kustomization flux-system --with-source

# Check GitRepository
flux get source git -A

HelmRelease Status

# List all HelmReleases
kubectl get helmrelease -A

# Detailed status
kubectl describe helmrelease <name> -n <namespace>

# Force HelmRelease reconciliation
flux reconcile helmrelease <name> -n <namespace>

Pod Diagnostics

# List pods with status
kubectl get pods -n <namespace>

# Detailed pod info
kubectl describe pod <pod-name> -n <namespace>

# Container logs
kubectl logs -n <namespace> <pod-name> -c <container-name>

# All containers
kubectl logs -n <namespace> <pod-name> --all-containers

# Previous crashed container
kubectl logs -n <namespace> <pod-name> -c <container> --previous

Service/Endpoint Check

# Check service
kubectl get svc -n <namespace>

# Check endpoints (empty = no healthy pods)
kubectl get endpoints <service-name> -n <namespace>

# Test internal connectivity
kubectl run curl-test --rm -it --restart=Never --image=curlimages/curl -- \
  curl -s http://<service>.<namespace>.svc.cluster.local:<port>/health

Secret Verification

# Check secret exists
kubectl get secret <name> -n <namespace>

# View secret data (base64 decoded)
kubectl get secret <name> -n <namespace> -o jsonpath='{.data.<key>}' | base64 -d

# SOPS decrypt locally
cd flux && sops -d clusters/superbloom/apps/<app>/secrets.yaml

Common Issues & Solutions

Pod CrashLoopBackOff

Symptom: Pod shows

CrashLoopBackOff

1/2 Ready

Diagnosis:

kubectl describe pod <pod> -n <namespace> | tail -30
kubectl logs -n <namespace> <pod> -c <container>

Common causes:

Missing RBAC permissions (Tailscale sidecar)
Missing environment variables
Failed health probes
Image pull errors

Tailscale fix: Add RBAC for secrets management:

rules:
  - apiGroups: [""]
    resources: ["secrets"]
    verbs: ["get", "create", "update", "patch"]

Empty Service Endpoints

Symptom: Service has no endpoints, traffic fails

Diagnosis:

kubectl get endpoints <svc> -n <namespace>
kubectl get pods -n <namespace> -o wide

Cause: Pod not "Ready" (all containers must pass readiness probes)

Fix: Ensure all containers are healthy. If sidecar failing, fix that first.

HelmRelease Schema Validation Error

Symptom:

values don't meet the specifications of the schema

Example error:

at '/serviceAccount/name': got string, want object

Cause: bjw-s app-template v4 schema mismatch

Fix for serviceAccount: Move inside controller:

controllers:
  main:
    serviceAccount:        # CORRECT - inside controllers.main
      name: my-app
    containers:
      main:
        ...

NOT at top level:

serviceAccount:           # WRONG - top level
  name: my-app

SOPS Decryption Failure

Symptom: Flux can't decrypt secrets

Diagnosis:

# Check sops-age secret exists
kubectl get secret sops-age -n flux-system

# Test local decryption
cd flux && sops -d clusters/superbloom/apps/<app>/secrets.yaml

Fixes:

Ensure
```
.sops.yaml
```
path_regex matches file location
Verify age key is available locally and in cluster

Certificate/ACME Failures

Symptom: 521 errors, certificate not issued

Diagnosis:

kubectl logs -n caddy-system -l app=caddy --tail=50

Common causes:

Domain not in DDNS → Add to DOMAINS list
Rate limited → Wait 1 hour
Origin unreachable → Check Cloudflare proxy settings

Fix:

# Add domain to DDNS
sops -d flux/clusters/superbloom/infra/ddns/secrets.yaml
# Add to DOMAINS, re-encrypt

# Restart DDNS and Caddy
kubectl rollout restart deployment ddns -n ddns
kubectl rollout restart deployment caddy -n caddy-system

Image Pull Errors

Symptom:

ImagePullBackOff

ErrImagePull

Diagnosis:

kubectl describe pod <pod> -n <namespace> | grep -A5 "Events"

Fixes:

Check image exists in GHCR
Verify image tag is correct
Check imagePullSecrets if private registry

Exec Format Error (Corrupted Image Cache)

Symptom: Container crashes with

exec /usr/local/bin/<binary>: exec format error

Cause: Containerd cached a wrong-architecture image (common with multi-arch

latest

tags)

Diagnosis:

kubectl logs -n <namespace> <pod> -c <container>
# Shows "exec format error"

# Check cached images
ssh superbloom "sudo ctr --address /run/k3s/containerd/containerd.sock -n k8s.io images ls | grep <image>"

Fix: Remove corrupted image from containerd cache:

ssh superbloom "sudo ctr --address /run/k3s/containerd/containerd.sock -n k8s.io images rm <image>:<tag>"

# Delete pods to force fresh pull
kubectl delete pod -n <namespace> -l app.kubernetes.io/instance=<app>

Prevention: Pin to specific version tags instead of

latest

for sidecars like Tailscale.

Container Exits Immediately (Exit Code 0)

Symptom: Container shows

Completed

with exit code 0, restarts in CrashLoopBackOff

Cause: Often a Docker build issue - source files are empty (0 bytes)

Diagnosis:

# Check file sizes in image
kubectl run debug --rm -it --image=<image>:<tag> -- ls -la /app/src/
# If files show 0 bytes, the build is broken

Fix: Rebuild with cache disabled:

# In .github/workflows/<app>.yml
- name: Build and push
  uses: docker/build-push-action@v5
  with:
    no-cache: true  # Temporarily disable cache

After successful rebuild, re-enable caching.

Useful Patterns

Watch Resources

# Watch pods
kubectl get pods -n <namespace> -w

# Watch events
kubectl get events -n <namespace> -w --sort-by='.lastTimestamp'

Exec Into Container

kubectl exec -it -n <namespace> <pod> -c <container> -- /bin/sh

Port Forward for Testing

kubectl port-forward -n <namespace> svc/<service> 8080:3000
curl http://localhost:8080/health

Delete and Recreate HelmRelease

# Sometimes needed to clear stuck state
kubectl delete helmrelease <name> -n <namespace>
flux reconcile kustomization flux-system --with-source

Best Practices

Always check Flux sync status first
Read HelmRelease events for schema errors
Empty endpoints = pod not Ready
Check ALL container logs in multi-container pods
Test internal connectivity before debugging ingress
Rate limits require waiting, not repeated attempts