Claude-skill-registry aks-deployment-troubleshooter
Diagnose and fix Kubernetes deployment failures, especially ImagePullBackOff, CrashLoopBackOff, and architecture mismatches. Battle-tested from 4-hour AKS debugging session with 10+ failure modes resolved.
git clone https://github.com/majiayu000/claude-skill-registry
T=$(mktemp -d) && git clone --depth=1 https://github.com/majiayu000/claude-skill-registry "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/data/aks-deployment-troubleshooter" ~/.claude/skills/majiayu000-claude-skill-registry-aks-deployment-troubleshooter && rm -rf "$T"
skills/data/aks-deployment-troubleshooter/SKILL.mdAKS Deployment Troubleshooter
Overview
This skill captures systematic approaches to debugging Kubernetes deployments, with specific focus on container image issues. Based on real debugging session resolving 10+ different failure modes.
When to Use
- Pods stuck in
ImagePullBackOff - Pods in
withCrashLoopBackOffexec format error - "no match for platform in manifest" errors
- Image registry authentication issues
- Helm deployment timeouts
Quick Diagnosis Flow
Pod not running? │ ├─► ImagePullBackOff │ │ │ ├─► "not found" ──► Wrong tag or registry path │ ├─► "unauthorized" ──► Auth/imagePullSecrets issue │ └─► "no match for platform" ──► Architecture mismatch │ ├─► CrashLoopBackOff │ │ │ ├─► "exec format error" ──► Wrong CPU architecture │ ├─► Exit code 1 ──► App startup failure (check logs) │ └─► OOMKilled ──► Memory limits too low │ └─► Pending │ ├─► Insufficient CPU/memory ──► Scale cluster or reduce requests └─► No matching node ──► Check nodeSelector/tolerations
Diagnostic Commands
Step 1: Get Pod Status
kubectl get pods -n <namespace>
Step 2: Describe Failing Pod
kubectl describe pod <pod-name> -n <namespace> | grep -E "(Image:|Failed|Error|pull)"
Step 3: Check Events
kubectl get events -n <namespace> --sort-by='.lastTimestamp' | tail -20
Step 4: Check Logs (for CrashLoopBackOff)
kubectl logs <pod-name> -n <namespace> --tail=50
Step 5: Check Node Architecture
kubectl get nodes -o jsonpath='{.items[*].status.nodeInfo.architecture}'
Error Resolution Guide
1. ImagePullBackOff: "not found"
Error:
Failed to pull image "ghcr.io/owner/repo/app:abc123": not found
Causes & Solutions:
| Cause | Solution |
|---|---|
| Tag doesn't exist | Verify image was pushed with exact tag |
| Short vs full SHA | Align metadata-action with deploy (use ) |
| Builds skipped | Manual trigger or remove path filters |
| Wrong registry | Check in Helm values |
Diagnostic:
# Check what tags exist (requires gh cli and package visibility) gh api /users/<owner>/packages/container/<package>/versions
2. ImagePullBackOff: "unauthorized"
Error:
failed to authorize: failed to fetch anonymous token: 401 Unauthorized
Causes & Solutions:
| Cause | Solution |
|---|---|
| Package is private | Make package public in GHCR settings |
| Missing imagePullSecrets | Create docker-registry secret |
| Wrong credentials | Regenerate and update secret |
Create imagePullSecrets:
kubectl create secret docker-registry ghcr-secret \ --docker-server=ghcr.io \ --docker-username=<github-username> \ --docker-password=<github-token> \ --namespace=<namespace>
Link secret in deployment:
spec: imagePullSecrets: - name: ghcr-secret
3. ImagePullBackOff: "no match for platform in manifest"
Error:
no match for platform in manifest: not found
Root Cause: Image built for wrong CPU architecture OR buildx provenance issue.
Step 1: Check cluster architecture:
kubectl get nodes -o jsonpath='{.items[*].status.nodeInfo.architecture}' # Output: amd64 amd64 OR arm64 arm64
Step 2: Match build platform:
# In GitHub Actions docker/build-push-action - uses: docker/build-push-action@v5 with: platforms: linux/arm64 # or linux/amd64 provenance: false # CRITICAL: Disable attestation manifests no-cache: true # Force fresh build
Why
?
Buildx creates multi-arch manifest lists with attestations. Some container runtimes can't find the actual image in complex manifests. Disabling provenance creates simple single-platform images.provenance: false
4. CrashLoopBackOff: "exec format error"
Error:
exec /usr/local/bin/docker-entrypoint.sh: exec format error
Root Cause: Binary architecture doesn't match node architecture.
Example: Built
linux/amd64 image, deployed to arm64 nodes.
Solution:
- Check node architecture:
kubectl get nodes -o jsonpath='{.items[*].status.nodeInfo.architecture}' - Update build platform to match
- Rebuild WITHOUT cache (cached layers may have wrong arch)
platforms: linux/arm64 # Match your cluster! no-cache: true # Force complete rebuild provenance: false # Simple manifest
5. Helm --set Comma Parsing Error
Error:
failed parsing --set data: key "com" has no value (cannot end with ,)
Root Cause: Helm interprets commas as array separators in
--set.
Wrong:
--set "origins=https://a.com,https://b.com"
Solution: Use heredoc values file:
# In GitHub Actions - name: Deploy run: | cat > /tmp/overrides.yaml << EOF sso: env: ALLOWED_ORIGINS: "https://a.com,https://b.com" EOF helm upgrade --install app ./chart \ --values /tmp/overrides.yaml
6. Azure Login "No subscriptions found"
Error:
Error: No subscriptions found for ***
Root Cause: Missing
subscriptionId in AZURE_CREDENTIALS.
Solution: Use
--sdk-auth format:
az ad sp create-for-rbac \ --name "github-actions" \ --role contributor \ --scopes /subscriptions/<subscription-id>/resourceGroups/<rg-name> \ --sdk-auth
Required JSON structure:
{ "clientId": "xxx", "clientSecret": "xxx", "subscriptionId": "xxx", // MUST be present "tenantId": "xxx" }
7. GHCR 403 Forbidden
Error:
403 Forbidden: permission_denied: write_package
Solutions:
- Make package public: GHCR → Package Settings → Change visibility
- Link package to repository: Package Settings → Connect Repository
- Ensure workflow has
permission:packages: write
permissions: contents: read packages: write
Docker Build Best Practices for K8s
Buildx Configuration for Reliability
- name: Set up Docker Buildx uses: docker/setup-buildx-action@v3 - name: Build and push uses: docker/build-push-action@v5 with: context: . push: true platforms: linux/arm64 # Match cluster architecture! provenance: false # Avoid manifest complexity no-cache: true # For debugging; remove in production tags: | ghcr.io/owner/repo:${{ github.sha }} ghcr.io/owner/repo:latest
Image Tag Strategy
Problem: Short SHA vs Full SHA mismatch
# docker/metadata-action default: short SHA (7 chars) type=sha,prefix= # Creates: ghcr.io/repo:abc1234 # github.sha is full SHA (40 chars) ${{ github.sha }} # Is: abc1234567890abcdef...
Solution: Use explicit full SHA:
tags: | type=raw,value=${{ github.sha }} type=raw,value=latest,enable={{is_default_branch}}
Pre-Deployment Checklist
Architecture
- Checked cluster node architecture (
)kubectl get nodes -o jsonpath='{.items[*].status.nodeInfo.architecture}' - Build platform matches cluster (arm64 vs amd64)
Docker Build
-
setprovenance: false -
matches clusterplatforms: linux/<arch> - Image tags are consistent between build and deploy
Registry
- Packages are public OR imagePullSecrets configured
- Workflow has
permissionpackages: write
Helm
- No commas in
values (use values file instead)--set - Image repository and tag are correctly templated
Azure/Cloud
- Credentials include subscriptionId
- Service principal has correct role assignments
Debugging Workflow
- Identify error type from
kubectl describe pod - Match to resolution guide above
- Fix ONE thing at a time
- Verify fix locally if possible before pushing
- Check builds completed before checking deploy
Common Mistakes (Lessons Learned)
- Assuming amd64 - Always check actual node architecture first
- Rerunning failed workflows - Old workflows use old code; trigger new run
- Multiple fixes per commit - Makes debugging harder; one fix at a time
- Ignoring build job status - Deploy can start before builds finish
- Caching issues - When stuck, try
no-cache: true
Related Skills
- Full deployment setupcloud-deploy-blueprint
- Helm chart patternshelm-charts
- Dockerfile best practicescontainerize-apps