Claude-skill-registry aks-deployment-troubleshooter

Diagnose and fix Kubernetes deployment failures, especially ImagePullBackOff, CrashLoopBackOff, and architecture mismatches. Battle-tested from 4-hour AKS debugging session with 10+ failure modes resolved.

install

source · Clone the upstream repo

git clone https://github.com/majiayu000/claude-skill-registry

Claude Code · Install into ~/.claude/skills/

T=$(mktemp -d) && git clone --depth=1 https://github.com/majiayu000/claude-skill-registry "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/data/aks-deployment-troubleshooter" ~/.claude/skills/majiayu000-claude-skill-registry-aks-deployment-troubleshooter && rm -rf "$T"

manifest: skills/data/aks-deployment-troubleshooter/SKILL.md

AKS Deployment Troubleshooter

Overview

This skill captures systematic approaches to debugging Kubernetes deployments, with specific focus on container image issues. Based on real debugging session resolving 10+ different failure modes.

When to Use

Pods stuck in
```
ImagePullBackOff
```
Pods in
```
CrashLoopBackOff
```
with
```
exec format error
```
"no match for platform in manifest" errors
Image registry authentication issues
Helm deployment timeouts

Quick Diagnosis Flow

Pod not running?
    │
    ├─► ImagePullBackOff
    │       │
    │       ├─► "not found" ──► Wrong tag or registry path
    │       ├─► "unauthorized" ──► Auth/imagePullSecrets issue
    │       └─► "no match for platform" ──► Architecture mismatch
    │
    ├─► CrashLoopBackOff
    │       │
    │       ├─► "exec format error" ──► Wrong CPU architecture
    │       ├─► Exit code 1 ──► App startup failure (check logs)
    │       └─► OOMKilled ──► Memory limits too low
    │
    └─► Pending
            │
            ├─► Insufficient CPU/memory ──► Scale cluster or reduce requests
            └─► No matching node ──► Check nodeSelector/tolerations

Diagnostic Commands

Step 1: Get Pod Status

kubectl get pods -n <namespace>

Step 2: Describe Failing Pod

kubectl describe pod <pod-name> -n <namespace> | grep -E "(Image:|Failed|Error|pull)"

Step 3: Check Events

kubectl get events -n <namespace> --sort-by='.lastTimestamp' | tail -20

Step 4: Check Logs (for CrashLoopBackOff)

kubectl logs <pod-name> -n <namespace> --tail=50

Step 5: Check Node Architecture

kubectl get nodes -o jsonpath='{.items[*].status.nodeInfo.architecture}'

Error Resolution Guide

1. ImagePullBackOff: "not found"

Error:

Failed to pull image "ghcr.io/owner/repo/app:abc123": not found

Causes & Solutions:

Cause	Solution
Tag doesn't exist	Verify image was pushed with exact tag
Short vs full SHA	Align metadata-action with deploy (use `type=raw,value=${{ github.sha }}` )
Builds skipped	Manual trigger or remove path filters
Wrong registry	Check `image.repository` in Helm values

Diagnostic:

# Check what tags exist (requires gh cli and package visibility)
gh api /users/<owner>/packages/container/<package>/versions

2. ImagePullBackOff: "unauthorized"

Error:

failed to authorize: failed to fetch anonymous token: 401 Unauthorized

Causes & Solutions:

Cause	Solution
Package is private	Make package public in GHCR settings
Missing imagePullSecrets	Create docker-registry secret
Wrong credentials	Regenerate and update secret

Create imagePullSecrets:

kubectl create secret docker-registry ghcr-secret \
  --docker-server=ghcr.io \
  --docker-username=<github-username> \
  --docker-password=<github-token> \
  --namespace=<namespace>

Link secret in deployment:

spec:
  imagePullSecrets:
    - name: ghcr-secret

3. ImagePullBackOff: "no match for platform in manifest"

Error:

no match for platform in manifest: not found

Root Cause: Image built for wrong CPU architecture OR buildx provenance issue.

Step 1: Check cluster architecture:

kubectl get nodes -o jsonpath='{.items[*].status.nodeInfo.architecture}'
# Output: amd64 amd64  OR  arm64 arm64

Step 2: Match build platform:

# In GitHub Actions docker/build-push-action
- uses: docker/build-push-action@v5
  with:
    platforms: linux/arm64  # or linux/amd64
    provenance: false       # CRITICAL: Disable attestation manifests
    no-cache: true          # Force fresh build

Why

provenance: false

? Buildx creates multi-arch manifest lists with attestations. Some container runtimes can't find the actual image in complex manifests. Disabling provenance creates simple single-platform images.

4. CrashLoopBackOff: "exec format error"

Error:

exec /usr/local/bin/docker-entrypoint.sh: exec format error

Root Cause: Binary architecture doesn't match node architecture.

Example: Built

linux/amd64

image, deployed to

arm64

nodes.

Solution:

Check node architecture:

kubectl get nodes -o jsonpath='{.items[*].status.nodeInfo.architecture}'

Update build platform to match
Rebuild WITHOUT cache (cached layers may have wrong arch)

platforms: linux/arm64  # Match your cluster!
no-cache: true          # Force complete rebuild
provenance: false       # Simple manifest

5. Helm --set Comma Parsing Error

Error:

failed parsing --set data: key "com" has no value (cannot end with ,)

Root Cause: Helm interprets commas as array separators in

--set

Wrong:

--set "origins=https://a.com,https://b.com"

Solution: Use heredoc values file:

# In GitHub Actions
- name: Deploy
  run: |
    cat > /tmp/overrides.yaml << EOF
    sso:
      env:
        ALLOWED_ORIGINS: "https://a.com,https://b.com"
    EOF

    helm upgrade --install app ./chart \
      --values /tmp/overrides.yaml

6. Azure Login "No subscriptions found"

Error:

Error: No subscriptions found for ***

Root Cause: Missing

subscriptionId

in AZURE_CREDENTIALS.

Solution: Use

--sdk-auth

format:

az ad sp create-for-rbac \
  --name "github-actions" \
  --role contributor \
  --scopes /subscriptions/<subscription-id>/resourceGroups/<rg-name> \
  --sdk-auth

Required JSON structure:

{
  "clientId": "xxx",
  "clientSecret": "xxx",
  "subscriptionId": "xxx",  // MUST be present
  "tenantId": "xxx"
}

7. GHCR 403 Forbidden

Error:

403 Forbidden: permission_denied: write_package

Solutions:

Make package public: GHCR → Package Settings → Change visibility
Link package to repository: Package Settings → Connect Repository
Ensure workflow has
```
packages: write
```
permission:

permissions:
  contents: read
  packages: write

Docker Build Best Practices for K8s

Buildx Configuration for Reliability

- name: Set up Docker Buildx
  uses: docker/setup-buildx-action@v3

- name: Build and push
  uses: docker/build-push-action@v5
  with:
    context: .
    push: true
    platforms: linux/arm64        # Match cluster architecture!
    provenance: false             # Avoid manifest complexity
    no-cache: true                # For debugging; remove in production
    tags: |
      ghcr.io/owner/repo:${{ github.sha }}
      ghcr.io/owner/repo:latest

Image Tag Strategy

Problem: Short SHA vs Full SHA mismatch

# docker/metadata-action default: short SHA (7 chars)
type=sha,prefix=  # Creates: ghcr.io/repo:abc1234

# github.sha is full SHA (40 chars)
${{ github.sha }}  # Is: abc1234567890abcdef...

Solution: Use explicit full SHA:

tags: |
  type=raw,value=${{ github.sha }}
  type=raw,value=latest,enable={{is_default_branch}}

Pre-Deployment Checklist

Architecture

Checked cluster node architecture (

kubectl get nodes -o jsonpath='{.items[*].status.nodeInfo.architecture}'

)

Build platform matches cluster (arm64 vs amd64)

Docker Build

```
provenance: false
```
set
```
platforms: linux/<arch>
```
matches cluster
Image tags are consistent between build and deploy

Registry

Packages are public OR imagePullSecrets configured
Workflow has
```
packages: write
```
permission

Helm

No commas in
```
--set
```
values (use values file instead)
Image repository and tag are correctly templated

Azure/Cloud

Credentials include subscriptionId
Service principal has correct role assignments

Debugging Workflow

Identify error type from
```
kubectl describe pod
```
Match to resolution guide above
Fix ONE thing at a time
Verify fix locally if possible before pushing
Check builds completed before checking deploy

Common Mistakes (Lessons Learned)

Assuming amd64 - Always check actual node architecture first
Rerunning failed workflows - Old workflows use old code; trigger new run
Multiple fixes per commit - Makes debugging harder; one fix at a time
Ignoring build job status - Deploy can start before builds finish
Caching issues - When stuck, try
```
no-cache: true
```

Related Skills

```
cloud-deploy-blueprint
```
- Full deployment setup
```
helm-charts
```
- Helm chart patterns
```
containerize-apps
```
- Dockerfile best practices