Learn-skills.dev kubernetes-specialist

Expert Kubernetes Specialist with deep expertise in container orchestration, cluster management, and cloud-native applications. Proficient in Kubernetes architecture, Helm charts, operators, and multi-cluster management across EKS, AKS, GKE, and on-premises deployments.

install
source · Clone the upstream repo
git clone https://github.com/NeverSight/learn-skills.dev
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/NeverSight/learn-skills.dev "$T" && mkdir -p ~/.claude/skills && cp -r "$T/data/skills-md/404kidwiz/claude-supercode-skills/kubernetes-specialist" ~/.claude/skills/neversight-learn-skills-dev-kubernetes-specialist && rm -rf "$T"
manifest: data/skills-md/404kidwiz/claude-supercode-skills/kubernetes-specialist/SKILL.md
source content

Kubernetes Specialist

Purpose

Provides expert Kubernetes orchestration and cloud-native application expertise with deep knowledge of container orchestration, cluster management, and production-grade deployments. Specializes in Kubernetes architecture, Helm charts, operators, multi-cluster management, and GitOps workflows across EKS, AKS, GKE, and on-premises deployments.

When to Use

  • Designing Kubernetes cluster architecture for production workloads
  • Implementing Helm charts, operators, or GitOps workflows (ArgoCD, Flux)
  • Troubleshooting cluster issues (networking, storage, performance)
  • Planning Kubernetes upgrades or multi-cluster strategies
  • Optimizing resource utilization and cost in Kubernetes environments
  • Setting up service mesh (Istio, Linkerd) and observability
  • Implementing Kubernetes security and RBAC policies

Quick Start

Invoke this skill when:

  • Designing Kubernetes cluster architecture for production workloads
  • Implementing Helm charts, operators, or GitOps workflows
  • Troubleshooting cluster issues (networking, storage, performance)
  • Planning Kubernetes upgrades or multi-cluster strategies
  • Optimizing resource utilization and cost in Kubernetes environments

Do NOT invoke when:

  • Simple Docker container needs (use docker commands directly)
  • Cloud infrastructure provisioning (use cloud-architect instead)
  • Application code debugging (use backend-developer/frontend-developer)
  • Database-specific issues (use database-administrator instead)

Decision Framework

Deployment Strategy Selection

├─ Zero downtime required?
│   ├─ Instant rollback needed → Blue-Green Deployment
│   │   Pros: Instant switch, easy rollback
│   │   Cons: 2x resources during deployment
│   │
│   ├─ Gradual rollout → Canary Deployment
│   │   Pros: Test with subset of traffic
│   │   Cons: Complex routing setup
│   │
│   └─ Simple updates → Rolling Update (default)
│       Pros: Built-in, no extra resources
│       Cons: Rollback takes time
│
├─ Stateful application?
│   ├─ Database → StatefulSet + PVC
│   │   Pros: Stable network IDs, ordered deployment
│   │   Cons: Complex scaling
│   │
│   └─ Stateless → Deployment
│       Pros: Easy scaling, self-healing
│
└─ Batch processing?
    ├─ One-time → Job
    ├─ Scheduled → CronJob
    └─ Parallel processing → Job with parallelism

Resource Configuration Matrix

Workload TypeCPU RequestCPU LimitMemory RequestMemory Limit
Web API100m-500m1000m256Mi-512Mi1Gi
Worker500m-1000m2000m512Mi-1Gi2Gi
Database1000m-2000m4000m2Gi-4Gi8Gi
Cache100m-250m500m1Gi-4Gi8Gi
Batch Job500m-2000m4000m1Gi-4Gi8Gi

Node Pool Strategy

Use CaseInstance TypeScalingCost
System podst3.large (3 nodes)FixedLow
Applicationsm5.xlargeAuto 3-20Medium
Batch/Spotm5.large-2xlargeAuto 0-50Very Low
GPU workloadsp3.2xlargeManualHigh

Red Flags → Escalate

STOP and escalate if:

  • Cluster upgrade with breaking API changes (deprecated versions)
  • Multi-region active-active requirements
  • Compliance requirements (PCI-DSS, HIPAA) need validation
  • Custom scheduler or controller development needed
  • etcd corruption or cluster state issues

Quality Checklist

Cluster Configuration

  • Multi-AZ deployment (nodes spread across availability zones)
  • Node autoscaling configured (Cluster Autoscaler or Karpenter)
  • System node pool with taints (separate critical addons from apps)
  • Encryption enabled (secrets at rest with KMS)
  • Audit logging enabled (API server logs)

Security

  • Pod Security Standards enforced (restricted or baseline)
  • Network policies configured (default deny + explicit allow)
  • RBAC configured (least privilege for all service accounts)
  • Image scanning enabled (scan for vulnerabilities)
  • Private container registry configured

Resource Management

  • All pods have resource requests and limits
  • HorizontalPodAutoscalers configured for scalable workloads
  • PodDisruptionBudgets defined (prevent too many pods down)
  • ResourceQuotas set per namespace
  • LimitRanges defined (default limits for pods)

High Availability

  • Deployments have ≥2 replicas
  • Anti-affinity rules prevent pod co-location
  • Readiness and liveness probes configured
  • PodDisruptionBudgets allow for rolling updates
  • Multi-region cluster (if global scale required)

Observability

  • Metrics server installed (kubectl top works)
  • Prometheus monitoring application metrics
  • Centralized logging (CloudWatch, Elasticsearch, Loki)
  • Distributed tracing (Jaeger, Tempo)
  • Dashboards for cluster and application health

Disaster Recovery

  • Velero installed for cluster backups
  • Backup schedule configured (daily minimum)
  • Restore tested (annual drill)
  • etcd backups automated (cloud-managed clusters)

Additional Resources