Claude-skill-registry kubernetes-operations
install
source · Clone the upstream repo
git clone https://github.com/majiayu000/claude-skill-registry
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/majiayu000/claude-skill-registry "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/data/kubernetes-operations-kcns008-cluster-skills" ~/.claude/skills/majiayu000-claude-skill-registry-kubernetes-operations && rm -rf "$T"
manifest:
skills/data/kubernetes-operations-kcns008-cluster-skills/SKILL.mdsource content
Kubernetes / OpenShift Cluster Operations
Day-2 operations, maintenance, and lifecycle management for production clusters.
Current Versions & Documentation (January 2026)
| Platform | Current Version | Upgrade Path | Documentation |
|---|---|---|---|
| Kubernetes | 1.31.x | 1.30 → 1.31 | https://kubernetes.io/docs/tasks/administer-cluster/ |
| OpenShift | 4.17.x | 4.16 → 4.17 | https://docs.openshift.com/container-platform/4.17/ |
| EKS | 1.31 | Rolling updates | https://docs.aws.amazon.com/eks/latest/userguide/update-cluster.html |
| AKS | 1.31 | Blue-green or rolling | https://learn.microsoft.com/azure/aks/upgrade-cluster |
| GKE | 1.31 | Surge upgrades | https://cloud.google.com/kubernetes-engine/docs/how-to/upgrading-a-cluster |
Key Tools & Versions
| Tool | Version | Install | Purpose |
|---|---|---|---|
| kubeadm | 1.31.x | Package manager | Cluster bootstrap |
| Velero | 1.15.x | Helm/CLI | Backup & restore |
| kube-prometheus-stack | v67.x | Helm | Monitoring |
| VPA | 1.3.x | kubectl apply | Vertical scaling |
| Cluster Autoscaler | 1.31.x | Helm | Node autoscaling |
| Karpenter | 1.1.x | Helm | AWS node provisioning |
Command Usage Convention
IMPORTANT: This skill uses
kubectl as the primary command. When working with:
- OpenShift/ARO clusters: Replace
withkubectloc - Standard Kubernetes (AKS, EKS, GKE): Use
as shownkubectl
Node Operations
Node Lifecycle
# View node status kubectl get nodes -o wide # Detailed node info kubectl describe node ${NODE_NAME} # Check node resources kubectl top nodes # Node labels and taints kubectl get nodes --show-labels kubectl describe node ${NODE} | grep -A 5 Taints
Drain and Cordon
# Cordon: Mark node unschedulable (no new pods) kubectl cordon ${NODE_NAME} # Drain: Evict pods safely kubectl drain ${NODE_NAME} \ --ignore-daemonsets \ --delete-emptydir-data \ --grace-period=60 \ --timeout=300s # Force drain (use with caution) kubectl drain ${NODE_NAME} \ --ignore-daemonsets \ --delete-emptydir-data \ --force \ --grace-period=30 # Uncordon: Allow scheduling again kubectl uncordon ${NODE_NAME}
Cluster Autoscaler Configuration
apiVersion: apps/v1 kind: Deployment metadata: name: cluster-autoscaler namespace: kube-system spec: template: spec: containers: - name: cluster-autoscaler image: registry.k8s.io/autoscaling/cluster-autoscaler:v1.31.0 command: - ./cluster-autoscaler - --v=4 - --cloud-provider=${CLOUD_PROVIDER} - --nodes=${MIN}:${MAX}:${NODE_GROUP} - --scale-down-delay-after-add=10m - --scale-down-unneeded-time=10m - --scale-down-utilization-threshold=0.5 - --skip-nodes-with-local-storage=false - --skip-nodes-with-system-pods=true - --balance-similar-node-groups=true
Backup and Recovery
etcd Backup
# Backup etcd (run on control plane node) ETCDCTL_API=3 etcdctl snapshot save /backup/etcd-$(date +%Y%m%d-%H%M%S).db \ --endpoints=https://127.0.0.1:2379 \ --cacert=/etc/kubernetes/pki/etcd/ca.crt \ --cert=/etc/kubernetes/pki/etcd/healthcheck-client.crt \ --key=/etc/kubernetes/pki/etcd/healthcheck-client.key # Verify backup ETCDCTL_API=3 etcdctl snapshot status /backup/etcd-snapshot.db --write-out=table
Velero Backup (v1.15.x)
# Install Velero CLI brew install velero # Install Velero server with AWS provider velero install \ --provider aws \ --bucket ${BUCKET_NAME} \ --secret-file ./credentials-velero \ --backup-location-config region=${REGION} \ --snapshot-location-config region=${REGION} \ --plugins velero/velero-plugin-for-aws:v1.10.0 \ --use-node-agent # Create backup velero backup create ${BACKUP_NAME} \ --include-namespaces ${NAMESPACES} \ --ttl 720h \ --default-volumes-to-fs-backup # Create scheduled backup velero schedule create daily-backup \ --schedule="0 2 * * *" \ --include-namespaces ${NAMESPACES} \ --ttl 168h # Restore from backup velero restore create --from-backup ${BACKUP_NAME}
Velero Backup Manifest
apiVersion: velero.io/v1 kind: Backup metadata: name: ${BACKUP_NAME} namespace: velero spec: includedNamespaces: - ${NAMESPACE_1} - ${NAMESPACE_2} excludedResources: - events - events.events.k8s.io storageLocation: default volumeSnapshotLocations: - default ttl: 720h0m0s snapshotVolumes: true hooks: resources: - name: backup-hook includedNamespaces: - ${NAMESPACE} labelSelector: matchLabels: app: database pre: - exec: container: database command: - /bin/sh - -c - "pg_dump -U postgres > /backup/pre-backup.sql" onError: Fail timeout: 120s
Cluster Upgrades
Pre-Upgrade Checklist
#!/bin/bash # pre-upgrade-check.sh echo "=== Cluster Version ===" kubectl version --short echo -e "\n=== Node Status ===" kubectl get nodes echo -e "\n=== Pods Not Running ===" kubectl get pods -A --field-selector=status.phase!=Running,status.phase!=Succeeded echo -e "\n=== PDBs That May Block Drain ===" kubectl get pdb -A echo -e "\n=== Pending PVCs ===" kubectl get pvc -A --field-selector=status.phase=Pending echo -e "\n=== Deprecated APIs in Use ===" kubectl get --raw /metrics | grep apiserver_requested_deprecated_apis
AKS Upgrade (Azure)
# Check current version and available upgrades az aks get-versions --location ${LOCATION} -o table az aks get-upgrades --resource-group ${RG} --name ${CLUSTER} -o table # Upgrade control plane and node pools az aks upgrade --resource-group ${RG} --name ${CLUSTER} \ --kubernetes-version 1.31.0 # Use blue-green upgrade with max surge az aks nodepool upgrade --resource-group ${RG} --cluster-name ${CLUSTER} \ --name ${NODEPOOL} --kubernetes-version 1.31.0 \ --max-surge 33% # Enable auto-upgrade channel az aks update --resource-group ${RG} --name ${CLUSTER} \ --auto-upgrade-channel stable
EKS Upgrade
# Update control plane aws eks update-cluster-version \ --name ${CLUSTER_NAME} \ --kubernetes-version 1.31 # Wait for completion aws eks wait cluster-active --name ${CLUSTER_NAME} # Update EKS add-ons for addon in vpc-cni coredns kube-proxy eks-pod-identity-agent; do aws eks update-addon --cluster-name ${CLUSTER_NAME} \ --addon-name $addon \ --resolve-conflicts PRESERVE done # Update managed node groups aws eks update-nodegroup-version \ --cluster-name ${CLUSTER_NAME} \ --nodegroup-name ${NODEGROUP_NAME}
GKE Upgrade
# Check available versions gcloud container get-server-config --region ${REGION} # Upgrade control plane gcloud container clusters upgrade ${CLUSTER} --region ${REGION} \ --master --cluster-version 1.31 # Upgrade node pools gcloud container clusters upgrade ${CLUSTER} --region ${REGION} \ --node-pool ${POOL} \ --cluster-version 1.31 # Enable release channel gcloud container clusters update ${CLUSTER} --region ${REGION} \ --release-channel regular
OpenShift Upgrade
# Check available updates oc adm upgrade # View current version and channel oc get clusterversion oc get clusterversion version -o jsonpath='{.spec.channel}' # Change channel oc adm upgrade channel stable-4.17 # Start upgrade oc adm upgrade --to-latest # OR upgrade to specific version oc adm upgrade --to=4.17.5 # Monitor upgrade progress watch -n 10 'oc get clusterversion && oc get clusteroperators'
Resource Management
Resource Quotas
apiVersion: v1 kind: ResourceQuota metadata: name: compute-quota namespace: ${NAMESPACE} spec: hard: requests.cpu: "10" requests.memory: 20Gi limits.cpu: "20" limits.memory: 40Gi pods: "50" persistentvolumeclaims: "10" requests.storage: 100Gi
Limit Ranges
apiVersion: v1 kind: LimitRange metadata: name: default-limits namespace: ${NAMESPACE} spec: limits: - type: Container default: cpu: 500m memory: 512Mi defaultRequest: cpu: 100m memory: 128Mi max: cpu: "4" memory: 8Gi min: cpu: 50m memory: 64Mi
Check Resource Usage
# Namespace resource usage vs quota kubectl describe quota -n ${NAMESPACE} # Pod resource usage kubectl top pods -n ${NAMESPACE} --sort-by=memory kubectl top pods -n ${NAMESPACE} --sort-by=cpu # Node resource allocation kubectl describe nodes | grep -A 5 "Allocated resources"
Certificate Management
Check Certificate Expiry
# kubeadm certificates kubeadm certs check-expiration # Manual check openssl x509 -in /etc/kubernetes/pki/apiserver.crt -noout -dates # Check all certs for cert in /etc/kubernetes/pki/*.crt; do echo "=== $cert ===" openssl x509 -in $cert -noout -dates done
Rotate Certificates
# Renew all certificates (kubeadm) kubeadm certs renew all # Restart control plane components crictl pods --name kube-apiserver -q | xargs crictl stopp crictl pods --name kube-controller-manager -q | xargs crictl stopp crictl pods --name kube-scheduler -q | xargs crictl stopp
Monitoring Setup
Prometheus Stack (kube-prometheus-stack v67.x)
# Add Helm repo helm repo add prometheus-community https://prometheus-community.github.io/helm-charts helm repo update # Install helm upgrade --install prometheus prometheus-community/kube-prometheus-stack \ --namespace monitoring \ --create-namespace \ --set prometheus.prometheusSpec.retention=30d \ --set prometheus.prometheusSpec.replicas=2 \ --set prometheus.prometheusSpec.resources.requests.memory=2Gi \ --set alertmanager.alertmanagerSpec.replicas=3 \ --set grafana.persistence.enabled=true # Access Grafana kubectl port-forward -n monitoring svc/prometheus-grafana 3000:80
Custom ServiceMonitor
apiVersion: monitoring.coreos.com/v1 kind: ServiceMonitor metadata: name: ${APP_NAME} namespace: monitoring labels: release: prometheus spec: namespaceSelector: matchNames: - ${NAMESPACE} selector: matchLabels: app.kubernetes.io/name: ${APP_NAME} endpoints: - port: metrics interval: 30s path: /metrics
Cost Optimization
VerticalPodAutoscaler
apiVersion: autoscaling.k8s.io/v1 kind: VerticalPodAutoscaler metadata: name: ${APP_NAME}-vpa namespace: ${NAMESPACE} spec: targetRef: apiVersion: apps/v1 kind: Deployment name: ${APP_NAME} updatePolicy: updateMode: "Auto" resourcePolicy: containerPolicies: - containerName: "*" minAllowed: cpu: 50m memory: 64Mi maxAllowed: cpu: 4 memory: 8Gi
Namespace Lifecycle
Namespace Template
apiVersion: v1 kind: Namespace metadata: name: ${NAMESPACE} labels: app.kubernetes.io/managed-by: cluster-skills environment: ${ENVIRONMENT} team: ${TEAM} annotations: owner: ${OWNER_EMAIL} --- apiVersion: v1 kind: ResourceQuota metadata: name: default-quota namespace: ${NAMESPACE} spec: hard: requests.cpu: "10" requests.memory: 20Gi limits.cpu: "20" limits.memory: 40Gi pods: "50" --- apiVersion: networking.k8s.io/v1 kind: NetworkPolicy metadata: name: default-deny namespace: ${NAMESPACE} spec: podSelector: {} policyTypes: - Ingress - Egress
Disaster Recovery
Full Cluster Recovery Checklist
- Restore etcd - See etcd restore section
- Verify Control Plane
kubectl get nodes kubectl get pods -n kube-system kubectl cluster-info - Restore Workloads (Velero)
velero restore create --from-backup ${BACKUP_NAME} - Verify Application Health
kubectl get pods -A kubectl get svc -A - Verify DNS and Networking
kubectl run dns-test --image=busybox --rm -it --restart=Never -- nslookup kubernetes