Claude-skill-registry Kubernetes Platform Engineering
Production Kubernetes platform patterns covering cluster architecture, security, GitOps, observability, autoscaling, and operational guardrails
install
source · Clone the upstream repo
git clone https://github.com/majiayu000/claude-skill-registry
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/majiayu000/claude-skill-registry "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/data/kubernetes-platform" ~/.claude/skills/majiayu000-claude-skill-registry-kubernetes-platform-engineering && rm -rf "$T"
manifest:
skills/data/kubernetes-platform/SKILL.mdsource content
Kubernetes Platform Engineering
Overview
Platform engineering on Kubernetes is about making the “golden path” easy: secure-by-default workloads, consistent delivery via GitOps, and predictable operations (capacity, upgrades, incident response).
Why This Matters
- Reliability: self-healing + controlled rollouts reduce incidents
- Security: least privilege and isolation by default
- Developer velocity: standard templates + paved roads
- Cost control: right-sizing and autoscaling without surprises
Core Concepts
1. Cluster Architecture
- Separate node pools by workload class (stateless, stateful, GPU, batch, system).
- Define upgrade strategy (surge capacity, maintenance windows, version skew policy).
- Decide multi-cluster approach (per env/region/tenant) and shared services (ingress, monitoring, secrets).
2. Workload Patterns
- Deployment for stateless services; use
,PodDisruptionBudget
, andreadinessProbe
.topologySpreadConstraints - StatefulSet for stable identity/storage (databases, queues) with explicit backup/restore.
- Job/CronJob for batch; enforce concurrency policies and deadlines.
- DaemonSet for node-level agents (logging, CNI, security).
3. Networking
- Use
/Gateway with TLS termination and standardized auth/rate limiting at the edge.Ingress - Apply NetworkPolicies (default-deny + explicit allow) for zero-trust inside the cluster.
- Consider service mesh only when you need it (mTLS, traffic shifting, retries, L7 telemetry); keep complexity intentional.
4. Storage
- Standardize
per tier (fast SSD, standard, archival) and backup policies.StorageClass - For stateful apps: set anti-affinity/zone spread; verify recovery time objectives (RTO/RPO).
- Avoid “pet” PVs by having a documented restore path and regular restore drills.
5. Security Baseline
- Enforce Pod Security (restricted baseline): non-root, read-only root filesystem, drop capabilities.
- RBAC least privilege: namespace-scoped roles, separate human vs automation identities.
- Secrets: prefer external secret managers (e.g., External Secrets + AWS/GCP/Azure secret store); avoid long-lived static secrets in Git.
- Admission policies: Kyverno / Gatekeeper for guardrails (image registries, resource limits, required labels).
6. Observability & Ops
- Metrics: Prometheus + dashboards per service (latency, errors, saturation, restarts).
- Logs: structured JSON with trace IDs; centralized retention policies.
- Traces: OpenTelemetry for request graphs across services.
- SLOs: define error budgets and link alerts to user impact.
7. GitOps & Delivery
- Git is source of truth: Argo CD / Flux reconciles desired state.
- Use Helm/Kustomize overlays per environment; keep secrets out-of-band.
- Progressive delivery: canary/blue-green via Argo Rollouts / Flagger (or mesh-based traffic shifting).
8. Cost & Capacity Management
- Require resource requests/limits; use VPA recommendations carefully (watch for restarts).
- Use HPA/KEDA for demand-based scaling; use cluster autoscaler/Karpenter for nodes.
- Separate spot/preemptible pools for tolerant workloads; enforce pod placement policies.
Quick Start (Minimal “Production-ish” Deployment)
apiVersion: apps/v1 kind: Deployment metadata: name: app spec: replicas: 2 strategy: type: RollingUpdate rollingUpdate: { maxUnavailable: 0, maxSurge: 1 } selector: { matchLabels: { app: app } } template: metadata: { labels: { app: app } } spec: securityContext: { runAsNonRoot: true } containers: - name: app image: your-registry/app:1.0.0 ports: [{ containerPort: 3000 }] resources: requests: { cpu: "100m", memory: "256Mi" } limits: { cpu: "500m", memory: "512Mi" } readinessProbe: httpGet: { path: /readyz, port: 3000 } initialDelaySeconds: 5 livenessProbe: httpGet: { path: /healthz, port: 3000 } initialDelaySeconds: 10 --- apiVersion: policy/v1 kind: PodDisruptionBudget metadata: name: app spec: minAvailable: 1 selector: { matchLabels: { app: app } }
Production Checklist
- Requests/limits set; namespaces have
/LimitRangeResourceQuota -
+readinessProbe
+ graceful shutdown configuredlivenessProbe - PDB + topology spread/anti-affinity to survive node/zone events
- NetworkPolicies enforce default-deny and least access
- RBAC least privilege; service accounts scoped per app
- Metrics/logs/traces exported; alerts tied to SLOs
- Backup/restore drills for stateful components
- Upgrade playbook exists (cluster + addons + workload compatibility)
Tools & Libraries
| Tool | Purpose |
|---|---|
| Argo CD / Flux | GitOps continuous delivery |
| Helm / Kustomize | Packaging and environment overlays |
| Prometheus + Grafana | Metrics + dashboards |
| Loki/ELK | Central logging |
| OpenTelemetry | Distributed tracing |
| External Secrets | Secret manager integration |
| Kyverno / Gatekeeper | Policy enforcement |
Anti-patterns
- No requests/limits: unpredictable scheduling and noisy-neighbor incidents
- Hand-applied changes:
drift instead of GitOps reconciliationkubectl apply - Flat network: no NetworkPolicies; lateral movement is trivial
- Single replica: planned/unplanned disruption becomes downtime
Real-World Examples
Example: Progressive Delivery (Canary)
- Deploy new version at 5% traffic; watch SLOs; ramp to 25/50/100%; auto-rollback on regressions.
Example: Autoscaling
- HPA on CPU/requests-per-second + cluster autoscaler for nodes; cap max replicas to protect dependencies.
Example: Zero-Trust Networking
- Default-deny in each namespace; only allow traffic from ingress controller to app and app to explicit dependencies.
Common Mistakes
- Missing readiness probes (traffic hits pods before they’re ready)
- Running containers as root (wider blast radius on compromise)
- Unbounded egress (data exfiltration paths and surprise costs)
- Treating the cluster as the product (no SLOs, no runbooks, no ownership)
Integration Points
- CI (image build, SBOM, scanning, signing)
- CD (GitOps reconciler + progressive delivery)
- Cloud provider (EKS/GKE/AKS, IAM, load balancers, DNS)
- Observability + incident management (paging, runbooks, postmortems)