Claude-skill-registry advanced-kubernetes
Custom Resource Definitions (CRDs) extend Kubernetes API with custom
git clone https://github.com/majiayu000/claude-skill-registry
T=$(mktemp -d) && git clone --depth=1 https://github.com/majiayu000/claude-skill-registry "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/data/advanced-kubernetes" ~/.claude/skills/majiayu000-claude-skill-registry-advanced-kubernetes && rm -rf "$T"
skills/data/advanced-kubernetes/SKILL.mdAdvanced Kubernetes: Operators & CRDs
Level 1: Quick Reference
Core Concepts at a Glance
Custom Resource Definitions (CRDs) extend Kubernetes API with custom object types. Operators are controllers that manage these custom resources using domain-specific logic.
CRD vs ConfigMap Comparison:
| Aspect | CRD | ConfigMap |
|---|---|---|
| API Integration | Full Kubernetes API support (CRUD, watch, RBAC) | Simple key-value storage |
| Validation | OpenAPI v3 schema validation, admission webhooks | No built-in validation |
| Versioning | Multiple versions with conversion webhooks | Single version only |
| Use Case | Complex application state, declarative APIs | Configuration data, environment variables |
| Controller Support | Reconciliation loops, status tracking | Manual polling required |
| Example | Database instances, ML workflows, backup policies | App config files, feature flags |
Operator Pattern Overview
┌─────────────────────────────────────────────────────────┐ │ Kubernetes API Server │ │ (stores desired state in etcd) │ └────────────┬────────────────────────────┬───────────────┘ │ │ │ Watch │ Update Status ↓ ↑ ┌────────────────┐ ┌─────────────────────┐ │ Controller │────────→│ External Resources │ │ (Reconcile) │ Manage │ (DBs, APIs, etc.) │ └────────────────┘ └─────────────────────┘ ↑ │ Compare │ ┌────┴─────┐ │ Desired │ │ vs Actual│ └──────────┘
Reconciliation Loop:
- Watch - Controller watches for changes to custom resources
- Compare - Reconcile function compares desired vs actual state
- Act - Controller takes actions to align actual state with desired
- Update Status - Controller updates resource status with current state
- Requeue - Schedule next reconciliation (periodic or event-driven)
Controller Reconciliation Logic
// Simplified reconciliation pattern func (r *MyReconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctrl.Result, error) { // 1. Fetch the custom resource obj := &myapi.MyResource{} if err := r.Get(ctx, req.NamespacedName, obj); err != nil { return ctrl.Result{}, client.IgnoreNotFound(err) } // 2. Handle deletion (finalizers) if !obj.DeletionTimestamp.IsZero() { return r.handleDeletion(ctx, obj) } // 3. Reconcile external state if err := r.reconcileExternal(ctx, obj); err != nil { return ctrl.Result{}, err } // 4. Update status obj.Status.Ready = true if err := r.Status().Update(ctx, obj); err != nil { return ctrl.Result{}, err } return ctrl.Result{}, nil // Success, no requeue }
Essential Checklist
Prerequisites:
- Kubernetes cluster (v1.25+) - local (kind, minikube) or remote
- kubectl configured with admin access
- Go 1.21+ installed
- Docker/Podman for building operator images
Development Tools:
-
(v3.12+) - scaffolding and code generationkubebuilder -
(optional) - alternative frameworkoperator-sdk -
- generates CRDs, RBACs, webhookscontroller-gen -
- manages Kubernetes manifestskustomize
Testing Tools:
-
- runs API server locally for unit testsenvtest -
- Kubernetes in Docker for integration testskind -
- BDD testing framework (optional)ginkgo
Key Files in Operator Project:
my-operator/ ├── api/v1/ # CRD definitions (Go structs) ├── config/ │ ├── crd/ # Generated CRD YAML │ ├── rbac/ # Generated RBAC YAML │ ├── manager/ # Operator deployment │ └── webhook/ # Webhook configurations ├── controllers/ # Reconciliation logic ├── main.go # Entrypoint (manager setup) └── Dockerfile # Container image build
Quick Commands:
# Initialize operator project kubebuilder init --domain example.com --repo github.com/myorg/my-operator # Create CRD + controller kubebuilder create api --group apps --version v1 --kind MyApp # Generate manifests make manifests # Run locally (connects to current kubeconfig cluster) make install run # Run tests make test # Build and deploy make docker-build docker-push deploy IMG=myregistry/my-operator:v1.0.0
Common Pitfalls:
- ❌ Forgetting to update CRD when changing API structs → run
make manifests - ❌ Infinite reconciliation loops → use
ctrl.Result{RequeueAfter: time.Minute} - ❌ Not handling deletion properly → implement finalizers
- ❌ Blocking operations in reconcile → use background workers for long tasks
- ❌ Not setting owner references → orphaned resources on deletion
When to Use Operators:
- ✅ Managing complex stateful applications (databases, message queues)
- ✅ Automating operational tasks (backups, upgrades, scaling)
- ✅ Integrating with external systems (cloud APIs, SaaS platforms)
- ✅ Enforcing organizational policies (cost controls, security standards)
- ❌ Simple deployments (use Helm or plain manifests)
- ❌ One-time configuration changes (use Jobs or manual kubectl)
Level 2: Implementation Guide
📚 Complete Examples: See REFERENCE.md for full controller implementations, webhook code, test suites, and production-ready patterns.
2.1 Custom Resource Definitions (CRDs)
CRDs extend Kubernetes API with custom object types validated by OpenAPI v3 schemas.
Key Components:
- Spec - Desired state (user input)
- Status - Observed state (controller output, separate subresource)
- Validation - Markers like
+kubebuilder:validation:Minimum=1 - Versions - Support multiple API versions with conversion webhooks
Essential Kubebuilder Markers:
// +kubebuilder:validation:Minimum=1 // +kubebuilder:validation:Maximum=10 Size int32 `json:"size"` // +kubebuilder:validation:Pattern=`^[a-z0-9.-]+/[a-z0-9.-]+:[a-z0-9.-]+$` Image string `json:"image"` // +optional Port int32 `json:"port,omitempty"`
Printcolumns for
:kubectl get
// +kubebuilder:printcolumn:name="Ready",type=integer,JSONPath=`.status.readyReplicas` // +kubebuilder:printcolumn:name="Phase",type=string,JSONPath=`.status.phase`
Subresources:
- Separate status endpoint+kubebuilder:subresource:status
- Enable+kubebuilder:subresource:scalekubectl scale
Generate CRDs:
make manifests → outputs to config/crd/bases/
See REFERENCE.md for complete CRD definition, versioning, and conversion webhooks.
2.2 Operators and Controllers
Reconciliation Loop:
- Watch - Controller watches for resource changes (via informers/caches)
- Compare - Reconcile compares desired vs actual state
- Act - Create/update/delete Kubernetes resources to match desired state
- Update Status - Set status conditions (
,Ready
,Progressing
)Degraded - Requeue - Schedule next reconciliation (event-driven or periodic)
Controller Pattern:
func (r *Reconciler) Reconcile(ctx, req) (ctrl.Result, error) { // 1. Fetch custom resource obj := &MyResource{} if err := r.Get(ctx, req.NamespacedName, obj); err != nil { return ctrl.Result{}, client.IgnoreNotFound(err) } // 2. Handle deletion (finalizers) if !obj.DeletionTimestamp.IsZero() { return r.handleDeletion(ctx, obj) } // 3. Reconcile external state if err := r.reconcileDeployment(ctx, obj); err != nil { return ctrl.Result{RequeueAfter: 30*time.Second}, err } // 4. Update status obj.Status.Ready = true return ctrl.Result{}, r.Status().Update(ctx, obj) }
Key Functions:
- Idempotent create/updatecontrollerutil.CreateOrUpdate()
- Automatic garbage collectioncontrollerutil.SetControllerReference()
- Cleanup before deletioncontrollerutil.AddFinalizer()
Error Handling:
- Transient errors - Requeue with delay:
ctrl.Result{RequeueAfter: 30s} - Permanent errors - Set degraded condition, don't requeue
- Unknown errors - Return error for exponential backoff
See REFERENCE.md for complete controller implementation with finalizers, owner references, and error handling.
2.3 Admission Webhooks
Webhooks intercept API requests before persistence for validation/mutation.
Types:
- Validating - Accept/reject requests (JWT validation, cross-field checks)
- Mutating - Modify requests (inject sidecars, set defaults)
Implementation:
// Validating webhook func (r *MyApp) ValidateCreate() (admission.Warnings, error) { if r.Spec.Size < 1 || r.Spec.Size > 100 { return nil, fmt.Errorf("size must be 1-100") } return nil, nil } // Mutating webhook (Defaulter) func (r *MyApp) Default() { if r.Spec.Port == 0 { r.Spec.Port = 8080 } }
Setup:
- Implement
orwebhook.Validator
interfacewebhook.Defaulter - Add kubebuilder marker:
// +kubebuilder:webhook:path=/validate-...,mutating=false,...
generates webhook configmake manifests- Deploy with cert-manager for TLS certificates
Requirements:
- TLS certificates (use cert-manager)
- Service routing webhook traffic to operator
(default) - reject on webhook errorsfailurePolicy: fail
See REFERENCE.md for complete webhook examples, cert-manager setup, and validation patterns.
2.4 Leader Election & High Availability
Leader election ensures only one controller instance reconciles at a time (prevents race conditions).
Configuration:
mgr, err := ctrl.NewManager(ctrl.GetConfigOrDie(), ctrl.Options{ LeaderElection: true, LeaderElectionID: "myapp-controller.example.com", LeaderElectionNamespace: "myapp-system", })
How It Works:
- Uses Kubernetes
resource for coordinationLease - One replica acquires lease, becomes leader
- Other replicas standby, ready to take over on leader failure
- Leader renews lease every 10s (default)
Deployment:
spec: replicas: 3 # High availability containers: - args: - --leader-elect
See REFERENCE.md for RBAC requirements and lease configuration tuning.
2.5 Testing Operators
Unit Testing with envtest:
- Runs local API server (no kubelet, no containers)
- Fast tests (milliseconds per test)
- Full CRD validation
Setup:
testEnv = &envtest.Environment{ CRDDirectoryPaths: []string{filepath.Join("..", "config", "crd", "bases")}, } cfg, _ := testEnv.Start() k8sClient, _ = client.New(cfg, client.Options{Scheme: scheme.Scheme})
Test Pattern:
It("Should create Deployment", func() { myApp := &MyApp{...} Expect(k8sClient.Create(ctx, myApp)).Should(Succeed()) deployment := &Deployment{} Eventually(func() error { return k8sClient.Get(ctx, namespacedName, deployment) }, timeout, interval).Should(Succeed()) Expect(*deployment.Spec.Replicas).To(Equal(int32(3))) })
Integration Testing with kind:
kind create cluster make docker-build docker-push deploy IMG=operator:test kubectl wait --for=condition=available deployment/operator kubectl apply -f test-cr.yaml
See REFERENCE.md for complete test suites, ginkgo patterns, and E2E test scripts.
2.6 Best Practices & Anti-Patterns
✅ Best Practices:
- Idempotent reconciliation - Same result on multiple calls
- Use
- Simplifies create/update logicCreateOrUpdate - Set owner references - Automatic garbage collection
- Finalizers for cleanup - External resources (cloud APIs, databases)
- Status conditions -
,Ready
,Progressing
with detailed messagesDegraded - Structured logging - JSON format with consistent key-value pairs
❌ Anti-Patterns:
- Blocking operations - Don't make sync calls that block reconcile
- Infinite loops - Updating spec in reconcile triggers another reconcile
- Hardcoded values - Use env vars/ConfigMaps
- Missing watches - Ensure RBAC allows watching dependent resources
- No health checks - Implement
and/healthz
endpoints/readyz
Requeue Strategies:
// Immediate requeue (rate-limited) return ctrl.Result{Requeue: true}, nil // Requeue after delay return ctrl.Result{RequeueAfter: 30 * time.Second}, nil // No requeue (wait for watch event) return ctrl.Result{}, nil // Error (exponential backoff) return ctrl.Result{}, fmt.Errorf("transient error")
See REFERENCE.md for advanced patterns, multi-cluster operators, and OLM integration.
Level 3: Deep Dive Resources
Advanced Operator Patterns
State Machine Operators
-
Model complex workflows as finite state machines
-
Use status phases to track progression through states
-
Implement state transition validations and guards
Multi-Tenancy Operators
-
Namespace isolation strategies
-
Shared vs dedicated operator deployments
-
RBAC scoping for tenant-specific resources
GitOps Integration
-
Reconcile against Git repository state
-
Implement drift detection and auto-remediation
-
Use annotations to track source commits
External Secret Management
-
Integrate with Vault, AWS Secrets Manager, or Azure Key Vault
-
Implement secret rotation without downtime
-
Use external-secrets operator pattern
Multi-Cluster Operators
Architecture Patterns:
-
Hub-Spoke Model - Central operator manages multiple clusters
-
Federated Model - Operators in each cluster coordinate via shared state
-
Active-Active - Operators in multiple clusters handle same resources
Implementation Considerations:
-
Use cluster-api for cluster lifecycle management
-
Implement cross-cluster service discovery (e.g., Submariner)
-
Handle network partitions and split-brain scenarios
-
Use consensus protocols for distributed state
Tools:
-
KubeFed (deprecated) - Kubernetes Federation v2
-
OCM (Open Cluster Management) - CNCF sandbox project
-
Argo CD ApplicationSet - Multi-cluster GitOps
-
Crossplane - Universal control plane for multi-cloud
Operator Lifecycle Manager (OLM)
What is OLM?
-
Package manager for Kubernetes operators
-
Handles installation, upgrades, and dependency management
-
Used by OpenShift and available as CNCF project
OLM Components:
-
Catalog - Repository of operator metadata (CSV, CRD)
-
Subscription - Declarative operator installation
-
InstallPlan - Execution plan for operator installation
-
ClusterServiceVersion (CSV) - Operator metadata and deployment info
Creating an OLM Bundle:
# Generate bundle manifests operator-sdk generate bundle --version 1.0.0 # Validate bundle operator-sdk bundle validate ./bundle # Build and push bundle image docker build -f bundle.Dockerfile -t myregistry/myapp-operator-bundle:v1.0.0 . docker push myregistry/myapp-operator-bundle:v1.0.0 # Add to catalog opm index add --bundles myregistry/myapp-operator-bundle:v1.0.0 \ --tag myregistry/myapp-catalog:latest
OLM Best Practices:
-
Define proper upgrade paths in CSV
-
Test upgrade scenarios (skip versions, downgrades)
-
Use semantic versioning
-
Document breaking changes in release notes
Advanced Testing Strategies
Property-Based Testing:
-
Use tools like
for property-based testsgopter -
Test invariants across state transitions
-
Generate random valid/invalid inputs
Chaos Testing:
-
Use Chaos Mesh or Litmus to inject failures
-
Test operator resilience to node failures, network partitions
-
Verify recovery from partial updates
Performance Testing:
-
Benchmark reconciliation loop latency
-
Test with 1000+ custom resources
-
Measure memory/CPU usage under load
-
Use profiling tools (pprof) for bottleneck analysis
Production Readiness Checklist
Observability:
-
Metrics exported via Prometheus endpoint
-
Structured logging with levels (info, warn, error)
-
Distributed tracing (OpenTelemetry)
-
Custom metrics for business logic (e.g., backup success rate)
Security:
-
RBAC follows least-privilege principle
-
Secrets encrypted at rest and in transit
-
Pod Security Standards enforced
-
Network policies restrict traffic
-
Image vulnerability scanning in CI/CD
Reliability:
-
Leader election enabled for HA
-
Graceful shutdown with finalizers
-
Rate limiting to prevent API server overload
-
Circuit breakers for external dependencies
-
Backup/restore procedures documented
Operational:
-
Runbooks for common failure scenarios
-
SLO/SLI definitions (e.g., 99.9% reconciliation success)
-
Alerting rules for critical conditions
-
Upgrade/rollback procedures tested
-
Capacity planning documented
Resources and Further Learning
Official Documentation:
Bundled Resources in This Directory:
-
- Complete CRD with OpenAPI schematemplates/crd-definition.yaml -
- Controller with reconcile logictemplates/operator-scaffold.go -
- Validating and mutating webhookstemplates/webhook.go -
- RBAC manifests for operator deploymenttemplates/rbac.yaml -
- Development environment setupscripts/setup-operator-dev.sh -
- Common patterns and anti-patternsresources/operator-patterns.md
Community Resources:
Example Production Operators:
Next Steps
-
Build a Simple Operator - Start with a basic CRD and controller
-
Add Validation - Implement admission webhooks
-
Test Thoroughly - Write unit tests with envtest, integration tests with kind
-
Observe in Production - Deploy with metrics, logging, and tracing
-
Iterate - Add features based on operational experience
Advanced Topics to Explore:
-
Custom admission plugins
-
API aggregation and extension API servers
-
Operator Hub and OLM
-
Multi-cluster federation
-
Operator performance optimization