Claude-skill-registry Canary Deployment
Gradually rolling out new versions to a subset of users, monitoring for issues, and expanding the rollout if everything looks healthy to minimize risk and enable fast rollback.
install
source · Clone the upstream repo
git clone https://github.com/majiayu000/claude-skill-registry
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/majiayu000/claude-skill-registry "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/data/canary-deployment-amnadtaowsoam-cerebraskills" ~/.claude/skills/majiayu000-claude-skill-registry-canary-deployment && rm -rf "$T"
manifest:
skills/data/canary-deployment-amnadtaowsoam-cerebraskills/SKILL.mdsource content
Canary Deployment
Current Level: Intermediate
Domain: DevOps / Deployment
Overview
Canary deployment is a deployment strategy where you gradually roll out a new version to a subset of users, monitor for issues, and expand the rollout if everything looks healthy. This minimizes risk by catching issues early and enabling fast rollback.
What is Canary Deployment
Core Concept
┌─────────────────────────────────────────────────────────────────┐ │ Canary Deployment Process │ ├─────────────────────────────────────────────────────────────────┤ │ │ │ ┌─────────────────────────────────────────────────────────┐ │ │ │ Stage 1: 5% Canary │ │ │ │ ┌─────────────────────────────────────────────────┐ │ │ │ │ │ Old Version (95%) │ New Version (5%) │ │ │ │ │ │ └─────────────────────────────────────────────────┘ │ │ │ └─────────────────────────────────────────────────────────┘ │ │ │ │ ┌─────────────────────────────────────────────────────────┐ │ │ │ Stage 2: 25% Canary │ │ │ │ ┌─────────────────────────────────────────────────┐ │ │ │ │ │ Old Version (75%) │ New Version (25%) │ │ │ │ │ │ └─────────────────────────────────────────────────┘ │ │ │ └─────────────────────────────────────────────────────────┘ │ │ │ │ ┌─────────────────────────────────────────────────────────┐ │ │ │ Stage 3: 50% Canary │ │ │ │ ┌─────────────────────────────────────────────────┐ │ │ │ │ │ Old Version (50%) │ New Version (50%) │ │ │ │ │ │ └─────────────────────────────────────────────────┘ │ │ │ └─────────────────────────────────────────────────────────┘ │ │ │ │ ┌─────────────────────────────────────────────────────────┐ │ │ │ Stage 4: 100% Rollout │ │ │ │ ┌─────────────────────────────────────────────────┐ │ │ │ │ │ New Version (100%) │ │ │ │ │ │ └─────────────────────────────────────────────────┘ │ │ │ └─────────────────────────────────────────────────────────┘ │ │ │ └─────────────────────────────────────────────────────────────────┘
Why Canary Deployment
| Benefit | Impact |
|---|---|
| Reduce Blast Radius | Only affects small % of users |
| Real-World Testing | Test with actual users |
| Early Issue Detection | Catch problems early |
| Confidence Building | Gradual expansion |
| Rollback Safety | Quick rollback if issues |
Canary Process
Step-by-Step Process
┌─────────────────────────────────────────────────────────────────┐ │ Canary Deployment Steps │ ├─────────────────────────────────────────────────────────────────┤ │ │ │ 1. Deploy to Canary Servers ──▶ 2. Route 5% Traffic ──▶ 3. Monitor Metrics ──▶ 4. Increase to 25% ──▶ 5. Monitor ──▶ 6. Increase to 50% ──▶ 7. Monitor ──▶ 8. 100% Rollout │ │ │ │ └───────────────────────────────────────────────────────────────┘ │ └─────────────────────────────────────────────────────────────────┘
Step 1: Deploy to Canary
# Deploy new version to canary servers kubectl apply -f canary-deployment.yaml # Wait for canary to be ready kubectl rollout status deployment/web-app-canary
# Canary deployment apiVersion: apps/v1 kind: Deployment metadata: name: web-app-canary labels: app: web-app version: v2.0 spec: replicas: 1 selector: matchLabels: app: web-app template: metadata: labels: app: web-app version: v2.0 spec: containers: - name: web-app image: web-app:v2.0 ports: - containerPort: 8080 readinessProbe: httpGet: path: /health port: 8080 initialDelaySeconds: 5 periodSeconds: 10 successThreshold: 3 failureThreshold: 3 livenessProbe: httpGet: path: /health port: 8080 initialDelaySeconds: 5 periodSeconds: 10 successThreshold: 3 failureThreshold: 3
Step 2: Route 5% Traffic
# Service with traffic splitting apiVersion: v1 kind: Service metadata: name: web-app spec: selector: app: web-app ports: - protocol: TCP port: 80 targetPort: 8080
# Istio VirtualService for canary apiVersion: networking.istio.io/v1alpha3 kind: VirtualService metadata: name: web-app spec: hosts: - web-app.example.com http: - route: - destination: host: web-app subset: v1 weight: 95 - destination: host: web-app subset: v2 weight: 5
Step 3: Monitor Metrics
# Monitoring script import requests import time def monitor_canary(): """Monitor canary metrics.""" while True: try: # Check error rate error_rate = get_error_rate() print(f"Error rate: {error_rate}%") # Check latency latency = get_latency() print(f"Latency: {latency}ms") # Check business metrics conversion_rate = get_conversion_rate() print(f"Conversion rate: {conversion_rate}%") # Check if metrics are healthy if error_rate > 5: print("⚠️ Error rate too high") trigger_rollback() break if latency > 1000: print("⚠️ Latency too high") trigger_rollback() break if conversion_rate < 10: print("⚠️ Conversion rate too low") trigger_rollback() break print("✓ Metrics healthy") except Exception as e: print(f"✗ Error monitoring: {e}") time.sleep(60) # Check every minute monitor_canary()
Step 4: Increase to 25%
# Update VirtualService for 25% canary apiVersion: networking.istio.io/v1alpha3 kind: VirtualService metadata: name: web-app spec: hosts: - web-app.example.com http: - route: - destination: host: web-app subset: v1 weight: 75 - destination: host: web-app subset: v2 weight: 25
Step 5: Monitor
# Continue monitoring for 25% canary def monitor_25_percent(): """Monitor 25% canary metrics.""" while True: try: # Check error rate error_rate = get_error_rate() print(f"Error rate: {error_rate}%") # Check latency latency = get_latency() print(f"Latency: {latency}ms") # Check business metrics conversion_rate = get_conversion_rate() print(f"Conversion rate: {conversion_rate}%") # Check if metrics are healthy if error_rate > 5: print("⚠️ Error rate too high") trigger_rollback() break if latency > 1000: print("⚠️ Latency too high") trigger_rollback() break if conversion_rate < 10: print("⚠️ Conversion rate too low") trigger_rollback() break print("✓ Metrics healthy") except Exception as e: print(f"✗ Error monitoring: {e}") time.sleep(60) # Check every minute monitor_25_percent()
Step 6: Increase to 50%
# Update VirtualService for 50% canary apiVersion: networking.istio.io/v1alpha3 kind: VirtualService metadata: name: web-app spec: hosts: - web-app.example.com http: - route: - destination: host: web-app subset: v1 weight: 50 - destination: host: web-app subset: v2 weight: 50
Step 7: Monitor
# Continue monitoring for 50% canary def monitor_50_percent(): """Monitor 50% canary metrics.""" while True: try: # Check error rate error_rate = get_error_rate() print(f"Error rate: {error_rate}%") # Check latency latency = get_latency() print(f"Latency: {latency}ms") # Check business metrics conversion_rate = get_conversion_rate() print(f"Conversion rate: {conversion_rate}%") # Check if metrics are healthy if error_rate > 5: print("⚠️ Error rate too high") trigger_rollback() break if latency > 1000: print("⚠️ Latency too high") trigger_rollback() break if conversion_rate < 10: print("⚠️ Conversion rate too low") trigger_rollback() break print("✓ Metrics healthy") except Exception as e: print(f"✗ Error monitoring: {e}") time.sleep(60) # Check every minute monitor_50_percent()
Step 8: 100% Rollout
# Update VirtualService for 100% canary apiVersion: networking.istio.io/v1alpha3 kind: VirtualService metadata: name: web-app spec: hosts: - web-app.example.com http: - route: - destination: host: web-app subset: v2 weight: 100
Traffic Routing
Load Balancer with Weights
# Kubernetes Service with traffic splitting apiVersion: v1 kind: Service metadata: name: web-app spec: selector: app: web-app ports: - protocol: TCP port: 80 targetPort: 8080
# Istio DestinationRule apiVersion: networking.istio.io/v1alpha3 kind: DestinationRule metadata: name: web-app spec: host: web-app subsets: - name: v1 labels: version: v1.0 - name: v2 labels: version: v2.0
Service Mesh
# Istio VirtualService apiVersion: networking.istio.io/v1alpha3 kind: VirtualService metadata: name: web-app spec: hosts: - web-app.example.com http: - route: - destination: host: web-app subset: v1 weight: 95 - destination: host: web-app subset: v2 weight: 5
Feature Flags
// Using LaunchDarkly for canary const LaunchDarkly = require('ldclient-node'); const client = LaunchDarkly.init('sdk-key'); client.on('ready', () => { const flagValue = client.variation('new-checkout', false, { key: 'user-123', custom: { segment: 'beta' } }); if (flagValue) { // Show new checkout } else { // Show old checkout } });
Canary Metrics
Error Rate
# Calculate error rate def get_error_rate(): """Calculate error rate.""" total_requests = 1000 error_requests = 50 error_rate = (error_requests / total_requests) * 100 return error_rate # Threshold: < 5%
Latency
# Calculate latency def get_latency(): """Calculate latency.""" response_times = [100, 150, 200, 250, 300] average_latency = sum(response_times) / len(response_times) return average_latency # Threshold: < 1000ms
Business Metrics
# Calculate conversion rate def get_conversion_rate(): """Calculate conversion rate.""" visitors = 1000 conversions = 100 conversion_rate = (conversions / visitors) * 100 return conversion_rate # Threshold: > 10%
Resource Usage
# Calculate resource usage def get_resource_usage(): """Calculate resource usage.""" cpu_usage = 50 # Percentage memory_usage = 60 # Percentage return { 'cpu': cpu_usage, 'memory': memory_usage } # Threshold: CPU < 80%, Memory < 80%
Canary Decision
Automated Decision
# Automated canary decision def automated_canary_decision(): """Automatically decide whether to proceed.""" error_rate = get_error_rate() latency = get_latency() conversion_rate = get_conversion_rate() # Check if metrics are healthy if error_rate < 5 and latency < 1000 and conversion_rate > 10: print("✓ Metrics healthy, proceeding to next stage") return True else: print("✗ Metrics unhealthy, rolling back") return False # Usage if automated_canary_decision(): increase_canary_percentage() else: rollback_canary()
Manual Decision
# Manual canary decision def manual_canary_decision(): """Manually decide whether to proceed.""" error_rate = get_error_rate() latency = get_latency() conversion_rate = get_conversion_rate() print(f"Error rate: {error_rate}%") print(f"Latency: {latency}ms") print(f"Conversion rate: {conversion_rate}%") # Ask user for decision decision = input("Proceed to next stage? (yes/no): ") if decision.lower() == 'yes': print("✓ Proceeding to next stage") return True else: print("✗ Rolling back") return False # Usage if manual_canary_decision(): increase_canary_percentage() else: rollback_canary()
Hybrid Decision
# Hybrid canary decision def hybrid_canary_decision(): """Hybrid decision: auto proceed unless anomaly.""" error_rate = get_error_rate() latency = get_latency() conversion_rate = get_conversion_rate() print(f"Error rate: {error_rate}%") print(f"Latency: {latency}ms") print(f"Conversion rate: {conversion_rate}%") # Check for anomalies if error_rate > 5 or latency > 1000 or conversion_rate < 10: print("⚠️ Anomaly detected, manual review required") decision = input("Proceed to next stage? (yes/no): ") return decision.lower() == 'yes' else: print("✓ Metrics healthy, automatically proceeding") return True # Usage if hybrid_canary_decision(): increase_canary_percentage() else: rollback_canary()
Canary Duration
Fast Canary
# Fast canary: 10 minutes per stage def fast_canary(): """Fast canary deployment.""" stages = [5, 25, 50, 100] duration = 10 # minutes for stage in stages: print(f"Rolling out to {stage}%") set_canary_percentage(stage) time.sleep(duration * 60) # Wait for duration if not check_metrics(): rollback_canary() return print("✓ Canary deployment successful")
Slow Canary
# Slow canary: Hours per stage def slow_canary(): """Slow canary deployment.""" stages = [5, 25, 50, 100] duration = 60 # minutes for stage in stages: print(f"Rolling out to {stage}%") set_canary_percentage(stage) time.sleep(duration * 60) # Wait for duration if not check_metrics(): rollback_canary() return print("✓ Canary deployment successful")
Soak Time
# Canary with soak time def canary_with_soak_time(): """Canary deployment with soak time.""" stages = [5, 25, 50, 100] soak_time = 30 # minutes for stage in stages: print(f"Rolling out to {stage}%") set_canary_percentage(stage) time.sleep(soak_time * 60) # Soak time if not check_metrics(): rollback_canary() return print("✓ Canary deployment successful")
User Selection
Random Users
# Random user selection import random def random_user_selection(): """Select random users for canary.""" canary_percentage = 5 user_id = get_user_id() # Randomly select users if random.random() < canary_percentage / 100: return True else: return False # Usage if random_user_selection(): show_new_version() else: show_old_version()
Internal Users
# Internal user selection def internal_user_selection(): """Select internal users for canary.""" user_email = get_user_email() # Check if user is internal if user_email.endswith('@company.com'): return True else: return False # Usage if internal_user_selection(): show_new_version() else: show_old_version()
Beta Testers
# Beta tester selection def beta_tester_selection(): """Select beta testers for canary.""" user_id = get_user_id() # Check if user is beta tester beta_testers = ['user-1', 'user-2', 'user-3'] if user_id in beta_testers: return True else: return False # Usage if beta_tester_selection(): show_new_version() else: show_old_version()
Segment-Based
# Segment-based selection def segment_based_selection(): """Select users based on segment.""" user_segment = get_user_segment() # Select specific segments canary_segments = ['free-tier', 'new-users'] if user_segment in canary_segments: return True else: return False # Usage if segment_based_selection(): show_new_version() else: show_old_version()
Automated Canary with Flagger
Flagger Installation
# Install Flagger kubectl apply -k github.com/fluxcd/flagger//kustomize/istio
Flagger Configuration
# Flagger Canary resource apiVersion: flagger.app/v1beta1 kind: Canary metadata: name: web-app spec: targetRef: apiVersion: apps/v1 kind: Deployment name: web-app service: port: 80 targetPort: 8080 analysis: interval: 1m threshold: 5 maxWeight: 50 stepWeight: 5 metrics: - name: request-success-rate thresholdRange: min: 99 - name: request-duration thresholdRange: max: 500 webhooks: - name: load-test url: http://flagger-loadtester/ timeout: 5s metadata: cmd: "hey -z 1m -q 10 -c 2 http://web-app-canary/"
Flagger Monitoring
# Monitor canary status kubectl get canary web-app -w # Check canary events kubectl describe canary web-app
Monitoring and Alerting
Dashboards
# Create monitoring dashboard def create_canary_dashboard(): """Create canary monitoring dashboard.""" dashboard = { 'title': 'Canary Deployment Monitoring', 'panels': [ { 'title': 'Error Rate', 'type': 'graph', 'targets': [ { 'expr': 'rate(http_requests_total{status=~"5.."}[5m])' } ] }, { 'title': 'Latency', 'type': 'graph', 'targets': [ { 'expr': 'histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))' } ] }, { 'title': 'Conversion Rate', 'type': 'graph', 'targets': [ { 'expr': 'rate(conversions_total[5m]) / rate(visitors_total[5m])' } ] } ] } return dashboard
Alerting
# Set up alerts def setup_canary_alerts(): """Set up canary alerts.""" alerts = [ { 'name': 'High Error Rate', 'condition': 'error_rate > 5', 'action': 'send_alert', 'message': 'Error rate too high, rolling back' }, { 'name': 'High Latency', 'condition': 'latency > 1000', 'action': 'send_alert', 'message': 'Latency too high, rolling back' }, { 'name': 'Low Conversion Rate', 'condition': 'conversion_rate < 10', 'action': 'send_alert', 'message': 'Conversion rate too low, rolling back' } ] return alerts
Rollback
Automatic Rollback
# Automatic rollback def automatic_rollback(): """Automatically rollback if metrics are unhealthy.""" error_rate = get_error_rate() latency = get_latency() conversion_rate = get_conversion_rate() # Check if metrics are unhealthy if error_rate > 5 or latency > 1000 or conversion_rate < 10: print("✗ Metrics unhealthy, rolling back") rollback_canary() else: print("✓ Metrics healthy")
Manual Rollback
# Manual rollback def manual_rollback(): """Manually rollback canary.""" decision = input("Rollback canary? (yes/no): ") if decision.lower() == 'yes': print("Rolling back canary...") rollback_canary() else: print("Continuing canary...")
Rollback Script
# Rollback script def rollback_canary(): """Rollback canary to old version.""" print("Rolling back canary...") # Update VirtualService to route 100% to old version update_virtual_service(100, 0) # Delete canary deployment delete_canary_deployment() print("✓ Canary rolled back")
Database Challenges
Backward-Compatible Schema Changes
-- Version N: Add new column (nullable) ALTER TABLE users ADD COLUMN new_feature_enabled BOOLEAN; -- Version N+1: Populate new column UPDATE users SET new_feature_enabled = FALSE; -- Version N+2: Enable feature for canary UPDATE users SET new_feature_enabled = TRUE; -- Version N+3: Make column non-nullable ALTER TABLE users ALTER COLUMN new_feature_enabled SET NOT NULL;
Both Versions Work with Same Schema
-- Both versions must work with same schema -- Old version: Ignores new column -- New version: Uses new column -- Example: -- Old version SELECT * FROM users; -- New version SELECT *, new_feature_enabled FROM users;
Session Handling
Sticky Sessions
# Istio DestinationRule with sticky sessions apiVersion: networking.istio.io/v1alpha3 kind: DestinationRule metadata: name: web-app spec: host: web-app trafficPolicy: loadBalancer: consistentHash: httpCookie: name: canary ttl: 60s
Stateless Apps
# Stateless app: No session handling needed def handle_request(request): """Handle request without session.""" # Process request response = process_request(request) # Return response return response
Progressive Delivery
Canary + Feature Flags
# Canary deployment with feature flags def canary_with_feature_flags(): """Canary deployment with feature flags.""" # Deploy new version deploy_new_version() # Enable feature flag for 5% of users set_feature_flag_percentage('new-checkout', 5) # Monitor metrics if check_metrics(): # Increase to 25% set_feature_flag_percentage('new-checkout', 25) if check_metrics(): # Increase to 50% set_feature_flag_percentage('new-checkout', 50) if check_metrics(): # Increase to 100% set_feature_flag_percentage('new-checkout', 100) else: rollback_feature_flag() else: rollback_feature_flag() else: rollback_feature_flag()
A/B Testing + Canary
# A/B testing with canary def ab_testing_with_canary(): """A/B testing with canary deployment.""" # Deploy new version deploy_new_version() # Split traffic 50/50 set_canary_percentage(50) # Monitor metrics if check_metrics(): # Determine winner winner = determine_winner() if winner == 'new': # Rollout new version set_canary_percentage(100) else: # Rollback to old version rollback_canary() else: rollback_canary()
Tools
Flagger
# Install Flagger kubectl apply -k github.com/fluxcd/flagger//kustomize/istio # Create canary kubectl apply -f canary.yaml # Monitor canary kubectl get canary web-app -w
Argo Rollouts
# Install Argo Rollouts kubectl create namespace argo-rollouts kubectl apply -n argo-rollouts -f https://github.com/argoproj/argo-rollouts/releases/latest/download/install.yaml # Create rollout kubectl apply -f rollout.yaml # Monitor rollout kubectl argo rollouts get rollout web-app -w
AWS CodeDeploy
# Create deployment group aws deploy create-deployment-group \ --application-name web-app \ --deployment-group-name canary \ --deployment-config-name CodeDeployDefault.Canary10Percent30Minutes # Create deployment aws deploy create-deployment \ --application-name web-app \ --deployment-group-name canary \ --deployment-config-name CodeDeployDefault.Canary10Percent30Minutes \ --s3-location bucket=web-app,bundleType=zip,key=web-app-v2.zip
Service Mesh
# Istio VirtualService apiVersion: networking.istio.io/v1alpha3 kind: VirtualService metadata: name: web-app spec: hosts: - web-app.example.com http: - route: - destination: host: web-app subset: v1 weight: 95 - destination: host: web-app subset: v2 weight: 5
Feature Flags
// LaunchDarkly const LaunchDarkly = require('ldclient-node'); const client = LaunchDarkly.init('sdk-key'); client.on('ready', () => { const flagValue = client.variation('new-checkout', false, { key: 'user-123' }); if (flagValue) { show_new_version(); } else { show_old_version(); } });
Canary vs Blue-Green
| Aspect | Canary | Blue-Green |
|---|---|---|
| Rollout | Gradual | All-at-once |
| Risk | Lower | Higher |
| Speed | Slower | Faster |
| Rollback | Fast | Instant |
| Cost | Lower | Higher |
| Complexity | Higher | Lower |
Real Examples
Example 1: API Service Canary
Scenario: Deploy new API version
Timeline:
- 09:00: Deploy to canary
- 09:05: Route 5% traffic
- 09:15: Monitor metrics (healthy)
- 09:20: Increase to 25%
- 09:35: Monitor metrics (healthy)
- 09:40: Increase to 50%
- 09:55: Monitor metrics (healthy)
- 10:00: Rollout to 100%
Outcome: Successful deployment
Example 2: Frontend Canary
Scenario: Deploy new checkout flow
Timeline:
- 14:00: Deploy to canary
- 14:05: Route 5% traffic
- 14:15: Monitor metrics (conversion rate low)
- 14:20: Rollback to old version
Outcome: Rollback due to low conversion rate
Example 3: Kubernetes Canary with Flagger
Scenario: Deploy microservice with Flagger
Timeline:
- 10:00: Deploy new version
- 10:05: Flagger automatically routes 5% traffic
- 10:15: Flagger increases to 10%
- 10:25: Flagger increases to 15%
- 10:35: Flagger increases to 20%
- 10:45: Flagger increases to 25%
- 10:55: Flagger increases to 50%
- 11:05: Flagger increases to 100%
Outcome: Successful automated deployment
Canary Anti-Patterns
Not Monitoring Metrics
# Bad: Blind canary (no monitoring) def blind_canary(): """Blind canary deployment.""" # Deploy new version deploy_new_version() # Route 5% traffic set_canary_percentage(5) # Wait 10 minutes time.sleep(600) # Increase to 25% set_canary_percentage(25) # Wait 10 minutes time.sleep(600) # Increase to 50% set_canary_percentage(50) # Wait 10 minutes time.sleep(600) # Rollout to 100% set_canary_percentage(100)
Increasing Too Fast
# Bad: Fast canary (increases too fast) def fast_canary(): """Fast canary deployment.""" stages = [5, 25, 50, 100] duration = 1 # minute for stage in stages: set_canary_percentage(stage) time.sleep(duration * 60)
No Rollback Plan
# Bad: No rollback plan def canary_without_rollback(): """Canary deployment without rollback plan.""" # Deploy new version deploy_new_version() # Route 5% traffic set_canary_percentage(5) # Monitor metrics if not check_metrics(): # No rollback plan print("Metrics unhealthy, but no rollback plan") return # Continue rollout set_canary_percentage(25)
Summary Checklist
Planning
- Canary strategy defined
- Metrics thresholds set
- Rollback plan documented
- Monitoring configured
- Alerting set up
Deployment
- Deploy to canary
- Route initial traffic (5%)
- Monitor metrics
- Increase to 25%
- Monitor metrics
- Increase to 50%
- Monitor metrics
- Rollout to 100%
Monitoring
- Error rate monitored
- Latency monitored
- Business metrics monitored
- Resource usage monitored
- Alerts configured
Rollback
- Rollback triggers defined
- Rollback procedure documented
- Rollback tested
- Rollback automated
--- ## Quick Start ### Kubernetes Canary Deployment ```yaml # canary-deployment.yaml apiVersion: apps/v1 kind: Deployment metadata: name: app-canary spec: replicas: 1 # 5% of traffic selector: matchLabels: app: myapp version: canary template: metadata: labels: app: myapp version: canary spec: containers: - name: app image: myapp:v2 --- apiVersion: v1 kind: Service metadata: name: app-service spec: selector: app: myapp ports: - port: 80
Traffic Splitting
# Istio VirtualService for canary apiVersion: networking.istio.io/v1alpha3 kind: VirtualService metadata: name: app spec: hosts: - app.example.com http: - match: - headers: canary: exact: "true" route: - destination: host: app subset: canary weight: 100 - route: - destination: host: app subset: stable weight: 95 - destination: host: app subset: canary weight: 5
Production Checklist
- Traffic Splitting: Configure traffic splitting (5% initial)
- Monitoring: Set up comprehensive monitoring
- Metrics: Define success/failure metrics
- Alerts: Configure alerts for canary issues
- Rollback Plan: Automated rollback triggers
- Testing: Test canary deployment process
- Documentation: Document canary procedure
- Approval: Approval process for promotion
- Timeline: Define canary duration (e.g., 1 hour)
- Gradual Rollout: Plan for gradual traffic increase
- Communication: Notify team of canary deployment
- Validation: Validate canary health before promotion
Anti-patterns
❌ Don't: No Monitoring
# ❌ Bad - Deploy without monitoring apiVersion: apps/v1 kind: Deployment # No health checks or metrics!
# ✅ Good - With monitoring apiVersion: apps/v1 kind: Deployment spec: template: spec: containers: - name: app livenessProbe: httpGet: path: /health port: 8080 readinessProbe: httpGet: path: /ready port: 8080
❌ Don't: Too Fast Rollout
# ❌ Bad - Jump to 100% immediately weight: 5 # 5% # ... 1 minute later weight: 100 # 100% - Too fast!
# ✅ Good - Gradual increase # Stage 1: 5% for 1 hour weight: 5 # Stage 2: 25% for 1 hour weight: 25 # Stage 3: 50% for 1 hour weight: 50 # Stage 4: 100% weight: 100
❌ Don't: No Rollback Triggers
# ❌ Bad - Manual rollback only # No automatic triggers
# ✅ Good - Automatic rollback apiVersion: argoproj.io/v1alpha1 kind: Rollout spec: strategy: canary: steps: - setWeight: 5 - pause: { duration: 1h } - analysis: templates: - templateName: error-rate args: - name: error-rate value: "0.01" # Rollback if > 1%
Integration Points
- Blue-Green Deployment (
) - Alternative strategy26-deployment-strategies/blue-green-deployment/ - Rollback Strategies (
) - Rollback procedures26-deployment-strategies/rollback-strategies/ - Monitoring (
) - Deployment monitoring14-monitoring-observability/
Further Reading
- Kubernetes Canary Deployment
- Istio Traffic Management
- Argo Rollouts
- Rollback monitored
Post-Deployment
- Metrics reviewed
- Documentation updated
- Lessons learned
- Process improved