Claude-skill-registry backup-disaster-recovery
Implement backup strategies, disaster recovery plans, and data restoration procedures for protecting critical infrastructure and data.
install
source · Clone the upstream repo
git clone https://github.com/majiayu000/claude-skill-registry
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/majiayu000/claude-skill-registry "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/data/backup-disaster-recovery" ~/.claude/skills/majiayu000-claude-skill-registry-backup-disaster-recovery && rm -rf "$T"
manifest:
skills/data/backup-disaster-recovery/SKILL.mdtags
source content
Backup and Disaster Recovery
Overview
Design and implement comprehensive backup and disaster recovery strategies to ensure data protection, business continuity, and rapid recovery from infrastructure failures.
When to Use
- Data protection and compliance
- Business continuity planning
- Disaster recovery planning
- Point-in-time recovery
- Cross-region failover
- Data migration
- Compliance and audit requirements
- Recovery time objective (RTO) optimization
Implementation Examples
1. Database Backup Configuration
# postgres-backup-cronjob.yaml apiVersion: v1 kind: ConfigMap metadata: name: backup-script namespace: databases data: backup.sh: | #!/bin/bash set -euo pipefail BACKUP_DIR="/backups/postgresql" RETENTION_DAYS=30 DB_HOST="${POSTGRES_HOST}" DB_PORT="${POSTGRES_PORT:-5432}" DB_USER="${POSTGRES_USER}" DB_PASSWORD="${POSTGRES_PASSWORD}" export PGPASSWORD="$DB_PASSWORD" # Create backup directory mkdir -p "$BACKUP_DIR" # Full backup BACKUP_FILE="$BACKUP_DIR/full-$(date +%Y%m%d-%H%M%S).sql" echo "Starting backup to $BACKUP_FILE" pg_dump -h "$DB_HOST" -p "$DB_PORT" -U "$DB_USER" -v \ --format=plain --no-owner --no-privileges > "$BACKUP_FILE" # Compress backup gzip "$BACKUP_FILE" echo "Backup compressed: ${BACKUP_FILE}.gz" # Upload to S3 aws s3 cp "${BACKUP_FILE}.gz" \ "s3://my-backups/postgres/$(date +%Y/%m/%d)/" # Clean local old backups find "$BACKUP_DIR" -type f -mtime +7 -delete # Verify backup if pg_restore -d "postgresql://$DB_USER@$DB_HOST:$DB_PORT/test_restore" \ "${BACKUP_FILE}.gz" --single-transaction 2>/dev/null; then echo "Backup verification successful" dropdb -h "$DB_HOST" -p "$DB_PORT" -U "$DB_USER" test_restore fi echo "Backup complete: ${BACKUP_FILE}.gz" --- apiVersion: batch/v1 kind: CronJob metadata: name: postgres-backup namespace: databases spec: schedule: "0 2 * * *" # 2 AM daily successfulJobsHistoryLimit: 3 failedJobsHistoryLimit: 3 jobTemplate: spec: template: spec: serviceAccountName: backup-sa containers: - name: backup image: postgres:15-alpine env: - name: POSTGRES_HOST valueFrom: secretKeyRef: name: postgres-credentials key: host - name: POSTGRES_USER valueFrom: secretKeyRef: name: postgres-credentials key: username - name: POSTGRES_PASSWORD valueFrom: secretKeyRef: name: postgres-credentials key: password - name: AWS_ACCESS_KEY_ID valueFrom: secretKeyRef: name: aws-credentials key: access-key - name: AWS_SECRET_ACCESS_KEY valueFrom: secretKeyRef: name: aws-credentials key: secret-key volumeMounts: - name: backup-script mountPath: /backup - name: backup-storage mountPath: /backups command: - sh - -c - apk add --no-cache aws-cli && bash /backup/backup.sh volumes: - name: backup-script configMap: name: backup-script defaultMode: 0755 - name: backup-storage emptyDir: sizeLimit: 100Gi restartPolicy: OnFailure
2. Disaster Recovery Plan Template
# disaster-recovery-plan.yaml --- apiVersion: v1 kind: ConfigMap metadata: name: dr-procedures namespace: operations data: dr-runbook.md: | # Disaster Recovery Runbook ## RTO and RPO Targets - RTO (Recovery Time Objective): 4 hours - RPO (Recovery Point Objective): 1 hour ## Pre-Disaster Checklist - [ ] Verify backups are current - [ ] Test backup restoration process - [ ] Verify DR site resources are provisioned - [ ] Confirm failover DNS is configured ## Primary Region Failure ### Detection (0-15 minutes) - Alerting system detects primary region down - Incident commander declared - War room opened in Slack #incidents ### Initial Actions (15-30 minutes) - Verify primary region is truly down - Check backup systems in secondary region - Validate latest backup timestamp ### Failover Procedure (30 minutes - 2 hours) 1. Validate backup integrity 2. Restore database from latest backup 3. Update application configuration 4. Perform DNS failover to secondary region 5. Verify application health ### Recovery Steps 1. Restore from backup: `restore-backup.sh --backup-id=latest` 2. Update DNS: `aws route53 change-resource-record-sets --cli-input-json file://failover.json` 3. Verify: `curl https://myapp.com/health` 4. Run smoke tests 5. Monitor error rates and performance ## Post-Disaster - Document timeline and RCA - Update runbooks - Schedule post-mortem - Test backups again --- apiVersion: v1 kind: Secret metadata: name: dr-credentials namespace: operations type: Opaque stringData: backup_aws_access_key: "AKIAIOSFODNN7EXAMPLE" backup_aws_secret_key: "wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY" dr_site_password: "secure-password-here"
3. Backup and Restore Script
#!/bin/bash # backup-restore.sh - Complete backup and restore utilities set -euo pipefail BACKUP_BUCKET="s3://my-backups" BACKUP_RETENTION_DAYS=30 TIMESTAMP=$(date +%Y%m%d_%H%M%S) # Colors RED='\033[0;31m' GREEN='\033[0;32m' YELLOW='\033[1;33m' NC='\033[0m' log_info() { echo -e "${GREEN}[INFO]${NC} $1" } log_error() { echo -e "${RED}[ERROR]${NC} $1" } log_warn() { echo -e "${YELLOW}[WARN]${NC} $1" } # Backup function backup_all() { local environment=$1 log_info "Starting backup for $environment environment" # Backup databases log_info "Backing up databases..." for db in myapp_db analytics_db; do local backup_file="$BACKUP_BUCKET/$environment/databases/${db}-${TIMESTAMP}.sql.gz" pg_dump "$db" | gzip | aws s3 cp - "$backup_file" log_info "Backed up $db to $backup_file" done # Backup Kubernetes resources log_info "Backing up Kubernetes resources..." kubectl get all,configmap,secret,ingress,pvc -A -o yaml | \ gzip | aws s3 cp - "$BACKUP_BUCKET/$environment/kubernetes-${TIMESTAMP}.yaml.gz" log_info "Kubernetes resources backed up" # Backup volumes log_info "Backing up persistent volumes..." for pvc in $(kubectl get pvc -A -o name); do local pvc_name=$(echo $pvc | cut -d'/' -f2) log_info "Backing up PVC: $pvc_name" kubectl exec -n default -it backup-pod -- \ tar czf - /data | aws s3 cp - "$BACKUP_BUCKET/$environment/volumes/${pvc_name}-${TIMESTAMP}.tar.gz" done log_info "All backups completed successfully" } # Restore function restore_all() { local environment=$1 local backup_date=$2 log_warn "Restoring from backup date: $backup_date" read -p "Are you sure? (yes/no): " confirm if [ "$confirm" != "yes" ]; then log_error "Restore cancelled" exit 1 fi # Restore databases log_info "Restoring databases..." for db in myapp_db analytics_db; do local backup_file="$BACKUP_BUCKET/$environment/databases/${db}-${backup_date}.sql.gz" log_info "Restoring $db from $backup_file" aws s3 cp "$backup_file" - | gunzip | psql "$db" done # Restore Kubernetes resources log_info "Restoring Kubernetes resources..." local k8s_backup="$BACKUP_BUCKET/$environment/kubernetes-${backup_date}.yaml.gz" aws s3 cp "$k8s_backup" - | gunzip | kubectl apply -f - log_info "Restore completed successfully" } # Test restore test_restore() { local environment=$1 log_info "Testing restore procedure..." # Get latest backup local latest_backup=$(aws s3 ls "$BACKUP_BUCKET/$environment/databases/" | \ sort | tail -n 1 | awk '{print $4}') if [ -z "$latest_backup" ]; then log_error "No backups found" exit 1 fi log_info "Testing restore from: $latest_backup" # Create test database psql -c "CREATE DATABASE test_restore_$(date +%s);" # Download and restore aws s3 cp "$BACKUP_BUCKET/$environment/databases/$latest_backup" - | \ gunzip | psql "test_restore_$(date +%s)" log_info "Test restore successful" } # List backups list_backups() { local environment=$1 log_info "Available backups for $environment:" aws s3 ls "$BACKUP_BUCKET/$environment/" --recursive | grep -E "\.sql\.gz|\.yaml\.gz|\.tar\.gz" } # Cleanup old backups cleanup_old_backups() { local environment=$1 log_info "Cleaning up backups older than $BACKUP_RETENTION_DAYS days" find "$BACKUP_BUCKET/$environment" -type f -mtime "+$BACKUP_RETENTION_DAYS" -delete log_info "Cleanup completed" } # Main main() { case "${1:-}" in backup) backup_all "${2:-production}" ;; restore) restore_all "${2:-production}" "${3:-}" ;; test) test_restore "${2:-production}" ;; list) list_backups "${2:-production}" ;; cleanup) cleanup_old_backups "${2:-production}" ;; *) echo "Usage: $0 {backup|restore|test|list|cleanup} [environment] [backup-date]" exit 1 ;; esac } main "$@"
4. Cross-Region Failover
# route53-failover.yaml apiVersion: v1 kind: ConfigMap metadata: name: failover-config namespace: operations data: failover.sh: | #!/bin/bash set -euo pipefail PRIMARY_REGION="us-east-1" SECONDARY_REGION="us-west-2" DOMAIN="myapp.com" HOSTED_ZONE_ID="Z1234567890ABC" echo "Initiating failover to $SECONDARY_REGION" # Get primary endpoint PRIMARY_ENDPOINT=$(aws elbv2 describe-load-balancers \ --region "$PRIMARY_REGION" \ --query 'LoadBalancers[0].DNSName' \ --output text) # Get secondary endpoint SECONDARY_ENDPOINT=$(aws elbv2 describe-load-balancers \ --region "$SECONDARY_REGION" \ --query 'LoadBalancers[0].DNSName' \ --output text) # Update Route53 to failover aws route53 change-resource-record-sets \ --hosted-zone-id "$HOSTED_ZONE_ID" \ --change-batch '{ "Changes": [ { "Action": "UPSERT", "ResourceRecordSet": { "Name": "'$DOMAIN'", "Type": "A", "TTL": 60, "SetIdentifier": "Primary", "Failover": "PRIMARY", "AliasTarget": { "HostedZoneId": "Z35SXDOTRQ7X7K", "DNSName": "'$PRIMARY_ENDPOINT'", "EvaluateTargetHealth": true } } }, { "Action": "UPSERT", "ResourceRecordSet": { "Name": "'$DOMAIN'", "Type": "A", "TTL": 60, "SetIdentifier": "Secondary", "Failover": "SECONDARY", "AliasTarget": { "HostedZoneId": "Z35SXDOTRQ7X7K", "DNSName": "'$SECONDARY_ENDPOINT'", "EvaluateTargetHealth": false } } } ] }' echo "Failover completed"
Best Practices
✅ DO
- Perform regular backup testing
- Use multiple backup locations
- Implement automated backups
- Document recovery procedures
- Test failover procedures regularly
- Monitor backup completion
- Use immutable backups
- Encrypt backups at rest and in transit
❌ DON'T
- Rely on a single backup location
- Ignore backup failures
- Store backups with production data
- Skip testing recovery procedures
- Over-compress backups beyond recovery speed needs
- Forget to verify backup integrity
- Store encryption keys with backups
- Assume backups are automatically working