Vibeship-spawner-skills disaster-recovery

id: disaster-recovery

install

source · Clone the upstream repo

git clone https://github.com/vibeforge1111/vibeship-spawner-skills

manifest: enterprise/disaster-recovery/skill.yaml

tags

#disaster-recovery #rpo-rto #failover #chaos-engineering #backup-restore #dr-strategies

source content

id: disaster-recovery name: Disaster Recovery category: enterprise description: Use when designing disaster recovery strategies, defining RPO/RTO targets, implementing failover mechanisms, or conducting chaos engineering tests - covers active-active, pilot light, and backup strategies

patterns: golden_rules: - rule: "Define RTO/RPO first" reason: "Drives all architecture decisions" - rule: "Test regularly" reason: "Untested plans fail when needed" - rule: "Automate failover" reason: "Manual steps increase RTO" - rule: "Document runbooks" reason: "Stress impairs memory" - rule: "Include dependencies" reason: "DR is only as strong as weakest link"

service_tiers: mission_critical: rto: "15 minutes" rpo: "1 minute" mtpd: "1 hour" strategy: "active-active" examples: - "Payment processing" - "Core trading" - "Emergency systems" business_critical: rto: "1 hour" rpo: "15 minutes" mtpd: "4 hours" strategy: "warm-standby" examples: - "Order management" - "Customer portal" - "CRM" business_operational: rto: "4 hours" rpo: "1 hour" mtpd: "24 hours" strategy: "pilot-light" examples: - "Reporting" - "Analytics" - "Internal tools" business_support: rto: "24 hours" rpo: "4 hours" mtpd: "3 days" strategy: "backup-restore" examples: - "Development" - "Testing" - "Archives"

dr_strategies: backup_restore: cost: "$" rto: "Hours" rpo: "Hours" description: "Periodic backups, restore when needed" pilot_light: cost: "$$" rto: "10+ minutes" rpo: "Minutes" description: "Core services running, scale on failover" warm_standby: cost: "$$$" rto: "Minutes" rpo: "Seconds" description: "Scaled-down replica always running" active_active: cost: "$$$$" rto: "Seconds" rpo: "Near-zero" description: "Full redundancy, traffic to both sites"

backup_types: - "full" - "incremental" - "differential" - "snapshot"

anti_patterns:

pattern: "Untested plans" problem: "Fail during actual disaster" solution: "Regular DR drills"
pattern: "Manual failover" problem: "Slow RTO, error-prone" solution: "Automate failover steps"
pattern: "Ignoring dependencies" problem: "Partial recovery" solution: "Map all dependencies"
pattern: "Same region backups" problem: "Lost with primary" solution: "Cross-region replication"
pattern: "Stale runbooks" problem: "Wrong procedures" solution: "Review after each test"
pattern: "No rollback plan" problem: "Stuck in broken state" solution: "Always plan failback"

implementation_checklist: planning: - "RTO/RPO defined for all services" - "Services tiered by criticality" - "DR strategy selected per tier" - "Dependencies mapped" - "Runbooks documented" implementation: - "Backups configured and encrypted" - "Cross-region replication active" - "Failover automation ready" - "DNS failover configured" - "Monitoring for DR metrics" testing: - "DR drill schedule established" - "Chaos experiments defined" - "Backup restore tested" - "Failover tested" - "RTO/RPO validated" operations: - "On-call procedures include DR" - "Communication plan ready" - "Escalation path defined" - "Post-incident review process"

handoffs:

skill: enterprise-architecture trigger: "architecture decisions for resilience"
skill: multi-tenancy trigger: "per-tenant DR requirements"

ecosystem: backup: - "AWS Backup" - "Azure Backup" - "Veeam" - "Velero (Kubernetes)" replication: - "PostgreSQL streaming replication" - "MySQL Group Replication" - "MongoDB Atlas" - "CockroachDB" chaos: - "Gremlin" - "Chaos Monkey" - "Litmus" - "Chaos Mesh" dns: - "Route 53" - "Azure Traffic Manager" - "Cloudflare"

sources: references: - "AWS Disaster Recovery Strategies" - "Gremlin Chaos Engineering Guide" - "Azure DR Best Practices"