Awesome-omni-skill sos-emergency

Ship Operating System: Emergency Kubernetes cluster recovery, Talos reset procedures, Synology Container Manager recovery, and graceful shutdown protocols. Trigger with /sos

install

source · Clone the upstream repo

git clone https://github.com/diegosouzapw/awesome-omni-skill

Claude Code · Install into ~/.claude/skills/

T=$(mktemp -d) && git clone --depth=1 https://github.com/diegosouzapw/awesome-omni-skill "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/devops/sos-emergency" ~/.claude/skills/diegosouzapw-awesome-omni-skill-sos-emergency && rm -rf "$T"

manifest: skills/devops/sos-emergency/SKILL.md

source content

Ship Operating System (SOS) - Emergency Recovery

"When everything fails, this skill has your back."

I provide emergency recovery procedures for Sunkworks home lab infrastructure. This skill embraces the reality that systems fail—often spectacularly, often live on stream—and focuses on getting you back to operational status.

Enterprise Warning: These are emergency procedures. Some operations are destructive. Always verify you're targeting the correct cluster/node before executing recovery commands.

Slash Command

/sos

Runs emergency diagnostic workflow:

Identify cluster connectivity status
Check node health and etcd quorum
Assess recovery options based on failure mode
Recommend appropriate recovery procedure

Usage: Type

/sos

when systems are unresponsive or in critical failure state.

Script Verification: Before executing, verify the script integrity:

sha256sum .github/skills/sos-emergency/scripts/diagnose.sh
# Expected: 8691dce29cdc2a6ab7a02ff71f04fb35d1da30a44e8f0e794eef5c9afe7826c2

Execute diagnostics:

bash .github/skills/sos-emergency/scripts/diagnose.sh

When I Activate

```
/sos
```
(slash command)
"Everything is broken"
"Cluster unreachable"
"etcd quorum lost"
"Talos node hung"
"Synology Container Manager crashed"
"Power failure protocol"
"Emergency shutdown"
"Abandon ship"

Expected Failure Modes

Kubernetes Cluster Failures

Failure Mode	Symptoms	Est. Recovery Time
Single control plane down	API server unreachable, etcd errors	15-30 min
etcd quorum loss	All control planes down, cluster frozen	1-2 hours
Node kubelet crash	Pods stuck Terminating, node NotReady	10-20 min
CNI failure	Pods running but no network connectivity	30-60 min

Talos-Specific Failures

Failure Mode	Symptoms	Est. Recovery Time
Extension deadlock	Node boots but services hang (Tailscale subnet router)	30-45 min
Config drift	talosctl commands rejected, authentication failures	15-30 min
Disk pressure	Node evicting pods, kubelet OOM	20-40 min

Synology Infrastructure Failures

Failure Mode	Symptoms	Est. Recovery Time
Container Manager crash	Docker daemon unresponsive, WebUI timeout	10-20 min
Volume mount failures	Containers crash loop, permission errors	15-30 min
DSM update breakage	Services fail post-update	30-60 min

Core Recovery Procedures

1. Kubernetes Cluster Recovery

Check Cluster Connectivity

# Test API server reachability
kubectl cluster-info --request-timeout=5s

# If unreachable, check kubeconfig context
kubectl config current-context
kubectl config get-contexts

etcd Health Check

# For Talos clusters
talosctl -n <control-plane-ip> etcd status

# Check etcd member list
talosctl -n <control-plane-ip> etcd members

# Verify etcd alarm status (capacity/corruption)
talosctl -n <control-plane-ip> etcd alarm list

etcd Quorum Recovery (DESTRUCTIVE - Single Node)

Use only when quorum is lost and cannot be recovered normally.

# 1. Identify the node with most recent etcd data
talosctl -n <node1-ip> etcd status
talosctl -n <node2-ip> etcd status
talosctl -n <node3-ip> etcd status

# 2. Remove failed members from the surviving node
talosctl -n <surviving-node> etcd remove-member <failed-member-id>

# 3. If single node remains, force new cluster
talosctl -n <surviving-node> etcd forfeit-leadership

etcd Snapshot Restore

# 1. Take snapshot from healthy node (if available)
talosctl -n <healthy-node> etcd snapshot /tmp/etcd-backup.db

# 2. Copy snapshot and restore (cluster down scenario)
talosctl -n <node-ip> bootstrap --recover-from=/path/to/etcd-backup.db

2. Talos Reset Procedures

The Tailscale Subnet Router Deadlock

From Moon and Back Episode: Node hangs during boot when Tailscale extension can't reach coordination server.

Symptoms:

Node boots but never becomes Ready
```
talosctl dmesg
```
shows Tailscale initialization timeout
Network extension blocks kubelet startup

Recovery:

# Option 1: Wait for timeout (may take 10+ minutes)
# Tailscale extension has built-in timeout, node may recover

# Option 2: Force reset to maintenance mode
talosctl -n <node-ip> reset --graceful=false --reboot

# Option 3: Remove extension config temporarily
# Edit machine config to comment out Tailscale extension
talosctl -n <node-ip> edit machineconfig
# Then apply:
talosctl -n <node-ip> apply-config --insecure

Complete Node Reset

DESTRUCTIVE: Wipes node and reinstalls from config.

# Graceful reset (preserves etcd data if possible)
talosctl -n <node-ip> reset --graceful

# Hard reset (complete wipe)
talosctl -n <node-ip> reset --graceful=false --reboot

# Reset with system disk wipe
talosctl -n <node-ip> reset --system-labels-to-wipe STATE --system-labels-to-wipe EPHEMERAL

Recover Unresponsive Node

# 1. Check if maintenance mode is available
talosctl -n <node-ip> --endpoints <node-ip> version --insecure

# 2. If responsive in maintenance, reapply config
talosctl -n <node-ip> apply-config -f machineconfig.yaml --insecure

# 3. If completely unresponsive, physical intervention required
# Boot from Talos ISO and reinstall

3. Synology Emergency Access

Container Manager Crash Recovery

When DSM Container Manager (Docker) becomes unresponsive:

# SSH to Synology (enable SSH in Control Panel first)
ssh admin@<synology-ip>

# Check Docker daemon status
sudo synosystemctl status pkgctl-Docker

# Restart Docker service
sudo synosystemctl restart pkgctl-Docker

# If restart fails, stop and check logs
sudo synosystemctl stop pkgctl-Docker
sudo cat /var/log/synopkg.log | grep -i docker

# Force kill if hung
sudo killall -9 dockerd
sudo synosystemctl start pkgctl-Docker

Emergency SSH Access Setup

Prepare before you need it:

# Enable SSH in DSM Control Panel > Terminal & SNMP
# Set non-standard port for security

# Add SSH key for passwordless access
ssh-copy-id -i ~/.ssh/id_ed25519 admin@<synology-ip>

# Test access
ssh admin@<synology-ip> "sudo whoami"

Volume Mount Recovery

# SSH to Synology and check mounts
ssh admin@<synology-ip>

# List current mounts
mount | grep volume

# Check disk status
sudo cat /proc/mdstat

# If volume not mounted, check for errors
sudo dmesg | tail -50

# Remount volume (CAREFUL: verify volume number)
sudo mount /volume1

4. Abandon Ship Protocol

Graceful Shutdown Sequence

For planned power outages or emergency situations:

#!/bin/bash
# abandon-ship.sh - Graceful home lab shutdown

echo "=== ABANDON SHIP PROTOCOL INITIATED ==="
echo "Time: $(date)"

# Phase 1: Suspend Flux reconciliation (prevent drift during shutdown)
echo "Phase 1: Suspending Flux reconciliation..."
kubectl annotate fluxinstance flux -n flux-system \
  kustomize.toolkit.fluxcd.io/suspend=true

# Phase 2: Scale down workloads
echo "Phase 2: Scaling down deployments..."
kubectl get deployments -A -o name | \
  xargs -I{} kubectl scale {} --replicas=0

# Phase 3: Drain nodes gracefully
echo "Phase 3: Draining worker nodes..."
for node in $(kubectl get nodes -l node-role.kubernetes.io/control-plane!= -o name); do
  kubectl drain $node --ignore-daemonsets --delete-emptydir-data --force
done

# Phase 4: Drain control planes
echo "Phase 4: Draining control planes..."
for node in $(kubectl get nodes -l node-role.kubernetes.io/control-plane -o name); do
  kubectl drain $node --ignore-daemonsets --delete-emptydir-data --force
done

# Phase 5: Shutdown Talos nodes
echo "Phase 5: Shutting down Talos nodes..."
for node_ip in <worker-ips>; do
  talosctl -n $node_ip shutdown
done

# Wait for workers
sleep 30

for node_ip in <control-plane-ips>; do
  talosctl -n $node_ip shutdown
done

# Phase 6: Shutdown Synology (if integrated)
echo "Phase 6: Shutting down Synology NAS..."
ssh admin@<synology-ip> "sudo shutdown -h now"

echo "=== SHUTDOWN COMPLETE ==="
echo "Safe to remove power in 60 seconds"

UPS Integration Check

# Check UPS status via NUT (Network UPS Tools)
upsc <ups-name>@localhost

# Monitor battery level
upsc <ups-name>@localhost battery.charge

# Check if on battery power
upsc <ups-name>@localhost ups.status
# OL = Online (power good)
# OB = On Battery (power lost)
# LB = Low Battery (critical)

Emergency Power Script

#!/bin/bash
# emergency-power.sh - Triggered by UPS on low battery

BATTERY_LEVEL=$(upsc <ups-name>@localhost battery.charge 2>/dev/null)

if [ "$BATTERY_LEVEL" -lt 20 ]; then
  echo "CRITICAL: Battery at ${BATTERY_LEVEL}%"
  echo "Initiating emergency shutdown..."
  bash /path/to/abandon-ship.sh
fi

Recovery Playbooks

Playbook 1: "I Can't Reach My Cluster" (15-30 min)

Check network connectivity:
```
ping <control-plane-ip>
```
Check VPN/Tailscale:
```
tailscale status
```
Check kubeconfig:
```
kubectl config current-context
```

Test direct connection:

kubectl --server=https://<ip>:6443 cluster-info

SSH to node and check kubelet:
```
talosctl -n <ip> service kubelet
```

Playbook 2: "Node Won't Come Back Up" (30-60 min)

Check node status:
```
talosctl -n <ip> version --insecure
```
Review boot logs:
```
talosctl -n <ip> dmesg | tail -100
```
Check etcd membership:
```
talosctl -n <other-node> etcd members
```
If extension hang: Apply config without problematic extension
If total failure: Reset and reinstall from known-good config

Playbook 3: "Everything Is Down" (1-2 hours)

Identify last known good state (Terraform state, git history)
Check physical infrastructure (power, network switches)
Boot minimal cluster (single control plane if needed)
Restore etcd from most recent snapshot
Gradually bring up additional nodes
Verify Flux sync before re-enabling reconciliation

MCP Server Integration

After initial recovery, the Flux Operator MCP Server accelerates verification:

Post-Recovery Health Check

Once cluster connectivity is restored, use MCP tools instead of manual kubectl:

MCP Tool: get_flux_instance
→ Comprehensive Flux controller status and CRD health
→ Faster than iterative /flux-status diagnosis

MCP Tool: get_kubernetes_resources  
→ Query all pods, check for CrashLoopBackOff across namespaces
→ Identify remaining issues post-recovery

Setup: See flux-operator skill Step 10 for MCP server configuration.

Integration Points

Flux Operator → Use MCP server for comprehensive post-recovery verification
Prometheus Observer: Check
```
/prometheus-status
```
for alert state after cluster recovery
Post-Mortem Author: After recovery, use
```
/postmortem
```
to document what happened

Read-Only Diagnostic Commands

Safe commands for assessment without modification:

# Cluster state
kubectl get nodes -o wide
kubectl get pods -A | grep -v Running
kubectl get events -A --sort-by=.lastTimestamp | tail -20

# Talos diagnostics
talosctl -n <node-ip> version
talosctl -n <node-ip> service
talosctl -n <node-ip> dmesg | tail -50

# Network diagnostics
talosctl -n <node-ip> netstat -tulpn

# etcd diagnostics
talosctl -n <node-ip> etcd status

Sunkworks Episode Notes

"The best debugging happens live, with an audience watching you fail."

This skill is designed for live-stream troubleshooting where:

Mistakes are educational
Recovery time matters for audience engagement
Documentation happens in real-time
Iterative debugging is expected and embraced