Nexus-agents infrastructure-management
install
source · Clone the upstream repo
git clone https://github.com/williamzujkowski/nexus-agents
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/williamzujkowski/nexus-agents "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/infrastructure-management" ~/.claude/skills/williamzujkowski-nexus-agents-infrastructure-management && rm -rf "$T"
manifest:
skills/infrastructure-management/SKILL.mdsource content
Infrastructure Management Skill
Overview
Manages physical and SBC infrastructure with awareness of hardware boot times, access hierarchies, and out-of-band management capabilities.
Access Strategy — Try in Order
- SSH key-based — Primary access method
- SSH password — Fallback if key fails
- Tailscale/VPN — If direct SSH unreachable
- OOB management (iDRAC/iLO/IPMI) — For power cycling, console when SSH down
- Serial console — Last remote option
- Physical access — Keyboard/monitor as final resort
Always maintain at least two working access methods per host.
Phase 1: Connectivity Audit
For each managed host, check access:
# SSH connectivity check (2s timeout) ssh -o ConnectTimeout=2 -o BatchMode=yes USER@HOST "echo ok" 2>&1 # Check SSH via password (if key fails) # NOTE: sshpass usage requires explicit user approval # Check if OOB/iDRAC is reachable curl -sk --connect-timeout 5 https://IDRAC_IP/data?get=pwState 2>&1 || echo "iDRAC unreachable" # IPMI ping check ipmitool -I lanplus -H IPMI_IP -U root -P PASSWORD power status 2>&1
Report format:
Host: hostname (IP) SSH Key: OK | FAIL (reason) SSH Pass: OK | FAIL | NOT_TESTED OOB: OK (iDRAC6/iLO4/IPMI) | UNREACHABLE Boot Time: ~30s (SBC) | ~3min (desktop) | ~10min (enterprise) Status: HEALTHY | DEGRADED | UNREACHABLE
Phase 2: Hardware Health
Query available health data from each host:
# Temperature (via SSH) ssh HOST "cat /sys/class/thermal/thermal_zone*/temp 2>/dev/null || sensors 2>/dev/null" # Disk health ssh HOST "df -h && smartctl -a /dev/sda 2>/dev/null | grep -E 'Health|Temperature|Reallocated'" # Memory ssh HOST "free -h" # Uptime and load ssh HOST "uptime" # Docker status (if applicable) ssh HOST "docker ps --format 'table {{.Names}}\t{{.Status}}\t{{.Ports}}' 2>/dev/null"
For iDRAC-equipped servers:
# Sensor readings via REST API curl -sk --cookie "session_cookie" "https://IDRAC_IP/data?get=tempprobes" curl -sk --cookie "session_cookie" "https://IDRAC_IP/data?get=fanstatus" # System Event Log ssh IDRAC_IP "racadm getsel" 2>/dev/null
Phase 3: Recovery Actions
When a host is unreachable:
- Wait for boot — Enterprise servers with lots of RAM take 10-15 minutes
- Try OOB power cycle —
or iDRAC web/APIipmitool power cycle - Check network — ping gateway, check switch port
- Serial console — If available via OOB
- Physical intervention — Document what's needed, create issue
Boot Time Reference
| Hardware Type | Expected Boot Time |
|---|---|
| Raspberry Pi / SBC | 30-60 seconds |
| Desktop / small server | 1-3 minutes |
| 1U/2U rack server (≤64GB) | 3-5 minutes |
| Enterprise server (128GB+) | 8-15 minutes |
| High-memory (512GB+) | 12-20 minutes |
Do NOT declare a server failed until at least 2x the expected boot time has passed.
Phase 4: Preventive Checks
For SBC hosts (Raspberry Pi):
- Check SD card health:
sudo dmesg | grep -i "mmc\|error\|read-only" - Verify USB boot if applicable
- Check power supply voltage:
vcgencmd measure_volts - Monitor temperature:
vcgencmd measure_temp
For enterprise servers:
- Review System Event Log for predictive failures
- Check RAID status:
ssh HOST "cat /proc/mdstat 2>/dev/null || megacli -LDInfo -Lall -aALL 2>/dev/null" - Verify firmware versions against known-good baselines
Output Format
Produce a summary with:
## Infrastructure Status Report ### Hosts Summary | Host | IP | SSH | OOB | Health | Boot Est. | |------|-----|-----|-----|--------|-----------| | ... | ... | ... | ... | ... | ... | ### Findings - [CRITICAL] Host X unreachable via all methods - [WARNING] Host Y disk SMART warning - [INFO] Host Z uptime 45 days, consider updates ### Recommended Actions 1. ... 2. ...
Phase 5: BOSH/CF Deployment Verification
For BOSH-managed infrastructure:
# Verify director health source ~/deployments/bosh/env.sh bosh env # Director reachable? # Check all VMs running bosh vms # All instances "running"? # Check director processes ssh -i <key> jumpbox@DIRECTOR_IP "sudo monit summary" # Expected: nats, postgres, blobstore_nginx, director, workers, health_monitor, lxd_cpi # Verify CredHub on director curl -sk https://DIRECTOR_IP:8844/info # Should return JSON with app name "CredHub" credhub find # Should list credentials # BBR readiness bbr director --host DIRECTOR_IP --username bbr --private-key-path bbr.pem pre-backup-check
Post-Deployment Checklist
After any
bosh create-env or bosh deploy:
- Process check:
on affected VMs — all processes "running"monit summary - Connectivity check: curl service endpoints (CredHub :8844, UAA :8443)
- VM count:
matches expected countbosh vms - Dependent services: Verify services that depend on the updated component
- Smoke tests:
if availablebosh run-errand smoke-tests - Backup readiness:
still passesbbr pre-backup-check
Common Ops File Dependencies
| Ops File | Depends On | Provides |
|---|---|---|
| | CredHub on director (:8844) |
| (base) | UAA on director (:8443) |
| (base) | backup-and-restore-sdk |
| CPI ops (e.g., Incus) | (base) | VM lifecycle management |
CRITICAL: Missing
uaa.yml when credhub.yml is included causes CredHub to silently not start. Always check monit summary after ops file changes.
Phase 6: Documentation-Reality Drift Check
Verify documentation against live system:
# VM count bosh vms 2>/dev/null | grep -c running # Compare against README # Service inventory systemctl list-units --state=running --type=service | grep -E "podman|grafana|loki" # Tool availability (verify before referencing in docs) which terraform terragrunt make 2>/dev/null # Network topology ip -br addr show | grep -E "bond|vlan" # Storage zpool list; df -h /srv/nfs
Flag any discrepancies between docs and live output. Live system is always authoritative.
Important Notes
- Never store credentials in skill output or issues — reference vault/config
- Always prefer non-destructive actions (check before restart)
- Power cycle is a last resort — data loss risk on unclean shutdown
- Create GitHub issues for persistent problems requiring physical access
- SBC SD cards wear out — check for read-only filesystem warnings
- When fixing one system, verify adjacent systems (discovery pattern)
- Missing ops file dependencies cause silent failures — always verify all processes after deploy