Nexus-agents infrastructure-management

install

source · Clone the upstream repo

git clone https://github.com/williamzujkowski/nexus-agents

Claude Code · Install into ~/.claude/skills/

T=$(mktemp -d) && git clone --depth=1 https://github.com/williamzujkowski/nexus-agents "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/infrastructure-management" ~/.claude/skills/williamzujkowski-nexus-agents-infrastructure-management && rm -rf "$T"

manifest: skills/infrastructure-management/SKILL.md

source content

Infrastructure Management Skill

Overview

Manages physical and SBC infrastructure with awareness of hardware boot times, access hierarchies, and out-of-band management capabilities.

Access Strategy — Try in Order

SSH key-based — Primary access method
SSH password — Fallback if key fails
Tailscale/VPN — If direct SSH unreachable
OOB management (iDRAC/iLO/IPMI) — For power cycling, console when SSH down
Serial console — Last remote option
Physical access — Keyboard/monitor as final resort

Always maintain at least two working access methods per host.

Phase 1: Connectivity Audit

For each managed host, check access:

# SSH connectivity check (2s timeout)
ssh -o ConnectTimeout=2 -o BatchMode=yes USER@HOST "echo ok" 2>&1

# Check SSH via password (if key fails)
# NOTE: sshpass usage requires explicit user approval

# Check if OOB/iDRAC is reachable
curl -sk --connect-timeout 5 https://IDRAC_IP/data?get=pwState 2>&1 || echo "iDRAC unreachable"

# IPMI ping check
ipmitool -I lanplus -H IPMI_IP -U root -P PASSWORD power status 2>&1

Report format:

Host: hostname (IP)
  SSH Key:     OK | FAIL (reason)
  SSH Pass:    OK | FAIL | NOT_TESTED
  OOB:         OK (iDRAC6/iLO4/IPMI) | UNREACHABLE
  Boot Time:   ~30s (SBC) | ~3min (desktop) | ~10min (enterprise)
  Status:      HEALTHY | DEGRADED | UNREACHABLE

Phase 2: Hardware Health

Query available health data from each host:

# Temperature (via SSH)
ssh HOST "cat /sys/class/thermal/thermal_zone*/temp 2>/dev/null || sensors 2>/dev/null"

# Disk health
ssh HOST "df -h && smartctl -a /dev/sda 2>/dev/null | grep -E 'Health|Temperature|Reallocated'"

# Memory
ssh HOST "free -h"

# Uptime and load
ssh HOST "uptime"

# Docker status (if applicable)
ssh HOST "docker ps --format 'table {{.Names}}\t{{.Status}}\t{{.Ports}}' 2>/dev/null"

For iDRAC-equipped servers:

# Sensor readings via REST API
curl -sk --cookie "session_cookie" "https://IDRAC_IP/data?get=tempprobes"
curl -sk --cookie "session_cookie" "https://IDRAC_IP/data?get=fanstatus"

# System Event Log
ssh IDRAC_IP "racadm getsel" 2>/dev/null

Phase 3: Recovery Actions

When a host is unreachable:

Wait for boot — Enterprise servers with lots of RAM take 10-15 minutes
Try OOB power cycle —
```
ipmitool power cycle
```
or iDRAC web/API
Check network — ping gateway, check switch port
Serial console — If available via OOB
Physical intervention — Document what's needed, create issue

Boot Time Reference

Hardware Type	Expected Boot Time
Raspberry Pi / SBC	30-60 seconds
Desktop / small server	1-3 minutes
1U/2U rack server (≤64GB)	3-5 minutes
Enterprise server (128GB+)	8-15 minutes
High-memory (512GB+)	12-20 minutes

Do NOT declare a server failed until at least 2x the expected boot time has passed.

Phase 4: Preventive Checks

For SBC hosts (Raspberry Pi):

Check SD card health:

sudo dmesg | grep -i "mmc\|error\|read-only"

Verify USB boot if applicable
Check power supply voltage:
```
vcgencmd measure_volts
```
Monitor temperature:
```
vcgencmd measure_temp
```

For enterprise servers:

Review System Event Log for predictive failures

Check RAID status:

ssh HOST "cat /proc/mdstat 2>/dev/null || megacli -LDInfo -Lall -aALL 2>/dev/null"

Verify firmware versions against known-good baselines

Output Format

Produce a summary with:

## Infrastructure Status Report

### Hosts Summary
| Host | IP | SSH | OOB | Health | Boot Est. |
|------|-----|-----|-----|--------|-----------|
| ...  | ... | ... | ... | ...    | ...       |

### Findings
- [CRITICAL] Host X unreachable via all methods
- [WARNING] Host Y disk SMART warning
- [INFO] Host Z uptime 45 days, consider updates

### Recommended Actions
1. ...
2. ...

Phase 5: BOSH/CF Deployment Verification

For BOSH-managed infrastructure:

# Verify director health
source ~/deployments/bosh/env.sh
bosh env                            # Director reachable?

# Check all VMs running
bosh vms                            # All instances "running"?

# Check director processes
ssh -i <key> jumpbox@DIRECTOR_IP "sudo monit summary"
# Expected: nats, postgres, blobstore_nginx, director, workers, health_monitor, lxd_cpi

# Verify CredHub on director
curl -sk https://DIRECTOR_IP:8844/info    # Should return JSON with app name "CredHub"
credhub find                               # Should list credentials

# BBR readiness
bbr director --host DIRECTOR_IP --username bbr --private-key-path bbr.pem pre-backup-check

Post-Deployment Checklist

After any

bosh create-env

bosh deploy

Process check:
```
monit summary
```
on affected VMs — all processes "running"
Connectivity check: curl service endpoints (CredHub :8844, UAA :8443)
VM count:
```
bosh vms
```
matches expected count
Dependent services: Verify services that depend on the updated component
Smoke tests:
```
bosh run-errand smoke-tests
```
if available
Backup readiness:
```
bbr pre-backup-check
```
still passes

Common Ops File Dependencies

Ops File	Depends On	Provides
`credhub.yml`	`uaa.yml`	CredHub on director (:8844)
`uaa.yml`	(base)	UAA on director (:8443)
`bbr.yml`	(base)	backup-and-restore-sdk
CPI ops (e.g., Incus)	(base)	VM lifecycle management

CRITICAL: Missing

uaa.yml

when

credhub.yml

is included causes CredHub to silently not start. Always check

monit summary

after ops file changes.

Phase 6: Documentation-Reality Drift Check

Verify documentation against live system:

# VM count
bosh vms 2>/dev/null | grep -c running    # Compare against README

# Service inventory
systemctl list-units --state=running --type=service | grep -E "podman|grafana|loki"

# Tool availability (verify before referencing in docs)
which terraform terragrunt make 2>/dev/null

# Network topology
ip -br addr show | grep -E "bond|vlan"

# Storage
zpool list; df -h /srv/nfs

Flag any discrepancies between docs and live output. Live system is always authoritative.

Important Notes

Never store credentials in skill output or issues — reference vault/config
Always prefer non-destructive actions (check before restart)
Power cycle is a last resort — data loss risk on unclean shutdown
Create GitHub issues for persistent problems requiring physical access
SBC SD cards wear out — check for read-only filesystem warnings
When fixing one system, verify adjacent systems (discovery pattern)
Missing ops file dependencies cause silent failures — always verify all processes after deploy