Awesome-omni-skill infra-guardian
OpenClaw Agent Infrastructure Guardian — keep your agent's infrastructure alive. Process lifecycle management with detached execution, auto-restart on failure. Cron scheduler health monitoring (per-job detection, auto-recovery). Direct Telegram/messaging alerts independent of OpenClaw. System-level watchdog that runs from crontab, not OpenClaw cron. Use when launching background processes, monitoring cron job health, or when things keep dying silently.
git clone https://github.com/diegosouzapw/awesome-omni-skill
T=$(mktemp -d) && git clone --depth=1 https://github.com/diegosouzapw/awesome-omni-skill "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/devops/infra-guardian" ~/.claude/skills/diegosouzapw-awesome-omni-skill-infra-guardian && rm -rf "$T"
skills/devops/infra-guardian/SKILL.mdInfra Guardian
Keep your OpenClaw agent's infrastructure alive. Processes, cron jobs, the works.
What It Does
| Layer | What | How |
|---|---|---|
| Process Management | Launch, track, auto-restart background processes | + + registry + healthcheck |
| Cron Health | Detect stalled OpenClaw cron jobs per-job | Reads from disk, checks vs interval |
| Auto-Recovery | Restart gateway when cron scheduler stalls | SIGUSR1 when >50% jobs overdue |
| Alerting | Telegram alerts independent of OpenClaw | Direct Bot API calls — works even if OpenClaw is down |
Core principle: Monitoring components must not depend on the system they monitor.
Quick Start
# Process management bash scripts/managed-process.sh register my-bot "python3 /path/to/bot.py" 480 bash scripts/managed-process.sh start my-bot bash scripts/managed-process.sh status # Infrastructure watchdog (add to system crontab, NOT OpenClaw cron) */10 * * * * /path/to/scripts/managed-process.sh watchdog
Commands
| Command | Usage | Description |
|---|---|---|
| | Define a managed process. Duration 0 = indefinite. |
| | Launch registered process (fully detached). |
| | Graceful shutdown via SIGTERM. |
| | Stop then start. |
| | Show all processes or one specific. |
| | Check all registered processes, restart dead ones. |
| | Unified check: cron health + process health (for system crontab). |
| | Check OpenClaw cron scheduler per-job health. |
| | Check key process liveness (configurable patterns). |
| | Remove process from registry + clean up files. |
Infrastructure Watchdog
The
watchdog command is the unified health check. Run it from system crontab — never from OpenClaw cron (you can't monitor the scheduler using the scheduler).
# Unified (recommended) bash scripts/managed-process.sh watchdog # Individual checks bash scripts/managed-process.sh cron-health bash scripts/managed-process.sh proc-health
Cron Health (cron-health
)
cron-healthReads OpenClaw's cron state directly from disk (
~/.openclaw/cron/jobs.json). No API dependency.
Per-job detection:
- Each enabled job checked independently
overdue by >2× its interval → STALEnextRunAtMs- Jobs with
use theirkind: "every"
as intervaleveryMs - Jobs with
use 24h as max expected intervalkind: "cron"
Auto-recovery:
- If >50% of jobs are stale → sends SIGUSR1 to restart gateway
- Alerts via Telegram Bot API directly (reads bot token from OpenClaw config)
- 30-minute cooldown between alerts (no spam)
Why this matters: OpenClaw has a known cron bug (#8424) where
kind: "cron" jobs permanently stall after missing a run. This watchdog catches it within 20 minutes instead of discovering it 8 hours later.
Process Health (proc-health
)
proc-healthChecks known background processes via
pgrep. Configurable patterns in the script.
- Alerts via Telegram if any monitored process is down
- 30-minute cooldown
Setup
# Add to system crontab (not OpenClaw cron!) crontab -e # Add this line: */10 * * * * /home/clawdbot/clawd/scripts/managed-process.sh watchdog
Process Management
The Problem
When AI agents launch background processes via
exec & or nohup, those processes are tied to the parent session. Session ends → child processes get SIGTERM → die silently → nobody knows.
The Rule
ALL long-running processes MUST go through this framework. No exceptions.
- ❌
python script.py & - ❌
nohup python script.py & - ❌
withexecbackground: true - ✅
thenbash scripts/managed-process.sh register <name> <cmd>start <name>
Detached Execution
Processes launch via
setsid + nohup + disown, giving them:
- Own session ID (SID) — not tied to any terminal
- PPID=1 (init) — survives parent death
- Immune to SIGHUP/session cleanup
Health Monitoring
The
healthcheck command (can run from cron every 5 min):
- Checks each registered process is alive (PID exists)
- If dead: checks if it completed normally (ran ≥80% of expected duration)
- If premature death: auto-restarts and writes alert flag
- Alert cooldown: 15 min between alerts (no spam)
Alert Integration
When a process dies, the healthcheck writes:
— trigger file/tmp/process_monitor_alert.flag
— alert message/tmp/process_monitor_alert.txt
Configure your agent's heartbeat to check these files and forward alerts.
Process Registry
All processes are tracked in
.process-registry.json:
{ "my-bot": { "command": "python3 /path/to/bot.py", "duration_min": 480, "auto_restart": true, "max_restarts": 5, "restart_cooldown": 30 } }
Signal Handling Best Practice
Your scripts should log signals, not silently exit:
import signal from datetime import datetime def handler(signum, frame): sig_name = signal.Signals(signum).name print(f"⚠️ SIGNAL: {sig_name} at {datetime.now()}", flush=True) global shutdown shutdown = True signal.signal(signal.SIGTERM, handler) signal.signal(signal.SIGINT, handler)
Never call
sys.exit(0) in signal handlers — it makes crashes look like clean exits, preventing watchdogs from restarting.
Architecture
System Crontab (every 10 min) └─ managed-process.sh watchdog ├─ cron-health: reads ~/.openclaw/cron/jobs.json │ ├─ per-job: nextRunAtMs overdue > 2× interval? │ ├─ if >50% stale → SIGUSR1 gateway restart │ └─ alert → Telegram Bot API (direct, no OpenClaw) └─ proc-health: pgrep known patterns └─ alert → Telegram Bot API (direct, no OpenClaw)
Independence chain: System crontab → bash script → disk read → direct Telegram API. Zero OpenClaw dependencies in the monitoring path.