Awesome-omni-skill infra-guardian

OpenClaw Agent Infrastructure Guardian — keep your agent's infrastructure alive. Process lifecycle management with detached execution, auto-restart on failure. Cron scheduler health monitoring (per-job detection, auto-recovery). Direct Telegram/messaging alerts independent of OpenClaw. System-level watchdog that runs from crontab, not OpenClaw cron. Use when launching background processes, monitoring cron job health, or when things keep dying silently.

install

source · Clone the upstream repo

git clone https://github.com/diegosouzapw/awesome-omni-skill

Claude Code · Install into ~/.claude/skills/

T=$(mktemp -d) && git clone --depth=1 https://github.com/diegosouzapw/awesome-omni-skill "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/devops/infra-guardian" ~/.claude/skills/diegosouzapw-awesome-omni-skill-infra-guardian && rm -rf "$T"

manifest: skills/devops/infra-guardian/SKILL.md

source content

Infra Guardian

Keep your OpenClaw agent's infrastructure alive. Processes, cron jobs, the works.

What It Does

Layer	What	How
Process Management	Launch, track, auto-restart background processes	`setsid` + `nohup` + registry + healthcheck
Cron Health	Detect stalled OpenClaw cron jobs per-job	Reads `jobs.json` from disk, checks `nextRunAtMs` vs interval
Auto-Recovery	Restart gateway when cron scheduler stalls	SIGUSR1 when >50% jobs overdue
Alerting	Telegram alerts independent of OpenClaw	Direct Bot API calls — works even if OpenClaw is down

Core principle: Monitoring components must not depend on the system they monitor.

Quick Start

# Process management
bash scripts/managed-process.sh register my-bot "python3 /path/to/bot.py" 480
bash scripts/managed-process.sh start my-bot
bash scripts/managed-process.sh status

# Infrastructure watchdog (add to system crontab, NOT OpenClaw cron)
*/10 * * * * /path/to/scripts/managed-process.sh watchdog

Commands

Command	Usage	Description
`register`	`register <name> <command> [duration_min]`	Define a managed process. Duration 0 = indefinite.
`start`	`start <name>`	Launch registered process (fully detached).
`stop`	`stop <name>`	Graceful shutdown via SIGTERM.
`restart`	`restart <name>`	Stop then start.
`status`	`status [name]`	Show all processes or one specific.
`healthcheck`	`healthcheck`	Check all registered processes, restart dead ones.
`watchdog`	`watchdog`	Unified check: cron health + process health (for system crontab).
`cron-health`	`cron-health`	Check OpenClaw cron scheduler per-job health.
`proc-health`	`proc-health`	Check key process liveness (configurable patterns).
`deregister`	`deregister <name>`	Remove process from registry + clean up files.

Infrastructure Watchdog

The

watchdog

command is the unified health check. Run it from system crontab — never from OpenClaw cron (you can't monitor the scheduler using the scheduler).

# Unified (recommended)
bash scripts/managed-process.sh watchdog

# Individual checks
bash scripts/managed-process.sh cron-health
bash scripts/managed-process.sh proc-health

Cron Health (

cron-health

)

Reads OpenClaw's cron state directly from disk (

~/.openclaw/cron/jobs.json

). No API dependency.

Per-job detection:

Each enabled job checked independently
```
nextRunAtMs
```
overdue by >2× its interval → STALE
Jobs with
```
kind: "every"
```
use their
```
everyMs
```
as interval
Jobs with
```
kind: "cron"
```
use 24h as max expected interval

Auto-recovery:

If >50% of jobs are stale → sends SIGUSR1 to restart gateway
Alerts via Telegram Bot API directly (reads bot token from OpenClaw config)
30-minute cooldown between alerts (no spam)

Why this matters: OpenClaw has a known cron bug (#8424) where

kind: "cron"

jobs permanently stall after missing a run. This watchdog catches it within 20 minutes instead of discovering it 8 hours later.

Process Health (

proc-health

)

Checks known background processes via

pgrep

. Configurable patterns in the script.

Alerts via Telegram if any monitored process is down
30-minute cooldown

Setup

# Add to system crontab (not OpenClaw cron!)
crontab -e
# Add this line:
*/10 * * * * /home/clawdbot/clawd/scripts/managed-process.sh watchdog

Process Management

The Problem

When AI agents launch background processes via

exec &

nohup

, those processes are tied to the parent session. Session ends → child processes get SIGTERM → die silently → nobody knows.

The Rule

ALL long-running processes MUST go through this framework. No exceptions.

❌
```
python script.py &
```
❌
```
nohup python script.py &
```
❌
```
exec
```
with
```
background: true
```

✅

bash scripts/managed-process.sh register <name> <cmd>

then

start <name>

Detached Execution

Processes launch via

setsid

nohup

disown

, giving them:

Own session ID (SID) — not tied to any terminal
PPID=1 (init) — survives parent death
Immune to SIGHUP/session cleanup

Health Monitoring

The

healthcheck

command (can run from cron every 5 min):

Checks each registered process is alive (PID exists)
If dead: checks if it completed normally (ran ≥80% of expected duration)
If premature death: auto-restarts and writes alert flag
Alert cooldown: 15 min between alerts (no spam)

Alert Integration

When a process dies, the healthcheck writes:

```
/tmp/process_monitor_alert.flag
```
— trigger file
```
/tmp/process_monitor_alert.txt
```
— alert message

Configure your agent's heartbeat to check these files and forward alerts.

Process Registry

All processes are tracked in

.process-registry.json

{
  "my-bot": {
    "command": "python3 /path/to/bot.py",
    "duration_min": 480,
    "auto_restart": true,
    "max_restarts": 5,
    "restart_cooldown": 30
  }
}

Signal Handling Best Practice

Your scripts should log signals, not silently exit:

import signal
from datetime import datetime

def handler(signum, frame):
    sig_name = signal.Signals(signum).name
    print(f"⚠️ SIGNAL: {sig_name} at {datetime.now()}", flush=True)
    global shutdown
    shutdown = True

signal.signal(signal.SIGTERM, handler)
signal.signal(signal.SIGINT, handler)

Never call

sys.exit(0)

in signal handlers — it makes crashes look like clean exits, preventing watchdogs from restarting.

Architecture

System Crontab (every 10 min)
  └─ managed-process.sh watchdog
       ├─ cron-health: reads ~/.openclaw/cron/jobs.json
       │    ├─ per-job: nextRunAtMs overdue > 2× interval?
       │    ├─ if >50% stale → SIGUSR1 gateway restart
       │    └─ alert → Telegram Bot API (direct, no OpenClaw)
       └─ proc-health: pgrep known patterns
            └─ alert → Telegram Bot API (direct, no OpenClaw)

Independence chain: System crontab → bash script → disk read → direct Telegram API. Zero OpenClaw dependencies in the monitoring path.

Awesome-omni-skill infra-guardian

Infra Guardian

What It Does

Quick Start

Commands

Infrastructure Watchdog

Cron Health (
`cron-health`
)

Process Health (
`proc-health`
)

Setup

Process Management

The Problem

The Rule

Detached Execution

Health Monitoring

Alert Integration

Process Registry

Signal Handling Best Practice

Architecture

Awesome-omni-skill infra-guardian

Infra Guardian

What It Does

Quick Start

Commands

Infrastructure Watchdog

Cron Health (cron-health)

Process Health (proc-health)

Setup

Process Management

The Problem

The Rule

Detached Execution

Health Monitoring

Alert Integration

Process Registry

Signal Handling Best Practice

Architecture

Cron Health (
`cron-health`
)

Process Health (
`proc-health`
)