Skills agent-health-diagnostics
Diagnose and fix the 4 most common OpenClaw agent failures — heartbeat spam, API rate limit cascades, channel death loops, and memory/embedding errors. Battle-tested across a 6-agent multi-host deployment.
git clone https://github.com/openclaw/skills
T=$(mktemp -d) && git clone --depth=1 https://github.com/openclaw/skills "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/agenthyjack/agent-health-diagnostics" ~/.claude/skills/clawdbot-skills-agent-health-diagnostics && rm -rf "$T"
skills/agenthyjack/agent-health-diagnostics/SKILL.mdAgent Health Diagnostics
Scripts available in the Collective Skills repo
Overview
When an OpenClaw agent misbehaves — spamming messages, going dark, burning API credits, or looping on dead channels — this skill provides the diagnostic playbook. Covers the 4 most common failure modes with exact commands to diagnose and fix each one.
Battle-tested across a 6-agent deployment spanning 3 hosts (Windows + Linux + Proxmox).
When to Use This Skill
Use when you observe any of these symptoms:
- Agent sending repeated heartbeat/status messages to Telegram/Discord/etc.
- Agent goes silent despite gateway showing "active"
- Logs show
or429 Too many tokens
errorsrate_limit - Channel connection loops:
,auto-restart attempt 1/10
, etc.2/10 - Memory search errors:
input length exceeds context length - Gateway says "active" but agent doesn't respond to messages
The 4 Failure Modes
1. Heartbeat Spam
Symptom: Agent sends repeated messages every N minutes. Root cause: Heartbeat interval too low (10m = 144 messages/day) + verbose prompt that always generates output instead of HEARTBEAT_OK. Quick fix:
# Check interval grep -A5 heartbeat ~/.openclaw/openclaw.json # Fix: set to 30m minimum, simplify prompt to checklist + HEARTBEAT_OK default # Then restart gateway openclaw gateway restart
Prevention: Never set heartbeat below 20 minutes. Heartbeat prompts should CHECK things, not CREATE things.
2. API Rate Limit Cascade
Symptom: All models fail, agent goes dark. Root cause: Heartbeat + N crons = (N+1) API calls per interval. Exceeds provider TPM limit → all fallbacks exhausted simultaneously. Quick fix:
# Check for rate limits journalctl -u <service> --since '1h ago' | grep '429\|rate_limit' # Count your crons (each burns tokens) openclaw cron list # Fix: reduce heartbeat to 30-60m, disable non-essential crons, stagger schedules
Prevention: Calculate token budget before adding crons. Each run ≈ 2K-10K tokens. Route heartbeats to cheap/local models.
3. Channel Death Loop
Symptom: Logs show repeated
auto-restart attempt N/10 for IRC/Discord/etc.
Root cause: Target server unreachable → health monitor restarts → fails again → loop. Each restart may trigger model calls, burning API tokens.
Quick fix:
# Check for loops journalctl -u <service> --since '1h ago' | grep 'auto-restart\|timed out' # Test connectivity nc -zv <target-ip> <target-port> -w 5 # Fix: disable the broken channel in openclaw.json # channels.<name>.enabled = false openclaw gateway restart
Prevention: Test connectivity BEFORE enabling channels. Disable channels you can't reach.
4. Memory/Embedding Overflow
Symptom:
memory sync failed or input length exceeds context length errors.
Root cause: File too large for embedding model's context window (mxbai-embed-large = 8K tokens).
Quick fix: Archive old sections of large files (MEMORY.md → memory/archive/). Keep active files under 8K tokens.
Prevention: Don't let MEMORY.md grow unbounded. Archive quarterly.
Remote Diagnostic Quick Reference
| What | Command |
|---|---|
| Service status | |
| Recent logs | |
| Live tail | |
| Rate limits | |
| Cron list | |
| Port test | |
| Config backup | |
Golden Rules
- Always back up config before editing.
cp openclaw.json openclaw.json.bak - Always restart gateway after config changes. Hot reload doesn't catch everything.
- Check logs before guessing.
tells you what's wrong 90% of the time.journalctl - Calculate your API budget. Heartbeat freq × (crons + 1) × avg tokens = burn rate.
- Disable what you can't reach. Dead channels create loops that waste resources.
- "Configured" ≠ "working." Verify with actual output after every change.