Joelclaw talon
Operate Talon, the Rust infrastructure watchdog daemon that supervises the system-bus worker and monitors k8s. ADR-0159.
git clone https://github.com/joelhooks/joelclaw
T=$(mktemp -d) && git clone --depth=1 https://github.com/joelhooks/joelclaw "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/talon" ~/.claude/skills/joelhooks-joelclaw-talon && rm -rf "$T"
skills/talon/SKILL.mdTalon — Infrastructure Watchdog Daemon
Compiled Rust binary that supervises the system-bus worker AND monitors the full k8s infrastructure stack. ADR-0159.
Quick Reference
talon validate # Parse/validate config + services files, print summary JSON talon --check # Single probe cycle, print results, exit talon --status # Current state machine position talon --dry-run # Print loaded config, exit talon --worker-only # Supervisor only, no infra probes talon # Full daemon mode (worker + probes + escalation)
Paths
| What | Where |
|---|---|
| Binary | |
| Source | |
| Config | |
| Service monitors | |
| Default config | |
| Default services template | |
| Voice stale cleanup | |
| State | |
| Probe results | |
| Log | (JSON lines, 10MB rotation) |
| Launchd plist | |
| RBAC guard manifest | |
| Worker stdout | |
| Worker stderr | |
| Talon launchd log | |
Build
export PATH="$HOME/.cargo/bin:$PATH" cd ~/Code/joelhooks/joelclaw/infra/talon cargo build --release cp target/release/talon ~/.local/bin/talon
Architecture
talon (single binary) ├── Worker Supervisor Thread (only when external launchd supervisor is not loaded) │ ├── Kill orphan on port 3111 │ ├── Spawn bun (child process) │ ├── Signal forwarding (SIGTERM → bun) │ ├── Health poll every 30s │ ├── PUT sync after healthy startup │ └── Crash recovery: exponential backoff 1s→30s │ ├── Infrastructure Probe Loop (main thread, 60s) │ ├── Colima VM alive? │ ├── Docker socket responding? │ ├── Talos container running? │ ├── k8s API reachable? │ ├── Node Ready + schedulable? │ ├── Flannel daemonset ready? │ ├── Redis PONG? │ ├── Inngest /health 200? │ ├── Typesense /health ok? │ └── Worker /api/inngest 200? │ └── Escalation (on failure) ├── Tier 1a: bridge-heal (force-cycle Colima on localhost↔VM split-brain) ├── Tier 1b: k8s-reboot-heal.sh (300s timeout, RBAC drift guard, VM `br_netfilter` repair, warmup-aware post-Colima invariants including deployment readiness + ImagePullBackOff pod reset, then voice-agent stale cleanup + launchd kickstart via `infra/voice-agent/cleanup-stale.sh`) ├── Tier 2: pi agent (cloud model, 10min cooldown) ├── Tier 3: pi agent (Ollama local, network-down fallback) └── Tier 4: Telegram + iMessage SOS fan-out (15min critical threshold)
State Machine
healthy → degraded (1 critical probe failure) degraded → failed (3 consecutive failures) failed → investigating (agent spawned) investigating → healthy (probes pass again) investigating → critical (agent failed to fix) critical → sos (SOS sent via Telegram + iMessage) any → healthy (all probes pass)
Probes
| Probe | Command | Critical? |
|---|---|---|
| colima | | Yes |
| docker | (Colima socket) | Yes |
| talos_container | | Yes |
| k8s_api | | Yes |
| node_ready | kubectl jsonpath for Ready condition | Yes |
| node_schedulable | kubectl jsonpath for spec (taints/cordon) | Yes |
| flannel | | No |
| redis | | Yes |
| kubelet_proxy_rbac | | Yes |
| vm:docker | | No |
| vm:k8s_api | | No |
| vm:redis | | No |
| vm:inngest | | No |
| vm:typesense | | No |
| inngest | | No |
| typesense | | No |
| worker | | No |
Critical probes trigger escalation immediately. Non-critical need 3 consecutive failures.
VM probes are witness probes only. They let Talon classify "service alive in VM but dead on localhost" as a Colima bridge split-brain and run bridge-heal instead of full recovery first.
Dynamic service probes
Add probes in
~/.joelclaw/talon/services.toml without rebuilding talon:
[launchd.gateway] label = "com.joel.gateway" critical = true timeout_secs = 5 [http.gateway_slack] url = "http://127.0.0.1:3018/health/slack" critical = true critical_after_consecutive_failures = 3 timeout_secs = 5 [launchd.voice_agent] label = "com.joel.voice-agent" critical = false timeout_secs = 5 [script.gateway_telegram_409] command = "test $(tail -20 /tmp/joelclaw/gateway.err 2>/dev/null | grep -c '409: Conflict') -lt 5" critical = true critical_after_consecutive_failures = 3 timeout_secs = 5 [script.colima_orphan_usernet] command = "test $(pgrep -f 'limactl usernet' | wc -l) -le 2" critical = true critical_after_consecutive_failures = 2 timeout_secs = 5 [script.k8s_disk_pressure] command = "! kubectl get nodes -o jsonpath='{.items[0].spec.taints}' 2>/dev/null | grep -q disk-pressure" critical = true critical_after_consecutive_failures = 1 timeout_secs = 10
passes whenlaunchd.<name>
reports a non-zero PIDlaunchctl list <label>
passes on HTTPhttp.<name>200
passes on exit code 0, fails on non-zero (runs viascript.<name>
)sh -c
escalates when the probe is marked critical (or after debounce if configured)critical = true
debounces critical alerts for dynamic probes (defaultcritical_after_consecutive_failures = N
= immediate)1
uses gateway endpointhttp.gateway_slack
, fails (503) when Slack channel is not started, and should be debounced (recommendedGET /health/slack
cycles)3- Do not probe
forhttp://127.0.0.1:8081/
by default — root returnsvoice_agent
when idle and causes false SOS noise503 - Service-heal pre-cleanup for
now clears stalevoice_agent
listeners onuv/main.py
before:8081
to avoid bind conflicts after force-cycleslaunchctl kickstart - Talon hot-reloads service probes when
mtime changes (no restart required)services.toml
forces immediate reloadkill -HUP $(launchctl print gui/$(id -u)/com.joel.talon | awk '/pid =/{print $3; exit}')
Recent dynamic probes added for the 2026-03-17 Colima/Restate incident:
— critical after 3 failures; checksscript.redis_aof_health
to catch Redis AOF rewrite/persistence corruption.aof_last_bgrewrite_status:ok
— critical after 2 failures; requires VM uptime >120s to catch Colima crash loops after force-cycles.script.colima_vm_uptime
— critical after 3 failures; verifies thescript.restate_worker_ready
pod reportsrestate-worker
before workloads are trusted.Ready=true
— non-critical witness probe; records whetherscript.kvm_device_present
is present inside Colima for nested-virt / Firecracker diagnosis./dev/kvm
Health endpoint
returns Talon state JSONGET http://127.0.0.1:9999/health- Gateway heartbeat consumes this as an additional watchdog signal
- Configure via
in[health]~/.config/talon/config.toml
SOS channel config
- Tier 4 sends to both Telegram and iMessage
- Telegram fields in
:[escalation]sos_telegram_chat_id
(defaults tosos_telegram_secret_name
)telegram_bot_token
- Talon now leases Telegram tokens via
(nosecrets lease <name> --ttl ...
). If you still see--raw
, redeploy the latest Talon binary.curl: (3) URL rejected: Malformed input to a URL function - iMessage recipient remains
sos_recipient
Launchd Management
Talon is active as
:com.joel.talon
launchctl print gui/$(id -u)/com.joel.talon | rg "state =|pid =|program =|last exit code ="
Reload binary/config after deploy:
launchctl kickstart -k gui/$(id -u)/com.joel.talon
Single owner for worker supervision is mandatory:
- If
is loaded, Talon now auto-disables its internal worker supervisor to prevent port-3111 thrash.com.joel.system-bus-worker - Preferred end-state is Talon-only supervision, but coexistence no longer causes kill/restart loops.
launchctl list com.joel.system-bus-worker
Legacy services should stay disabled when fully cut over:
launchctl bootout gui/$(id -u) ~/Library/LaunchAgents/com.joel.k8s-reboot-heal.plist
Troubleshooting
# Validate config + service monitor files talon validate | python3 -m json.tool # Check what talon sees right now talon --check | python3 -m json.tool # Check state machine talon --status | python3 -m json.tool # Broken-pipe robustness smoke test (should exit 0) talon --check | head -n 1 >/dev/null # Check health endpoint payload curl -sS http://127.0.0.1:9999/health | python3 -m json.tool # Check talon's own logs tail -20 ~/.local/state/talon/talon.log | python3 -m json.tool # Check launchd launchctl list | grep talon tail -50 ~/.local/log/talon.err # Manual probe test DOCKER_HOST=unix:///Users/joel/.colima/default/docker.sock docker inspect --format '{{.State.Status}}' joelclaw-controlplane-1 kubectl exec -n joelclaw redis-0 -- redis-cli ping kubectl auth can-i --as=apiserver-kubelet-client get nodes --subresource=proxy --all-namespaces kubectl auth can-i --as=apiserver-kubelet-client create nodes --subresource=proxy --all-namespaces ssh -F ~/.colima/_lima/colima/ssh.config lima-colima 'curl -sS http://127.0.0.1:8288/health' # Force bridge repair (same behavior Talon uses for split-brain) colima stop --force && colima start # Manual voice-agent stale cleanup (same post-gate step k8s-reboot-heal runs) ~/Code/joelhooks/joelclaw/infra/voice-agent/cleanup-stale.sh
Colima Stability Monitoring (2026-03-17)
Talon now monitors failure modes discovered during the Firecracker development incident:
| Probe | What it detects | Critical? |
|---|---|---|
| Corrupted Redis AOF from VM crash mid-write | Yes (after 3) |
| VM crash-loop (uptime < 120s = just restarted) | Yes (after 2) |
| Restate worker pod not 1/1 Ready | Yes (after 3) |
| Whether /dev/kvm exists (nested virt status) | No (informational) |
Known failure chain: nestedVirtualization → cascade
nestedVirtualization ON + heavy Docker build → Colima VZ VM crash (silent, no crash report) → Docker daemon restart → Talos container killed → Redis mid-write → AOF corruption → crash-loop → Restate mid-journal → stale invocations → infinite retries → Lima socket forwarding broken → docker CLI dead on macOS
Talon detects each stage:
< 120s → VM just crashedcolima_vm_uptime
probe fails → Redis downredis
fails → AOF corrupted (needs manual fix)redis_aof_health
fails → worker can't start (may be /dev/kvm mount or image pull)restate_worker_ready
Talon cannot auto-fix Redis AOF corruption (requires
redis-check-aof --fix). It WILL escalate to the pi agent (Tier 2) which should load the k8s skill's Redis AOF Recovery procedure.
Key Design Decisions
- Zero external deps — no tokio, no serde, no reqwest. Pure std. Keeps binary at ~444KB.
- Compiles its own PATH — immune to launchd environment brittleness (the class of bug that caused the 6-day outage).
- Worker is a child process — not a separate launchd service. Signal forwarding prevents orphans.
- TOML config parsed by hand — same pattern as worker-supervisor. No dependency just for config.
- Probes use Colima docker socket for critical host checks and add VM witness probes over Colima SSH for split-brain detection.
Related
- ADR-0159: Talon proposal
- ADR-0158: Worker supervisor (superseded by talon)
: Tier 1 heal scriptinfra/k8s-reboot-heal.sh
: Original standalone worker supervisor (superseded)infra/worker-supervisor/- Ollama + qwen3:8b: Tier 3 local fallback model