Joelclaw gateway
Operate the joelclaw gateway daemon — the always-on pi session that receives events, notifications, and messages. Use the joelclaw CLI for ALL gateway operations. Use when: 'restart gateway', 'gateway status', 'is gateway healthy', 'push to gateway', 'gateway not responding', 'telegram not working', 'messages not going through', 'gateway stuck', 'gateway debug', 'check gateway', 'drain queue', 'test gateway', 'stream events', or any task involving the gateway daemon.
git clone https://github.com/joelhooks/joelclaw
T=$(mktemp -d) && git clone --depth=1 https://github.com/joelhooks/joelclaw "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/gateway" ~/.claude/skills/joelhooks-joelclaw-gateway && rm -rf "$T"
skills/gateway/SKILL.mdGateway Operations
The gateway daemon is the always-on pi session that receives events from Inngest functions, Telegram, and webhooks. It's the system's notification and communication layer.
Rule: Always use
CLI. Never use joelclaw gateway
, launchctl
, or log file grep directly.curl
CLI Commands
joelclaw gateway status # Daemon availability, runtime mode, session pressure, Redis health joelclaw gateway restart # Roll daemon, clean Redis, fresh session joelclaw gateway enable # Re-enable launch agent + start daemon joelclaw gateway test # Push test event, verify delivery (Redis bridge path) joelclaw gateway push --type <type> [--payload JSON] # Push an event to all sessions joelclaw gateway events # Peek at pending events per session joelclaw gateway drain # Clear all event queues joelclaw gateway stream # NDJSON stream of all gateway events (ADR-0058) joelclaw gateway behavior {add|list|promote|remove|apply|stats} # ADR-0211 behavior control plane
joelclaw gateway restart is the canonical restart. It kills the process, cleans Redis state, re-enables com.joel.gateway if launchd disabled it, waits for launchd to respawn, and verifies the new session. joelclaw gateway enable is the direct recovery path when launchd drift disabled the service. Never use launchctl bootout/bootstrap directly.
Quick Triage
Substrate precheck first (avoid chasing secondary gateway symptoms):
colima status --json kubectl get nodes -o wide kubectl get pods -n joelclaw redis-0 inngest-0
If Colima is down or node/core pods are not healthy, recover substrate before gateway operations.
Run in order, stop at first failure:
joelclaw gateway status # 1. Is it alive? Sessions registered? joelclaw gateway test # 2. Can events flow end-to-end? joelclaw gateway events # 3. Are events piling up? (backpressure) joelclaw gateway restart # 4. If stuck, restart
If
joelclaw gateway status shows pending > 0 on sessions, the agent is mid-stream or stuck. If it persists after a minute, restart.
Redis-degraded mode (ADR-0214)
joelclaw gateway status now distinguishes:
— Redis bridge healthymode: normal
— daemon/channels/session available, but Redis-backed capabilities are degradedmode: redis_degraded
When
mode=redis_degraded:
- direct human conversation can still work
- Redis-backed commands/inspections are only partially trustworthy
validates the Redis bridge path, so expectjoelclaw gateway test
to skip that layer intentionallyjoelclaw gateway diagnose- use
to see the degraded capability list and session pressure fieldsjoelclaw gateway diagnose
Do not report
redis_degraded as “gateway down” unless process/session health is also failing.
Gateway context refresh scoping (ADR-0204)
Gateway rolling context refresh is useful only when it stays scoped.
Hard rule:
- do not treat automated digests, gateway event wrappers, recovery summaries, or terse acknowledgements as recall seeds
- if there is no real conversational topic, skip refresh rather than querying generic global memory
Why:
- a bad hidden
injection poisons the live gateway session even whencontext-refresh
still reports healthyjoelclaw gateway status - this showed up as unrelated voice/livekit notes bleeding into the gateway transcript
If Joel says the gateway session feels "fucked" while health checks look green, inspect the gateway session transcript for hidden
context-refresh / gateway-recovery / memory-recall messages before trusting the CLI summary.
Session pressure visibility (ADR-0218 rank 3 slice)
joelclaw gateway status / joelclaw gateway diagnose now expose session-pressure specifics instead of just a coarse health word:
- context usage % + next action
- next threshold summary (
/compact at 65% ...
/rotate at 75% ...
)rotate immediately - last compaction age + session age
- thread counts (
/active
/warm
)total - fallback state + activation count + consecutive prompt failures
- pressure reasons (
,context_usage
,context_ceiling
,compaction_gap
)session_age - last alert health/time + cooldown state
The daemon emits OTEL under
daemon.session-pressure (session_pressure.alert.sent|suppressed|failed). Operator paging is stricter now: only critical pressure states page Telegram, while elevated / recovered transitions stay in status/diagnose/OTEL.
Idle maintenance is autonomous for time-based pressure:
- watchdog evaluates idle session pressure even when no turn is active
- overdue compaction gaps trigger autonomous compaction instead of waiting for the next human/event turn
- age-triggered rotation can also happen from the watchdog path, seeding the fresh session with the compression summary before the next inbound turn
- those watchdog-triggered runs emit the same
telemetry as turn-bound maintenancedaemon.maintenance.started|completed|failed
Interruptibility and supersession (ADR-0196 / ADR-0218 rank 4 slice)
For direct human turns across Telegram, Discord, iMessage, and Slack invoke paths, the latest message now wins.
Runtime contract:
- new human turns get a short
batching window before dispatch1.5s - batching is per source, so rapid follow-ups collapse into one queued prompt
- if that source is already active, gateway supersedes immediately instead of waiting on the timer
- stale queued prompts from that source are dropped
- durable queue replay must not self-drop the freshest human message; supersession only applies when a genuinely newer same-source message exists
- daemon requests
on the stale turnsession.abort() - stale response text is suppressed instead of being delivered late
exposesjoelclaw gateway status
plussupersessionsupersession.batching
adds anjoelclaw gateway diagnose
layer with supersession and batching detailsinterruptibility
Passive intel / background event routes are excluded from this path.
Operator ack/timeout tracing (ADR-0218 rank 5 slice)
Telegram operator actions now get trace ids and explicit lifecycle tracking.
Current covered paths:
command-menu callbackscmd:*
callbacksworktree:*
ADR pitch callbackspitch:*- default Telegram callback actions and external callback-route handoffs
- direct Telegram slash commands registered through the command handler
- native Telegram
,/stop
, and/esc
commands/kill
Gateway now tracks
kind=callback|command plus ack/dispatched/completed/failed/timed_out state for those paths, exposes canonical operatorTracing in joelclaw gateway status (with callbackTracing kept as a compatibility alias), and adds an operator-tracing layer in joelclaw gateway diagnose.
Queued Telegram agent commands now keep their trace id through downstream gateway execution, complete on the real turn completion path, and fail on prompt error / assistant error / supersession instead of lying at enqueue time. Agent-backed command traces use a longer timeout window than simple callback ack paths.
External callback routes now have a Redis trace-result handoff too: the gateway leaves routed callbacks active, downstream consumers can publish
completed / failed back with the same traceId, and the in-tree Restate Telegram route is wired to close traces on real downstream resolution.
Still open: any out-of-tree external callback consumer that doesn't adopt the handoff will still timeout as untracked work.
Channel runtime contracts (ADR-0218 rank 6 slice)
joelclaw gateway status now exposes a canonical channels surface plus summarized channelHealth for Telegram, Discord, iMessage, and Slack. joelclaw gateway diagnose adds a channel-health layer so owner/passive/fallback and half-dead channel states are visible before a full outage.
Current rank-6 behavior also sends immediate degrade/recover alerts from the daemon, emits OTEL under
daemon.channel-health, respects joelclaw gateway known-issues mute state so known flaky channels stop crying wolf, and now tracks guarded heal policy state per channel.
channelHealth.healing now shows restart/manual policy, degraded streaks, cooldown, attempt counts, last result, plus manualRepairRequired, manualRepairSummary, and manualRepairCommands when the watchdog cannot fix the channel itself. gateway diagnose adds channel-healing, spells out manual operator repair steps, and the watchdog can attempt guarded restarts for restart-eligible degraded channels while leaving ownership/lease conflicts as manual/operator work.
Telegram retrying
getUpdates conflicts no longer read as healthy fallback: when polling is down and only retrying, the ownership contract degrades visibly in /health, gateway status, and gateway diagnose.
Low-signal operator-spam guardrails now also apply:
- treat
/restate
sources as automation for batching, so successful queue-dispatch DAG completions don’t hit the live gateway session immediatelyrestate/* - suppress
from operator delivery by defaulttest.gateway-e2e - route Joel Slack passive intel through the canonical Redis signal pipeline (
) instead of bypassing relay policyslack.signal.received - use
as the single heuristics surface for normalize → score → correlate → route across Slack/email signalspackages/gateway/src/operator-relay.ts - ingest
after the VIP pipeline delivers its direct Telegram brief so the gateway keeps correlation context without sending a duplicate alert, while allowing lower-signal email/Slack items to batch into a correlated digest by project/contact/conversation keysvip.email.received - drop low-signal-only digests (heartbeat-only / queue-dispatch-complete-only) instead of prompting the model for a pointless
HEARTBEAT_OK - routine fallback swap/recovery notices are not operator-facing during quiet hours, and recovery notices are log/OTEL-only unless some higher-signal path escalates them
- suppress direct operator-only
messages during quiet hoursKnowledge Watchdog Alert - add proactive-compaction hysteresis (30m cooldown unless context meaningfully worsens)
- require a minimum dwell on fallback before probing primary again, so fallback swap→restore chatter doesn’t flap immediately
- when the primary model is
, floorclaude-opus-4-6
tofallbackTimeoutMs
;240000
is now treated as stale and too aggressive for real Opus TTFT120000 - clear fallback timeout state immediately on empty/aborted
events so aborted turns cannot poison the next turn's latency/fallback monitoringmessage_end - emit structured fallback decision telemetry (
) with reason buckets / probe counts / probe backoff so OTEL can separate real provider sickness from a noisy control loopmodel_fallback.decision - back off primary recovery probes after transient/persistent probe failures instead of hammering the same sick provider every interval
- treat compaction/rotation as explicit maintenance windows: surface them in gateway status, emit maintenance lifecycle telemetry, and let idle-wait/watchdog logic extend for genuine maintenance work before declaring the turn dead
Muted degraded channels now also flip to
manual with the known-issue reason surfaced as repair guidance, instead of falsely advertising a restart policy that the watchdog will skip while muted.
Still not done: stricter cross-channel ownership enforcement and richer/native repair automation beyond CLI-guided manual steps.
Runtime guardrail enforcement (ADR-0189)
Gateway runtime now enforces two operator-visible guardrails:
- Tool-budget checkpoint tripwire
- channel turns: forced checkpoint after the 2-tool budget is exceeded
- internal/background turns: forced checkpoint after the 4-tool budget is exceeded
- telemetry:
daemon.guardrails:guardrail.checkpoint.*
- Automatic post-push deploy verification
- after successful
wheregit push
touchedHEAD
or root config (apps/web/
,turbo.json
,package.json
)pnpm-lock.yaml - daemon schedules
after ~75svercel ls --yes 2>&1 | head -10 - failures alert Telegram and emit
daemon.guardrails:guardrail.deploy_verification.failed
- after successful
Use
joelclaw gateway status to inspect live guardrails state, and joelclaw gateway diagnose when a checkpoint or deploy verification is active.
Behavior Control Plane (ADR-0211)
Gateway behavior is now explicit + deterministic:
- Runtime authority: Redis active contract (
)joelclaw:gateway:behavior:contract - History/candidates: Typesense
gateway_behavior_history - Write authority: CLI only (
)joelclaw gateway behavior ...
Operator directives can be entered directly via CLI or in-channel using strict syntax:
KEEP: ...MORE: ...LESS: ...STOP: ...START: ...
Gateway extension passively captures those lines and shells to
joelclaw gateway behavior add ....
It does not write Redis or Typesense directly.
Daily review is advisory-only: candidates are generated by cron and must be promoted manually via
joelclaw gateway behavior promote --id <candidate-id>.
Sending Events from Inngest Functions
Every Inngest function receives
gateway via middleware (ADR-0035):
async ({ event, step, ...rest }) => { const gateway = (rest as any).gateway as GatewayContext | undefined; await step.run("notify", async () => { if (!gateway) return; await gateway.notify("my.event.type", { message: "Human-readable summary", data: "structured-payload", }); }); }
| Method | Purpose | Routing |
|---|---|---|
| Pipeline/loop progress | Origin session + central gateway |
| Notifications (task done, webhook) | Origin session + central gateway |
| System warnings (disk, service down) | Central gateway only |
Import:
import type { GatewayContext } from "../middleware/gateway";
Always null-check
gateway — functions can run without a gateway session.
Low-Level: pushGatewayEvent()
For code outside the middleware:
import { pushGatewayEvent } from "./agent-loop/utils"; await pushGatewayEvent({ type: "video.downloaded", source: "inngest/video-download", payload: { message: "Downloaded: My Talk" }, originSession: event.data.originSession, });
Common Failure Modes
| Symptom | Likely Cause | Fix |
|---|---|---|
| Status shows healthy but messages don't arrive | Session stuck mid-stream on hung tool call | |
| Pending events growing on a session | Agent processing or blocked | Wait 1min, then |
| Telegram messages not delivered | HTML parsing error in response | Check , restart |
| Slack passive firehose looks dead (mentions still work) | not derived at startup | Ensure lease works; derives user id via , then |
| Slack replies have no default target | not derived at startup | Ensure lease works; derives DM channel via , then restart |
| Gateway restarts every few seconds | Crash loop — bad secret lease or code error | Check , fix cause |
| Redis connection failed | Redis pod down or Colima/k8s substrate down | Check , then / for cluster health |
optional dependency warning | Langfuse tracing dependency missing for pi extension runtime | Observability degradation only; do not treat as message-path blocker |
Architecture
launchd (com.joel.gateway) └─ gateway-start.sh (leases secrets, sets env) └─ bun run daemon.ts ├─ createAgentSession() → headless pi session (reads SOUL.md) ├─ Redis channel (joelclaw:notify:gateway) ├─ Telegram channel (@JoelClawPandaBot) ├─ WebSocket (port 3018, for TUI attach) ├─ Command queue (serial — one prompt at a time) └─ Heartbeat runner (periodic autonomous checks)
The gateway reads
~/.pi/agent/ at boot for identity/prompt context (SOUL.md, AGENTS.md, MEMORY.md, daily log), but the gateway extension itself is context-local:
- Canonical source:
~/Code/joelhooks/joelclaw/pi/extensions/gateway/index.ts - Active path:
(symlink)~/.joelclaw/gateway/.pi/extensions/gateway - Do not install/restore
globally~/.pi/agent/extensions/gateway
Daemon startup enforces this invariant and will fail if local extension is missing or a global gateway extension is detected.
This keeps gateway automation hooks out of normal interactive pi sessions.
Key Files
| File | Purpose |
|---|---|
| Daemon entry — session creation, channels, heartbeat |
| Redis subscribe, drain, prompt build |
| Telegram bot channel |
| Serial FIFO queue → |
| Periodic autonomous task runner |
| Middleware injecting context |
| CLI subcommands |
| launchd start script |
| Runtime logs and PID |
ADR anchors
- ADR-0213 — session lifecycle guards and anti-thrash behavior
- ADR-0146 — Langfuse observability integration (must fail open when optional dependency is missing)