Joelclaw system-bus
Develop, deploy, and debug the system-bus worker — joelclaw's 110+ Inngest durable function engine, webhook gateway, and observability pipeline. Triggers on 'add a function', 'new inngest function', 'system-bus', 'worker', 'add a webhook', 'deploy worker', 'restart worker', 'function failed', 'worker not working', 'register functions', or any task involving Inngest function development, webhook providers, or worker operations.
git clone https://github.com/joelhooks/joelclaw
T=$(mktemp -d) && git clone --depth=1 https://github.com/joelhooks/joelclaw "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/system-bus" ~/.claude/skills/joelhooks-joelclaw-system-bus && rm -rf "$T"
skills/system-bus/SKILL.mdSystem Bus Worker
The system-bus worker (
@joelclaw/system-bus) is joelclaw's event-driven backbone — 110+ Inngest durable functions, webhook ingestion, and observability. It runs as a Hono HTTP server registered with the self-hosted Inngest instance.
Architecture
packages/system-bus/ ├── src/ │ ├── serve.ts # Hono server, Inngest registration, health endpoint │ ├── inngest/ │ │ ├── client.ts # Inngest client + event type definitions │ │ ├── middleware/ # Gateway injection, dependency injection │ │ └── functions/ │ │ ├── index.ts # Combined exports │ │ ├── index.host.ts # Functions for host-role worker (local Mac) │ │ ├── index.cluster.ts # Functions for cluster-role worker (k8s) │ │ └── <function-name>.ts # Individual functions │ ├── lib/ # Shared utilities │ │ ├── inference.ts # LLM calls via pi (CANONICAL — always use this) │ │ ├── redis.ts # Redis client helper │ │ ├── typesense.ts # Typesense client │ │ ├── convex-content-sync.ts # Convex upsert for content pipeline │ │ ├── langfuse.ts # Langfuse tracing │ │ └── ... │ ├── observability/ │ │ └── emit.ts # OTEL event emission │ ├── webhooks/ │ │ ├── server.ts # Webhook router (mounted at /webhooks) │ │ ├── types.ts # Provider interface │ │ └── providers/ # Per-service webhook handlers │ │ ├── front.ts │ │ ├── github.ts │ │ ├── vercel.ts │ │ ├── todoist.ts │ │ ├── mux.ts │ │ └── joelclaw.ts │ └── memory/ # Memory pipeline components ├── scripts/ │ └── sync-content-to-convex.ts # Manual full Convex sync └── package.json
Worker Roles
Two deployment modes controlled by
WORKER_ROLE env var:
| Role | Where | Functions |
|---|---|---|
| Local Mac Mini via Talon supervisor (optional) | Agent loops, heartbeat checks, memory pipeline, content sync, video ingest, book download — anything needing local filesystem, pi CLI, or docker |
| k8s pod (GHCR image) | Webhooks (Front, GitHub, Vercel, Todoist, Mux), approvals, notifications, Slack backfill — stateless, network-only |
Functions are split between
index.host.ts and index.cluster.ts. The combined index.ts exports everything for tooling/tests.
Deployment Model
- Source of truth:
~/Code/joelhooks/joelclaw/packages/system-bus/ - Running host worker: launchd service
com.joel.system-bus-worker- launch script:
~/Code/system-bus-worker/packages/system-bus/start.sh - checkout used by the running host worker:
~/Code/system-bus-worker/
- launch script:
- Cluster runtime:
Deployment in the Talos/Colima k8s cluster for cluster-role workloadssystem-bus-worker - Cluster deploy path:
~/Code/joelhooks/joelclaw/k8s/publish-system-bus-worker.sh
Host function rollout reality
After changing
packages/system-bus/src/inngest/functions/* that run on the host worker:
- commit + push the monorepo change to
origin cd ~/Code/system-bus-worker && git pull --ff-onlylaunchctl kickstart -k gui/$(id -u)/com.joel.system-bus-workercurl -X PUT http://127.0.0.1:3111/api/inngest
Do not trust stale monorepo docs that imply the host worker runs directly from
~/Code/joelhooks/joelclaw.
Queue pilot flags are evaluated inside the live worker process, not your shell. If a host-worker emitter like
discovery-capture or /webhooks/github should switch to queue mode, put the flag in ~/.config/system-bus.env, then kickstart the worker and PUT-sync /api/inngest. Ad-hoc shell env only affects CLI-local emitters.
Queue triage flags follow the same rule. Current bounded admission contract:
sets the base triage mode.QUEUE_TRIAGE_MODE=off|shadow|enforce
(or exact event names) chooses which queue families participate at all.QUEUE_TRIAGE_FAMILIES=discovery,content,subscriptions,github
is the narrow Story 4 override that upgrades onlyQUEUE_TRIAGE_ENFORCE_FAMILIES=discovery,github
anddiscovery/noted
into enforce.github/workflow_run.completed- Any non-eligible family is clamped back to
even if someone sets globalshadow
.QUEUE_TRIAGE_MODE=enforce - Handler routing always stays registry-derived; triage may only shape bounded admission fields.
content/updated is the odd one out: its ingress comes from the launchd watcher com.joel.content-sync-watcher, not from a worker-local function. The canonical watcher source now belongs in infra/launchd/com.joel.content-sync-watcher.plist plus scripts/content-sync-watcher.sh, and the script reads ~/.config/system-bus.env on each trigger so QUEUE_PILOTS=content can switch between joelclaw queue emit and legacy joelclaw send without hand-editing the live plist.
For Story 5 soak work, start from
joelclaw jobs status for the first operator glance, then drop into joelclaw queue stats before spelunking raw OTEL or Redis. jobs status is the transitional runtime surface that rolls queue / Restate / Dkron / Inngest into one JSON snapshot without forcing the operator to jump between commands just to learn whether the substrate is healthy enough to take work. queue stats remains the queue-specific summary for Restate drainer health and queue triage behavior: it rolls up recent queue.dispatch.started|completed|failed telemetry plus the queue.triage.* lifecycle into live depth, terminal success/failure counts, waitTimeMs percentiles, dispatch-duration percentiles, fallback reasons, disagreement counts, applied-vs-suggested deltas, route mismatches, family rollups, and recent mismatch/fallback samples. Use joelclaw queue stats --since <iso|ms> when you need to anchor the sample to a known-clean point such as a supervised queue.drainer.started after rollout. Honest gotcha from the live Story 5 cleanup follow-through: global depth can lie because of unrelated historical backlog, so judge the supervised sample first with the anchored triage/dispatch window plus joelclaw queue inspect <stream-id> / joelclaw queue list --limit <n> on the fresh sample IDs. If old residue survives a supervised com.joel.restate-worker restart, clear it with a bounded @joelclaw/queue ack() pass only after confirming zero pending leases and an age filter on the orphaned stream IDs. If that command is broken or misleading, fix it before widening queue cutovers.
For ADR-0217 Phase 3 Story 2-4, the operator surfaces are
joelclaw queue observe, joelclaw queue pause, joelclaw queue resume, and joelclaw queue control status. queue observe still answers “what would Sonnet do right now?” in dry-run, but its snapshot.control.activePauses plus top-level control block now reflect the shipped deterministic control plane: active pauses, expirations, and recent queue.control.* OTEL. It short-circuits to a deterministic noop when the backlog is fully explained by fresh active manual pauses and no recent failures suggest downstream trouble, and it now also short-circuits to deterministic resume_family when queued work is entirely held behind a settled observer pause with no fresh drainer/triage failures. That avoids wasting a 60s Sonnet call on an obvious operator hold state and stops the observer from mistaking its own pause for permanent downstream failure. queue pause / queue resume are the bounded manual apply path before any automatic observer mutation. queue control status is the direct operator truth source for active manual controls and recent queue.control.applied|expired|rejected events.
ADR-0217 Phase 3 Story 4 now has a live host-worker runtime in
packages/system-bus/src/inngest/functions/queue-observer.ts. Durable cadence belongs in Inngest, not the gateway daemon: the cron controller stays on queue/observer, while manual queue/observer.requested probes now run through a separate queue/observer-requested function so operator requests do not sit behind the cron pass. Runtime flags live in ~/.config/system-bus.env and require the usual host-worker restart + PUT /api/inngest:
QUEUE_OBSERVER_MODE=off|dry-run|enforceQUEUE_OBSERVER_FAMILIES=discovery,content,subscriptions,githubQUEUE_OBSERVER_AUTO_FAMILIES=content
(currently clamped to a 60s minimum on the durable cron path)QUEUE_OBSERVER_INTERVAL_SECONDS
Both paths build the same bounded snapshot and use the shared Sonnet observer contract, but only the cron controller may auto-apply queue actions; manual probes are read-only even when the configured mode is
enforce. The shared observer also short-circuits deterministic noops for both truly empty queues and empty queues that only still have active pauses hanging around, so the cron path does not waste a full model call on obvious nothing-to-do snapshots. Idle empty snapshots with no recent drainer failures or triage trouble now report downstreamState=healthy instead of inheriting a noisy degraded label from stale throughput/latency history. Settled observer-held backlogs now get the same treatment: once a queued family is already behind an observer pause for at least one cadence and no fresh drainer/triage failures exist, the shared observer deterministically emits resume_family instead of reading its own hold as proof that downstream is still down. The manual probe function also uses singleton-skip semantics so repeated operator requests do not pile up stale queued runs. Live canaries also exposed a prompt-contract gotcha: Sonnet was happy to emit escalate.reason, which blew up the strict schema. The shared parser now normalizes that legacy shape to { severity: "warn", message }, and the prompt explicitly tells Sonnet to prefer pause_family/resume_family over batch_family for the content/updated pilot when downstream is crook.
Current operator truth after the latest live canaries: dry-run is earned and the hardened enforce path has now completed one full automatic observer cycle on
content/updated. The first supervised canary anchored at since=1772981290859 booted Restate out long enough to build a 30-item content backlog and auto-applied pause_family on snapshot cca656f7-a9ce-4ca2-9f6d-0ed332f56a4d. The patched follow-up canary anchored at since=1772985057594 then paused on snapshot 1cb24e7b-f0cd-4e0c-ae5d-27cb4934b49a, auto-resumed on snapshot 151aa03a-fced-41f0-9a54-2f3d1a70856d / run 01KK72HD0EMT3T34K8QP3SMEW9, and drained the held content item back to queue depth 0. The steady-state worker still belongs in QUEUE_OBSERVER_MODE=dry-run; enforce remains a supervised drill until more soak windows say otherwise. Dogfood follow-through also exposed that check/o11y-triage is a long-running current-state scan, not an irreplaceable per-event handler, so it now needs singleton-skip semantics to avoid piling duplicate queued runs behind one active pass.
Hard-won gotcha from the Story 3 live proof: queue operator commands must resolve Redis from the canonical CLI config (
~/.config/system-bus.env → REDIS_URL) before ambient shell env. The first proof looked wrong because the shell had an unrelated Upstash REDIS_URL, so queue pause wrote control state to the wrong Redis while queue emit still hit the localhost worker/drainer queue. If the CLI and worker disagree about Redis, fix that first or your proof is bullshit.
Adding a New Inngest Function
- Create
packages/system-bus/src/inngest/functions/<name>.ts - Import
frominngest../client - Define the function:
import { inngest } from "../client"; export const myFunction = inngest.createFunction( { id: "system/my-function", // NEVER set retries: 0 — let Inngest defaults handle retries concurrency: { limit: 1, key: '"my-function"' }, }, { event: "my/event.name" }, async ({ event, step, ...rest }) => { const gateway = (rest as any).gateway as import("../middleware/gateway").GatewayContext | undefined; const result = await step.run("do-work", async () => { // your logic here return { done: true }; }); return result; } );
- Export from
orindex.host.ts
(depending on role)index.cluster.ts - Add the export to
as wellindex.ts - Add the event type to
if it's a new eventclient.ts - TypeScript check:
bunx tsc --noEmit -p packages/system-bus/tsconfig.json - Deploy (see below)
Event Naming Convention
Events describe what happened, not commands:
agent/memory.observed not agent/memory.write.
LLM Inference
ALWAYS use the shared utility:
import { infer } from "../../lib/inference"; const { text } = await infer("Your prompt", { task: "classification", model: "anthropic/claude-haiku", system: "System prompt here", component: "my-function", action: "my-function.classify", noTools: true, print: true, });
This shells to
pi -p --no-session --no-extensions. Zero config, zero API cost. NEVER use OpenRouter, read auth.json, or use paid API keys directly.
If the function is doing long-form editorial or other large-context LLM work, set an explicit
timeout on infer() instead of inheriting the shared 120s default. Current earned example: content/review.submitted rewrite/retry/verify runs use a 300s budget because long posts can blow past 120s after all preflight bookkeeping already succeeded.
Gateway Context
All functions receive
gateway context via middleware (ADR-0144). Use it for notifications:
const gateway = (rest as any).gateway as import("../middleware/gateway").GatewayContext | undefined; await gateway?.notify("event.name", { details }); await gateway?.alert("Something broke", { error: String(err) }); await gateway?.progress("Step 3/5 complete");
Hard Rules
- VIP email history now has a dedicated backfill path:
(host worker) walks Front contact history for configured VIP sender emails and hydrates Typesensevip/email-threads.backfill
before narrative briefs rely on prior thread arcs. Default VIP sender aliases now include Alex Hillman'semail_threads
address so the backfill can resolve the canonical Front contact.alex@indyhall.org
operator output is narrative-first now: expect 3-5 sentence prose plus one calibrated urgency line (vip/email-received
), not the old🔴🟠🟡🟢✅
block format.Thread / Your last reply / Needs your attention- NEVER set
— Inngest defaults handle retries. This has caused multiple production failures.retries: 0 - Events silently dropped if functions not registered. Verify
returns >0 before sending events.joelclaw functions
forces re-registration.joelclaw refresh - Inngest server function registry goes stale on worker restart. Always
after restart.curl -X PUT http://127.0.0.1:3111/api/inngest - Don't edit monorepo while a loop is running.
scoops up unrelated changes.git add -A - Step names must be unique within a function — Inngest uses them for memoization.
over fan-out events for rate-limited APIs — fan-out starts all near-simultaneously even with throttle.step.invoke- Silent failure anti-pattern: Functions that shell to CLIs must detect and propagate subprocess failures.
- ADR content sync must degrade on frontmatter parse failures.
falls back to body-only parsing (empty frontmatter + stripped frontmatter block) and logs a warning, instead of dropping the ADR from Convex.upsertAdr - Non-authoritative side effects must degrade, not crash the workflow. Example:
keeps triage authoritative, retries review-task creation across primary/fallback Todoist projects (memory/proposal-triage
→MEMORY_REVIEW_TODOIST_PROJECT
), and only records degraded state if both fail.MEMORY_REVIEW_TODOIST_FALLBACK_PROJECT - Never call
CLI withjoelclaw
from inside a running Inngest function.Bun.spawnSync
probes the worker endpoint; sync subprocesses can deadlock the worker event loop. Use async subprocess execution (joelclaw inngest status
/Bun.spawn
) with explicit timeouts, or direct internal health probes.Bun.$ - Background agent runs must be non-blocking.
cannot usesystem/agent-dispatch
/other blocking subprocess APIs for long codex or claude runs on the host worker; blocking the Bun event loop causes Talon/worker-supervisor health checks to fail, the worker to restart,execSync
to drop, and Inngest runs to go stale./internal/agent-await - Pi is now the preferred Restate PRD story executor.
must honor the requestedsystem/agent-dispatch
when it callscwd
, should enable pi tools when file work is requested (infer()
or path-heavy prompts), and should use the dedicated roster agentreadFiles
for Restate PRD stories so they run under the tight execution prompt instead of the generic background-agent system prompt. The host bridge must also write astory-executor
inbox snapshot before long agent execution starts and deduperunning
by/internal/agent-dispatch
; otherwise multi-minute Restate retries spawn duplicate story agents and operators get a useless forever-requestId
state.pending - Execution mode: host vs sandbox (ADR-0217 Story 4/next batch).
acceptssystem/agent-dispatch
(default:executionMode: "host" | "sandbox"
). Host mode uses the existing shared-checkout path. Sandbox mode now has a concrete backend split:"host"
(default local). The local backend is the proved live path on the host worker: it materializes a clean temp checkout atsandboxBackend: "local" | "k8s"
, runs the requested agent inside that isolated repo, exports patch/touched-file artifacts, and then tears the sandbox down without dirtying the operator checkout. Gate A (non-coding vertical slice) is proven viabaseSha
. Gate B (minimal coding sandbox) is proven viapackages/agent-execution/__tests__/gate-a-smoke.test.ts
. The k8s backend is now code-landed and opt-in:packages/agent-execution/__tests__/gate-b-smoke.test.ts
owns Job spec generation plus Job launch/status/log helpers,@joelclaw/agent-execution
printsjob-runner.ts
log markers and POSTs terminal results toSandboxExecutionResult
, and/internal/agent-result
now preservesInboxResult
plus optional Job metadata. Current honest limit:sandboxBackend
remains the local-backend story executor for now; the k8s runner is for runner-installed CLIs until host-routed pi-in-pod execution is designed. Deterministic sandbox requests should carrypi
,workflowId
,storyId
,baseSha
, andrepoUrl
;branch
now has explicit tool/backend knobs (trigger-prd.ts
,PRD_EXEC_TOOL
,PRD_EXECUTION_MODE
).PRD_SANDBOX_BACKEND - Terminal state guarantees (ADR-0217 Story 5).
ensures every execution lands in a terminal state (system/agent-dispatch
). Duplicate requests with the samecompleted|failed|cancelled
are deduped at function entry — if a terminal result already exists, it returns that result without spawning new work. Cancellation viarequestId
kills the active subprocess (tracked insystem/agent.cancelled
map by requestId) and writes aactiveProcesses
inbox snapshot via thecancelled
handler.onFailure - Log surfacing (ADR-0217 Story 5). All terminal results include
/stdout
output (truncated to 10KB each) in thestderr
field. This is captured from subprocess execution and attached to the inbox result for post-mortem debugging. The logs are also emitted via OTEL events for searchability.logs - Do not capture tool-enabled pi attempts by waiting on pipe EOF. In
, background pi runs with tools can spawn descendants that inherit stdout/stderr, leavingsrc/lib/inference.ts
or similar pipe readers hanging after the realnew Response(proc.stdout).text()
child exits. Redirect stdout/stderr to temp files (or another exit-driven sink), wait forpi
, then read the captured output soproc.exited
can always write a terminal inbox snapshot.system/agent-dispatch - Apply the same exit-driven capture rule inside
command execution. Codex/Claude/bash subprocesses and local sandbox infra commands can also leave descendants holding stdout/stderr open after the parent exits. Ifsystem/agent-dispatch
waits on pipe EOF there, terminal inbox writeback stalls and sandbox runs lie inagent-dispatch
even though the real work already finished or failed.running - Use
for deterministic live verification of the dispatch substrate itself. This is the non-LLM proof lane fortool: "canary"
: fixed scenarios likesystem/agent-dispatch
andsleep-timeout
exercise the same subprocess capture + terminal inbox/registry path without depending on model behavior. Canonical timeout proof script:orphan-stderr
. Canonical operator surface:bun scripts/verify-agent-dispatch-timeout.ts
, and the default status envelope now exposes the latest persisted canary summary. Scheduled health integration is gated off by default and only activates when the live worker setsjoelclaw status --agent-dispatch-canary
.HEALTH_AGENT_DISPATCH_CANARY_SCHEDULE=signals
is an overall budget, not a per-fallback reset. Story 6 proved that reusing a fresh 10-minute timeout on every fallback attempt creates a hidden 30-minute failure chain (infer({ timeout })
→SIGTERM
) before the real story budget is exhausted.exit 143
must spend the remaining deadline across attempts and preserve up to a one-hour explicit request budget for Restate PRD story runs.src/lib/inference.ts- Timeout errors must say timeout, not
. Whenexit 143: empty output
is killed by the inference timer, surfacepi
in the thrown error and OTEL metadata so operators know it was our budget kill, not a mysterious subprocess crash.pi timed out after <ms> - Do not import
from system-bus via relative paths. Keep runbook resolution local inpackages/cli/src/*
(or extract to a dedicated leaf package) and avoid creatingpackages/system-bus
↔@joelclaw/system-bus
dependency cycles that break Turbo/Vercel.@joelclaw/sdk
Deploy: system-bus-worker (k8s)
~/Code/joelhooks/joelclaw/k8s/publish-system-bus-worker.sh kubectl -n joelclaw rollout status deployment/system-bus-worker --timeout=180s joelclaw refresh
Builds ARM64 image, pushes to GHCR, updates k8s deployment, verifies rollout.
Adding a Webhook Provider
See the
webhooks skill for full details. Quick summary:
- Create
implementingsrc/webhooks/providers/<service>.tsWebhookProvider - Register in
src/webhooks/server.ts - Add secret to
array inWEBHOOK_SECRETSserve.ts - Store secret in agent-secrets:
secrets add <service>_webhook_secret
Debugging
# Check worker health curl http://localhost:3111/ | jq # View registered functions joelclaw functions # Recent runs joelclaw runs --count 20 # Inspect a specific run joelclaw run <RUN_ID> # Worker logs (k8s) kubectl logs -n joelclaw deploy/system-bus-worker -f # Inngest server logs kubectl logs -n joelclaw inngest-0 | grep ERROR # Force re-registration curl -X PUT http://127.0.0.1:3111/api/inngest
Runtime forensics: stale RUNNING
runs
RUNNINGWhen Inngest APIs disagree (
runs list shows RUNNING, run detail shows terminal or non-cancellable state), treat it as runtime metadata drift, usually after SDK reachability failures.
Operational truths:
- Runtime DB is SQLite inside k8s Inngest pod:
.inngest-0:/data/main.db
alone is not sufficient to infer terminality.trace_runs.status- Terminal source-of-truth is the presence of terminal history entries:
FunctionCompletedFunctionFailedFunctionCancelled
Safe reconciliation sequence:
- Preview with
.joelclaw inngest sweep-stale-runs - Apply with
(auto backup + transactional writes).joelclaw inngest sweep-stale-runs --apply - If manual fallback is required:
- Backup DB:
kubectl -n joelclaw exec inngest-0 -- sqlite3 /data/main.db '.backup /data/main.db.pre-sweep-<ts>.sqlite' - Find stale candidates via
+trace_runs
+function_finishes
joins.history - Insert missing terminal history (
) for stale candidates.FunctionCancelled - Ensure
rows exist.function_finishes - Update
to cancelled (trace_runs.status
) only after history/finishes.500
- Backup DB:
- Verify with
and a freshjoelclaw run <id>
.joelclaw runs --status RUNNING
Key Files
| File | Purpose |
|---|---|
| HTTP server, Inngest registration, health endpoint, and host-only internal agent bridge endpoints (, , ) |
| Event type definitions, Inngest client |
| Gateway context injection |
| Host-role function list |
| Cluster-role function list |
| LLM inference via pi (use this, not raw APIs) |
| OTEL event emission |
| Webhook route registration |
| K8s deploy script |