Joelclaw system-architecture
Canonical joelclaw topology and wiring map. Use when reasoning about architecture, tracing event flow, debugging why something ran/didn't run, identifying which worker executes a function, checking what listens on a port, or following an event end-to-end.
git clone https://github.com/joelhooks/joelclaw
T=$(mktemp -d) && git clone --depth=1 https://github.com/joelhooks/joelclaw "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/system-architecture" ~/.claude/skills/joelhooks-joelclaw-system-architecture && rm -rf "$T"
skills/system-architecture/SKILL.mdSystem Architecture (Canonical Topology)
This skill is the single source of truth for joelclaw system wiring. Use it for:
- "why did this run / not run"
- "which worker handles this function"
- "what is listening on port X"
- "how does event Y flow"
- full-stack routing/debug across CLI → Inngest → workers → gateway → telemetry
Ground-Truth Scope + Evidence Snapshot
This document is grounded in direct reads of:
apps/docs-api/src/index.tspackages/restate/Dockerfilepackages/restate/src/index.tspackages/restate/src/workflows/dag-orchestrator.tspackages/agent-execution/src/microvm.tspackages/system-bus/src/serve.tspackages/system-bus/src/inngest/functions/index.host.tspackages/system-bus/src/inngest/functions/index.cluster.tspackages/system-bus/src/inngest/client.tsinfra/worker-supervisor/src/main.rs~/Library/LaunchAgents/com.joel*.plist
(all files)k8s/*infra/pds/values.yamlpackages/gateway/src/daemon.tspackages/gateway/src/channels/*.ts~/.joelclaw/gateway/AGENTS.md~/.joelclaw/gateway/.pi/settings.json~/.local/caddy/Caddyfile
+~/.colima/default/colima.yamlcolima status --json
,packages/cli/src/cli.ts
,packages/cli/src/config.tspackages/cli/src/inngest.ts
(key files:packages/system-bus/src/observability/*
,emit.ts
,otel-event.ts
)store.tspackages/telemetry/src/emitter.tspackages/system-bus/src/lib/langfuse.tspackages/inference-router/src/tracing.ts- ADRs in
(required + topology-adjacent)~/Vault/docs/decisions/ - last 50 lines of
~/Vault/system/system-log.jsonl
Related docs verified
— Restate/Firecracker runtime + workload execution flowdocs/architecture.md
— Restate worker deploy + auth/identity/PVC proceduresdocs/deploy.md
— workload command tree + runtime bridgedocs/cli.md
— not inspected in this updatedocs/observability.md
1) Physical Topology
Mac Mini "Panda" (host macOS) ├─ launchd services (gateway, worker supervisor, caddy, talon, agent-mail, etc.) ├─ Colima VM (driver: VZ, arch: aarch64, runtime: docker, VM IP: 192.168.64.2) │ └─ Talos node: joelclaw-controlplane-1 (k8s v1.35.0, internal IP 10.5.0.2) │ ├─ namespace: joelclaw │ │ ├─ inngest (StatefulSet + NodePort 8288/8289) │ │ ├─ redis (StatefulSet + NodePort 6379) │ │ ├─ typesense (StatefulSet + ClusterIP 8108) │ │ ├─ restate (StatefulSet + NodePort 8080/9070/9071) │ │ ├─ system-bus-worker (Deployment + ClusterIP 3111) │ │ ├─ restate-worker (Deployment + ClusterIP 9080; full agent image + Firecracker) │ │ ├─ dkron (StatefulSet + ClusterIP 8080) │ │ ├─ docs-api (Deployment + NodePort 3838) │ │ ├─ livekit-server (Deployment + NodePort 7880/7881) │ │ ├─ bluesky-pds (Deployment + NodePort 3000) │ │ └─ minio (StatefulSet + NodePort 30900/30901) │ └─ namespace: aistor │ ├─ aistor operator (Deployments: adminjob-operator, object-store-operator) │ └─ aistor-s3 object store (StatefulSet + NodePort 31000/31001) ├─ Caddy reverse proxy (tailnet HTTPS fan-in) ├─ Gateway daemon (embedded pi session) ├─ Firecracker substrate (requires Colima nestedVirtualization=true for /dev/kvm; OFF by default — unstable under load) └─ NAS "three-body" (NFS tiers per ADR-0088)
Known runtime endpoints
- Colima VM IP:
(192.168.64.2
)colima status --json - Kubernetes API (local forward):
(https://127.0.0.1:64784
)kubectl cluster-info - Tailnet hostnames seen in config:
(Caddy routes)panda.tail7af24.ts.net
(PDS values)pds.panda.tail7af24.ts.net
Tailscale mesh state
failed in this environment: UNKNOWN — needs manual verificationtailscale status --json
2) Process Inventory (Long-Running)
Host launchd inventory (snapshot)
Snapshot source:
and plist inspection.launchctl print gui/$(id -u)/<label>
| Launchd label | State | PID (snapshot) | Role | Ports / endpoints |
|---|---|---|---|---|
| running | 75292 | Host worker supervisor () | supervises child bun on 3111 |
| retired / rollback-only | — | Historical host Restate wrapper () | superseded by on 9080 |
| running | 81275 | Gateway daemon () | WS , Redis bridge |
| running | 9347 | Reverse proxy | 3443, 5443, 6443, 7443, 8290, 8443, 9443 |
| running | 96359 | Infra watchdog | health |
| running | 98048 | Secret lease daemon | no public port |
| running | 61110 | iMessage JSON-RPC socket daemon | Unix socket |
| running | 32095 | | local 8108 |
| running | 71887 | voice agent runtime | local 8081 |
| scheduled | (launchd timer) | ADR-0221 local sandbox janitor ( → ) | logs in |
| spawn scheduled | (none in launchctl snapshot) | agent-mail MCP HTTP service | observed listener (python process) |
| not running | — | startup helper for Colima | n/a |
| not running | — | periodic k8s heal script | n/a |
| not running | — | sync guard watcher | n/a |
| not running | — | gateway tripwire script | n/a |
| not running | — | fs watch -> content/updated event | n/a |
| not running | — | Vault log sync watcher | n/a |
Process supervision behavior: worker-supervisor
worker-supervisorSource:
infra/worker-supervisor/src/main.rs
- Default config:
- worker dir:
~/Code/joelhooks/joelclaw/packages/system-bus - command:
bun run src/serve.ts - port:
3111 - health endpoint:
/api/inngest - sync endpoint:
(PUT)/api/inngest - health interval: 30s
- restart after 3 consecutive health failures
- restart backoff: 1s → 30s max
- worker dir:
- Pre-start kills stale process on port 3111.
- Runs host import preflight before spawn:
bun --eval "await import('./src/inngest/functions/index.host.ts');"- on failure, skips spawn and retries with exponential backoff
- Loads env from
plus leased secrets.~/.config/system-bus.env - Forces
for the supervised host worker.WORKER_ROLE=host - Emits OTEL events via CLI on supervisor failures/restarts:
worker.supervisor.preflight.failedworker.supervisor.worker_exitworker.supervisor.health_check.restart
Worker supervision split note
- Talon is running (
), but host worker is still launched viacom.joel.talon
->com.joel.system-bus-worker
.worker-supervisor - ADR + system-log indicate Talon can defer worker supervision during coexistence.
Kubernetes process inventory
Node
(Talos v1.12.4, k8s v1.35.0, internal IPjoelclaw-controlplane-1
)10.5.0.2
Core services
| Service | Workload kind | Service type | Service port(s) | NodePort(s) / exposure | Role |
|---|---|---|---|---|---|
| Inngest | StatefulSet | NodePort () | 8288, 8289 | 8288, 8289 | Event API + connect ws |
| Redis | StatefulSet | NodePort | 6379 | 6379 | Queue/state/pubsub |
| Typesense | StatefulSet | ClusterIP | 8108 | host via launchd port-forward 8108 | Search + telemetry store |
| Restate | StatefulSet | NodePort | 8080, 9070, 9071 | 8080, 9070, 9071 | Durable workflow ingress + admin + metrics |
| system-bus-worker | Deployment | ClusterIP | 3111 | in-cluster only | Cluster-role worker (12 functions) |
| restate-worker | Deployment | ClusterIP | 9080 | in-cluster only | + + queue drainer in full agent image |
| docs-api | Deployment | NodePort | 3838 | 3838 | PDF/docs API + agentic search + taxonomy graph |
| dkron | StatefulSet | ClusterIP () + headless peer svc () | 8080, 8946, 6868 | in-cluster only; operator access via short-lived CLI-managed tunnel | Distributed cron scheduler for Restate pipelines |
| livekit-server | Deployment (Helm) | NodePort | 80, 7881 | 7880 (for svc port 80), 7881 | LiveKit signaling + rtc tcp |
| bluesky-pds | Deployment (Helm-managed) | NodePort | 3000 | 3000 | AT Proto PDS |
| minio | StatefulSet | ClusterIP + NodePort | 9000, 9001 | 30900, 30901 | Legacy local S3-compatible runtime |
aistor-s3-api ( ns) | NodePort service (operator-managed) | NodePort | 443, 9000 | 31000 (+ dynamic management NodePort) | AIStor S3 API (TLS + management) |
aistor-s3-console ( ns) | NodePort service (operator-managed) | NodePort | 9443 | 31001 | AIStor web console |
Restate / Firecracker runtime note
is the current durable execution worker. The image bundles Bun + Node +deployment/restate-worker
+pi
, the full repo checkout, and 76 symlinked skills.codex- Runtime auth/identity come from
andsecret/pi-auth
, which recreateconfigmap/agent-identity
plus the joelclaw identity chain inside the pod./root/.pi/agent/auth.json - Firecracker is enabled in-pod via privileged access to
on Colima VZ. The/dev/kvm
hostPath mount uses type/dev/kvm
(optional) so the pod starts without it when nestedVirtualization is off."" - Persistent microVM assets live on PVC
, mounted atfirecracker-images
for kernel, rootfs, and snapshot files./tmp/firecracker-test - Retry caps (2026-03-17): dagWorker maxAttempts=5, dagOrchestrator maxAttempts=3. Prevents Restate journal poisoning from infinite retries after code changes or infrastructure failures.
- Colima stability: nestedVirtualization is OFF by default (crashes VM under Docker build load). Toggle ON only for Firecracker testing sessions, then toggle OFF. See k8s skill for recovery procedures.
Control-plane access
- kube API exposed locally at
(forwarded)127.0.0.1:64784 - additional forwarded control ports observed:
,64785
(exact ownership mapping UNKNOWN — needs manual verification)9627
3) Worker Architecture (Role Split + Registration)
Source files:
packages/system-bus/src/serve.tspackages/system-bus/src/inngest/functions/index.host.tspackages/system-bus/src/inngest/functions/index.cluster.tspackages/system-bus/src/inngest/client.ts
Role model
parsed asWORKER_ROLE
(default) orhost
.cluster- Registered function set is role-dependent:
- host uses
hostFunctionDefinitions - cluster uses
clusterFunctionDefinitions
- host uses
Ground-truth counts
- Host function set: 101
- Cluster function set: 12
- Cluster subset functions:
,approvalRequestapprovalResolve
,todoistCommentAdded
,todoistTaskCompletedtodoistTaskCreated
,frontMessageReceived
,frontMessageSentfrontAssigneeChangedtodoistMemoryReviewBridge
,githubWorkflowRunCompletedgithubPackagePublishedwebhookSubscriptionDispatchGithubWorkflowRunCompleted
App registration isolation
From
inngest/client.ts:
- app id resolves to:
when role is hostsystem-bus-host
when role is clustersystem-bus-cluster
- explicit
overrides role-derived id.INNGEST_APP_ID
This prevents host and cluster workers from overwriting each other’s function graphs.
serveHost behavior
From
serve.ts:
- host role default
:serveHosthttp://host.docker.internal:3111 - cluster role default
: unset (connect-mode default)serveHost
overrides either role.INNGEST_SERVE_HOST
Kubernetes cluster worker manifest sets:
INNGEST_BASE_URL=http://inngest-svc:8288INNGEST_SERVE_HOST=http://system-bus-worker:3111
Registration mechanics
- Worker exposes
.GET|POST|PUT /api/inngest - Worker sends a delayed self-sync
~5s after startup.PUT /api/inngest
also performs startup PUT sync.worker-supervisor
Host is primary today
From index comments + function lists:
- ADR-0089 transition: host remains authoritative for broad function ownership.
- Cluster is intentionally limited to cluster-safe subset (12 functions).
4) Event Flow (CLI → Inngest → Worker → Completion)
Canonical flow: joelclaw send
joelclaw send- CLI
callsjoelclaw send <event>
.Inngest.send()
POSTs event JSON to:Inngest.send()${INNGEST_URL}/e/${INNGEST_EVENT_KEY}- default:
http://localhost:8288/e/<key>
- Inngest server persists the event and resolves matching function triggers.
- Inngest dispatches function steps to the worker app graph that owns that function ID:
- host app (
) for 101-host setsystem-bus-host - cluster app (
) for 12-cluster subsetsystem-bus-cluster
- host app (
- Worker handles callbacks via
(Hono +/api/inngest
handler).inngest/hono - Each
result is memoized by Inngest; next step executes when prior completes.step.run - Completion/failure is queryable via GraphQL (
) and CLI commands (/v0/gql
,runs
,run
,event
).events
Queue flow: joelclaw queue emit
→ Restate drainer → durable dispatch
joelclaw queue emit- CLI
persists ajoelclaw queue emit <event>
into Redis streamQueueEventEnvelope
and indexes it in sorted setjoelclaw:queue:events
.joelclaw:queue:priority - The
k8s deployment (restate-worker
) starts a deterministic queue drainer beside the channel callback listener.packages/restate/src/index.ts - On startup, the drainer claims pending + never-delivered entries via
, reindexes replayable entries, and emits OTEL replay evidence.@joelclaw/queue#getUnacked() - Each drain tick selects the next priority candidate from the sorted set, resolves its static registry target from
, and POSTs a one-node DAG request to Restatepackages/queue/src/registry.ts
./dagOrchestrator/{workflowId}/run/send - When backlog remains and a dispatch slot frees, the drainer self-pulses immediately instead of waiting for the next
heartbeat. That interval is now the idle poll / retry cadence, not a mandatory 2-second tax between successful sends.QUEUE_DRAIN_INTERVAL_MS - The current Story-3 bridge re-emits the queue item to its registered Inngest event target inside that one-node DAG request. This is deliberate: the deterministic queue/drainer is proven first; per-family Restate cutovers remain Story 4 work.
- On accepted Restate dispatch, the drainer acks the queue message; on failure it leaves the message in Redis, applies retry cooldown, and emits
OTEL evidence.queue.dispatch.failed - If backlog remains in Redis but the drainer stops making progress past
, it emitsQUEUE_DRAIN_STALL_AFTER_MS
and exits non-zero so k8s restartsqueue.drainer.stalled
. That is the self-heal path for a wedged drainer inside an otherwise-running Bun process.deployment/restate-worker - Crash recovery comes from the Redis stream + consumer-group replay path, not from vibes: restart the
pod, letrestate-worker
reclaim the inflight entries, then drain resumes.getUnacked()
Workload flow: joelclaw workload run
→ Redis → Restate DAG → execution
joelclaw workload run
can load an explicit stage DAG, validate unknown deps/self-deps/duplicates/cycles, and preserve per-stage acceptance gates.joelclaw workload plan ... --stages-from <file>
normalizes the selected stage into the canonicaljoelclaw workload run <plan-artifact>
runtime request.workload/requested- Queue admission writes the request into Redis, where the deterministic drainer forwards it into Restate as a
request.dagOrchestrator/{workflowId}/run/send
executes dependency waves: ready nodes in parallel, chained nodes only after everydagOrchestrator
node has terminal output.dependsOn
executes the node handler:dagWorker
→ subprocess work inside theshell
podrestate-worker
→infer
inside the pod, using the mounted auth + identity + skill setpi -p --no-session --no-extensions
→ Firecracker boot/restore throughmicrovm
with kernel/rootfs/snapshot files on PVC/dev/kvmfirecracker-images
- Each node emits OTEL (
), and the workflow emitsdag.node.*
so queue → Restate → execution remains observable.dag.workflow.* - Current truthful limit: the microVM runtime boots and restores snapshots in-cluster, but the broader exec-in-VM workspace drive protocol is still incomplete for general coding slices.
Webhook flow
- External service posts to
./webhooks/:provider - Caddy routes
on/webhooks/*
to workerlocalhost:8443
.localhost:3111
verifies signature, normalizes payload, emits Inngest events (webhookApp
).provider/event- Inngest executes subscribed functions.
"Why did this run / not run" trace recipe
joelclaw send <event> -d '<payload>'joelclaw events --prefix <event-prefix> --hours 1
(fan-out to function runs)joelclaw event <event-id>
(step trace + errors)joelclaw run <run-id>joelclaw runs --count 20 --hours 1joelclaw otel search "<component/action>" --hours 1- Validate function ownership in
/index.host.ts
.index.cluster.ts
5) Port Map (Canonical)
Exposure sources: k8s service manifests, Caddyfile,
,kubectl get svclisteners.lsof
| Port | Listener / owner | What it is | Exposure path |
|---|---|---|---|
| 3111 | host bun worker | host system-bus worker HTTP (, , , ) | local host; proxied via Caddy 3443 + webhook path via 8443 |
| 8080 | ssh forward (Colima) -> restate | Restate ingress / workflow API | NodePort + host forward |
| 8288 | ssh forward (Colima) -> Inngest svc | Inngest API + dashboard backend | NodePort + host forward; proxied via Caddy 9443 |
| 8289 | ssh forward (Colima) -> Inngest ws | Inngest connect websocket | NodePort + host forward; proxied via Caddy 8290 |
| 6379 | ssh forward (Colima) -> Redis | Redis | NodePort + host forward |
| 8108 | ssh forward / kubectl port-forward | Typesense API | ClusterIP; exposed locally by port-forward |
| 9070 | ssh forward (Colima) -> restate | Restate admin API | NodePort + host forward |
| 9071 | ssh forward (Colima) -> restate | Restate metrics | NodePort + host forward |
| 9080 | k8s service | Restate worker HTTP (, , queue drainer) | ClusterIP only |
| random high local port | transient (CLI-managed) -> | Dkron HTTP API | ClusterIP only; short-lived operator tunnel |
| 3838 | ssh forward (Colima) -> docs-api | docs-api HTTP | NodePort + host forward; proxied via Caddy 5443 |
| 7880 | ssh forward (Colima) -> livekit-server | LiveKit signaling | NodePort 7880; proxied via Caddy 7443 |
| 7881 | ssh forward (Colima) -> livekit-server | LiveKit RTC TCP | NodePort 7881 |
| 3000 | k8s bluesky-pds NodePort | Bluesky PDS HTTP | NodePort 3000 |
| 30900 | k8s minio-nodeport | Legacy MinIO S3 API (HTTP) | NodePort 30900 |
| 30901 | k8s minio-nodeport | Legacy MinIO console (HTTP) | NodePort 30901 |
| 31000 | k8s aistor-s3-api ( ns) | AIStor S3 API (TLS) | NodePort 31000 |
| 31001 | k8s aistor-s3-console ( ns) | AIStor console (TLS) | NodePort 31001 |
| 3443 | Caddy | HTTPS reverse proxy to | tailnet HTTPS |
| 5443 | Caddy | HTTPS reverse proxy to | tailnet HTTPS |
| 7443 | Caddy | HTTPS reverse proxy to | tailnet HTTPS |
| 9443 | Caddy | HTTPS reverse proxy to | tailnet HTTPS |
| 8290 | Caddy | HTTPS reverse proxy to | tailnet HTTPS |
| 8443 | Caddy (HTTP) | webhook/public ingress router | expected Funnel target |
| 6443 | Caddy | reverse proxy to local 6333 (Qdrant) | tailnet HTTPS |
| 3018 | gateway daemon | gateway websocket stream port | local |
| 9999 | talon | Talon health endpoint | local |
| 8765 | agent-mail HTTP service | MCP agent-mail API | local |
| 64784 | ssh forward | Kubernetes API | local kubectl endpoint |
Notes
- Host NodePort exposure appears through an
listener process (Colima portForwarder=ssh).ssh - Exact per-port ssh forward command line is UNKNOWN — needs manual verification (process introspection restricted in this environment).
6) Storage Topology
Redis
- Runtime: k8s StatefulSet (
, appendonly enabled).redis:7-alpine - Primary uses:
- gateway queue/session keys (
,joelclaw:events:*
,joelclaw:notify:*
)joelclaw:gateway:sessions - webhook subscriptions (
)joelclaw:webhook:* - gateway health mute/streak keys (
)gateway:health:*
- gateway queue/session keys (
Typesense
From observability code:
collection (canonical telemetry event store)otel_events
collection (vector-aware memory index; schema validated at startup)memory_observations- docs-api also points at
for docs search/index surfaces.http://typesense:8108
Firecracker runtime storage
- PVC:
firecracker-images - Mounted in
atdeployment/restate-worker/tmp/firecracker-test - Stores:
- kernel (
)vmlinux - rootfs (
)agent-rootfs.ext4 - snapshots (
,snapshots/vm.snap
)snapshots/vm.mem
- kernel (
- Firecracker snapshot restore is currently operator-proven at ~9ms on the Colima VZ nested-virt path.
Inngest state
- StatefulSet PVC mounted at
/data INNGEST_SQLITE_DIR=/data
docs-api surface
- Deployment:
on NodePortdocs-api3838 - Route count: 11 endpoints including
/health - Key routes:
— hybrid chunk search withGET /search
,concept
,concepts
,doc_id
, andexpandassembleGET /docs/searchGET /docsGET /docs/:idGET /docs/:id/tocGET /docs/:id/chunksGET /chunks/:idGET /conceptsGET /concepts/:idGET /concepts/:id/docs
- Taxonomy surface: 21-concept SKOS graph (10 parents + 11 sub-concepts) with
,broader
, andnarrower
edges.related
NAS (ADR-0088 + ADR-0187)
Tiering policy:
- Tier 1 local SSD (hot runtime state)
- Tier 2 NAS NVMe (
↔/Volumes/nas-nvme
)/volume2/data - Tier 3 NAS HDD (
)/Volumes/three-body
Access paths
| From | NVMe tier (1.5TB) | HDD tier (56TB) | Method |
|---|---|---|---|
| macOS host | | | NFS mount via LaunchDaemon |
| k8s pods | PVC | PVC | NFS PV (192.168.1.163) |
| host-worker funcs | | | Direct path (runs on macOS) |
k8s ↔ NAS networking
k8s pods reach the NAS via a LAN route through the Colima col0 bridge:
Talos → Docker NAT → VM col0 → macOS (ip.forwarding=1) → LAN → NAS
The VZ NAT on eth0 does NOT forward LAN traffic. Route persisted in Colima provision + colima-tunnel script:
ip route replace 192.168.1.0/24 via 192.168.64.1 dev col0
Always use IP 192.168.1.163, never hostname three-body — DNS doesn't resolve from k8s.
Degradation contract (ADR-0187):
- writes must fallback
local -> remote -> queued - queue spool default:
/tmp/joelclaw/nas-queue
Vault
- Obsidian vault at
/Users/joel/Vault - system log file:
/Users/joel/Vault/system/system-log.jsonl
7) Networking Topology
Caddy reverse proxy routes (from ~/.local/caddy/Caddyfile
)
~/.local/caddy/Caddyfile
->https://panda.tail7af24.ts.net:9443
(Inngest)localhost:8288
->https://panda.tail7af24.ts.net:8290
(Inngest connect)localhost:8289
->https://panda.tail7af24.ts.net:3443
(worker)localhost:3111
->https://panda.tail7af24.ts.net:5443
(docs-api)localhost:3838
->https://panda.tail7af24.ts.net:7443
(LiveKit)localhost:7880
->https://panda.tail7af24.ts.net:6443
(Qdrant)localhost:6333
path router:http://localhost:8443
->/webhooks/*localhost:3111- fallback ->
localhost:8288
Tailscale + Funnel
- Config comments and ADR-0051 describe Funnel path
.:443 -> localhost:8443 - Runtime
unavailable here: UNKNOWN — needs manual verification.tailscale status
External webhook ingress
Expected path:
- Internet provider -> Tailscale Funnel :443
- Funnel -> local
:8443 - Caddy path route
-> worker/webhooks/*:3111 - worker
verifies + emits Inngest event/webhooks/:provider
8) CLI Wiring (Command Tree → Endpoint Surface)
Primary command tree root:
packages/cli/src/cli.ts.
Endpoint map by command family
| Command family | Primary backend |
|---|---|
| Inngest Event API |
, , , , | Inngest GraphQL |
| Inngest/worker health probes + k8s checks + agent-mail liveness |
| Redis keys/channels + launchd/system ops |
| workload planner + Redis queue admission + Restate / runtime |
| docs-api REST API (, , , ) |
| Dkron REST API via direct or short-lived to |
| Typesense via capability adapter |
| Typesense recall adapter |
| Agent-mail MCP HTTP () via CLI adapter wrappers |
| worker launchd + Talon + k8s + Typesense diagnostics |
Config source:
(plus env overrides)~/.config/system-bus.env- defaults:
INNGEST_URL=http://localhost:8288INNGEST_WORKER_URL=http://localhost:3111
9) Observability + Tracing Topology
OTEL event pipeline
- Worker emits via
/emitOtelEvent()
.emitMeasuredOtelEvent() - Gateway emits via
(@joelclaw/telemetry
) to:emitGatewayOtel- default
OTEL_EMIT_URL=http://localhost:3111/observability/emit
- default
- Worker endpoint
validates token (/observability/emit
) if configured.x-otel-emit-token - Store path (
):storeOtelEvent- Typesense
(primary)otel_events - optional Convex mirror for high-severity recent window
- optional Sentry forward for
warn/error/fatal
- Typesense
Langfuse integration points
- Gateway boot:
callspackages/gateway/src/daemon.ts
from inference-router.initTracing({}) - Inference router traces model-route decisions:
packages/inference-router/src/tracing.ts- used from
packages/inference-router/src/router.ts
- System-bus LLM traces:
(packages/system-bus/src/lib/langfuse.ts
)traceLlmGeneration- called by
andpackages/system-bus/src/lib/inference.tschannel-message-classify.ts
10) Key ADR Topology Decisions
| ADR | Title | Status | Topology impact |
|---|---|---|---|
| ADR-0048 | Webhook gateway | shipped | normalization + signature verification + Inngest emission |
| ADR-0088 | NAS-backed storage tiering | shipped | Defines SSD/NAS NVMe/NAS HDD storage contract |
| ADR-0089 | Single-source worker deployment | shipped | Host/cluster role split + single canonical source |
| ADR-0144 | Gateway hexagonal architecture | shipped | Gateway as composition root; heavy logic in |
| ADR-0155 | Three-stage story pipeline | shipped | Simplified story function flow through Inngest durable steps |
| ADR-0156 | Graceful worker restart | superseded | Historical restart strategy; superseded by Talon ADR |
| ADR-0159 | Talon watchdog daemon | shipped | Compiled watchdog + infra supervision model |
| ADR-0038 | Embedded pi gateway daemon | shipped | Always-on gateway session architecture |
| ADR-0051 | Tailscale Funnel ingress | shipped | Public webhook ingress via Funnel/Caddy pattern |
| ADR-0148 | k8s resilience policy | accepted | NodePort-first exposure, probe requirements, restart recovery checklist |
| ADR-0158 | worker-supervisor binary | superseded | Legacy supervisor ADR now superseded, but binary remains in active launchd path |
| ADR-0182 | node-0 localhost resilience | shipped | endpoint class fallback () |
| ADR-0187 | NAS degradation fallback contract | accepted | mandatory local/remote/queued write fallback |
| ADR-0212 | AIStor as local S3 runtime | accepted | maintained local S3 runtime in namespace; legacy MinIO retained for rollback |
10.1) Sandbox Execution Contract (@joelclaw/agent-execution)
Package:
packages/agent-execution/Purpose: Canonical contract for sandboxed story execution shared between Restate workflows, system-bus Inngest functions, and k8s Job launcher.
Contract Types
Request:
SandboxExecutionRequest
,workflowId
,requestId
: identifiersstoryId
: story prompt/task to executetask
:agent{ name, variant?, model?, program? }
:sandbox"workspace-write" | "danger-full-access"
: git SHA before executionbaseSha
: working directorycwd?
: timeouttimeoutSeconds?
: post-execution verificationverificationCommands?
: tracking identifiersessionId?
Result:
SandboxExecutionResult
: correlation IDrequestId
:state"pending" | "running" | "completed" | "failed" | "cancelled"
,startedAt
,completedAt?
: timingdurationMs?
: execution artifacts (see below)artifacts?
: error message (failed state)error?
: stdout/stderr outputoutput?
Artifacts:
ExecutionArtifacts
: git SHA after executionheadSha
: list of modified/untracked files fromtouchedFilesgit status --porcelain
: git patch content (format-patch or diff)patch?
:verification?{ commands, success, output }
:logs?{ executionLog?, verificationLog? }
Repo Materialization (Story 3)
Function:
materializeRepo(targetPath, baseSha, options)
Behavior:
- Clone repo if target path doesn't exist (requires
)remoteUrl - Fetch + checkout if target path exists
- SHA verification after checkout
- Automatic unshallow if SHA not in shallow clone
- Isolated sandbox-local workspace (host worktree untouched)
Returns:
{ path, sha, freshClone, durationMs }
Key options:
: remote URL for fresh cloneremoteUrl?
: branch/ref to fetch (default:branch?
)"main"
: shallow clone depth (default:depth?
)1
: include submodulesincludeSubmodules?
: timeout (default:timeoutSeconds?
)300
Artifact Export (Story 3)
Function:
generatePatchArtifact(options)
Behavior:
- Captures touched-file inventory via
getTouchedFiles() - Generates git patch from
:baseSha..headSha- Uses
if commits exist in rangegit format-patch - Uses
if only uncommitted changesgit diff
- Uses
- Optionally includes untracked files as patch content
- Embeds verification summary and log references
- Serializable to JSON via
writeArtifactBundle()
Key options:
: path to git reporepoPath
: base SHA (start of diff range)baseSha
: head SHA (default: HEAD)headSha?
: include untracked files (default:includeUntracked?
)true
,verificationCommands?
,verificationSuccess?
: verification dataverificationOutput?
,executionLogPath?
: log referencesverificationLogPath?
: timeout (default:timeoutSeconds?
)60
Returns:
ExecutionArtifacts
Promotion Boundary (Phase 1)
Authoritative output is patch bundle + metadata.
Sandbox runs do not merge to main or push to remote. The runtime:
- Materializes repo at
in sandbox-local workspacebaseSha - Executes agent task
- Runs verification commands
- Exports patch artifact with touched files and verification results
- Emits
event withSandboxExecutionResultExecutionArtifacts
Promotion is a separate operator decision:
- Restate workflow receives
ExecutionArtifacts - Operator reviews patch + verification summary
- Operator applies patch to host repo (or discards)
- Operator commits and pushes (if approved)
This keeps sandbox runs isolated and reversible.
k8s Job Integration
Job spec generation:
generateJobSpec(request, options)
Cold k8s Jobs for isolated story execution:
- Deterministic Job naming keyed by
requestId - Runtime image contract: Git, Bun, agent tooling,
directory/workspace - Environment-driven config:
,WORKFLOW_ID
,REQUEST_ID
,STORY_ID
,TASK_PROMPT_B64
, etc.BASE_SHA - Resource limits:
CPU,500m-2
memory (configurable)1-4Gi - TTL cleanup: auto-delete after 5 minutes (default)
- Active deadline: 1 hour max runtime (default)
- No automatic retries (
)backoffLimit: 0 - Security: non-root (UID 1000), no privilege escalation, capabilities dropped
Runtime contract:
- Decode
from envTASK_PROMPT_B64 - Call
atmaterializeRepo()BASE_SHA - Execute agent with task
- Run verification commands (if
set)VERIFICATION_COMMANDS_B64 - Call
with resultsgeneratePatchArtifact() - Emit
event withSandboxExecutionResultExecutionArtifacts - Exit 0 (success) or non-zero (failure)
Cancellation: Delete Job resource (SIGTERM to container)
Job deletion:
generateJobDeletion(requestId) -> { name, namespace, propagationPolicy }
See
k8s/agent-runner.yaml for full runtime contract specification.
Topology Impact
- Story 2: Added contract types and Job spec generation
- Story 3: Added repo materialization and artifact export helpers
- ADR-0221 phase 1: added explicit local sandbox isolation primitives — deterministic sandbox identity, deterministic local sandbox paths, per-sandbox env materialization, minimal/full mode vocabulary, and a JSON registry helper for host-worker sandboxes
- ADR-0221 phase 2: wired those local helpers into the real host-worker
local backend so sandbox runs now allocate deterministic paths undersystem/agent-dispatch
, materialize~/.joelclaw/sandboxes/
, persist registry state, and carry.sandbox.env
metadata in inbox snapshotslocalSandbox - ADR-0221 phase 3/4/5/6: phase 3 added terminal retention/cleanup policy (
+ registry metadata), opportunistic pruning of expired local sandboxes on new-run startup, copy-firstcleanupAfter
materialization helpers with exclusion rules for env/secret junk, live sandbox env injection so the agent process actually sees the reserved runtime identity, a hash-preserving sandbox identity fix after live dogfood exposed path collisions from long shared requestId prefixes, abbreviated-.devcontainer
acceptance during repo materialization, truthful failed inbox snapshots when dispatch crashes before normal terminal writeback, and a repeatable operator probe atbaseSha
; phase 4 addsbun scripts/verify-local-sandbox-dispatch.ts
through the workload front door, requested-cwd mapping inside the cloned checkout, compose-backed full local mode startup, the reality that stale Restate workers can rejectsandboxMode=minimal|full
until restarted and reloaded, a recursion guard because sandboxed stage runs were able to callworkload/requested
/scripts/verify-workload-full-mode.ts
from inside the sandbox and spawn nested canaries instead of terminating honestly, and a guarded workflow-rig proof run (joelclaw workload run
) that completes terminally with healthy compose startup plus clean teardown; phase 5 adds the operator-facing CLI surfaceWR_20260310_013158
so retained sandboxes can be inspected and janitored on demand instead of only during startup opportunistic pruning, and the operator surfaces now reconcile registry entries against per-sandbox metadata before reporting or deleting so old partial writeback residue stops lying about terminal state; phase 6 makes janitoring scheduled instead of purely manual via repo-managed launchd servicejoelclaw workload sandboxes list|cleanup|janitor
, which runscom.joel.local-sandbox-janitor
→scripts/local-sandbox-janitor.sh
at load and every 30 minutesjoelclaw workload sandboxes janitor - Future: Runtime image build, hot-image CronJob, warm-pool scheduler, Restate integration
Current state: the host-worker local sandbox path is now using the local-isolation helpers in production code, the package has a concurrent proof that two local sandboxes keep distinct compose identity plus copied devcontainer state, guarded full-mode workflow-rig dogfood closes terminally, and cleanup now has both on-demand CLI surfaces and scheduled launchd janitoring. Follow-on work is now about deeper runtime ergonomics and debugging any remaining non-terminal stale residues, not missing basic cleanup automation.
11) Verification Commands (Health + Wiring)
Core topology
# Colima + VM IP colima status --json # Kubernetes control plane + node kubectl cluster-info kubectl get nodes -o wide # Core workloads kubectl get pods -n joelclaw -o wide kubectl get svc -n joelclaw -o wide
Host supervision
# Worker supervisor launchd state launchctl print gui/$(id -u)/com.joel.system-bus-worker | rg "state =|pid =|last exit code" # Gateway / Caddy / Talon launchctl print gui/$(id -u)/com.joel.gateway | rg "state =|pid =" launchctl print gui/$(id -u)/com.joel.caddy | rg "state =|pid =" launchctl print gui/$(id -u)/com.joel.talon | rg "state =|pid =" # Talon health curl -s http://127.0.0.1:9999/health
Worker role split
# Parse role counts directly from source lists python - <<'PY' import re from pathlib import Path for f,name in [('packages/system-bus/src/inngest/functions/index.host.ts','host'),('packages/system-bus/src/inngest/functions/index.cluster.ts','cluster')]: txt=Path(f).read_text() body=re.search(rf'export const {name}FunctionDefinitions = \[(.*?)\];', txt, re.S).group(1) count=sum(1 for line in body.splitlines() if line.strip() and not line.strip().startswith('//')) print(name, count) PY # Inngest app ID derivation logic rg -n "INNGEST_APP_ID|system-bus-host|system-bus-cluster|WORKER_ROLE" packages/system-bus/src/inngest/client.ts
Event flow trace
# Send event joelclaw send <event> -d '<json>' # Trace event and resulting runs joelclaw events --prefix <event-prefix> --hours 1 --count 20 joelclaw event <event-id> joelclaw runs --hours 1 --count 20 joelclaw run <run-id> # Telemetry correlation joelclaw otel search "<component_or_action>" --hours 1
Networking
# Caddy route config caddy validate --config ~/.local/caddy/Caddyfile # Listening ports snapshot /usr/sbin/lsof -iTCP -sTCP:LISTEN -n -P # Tailscale runtime (if daemon available) tailscale status --json
12) Known Unknowns (Do Not Guess)
- Tailscale daemon state is not readable in this environment.
-> failed to connect.tailscale status --json- UNKNOWN — needs manual verification
,docs/architecture.md
,docs/deploy.md
are absent in-repo.docs/observability.md- UNKNOWN — needs manual verification
- Exact command-line ownership of all Colima ssh forwarding ports (
,64784
,64785
, etc.)9627- UNKNOWN — needs manual verification
- Ingress controller runtime status for
k8s/docs-api-ingress.yaml- UNKNOWN — needs manual verification
13) Mandatory Update Policy (Non-Optional)
Update this skill in the same change whenever any of these change:
- Worker runtime wiring
,serve.ts
,client.ts
,index.host.tsindex.cluster.ts
, app IDs, serveHost behavior, registration pathWORKER_ROLE
- Supervision/process topology
- any
~/Library/LaunchAgents/com.joel*.plist
, Talon behavior, gateway launch script/labelinfra/worker-supervisor/*
- any
- Kubernetes topology
- any file under
k8s/ - Helm values affecting core services (
,livekit
, etc.)pds - Service type/port changes (NodePort/ClusterIP)
- any file under
- Networking/ingress
- Caddyfile route/port changes
- Tailscale/Funnel hostnames or ingress path changes
- Colima/VM networking model changes
- Storage topology
- Redis keyspace contracts for gateway/webhook routing
- Typesense telemetry collection/schema changes
- NAS mount/fallback/queue contract changes
- Observability/tracing
- OTEL emit endpoint/token behavior
- telemetry storage path changes (Typesense/Convex/Sentry)
- Langfuse integration points
- CLI control-plane routing
- command families moved to different endpoints/services
- ADR status changes affecting topology
- especially ADR-0048, 0088, 0089, 0144, 0155, 0156, 0159, 0182, 0187
If any item above changed and this skill was not updated, this skill is stale and non-canonical.