Joelclaw o11y-logging
Implement and verify joelclaw observability on every change so failures cannot stay silent. Use when adding/updating Inngest functions, gateway channels, webhook providers, APIs, workers, or any pipeline step. Enforces canonical OTEL contract, storage path, and verification gates. Triggers on: 'o11y', 'observability', 'logging', 'otel', 'instrument this', 'silent failure', 'add telemetry', 'log this function'.
git clone https://github.com/joelhooks/joelclaw
T=$(mktemp -d) && git clone --depth=1 https://github.com/joelhooks/joelclaw "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/o11y-logging" ~/.claude/skills/joelhooks-joelclaw-o11y-logging && rm -rf "$T"
skills/o11y-logging/SKILL.mdJoelClaw Observability + Logging
Prevent silent failure by default. Observability is not optional polish: it is part of done.
Non-Negotiable Rules
- Use the canonical event contract only.
packages/system-bus/src/observability/otel-event.tspackages/system-bus/src/observability/emit.tspackages/system-bus/src/observability/store.ts
- Worker/Inngest code emits through
oremitOtelEvent
.emitMeasuredOtelEvent - Gateway code emits through
.emitGatewayOtel - Internal ingestion goes through
(POST /observability/emit
), not ad-hoc writes.packages/system-bus/src/serve.ts - Never treat
as primary observability. Keep structured events as source of truth.console.log - High-cardinality values go in
, not in facet fields (metadata
,source
,component
,level
).success - Failures must set
with a meaningfulsuccess: false
.error - For warn/error/fatal, verify Convex mirror behavior (rolling window) in addition to Typesense write.
- In Inngest durable functions, any "emit once" telemetry must live inside
to avoid replay duplication after resume.step.run(...)
Event Conventions
: subsystem (source
,worker
,gateway
,webhook
,memory
, etc.)verification
: stable module/service name (component
,check-system-health
,redis-channel
)observe
: stable dotted action (action
,system.health.checked
)events.immediate_telegram
: request IDs, deployment IDs, function IDs, session IDs, payload identifiersmetadata
: include for timed operationsduration_ms
Use event-per-hop (wide event style): one context-rich event for each major boundary/operation, not scattered string logs.
Implementation Workflow
- Identify the boundary being changed.
- Inngest function, gateway channel, webhook route, API route, background job, sync step.
- Add success and failure envelopes.
- Start + completion for long tasks, or a single completion event for short tasks.
- Include operational and business context in
.metadata- Example: function id, event id, provider, queue depth, affected resource id.
- Keep severity useful.
for normal activity,debug/info
for degraded but recoverable,warn
for failures.error/fatal
- Run verification gates before finishing.
For full checklists and command recipes, read
references/implementation-checklist.md.
Quick Patterns
Worker / Inngest timed operation
import { emitMeasuredOtelEvent } from "../../observability/emit"; await emitMeasuredOtelEvent( { level: "info", source: "worker", component: "content-sync", action: "content_sync.run", metadata: { trigger: event.name }, }, async () => { await runSync(); } );
Gateway emission
import { emitGatewayOtel } from "../observability"; await emitGatewayOtel({ level: "error", component: "redis-channel", action: "events.immediate_telegram", success: false, error: "telegram_send_failed", metadata: { sessionId, queueDepth }, });
Definition of Done
- Structured OTEL events added for the changed path.
- No direct feature-level writes to Typesense/Convex for observability data.
- Smoke probe passes (
).scripts/otel-smoke.sh
andjoelclaw otel list
show expected behavior.joelclaw otel stats- New failure modes are queryable by
,source
, andcomponent
.action
Inngest Replay + Hang Triage
Use this when step code appears to run but runs remain
RUNNING/CANCELLED with Finalization errors.
- Inspect run trace first.
joelclaw run <run-id>
Look for
errors.Finalization.stack containing Unable to reach SDK URL.
- Confirm whether this is true network reachability or worker-side blocking.
joelclaw inngest status joelclaw logs worker --lines 200 joelclaw logs errors --lines 200
- Check for replay-noise in OTEL.
If an action that should emit once (for example
manifest.archive.prereqs-passed) appears hundreds of times in one run window, move that emit into its own step.run.
joelclaw otel search "manifest.archive.prereqs-passed" --hours 1
- Treat
as an ambiguous symptom.Unable to reach SDK URL
It can indicate ingress problems, but in practice it can also happen when a function handler blocks on local IO/dependencies long enough that finalization cannot complete.
Helper Script
Use
scripts/otel-smoke.sh for a fast end-to-end probe:
./skills/o11y-logging/scripts/otel-smoke.sh verification o11y-skill probe.emit
Key Files
packages/system-bus/src/observability/otel-event.tspackages/system-bus/src/observability/emit.tspackages/system-bus/src/observability/store.tspackages/system-bus/src/serve.tspackages/gateway/src/observability.tspackages/system-bus/src/inngest/functions/check-system-health.tspackages/cli/src/commands/otel.tsapps/web/app/api/otel/route.ts