Skilllibrary observability-logging

Instrument backend services with structured logs, metrics, traces, and correlation IDs that answer production questions without leaking secrets. Use this when defining event schemas, trace boundaries, service diagnostics, or alert-friendly log fields. Do not use for one-off debug prints or frontend analytics instrumentation.

install

source · Clone the upstream repo

git clone https://github.com/merceralex397-collab/skilllibrary

Claude Code · Install into ~/.claude/skills/

T=$(mktemp -d) && git clone --depth=1 https://github.com/merceralex397-collab/skilllibrary "$T" && mkdir -p ~/.claude/skills && cp -r "$T/09-backend-api-and-data/observability-logging" ~/.claude/skills/merceralex397-collab-skilllibrary-observability-logging && rm -rf "$T"

manifest: 09-backend-api-and-data/observability-logging/SKILL.md

source content

Purpose

Use this skill to make backend behavior explainable in production instead of merely noisy.

When to use this skill

Use this skill when:

adding or reviewing structured logging, trace spans, metrics, correlation IDs, or error-reporting fields
debugging why operators cannot answer latency, failure, throughput, or replay questions from the current telemetry
deciding what a service, worker, or webhook handler should emit for production triage

Do not use this skill when

the task is just temporary local debugging with no intention to keep the instrumentation
the work is frontend analytics or product event tagging rather than backend runtime visibility
a narrower active skill already owns the transport problem and observability is only incidental

Operating procedure

Start from operator questions. Define what someone on call must be able to answer: what failed, for whom, where it happened, how often, and whether it is still happening.
Choose the primary event vocabulary. Name stable event types for request lifecycle, external dependency calls, state transitions, retries, and terminal failures instead of free-form prose logs. Example tools: Python
```
structlog
```
, Node.js
```
pino
```
, Go
```
slog
```
/
```
zap
```
. Event names like
```
http.request.completed
```
,
```
db.query.slow
```
,
```
job.retry.exhausted
```
.
Separate logs, metrics, and traces by job. Logs explain specific events, metrics show rates and saturation, and traces connect one request or job across boundaries. Do not expect one signal type to carry all three jobs.
Control field quality. Keep correlation IDs, actor identifiers, resource identifiers, status, duration, and retry counts explicit. Avoid unbounded-cardinality fields or secret-bearing payload dumps.
Verify on unhappy paths. Trigger at least one failure or retry path and confirm that the emitted events are searchable, redacted, and correlated well enough to debug the incident.

Decision rules

Prefer structured event names and fields over giant interpolated message strings.
Log enough identifiers to join events, but never raw secrets, tokens, full auth headers, or unnecessary personal data.
Treat missing correlation IDs and missing error class fields as observability bugs, not polish issues.
If an event cannot drive a dashboard, alert, trace, or incident timeline, question whether it belongs in the long-term schema.

Output requirements

```
Primary Questions
```
```
Event and Field Plan
```
```
Redaction and Cardinality Risks
```
```
Verification
```

Scripts

```
scripts/log_schema_check.py
```
: validate sample JSON logs for required fields and obvious secret-bearing keys.

References

Read these only when relevant:

```
references/structured-event-schema.md
```
```
references/pii-and-secrets-redaction.md
```
```
references/trace-metric-log-triage.md
```

Related skills

```
rate-limits-retries
```
```
webhooks-events
```
```
background-jobs-queues
```

Anti-patterns

Using
```
log.debug
```
/
```
log.info
```
with f-strings instead of structured fields.
Logging full request/response bodies in production, risking PII and secret exposure.
Using
```
print()
```
instead of a structured logger.
Metrics with unbounded label cardinality (e.g.
```
user_id
```
as a metric label).
Trace spans wrapping everything instead of meaningful operation boundaries.

Failure handling

If the repo already ships logs in a legacy shape, describe the migration path before forcing a clean-schema rewrite.
If traces or metrics are unavailable in the stack, say so and design the best log-first fallback instead of pretending all three layers exist.
If the real issue is contract or delivery logic, hand off to the relevant backend skill after documenting the observability gap.