Ai logging-observability
Structured logging, distributed tracing, and metrics collection patterns for building observable systems. Use when implementing logging infrastructure, setting up distributed tracing with OpenTelemetry, designing metrics collection (RED/USE methods), configuring alerting and dashboards, or reviewing observability practices. Covers structured JSON logging, context propagation, trace sampling, Prometheus/Grafana stack, alert design, and PII/secret scrubbing.
git clone https://github.com/wpank/ai
T=$(mktemp -d) && git clone --depth=1 https://github.com/wpank/ai "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/tools/logging-observability" ~/.claude/skills/wpank-ai-logging-observability && rm -rf "$T"
skills/tools/logging-observability/SKILL.mdLogging & Observability
Patterns for building observable systems across the three pillars: logs, metrics, and traces.
Three Pillars
| Pillar | Purpose | Question It Answers | Example |
|---|---|---|---|
| Logs | What happened | Why did this request fail? | |
| Metrics | How much / how fast | Is latency increasing? | |
| Traces | Request flow | Where is the bottleneck? | Span: |
Each pillar is strongest when correlated. Embed
trace_id in every log line to jump from a log entry to the full distributed trace.
Installation
OpenClaw / Moltbot / Clawbot
npx clawhub@latest install logging-observability
Structured Logging
Always emit logs as structured JSON — never free-text strings.
Required Fields
| Field | Purpose | Required |
|---|---|---|
| ISO-8601 with milliseconds | Yes |
| Severity (DEBUG … FATAL) | Yes |
| Originating service name | Yes |
| Human-readable description | Yes |
| Distributed trace correlation | Yes |
| Current span within trace | Yes |
| Business-level correlation (order ID) | When applicable |
| Structured error object | On errors |
| Request-specific metadata | Recommended |
Context Enrichment
Attach context at the middleware level so downstream logs inherit automatically:
app.use((req, res, next) => { const ctx = { trace_id: req.headers['x-trace-id'] || crypto.randomUUID(), request_id: crypto.randomUUID(), user_id: req.user?.id, method: req.method, path: req.path, }; asyncLocalStorage.run(ctx, () => next()); });
Library Recommendations
| Library | Language | Strengths | Perf |
|---|---|---|---|
| Pino | Node.js | Fastest Node logger, low overhead | Excellent |
| structlog | Python | Composable processors, context binding | Good |
| zerolog | Go | Zero-allocation JSON logging | Excellent |
| zap | Go | High performance, typed fields | Excellent |
| tracing | Rust | Spans + events, async-aware | Excellent |
Choose a logger that outputs structured JSON natively. Avoid loggers requiring post-processing.
Log Levels
| Level | When to Use | Example |
|---|---|---|
| FATAL | App cannot continue, process will exit | Database connection pool exhausted |
| ERROR | Operation failed, needs attention | Payment charge failed: CARD_DECLINED |
| WARN | Unexpected but recoverable | Retry 2/3 for upstream timeout |
| INFO | Normal business events | Order ORD-1234 placed successfully |
| DEBUG | Developer troubleshooting | Cache miss for key user:82:preferences |
| TRACE | Very fine-grained (rarely in prod) | Entering validateAddress with payload |
Rules: Production default = INFO and above. If you log an ERROR, someone should act on it. Every FATAL should trigger an alert.
Distributed Tracing
OpenTelemetry Setup
Always prefer OpenTelemetry over vendor-specific SDKs:
import { NodeSDK } from '@opentelemetry/sdk-node'; import { OTLPTraceExporter } from '@opentelemetry/exporter-trace-otlp-http'; import { getNodeAutoInstrumentations } from '@opentelemetry/auto-instrumentations-node'; const sdk = new NodeSDK({ serviceName: 'order-service', traceExporter: new OTLPTraceExporter({ url: 'http://otel-collector:4318/v1/traces', }), instrumentations: [getNodeAutoInstrumentations()], }); sdk.start();
Span Creation
const tracer = trace.getTracer('order-service'); async function processOrder(order: Order) { return tracer.startActiveSpan('processOrder', async (span) => { try { span.setAttribute('order.id', order.id); span.setAttribute('order.total_cents', order.totalCents); await validateInventory(order); await chargePayment(order); span.setStatus({ code: SpanStatusCode.OK }); } catch (err) { span.setStatus({ code: SpanStatusCode.ERROR, message: err.message }); span.recordException(err); throw err; } finally { span.end(); } }); }
Context Propagation
- Use W3C Trace Context (
header) — default in OTeltraceparent - Propagate across HTTP, gRPC, and message queues
- For async workers: serialise
into the job payloadtraceparent
Trace Sampling
| Strategy | Use When |
|---|---|
| Always On | Low-traffic services, debugging |
| Probabilistic (N%) | General production use |
| Rate-limited (N/sec) | High-throughput services |
| Tail-based | When you need all error traces |
Always sample 100% of error traces regardless of strategy.
Metrics Collection
RED Method (Request-Driven)
Monitor these three for every service endpoint:
| Metric | What It Measures | Prometheus Example |
|---|---|---|
| Rate | Requests/sec | |
| Errors | Failed request ratio | |
| Duration | Response time | |
USE Method (Resource-Driven)
For infrastructure components (CPU, memory, disk, network):
| Metric | What It Measures | Example |
|---|---|---|
| Utilization | % resource busy | CPU usage at 78% |
| Saturation | Work queued/waiting | 12 requests queued in thread pool |
| Errors | Error events on resource | 3 disk I/O errors in last minute |
Monitoring Stack
| Tool | Category | Best For |
|---|---|---|
| Prometheus | Metrics | Pull-based metrics, alerting rules |
| Grafana | Visualisation | Dashboards for metrics, logs, traces |
| Jaeger | Tracing | Distributed trace visualisation |
| Loki | Logs | Log aggregation (pairs with Grafana) |
| OpenTelemetry | Collection | Vendor-neutral telemetry collection |
Recommendation: Start with OTel Collector → Prometheus + Grafana + Loki + Jaeger. Migrate to SaaS only when operational overhead justifies cost.
Alert Design
Severity Levels
| Severity | Response Time | Example |
|---|---|---|
| P1 | Immediate | Service fully down, data loss |
| P2 | < 30 min | Error rate > 5%, latency p99 > 5s |
| P3 | Business hours | Disk > 80%, cert expiring in 7 days |
| P4 | Best effort | Non-critical deprecation warning |
Alert Fatigue Prevention
- Alert on symptoms, not causes — "error rate > 5%" not "pod restarted"
- Multi-window, multi-burn-rate — catch both sudden spikes and slow burns
- Require runbook links — every alert must link to diagnosis and remediation
- Review monthly — delete or tune alerts that never fire or always fire
- Group related alerts — use inhibition rules to suppress child alerts
- Set appropriate thresholds — if alert fires daily and is ignored, raise threshold or delete
Dashboard Patterns
Overview Dashboard ("War Room")
- Total requests/sec across all services
- Global error rate (%) with trendline
- p50 / p95 / p99 latency
- Active alerts count by severity
- Deployment markers overlaid on graphs
Service Dashboard (Per-Service)
- RED metrics for each endpoint
- Dependency health (upstream/downstream success rates)
- Resource utilisation (CPU, memory, connections)
- Top errors table with count and last seen
Observability Checklist
Every service must have:
- Structured JSON logging with consistent schema
- Correlation / trace IDs propagated on all requests
- RED metrics exposed for every external endpoint
- Health check endpoints (
and/healthz
)/readyz - Distributed tracing with OpenTelemetry
- Dashboards for RED metrics and resource utilisation
- Alerts for error rate, latency, and saturation with runbook links
- Log level configurable at runtime without redeployment
- PII scrubbing verified and tested
- Retention policies defined for logs, metrics, and traces
Anti-Patterns
| Anti-Pattern | Problem | Fix |
|---|---|---|
| Logging PII | Privacy/compliance violation | Mask or exclude PII; use token references |
| Excessive logging | Storage costs balloon, signal drowns | Log business events, not data flow |
| Unstructured logs | Cannot query or alert on fields | Use structured JSON with consistent schema |
| String interpolation | Breaks structured fields, injection risk | Pass fields as metadata, not in message |
| Missing correlation IDs | Cannot trace across services | Generate and propagate trace_id everywhere |
| Alert storms | On-call fatigue, real issues buried | Use grouping, inhibition, deduplication |
| Metrics with high cardinality | Prometheus OOM, dashboard timeouts | Never use user ID or request ID as label |
NEVER Do
- NEVER log passwords, tokens, API keys, or secrets — even at DEBUG level
- NEVER use console.log / print in production — use a structured logger
- NEVER use user IDs, emails, or request IDs as metric labels — cardinality will explode
- NEVER create alerts without a runbook link — unactionable alerts erode trust
- NEVER rely on logs alone — you need metrics and traces for full observability
- NEVER log request/response bodies by default — opt-in only, with PII redaction
- NEVER ignore log volume — set budgets and alert when a service exceeds daily quota
- NEVER skip context propagation in async flows — broken traces are worse than no traces