Skillshub observability-setup

Observability Setup

install

source · Clone the upstream repo

git clone https://github.com/ComeOnOliver/skillshub

Claude Code · Install into ~/.claude/skills/

T=$(mktemp -d) && git clone --depth=1 https://github.com/ComeOnOliver/skillshub "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/TerminalSkills/skills/observability-setup" ~/.claude/skills/comeonoliver-skillshub-observability-setup && rm -rf "$T"

manifest: skills/TerminalSkills/skills/observability-setup/SKILL.md

source content

Observability Setup

Overview

This skill helps AI agents implement the three pillars of observability — traces, metrics, and structured logs — across microservice architectures. It uses OpenTelemetry as the instrumentation standard and Grafana's LGTM stack (Loki, Grafana, Tempo, Mimir) as the backend, though patterns apply to any OTel-compatible backend.

Instructions

OpenTelemetry Instrumentation (Node.js)

Install packages:

npm install @opentelemetry/sdk-node @opentelemetry/auto-instrumentations-node \
  @opentelemetry/exporter-trace-otlp-http @opentelemetry/exporter-metrics-otlp-http

Create

src/instrumentation.ts

— this file must be loaded BEFORE any other imports:

import { NodeSDK } from '@opentelemetry/sdk-node';
import { getNodeAutoInstrumentations } from '@opentelemetry/auto-instrumentations-node';
import { OTLPTraceExporter } from '@opentelemetry/exporter-trace-otlp-http';
import { OTLPMetricExporter } from '@opentelemetry/exporter-metrics-otlp-http';
import { PeriodicExportingMetricReader } from '@opentelemetry/sdk-metrics';

const sdk = new NodeSDK({
  serviceName: process.env.OTEL_SERVICE_NAME || 'my-service',
  traceExporter: new OTLPTraceExporter({ url: process.env.OTEL_EXPORTER_OTLP_ENDPOINT + '/v1/traces' }),
  metricReader: new PeriodicExportingMetricReader({
    exporter: new OTLPMetricExporter({ url: process.env.OTEL_EXPORTER_OTLP_ENDPOINT + '/v1/metrics' }),
    exportIntervalMillis: 15000,
  }),
  instrumentations: [getNodeAutoInstrumentations()],
});
sdk.start();

Add to start script:

node --require ./src/instrumentation.ts src/index.ts

Add custom spans for business-critical operations:

import { trace } from '@opentelemetry/api';
const tracer = trace.getTracer('payment-service');

async function processPayment(order) {
  return tracer.startActiveSpan('process-payment', async (span) => {
    span.setAttribute('order.id', order.id);
    span.setAttribute('payment.amount_cents', order.totalCents);
    try {
      const result = await paymentGateway.charge(order);
      span.setAttribute('payment.status', result.status);
      return result;
    } catch (error) {
      span.recordException(error);
      span.setStatus({ code: SpanStatusCode.ERROR, message: error.message });
      throw error;
    } finally {
      span.end();
    }
  });
}

OpenTelemetry Instrumentation (Python)

Install:

pip install opentelemetry-distro opentelemetry-exporter-otlp

Run:

opentelemetry-instrument --service_name notification-service python app.py

For structured logs with trace context:

import structlog
from opentelemetry import trace

def add_trace_context(logger, method_name, event_dict):
    span = trace.get_current_span()
    ctx = span.get_span_context()
    if ctx.is_valid:
        event_dict['trace_id'] = format(ctx.trace_id, '032x')
        event_dict['span_id'] = format(ctx.span_id, '016x')
    return event_dict

structlog.configure(processors=[add_trace_context, structlog.dev.ConsoleRenderer()])

Structured Logging with Trace Correlation

Every log line must include

trace_id

and

span_id

. Use JSON format:

{"level":"error","msg":"payment failed","trace_id":"abc123...","span_id":"def456...","service":"payment-service","error":"timeout","timestamp":"2025-01-15T10:30:00Z"}

For Node.js, use pino with a custom mixin:

const pino = require('pino');
const { trace } = require('@opentelemetry/api');

const logger = pino({
  mixin() {
    const span = trace.getActiveSpan();
    if (span) {
      const ctx = span.spanContext();
      return { trace_id: ctx.traceId, span_id: ctx.spanId };
    }
    return {};
  }
});

OTel Collector Configuration

receivers:
  otlp:
    protocols:
      grpc: { endpoint: "0.0.0.0:4317" }
      http: { endpoint: "0.0.0.0:4318" }

processors:
  batch:
    timeout: 5s
    send_batch_size: 1000
  memory_limiter:
    limit_mib: 512

exporters:
  otlphttp/tempo:
    endpoint: http://tempo:4318
  prometheusremotewrite:
    endpoint: http://mimir:9009/api/v1/push
  loki:
    endpoint: http://loki:3100/loki/api/v1/push

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [memory_limiter, batch]
      exporters: [otlphttp/tempo]
    metrics:
      receivers: [otlp]
      processors: [memory_limiter, batch]
      exporters: [prometheusremotewrite]
    logs:
      receivers: [otlp]
      processors: [memory_limiter, batch]
      exporters: [loki]

Grafana Dashboard Provisioning

Create dashboards as JSON files in

grafana/dashboards/

. Key panels:

Service Overview Dashboard:

Request rate:

sum(rate(http_server_request_duration_seconds_count[5m])) by (service_name)

P95 latency:

histogram_quantile(0.95, sum(rate(http_server_request_duration_seconds_bucket[5m])) by (le, service_name))

Error rate:

sum(rate(http_server_request_duration_seconds_count{http_status_code=~"5.."}[5m])) / sum(rate(http_server_request_duration_seconds_count[5m]))

Alert Rules:

Error rate > 5 % for 5 min → P2
P99 latency > 2s for 5 min → P3
Service health check failing for 1 min → P1

Signal Correlation in Grafana

Loki data source: add derived field with regex
```
trace_id=(\w+)
```
linking to Tempo.
Tempo data source: enable "Trace to logs" linking to Loki with filter
```
{service_name="$service"} | trace_id="$traceId"
```
.
Metrics → Logs: use exemplars in Mimir to link metric data points to trace IDs.

Examples

Example 1 — Quick OTel setup for Express app

Input: "Add observability to my Express API."

Output: Create

src/instrumentation.ts

with the Node SDK config above, add the

--require

flag to the start script, set

OTEL_EXPORTER_OTLP_ENDPOINT=http://localhost:4318

and

OTEL_SERVICE_NAME=api-gateway

.env

Example 2 — Docker Compose for LGTM stack

Input: "Give me a docker-compose for the observability backend."

Output:

services:
  otel-collector:
    image: otel/opentelemetry-collector-contrib:0.96.0
    volumes: ["./otel-collector-config.yaml:/etc/otelcol/config.yaml"]
    ports: ["4317:4317", "4318:4318"]
  tempo:
    image: grafana/tempo:2.4.0
    command: ["-config.file=/etc/tempo.yaml"]
    volumes: ["./tempo.yaml:/etc/tempo.yaml"]
  loki:
    image: grafana/loki:3.0.0
    ports: ["3100:3100"]
  mimir:
    image: grafana/mimir:2.11.0
    command: ["-config.file=/etc/mimir.yaml"]
    volumes: ["./mimir.yaml:/etc/mimir.yaml"]
  grafana:
    image: grafana/grafana:10.4.0
    ports: ["3000:3000"]
    volumes: ["./grafana/provisioning:/etc/grafana/provisioning", "./grafana/dashboards:/var/lib/grafana/dashboards"]
    environment:
      GF_AUTH_ANONYMOUS_ENABLED: "true"
      GF_AUTH_ANONYMOUS_ORG_ROLE: Admin

Guidelines

Instrument before you optimize. You cannot improve what you cannot see.
Auto-instrumentation first, custom spans second. Auto-instrumentation covers 80 % of what you need. Add custom spans only for business-critical paths.
Always correlate signals. A trace without logs is incomplete. A metric spike without a trace is unactionable.
Set resource limits on the collector. Without
```
memory_limiter
```
, a traffic spike can OOM the collector and create a cascading failure.
Use service.name consistently. It is the primary grouping key across all three signal types. Mismatched names break correlation.