Gsd-skill-creator monitoring-observability

Provides monitoring and observability best practices covering the three pillars (logs, metrics, traces), OpenTelemetry instrumentation, Prometheus/Grafana dashboards, SLO-based alerting, and APM strategies. Use when setting up monitoring, observability, prometheus, grafana, opentelemetry, alerting, tracing, logging, metrics, dashboards, SLOs, or APM.

install
source · Clone the upstream repo
git clone https://github.com/Tibsfox/gsd-skill-creator
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/Tibsfox/gsd-skill-creator "$T" && mkdir -p ~/.claude/skills && cp -r "$T/examples/skills/ops/monitoring-observability" ~/.claude/skills/tibsfox-gsd-skill-creator-monitoring-observability && rm -rf "$T"
manifest: examples/skills/ops/monitoring-observability/SKILL.md
source content

Monitoring and Observability

Production systems require visibility into their behavior. Observability goes beyond simple monitoring by enabling you to ask arbitrary questions about system state using logs, metrics, and traces. This guide covers instrumentation, collection, visualization, alerting, and the operational patterns that prevent alert fatigue while keeping systems reliable.

The Three Pillars

PillarWhat It CapturesBest ForKey Tools
LogsDiscrete events with contextDebugging specific requests, audit trailsELK, Loki, CloudWatch Logs
MetricsNumeric measurements over timeTrends, thresholds, capacity planningPrometheus, Datadog, CloudWatch Metrics
TracesRequest flow across servicesLatency breakdown, dependency mappingJaeger, Tempo, X-Ray
QuestionSignal
"Why did this request fail?"Logs (event detail) + Traces (call chain)
"Is error rate increasing?"Metrics (counters over time)
"Which service is slow?"Traces (span timing)
"What happened at 3:42 AM?"Logs (timestamped events)
"Are we within SLO budget?"Metrics (error ratio, latency percentiles)
"How do services depend on each other?"Traces (service graph)

OpenTelemetry SDK Setup

OpenTelemetry provides a vendor-neutral API for emitting all three signals. Instrument once, export anywhere.

Node.js Auto-Instrumentation

// tracing.ts -- Load BEFORE application code
import { NodeSDK } from '@opentelemetry/sdk-node';
import { OTLPTraceExporter } from '@opentelemetry/exporter-trace-otlp-http';
import { OTLPMetricExporter } from '@opentelemetry/exporter-metrics-otlp-http';
import { PeriodicExportingMetricReader } from '@opentelemetry/sdk-metrics';
import { getNodeAutoInstrumentations } from '@opentelemetry/auto-instrumentations-node';
import { Resource } from '@opentelemetry/resources';
import { ATTR_SERVICE_NAME, ATTR_SERVICE_VERSION } from '@opentelemetry/semantic-conventions';

const sdk = new NodeSDK({
  resource: new Resource({
    [ATTR_SERVICE_NAME]: 'order-service',
    [ATTR_SERVICE_VERSION]: process.env.APP_VERSION || '0.0.0',
  }),
  traceExporter: new OTLPTraceExporter({
    url: process.env.OTEL_EXPORTER_OTLP_ENDPOINT || 'http://localhost:4318/v1/traces',
  }),
  metricReader: new PeriodicExportingMetricReader({
    exporter: new OTLPMetricExporter(),
    exportIntervalMillis: 15000,
  }),
  instrumentations: [
    getNodeAutoInstrumentations({
      '@opentelemetry/instrumentation-fs': { enabled: false },
      '@opentelemetry/instrumentation-http': {
        ignoreIncomingPaths: ['/healthz', '/readyz', '/metrics'],
      },
    }),
  ],
});

sdk.start();
process.on('SIGTERM', () => sdk.shutdown().finally(() => process.exit(0)));

Manual Span Creation

import { trace, SpanStatusCode } from '@opentelemetry/api';

const tracer = trace.getTracer('order-service', '1.0.0');

async function processOrder(orderId: string): Promise<Order> {
  return tracer.startActiveSpan('processOrder', async (span) => {
    try {
      span.setAttribute('order.id', orderId);

      const order = await tracer.startActiveSpan('db.fetchOrder', async (dbSpan) => {
        dbSpan.setAttribute('db.system', 'postgresql');
        const result = await db.orders.findById(orderId);
        dbSpan.end();
        return result;
      });

      span.addEvent('order.validated', { 'order.total': order.total });
      await chargePayment(order);
      span.setStatus({ code: SpanStatusCode.OK });
      return order;
    } catch (error) {
      span.setStatus({ code: SpanStatusCode.ERROR, message: error.message });
      span.recordException(error);
      throw error;
    } finally {
      span.end();
    }
  });
}

Prometheus Metrics

Prometheus uses a pull model -- it scrapes HTTP endpoints at intervals.

TypeUse CaseExample
CounterCumulative count (only goes up)Total requests, errors, bytes sent
GaugeCurrent value (goes up and down)Active connections, queue depth
HistogramDistribution of values in bucketsRequest latency, response size
SummaryQuantiles calculated client-sideLegacy use -- prefer histograms

Express Application Metrics

import express from 'express';
import promClient from 'prom-client';

const app = express();
promClient.collectDefaultMetrics({ prefix: 'orderservice_' });

const httpRequestDuration = new promClient.Histogram({
  name: 'http_request_duration_seconds',
  help: 'Duration of HTTP requests in seconds',
  labelNames: ['method', 'route', 'status_code'],
  buckets: [0.01, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10],
});

const httpRequestsTotal = new promClient.Counter({
  name: 'http_requests_total',
  help: 'Total number of HTTP requests',
  labelNames: ['method', 'route', 'status_code'],
});

app.use((req, res, next) => {
  const end = httpRequestDuration.startTimer();
  res.on('finish', () => {
    const labels = { method: req.method, route: req.route?.path || req.path, status_code: res.statusCode.toString() };
    end(labels);
    httpRequestsTotal.inc(labels);
  });
  next();
});

app.get('/metrics', async (req, res) => {
  res.set('Content-Type', promClient.register.contentType);
  res.end(await promClient.register.metrics());
});

Grafana Dashboard

Define dashboards as JSON for version control. Key panels for any service overview:

{
  "dashboard": {
    "title": "Order Service Overview",
    "refresh": "30s",
    "panels": [
      {
        "title": "Request Rate",
        "type": "timeseries",
        "targets": [{
          "expr": "sum(rate(http_requests_total{service=\"order-service\"}[5m])) by (status_code)",
          "legendFormat": "{{status_code}}"
        }],
        "fieldConfig": { "defaults": { "unit": "reqps" } }
      },
      {
        "title": "P99 Latency",
        "type": "timeseries",
        "targets": [{
          "expr": "histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket{service=\"order-service\"}[5m])) by (le, route))",
          "legendFormat": "{{route}}"
        }],
        "fieldConfig": { "defaults": { "unit": "s" } }
      },
      {
        "title": "Error Rate (SLO: 99.9%)",
        "type": "stat",
        "targets": [{
          "expr": "1 - (sum(rate(http_requests_total{service=\"order-service\", status_code=~\"5..\"}[1h])) / sum(rate(http_requests_total{service=\"order-service\"}[1h])))",
          "instant": true
        }],
        "fieldConfig": { "defaults": { "unit": "percentunit" } }
      }
    ]
  }
}

SLO-Based Alerting

Alert on Service Level Objectives, not raw thresholds. This ties alerting to user-visible impact.

ConceptDefinitionExample
SLIService Level Indicator -- a measured metricProportion of requests < 300ms
SLOService Level Objective -- target for the SLI99.9% of requests < 300ms
Error BudgetAllowed failures = 1 - SLO0.1% can be slow (43 min/month)
Burn RateHow fast you consume error budget2x = burning twice as fast as sustainable

Multi-Window Burn Rate Alerts

# alerts/slo-alerts.yml
groups:
  - name: slo-burn-rate
    rules:
      # Fast burn -- 2% of 30-day budget in 1 hour
      - alert: HighErrorBudgetBurnRate_Fast
        expr: |
          (
            sum(rate(http_requests_total{status_code=~"5..", service="order-service"}[1h]))
            / sum(rate(http_requests_total{service="order-service"}[1h]))
          ) > (14.4 * 0.001)
          and
          (
            sum(rate(http_requests_total{status_code=~"5..", service="order-service"}[5m]))
            / sum(rate(http_requests_total{service="order-service"}[5m]))
          ) > (14.4 * 0.001)
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "Burning error budget 14.4x faster than sustainable"
          runbook: "https://wiki.internal/runbooks/order-service-high-errors"

      # Slow burn -- 10% of budget in 3 days
      - alert: HighErrorBudgetBurnRate_Slow
        expr: |
          (
            sum(rate(http_requests_total{status_code=~"5..", service="order-service"}[6h]))
            / sum(rate(http_requests_total{service="order-service"}[6h]))
          ) > (1.0 * 0.001)
        for: 30m
        labels:
          severity: warning
        annotations:
          summary: "Slowly burning error budget"

      # Latency SLO
      - alert: LatencySLOBreach
        expr: |
          histogram_quantile(0.99,
            sum(rate(http_request_duration_seconds_bucket{service="order-service"}[5m])) by (le)
          ) > 0.3
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "P99 latency exceeds 300ms SLO"

Log Aggregation Patterns

Structured Logging with Trace Correlation

import pino from 'pino';
import { context, trace } from '@opentelemetry/api';

const logger = pino({
  level: process.env.LOG_LEVEL || 'info',
  mixin() {
    const span = trace.getSpan(context.active());
    if (span) {
      const { traceId, spanId } = span.spanContext();
      return { traceId, spanId };
    }
    return {};
  },
  redact: {
    paths: ['req.headers.authorization', 'password', 'ssn', 'creditCard'],
    censor: '[REDACTED]',
  },
});

// Good: structured with context
logger.info({ orderId, userId, total: order.total }, 'Order placed successfully');

// Bad: unstructured string interpolation -- not queryable, PII leaks
// logger.info(`Order ${orderId} placed by user ${userId} for $${total}`);

Loki Query Examples

# Find errors for a specific trace
{service="order-service"} |= "error" | json | traceId="abc123def456"

# Error rate by service
sum(rate({service=~".+"} |= "error" [5m])) by (service)

# Slow database queries
{service="order-service"} | json | db_duration_ms > 500

Distributed Tracing with Context Propagation

Client -> API Gateway -> Order Service -> Payment Service -> Database
  |           |              |                |
  trace_id=abc trace_id=abc  trace_id=abc     trace_id=abc
  span_id=001  span_id=002   span_id=003      span_id=004
               parent=001    parent=002       parent=003
import { context, propagation } from '@opentelemetry/api';

// Outgoing: inject trace context into headers
async function callPaymentService(order: Order): Promise<PaymentResult> {
  const headers: Record<string, string> = {};
  propagation.inject(context.active(), headers);

  return fetch('https://payment-service/charge', {
    method: 'POST',
    headers: { 'Content-Type': 'application/json', ...headers },
    body: JSON.stringify({ orderId: order.id, amount: order.total }),
  }).then(r => r.json());
}

// Incoming via message queue: extract manually
function processMessage(message: QueueMessage) {
  const parentContext = propagation.extract(context.active(), message.headers);
  context.with(parentContext, () => {
    tracer.startActiveSpan('processMessage', (span) => {
      // spans here are children of the original trace
      span.end();
    });
  });
}

APM Business Metrics

import { metrics } from '@opentelemetry/api';

const meter = metrics.getMeter('order-service', '1.0.0');

const orderCounter = meter.createCounter('orders_total', {
  description: 'Total orders processed',
});
const orderValue = meter.createHistogram('order_value_dollars', {
  description: 'Distribution of order values',
});

async function completeOrder(order: Order) {
  orderCounter.add(1, { payment_method: order.paymentMethod, region: order.region });
  orderValue.record(order.total, { currency: order.currency });
}

Alert Fatigue Prevention

StrategyImplementationImpact
Alert on symptoms, not causesAlert on error rate, not CPUFewer alerts, user-facing focus
Multi-window burn ratesShort + long window must both fireEliminates transient spikes
Severity routingCritical -> page, Warning -> ticketRight urgency for right signal
Alert deduplicationGroup related alerts by serviceOne alert per incident
Regular alert reviewMonthly: delete or tune noisy alertsContinuous improvement
Require runbook linksEvery alert links to a runbookResponders know what to do
Error budget basedAlert when burning budget, not on every errorTolerates expected failure rate
Inhibition rulesCritical suppresses warning for same serviceReduces duplicate noise

Anti-Patterns

Anti-PatternProblemFix
Alerting on raw thresholds (CPU > 80%)Alerts on non-issues, misses real problemsAlert on SLOs and user-facing symptoms
No structured loggingLogs are unsearchable at scaleUse JSON structured logging with consistent fields
Missing trace context in logsCannot correlate logs with tracesInject traceId/spanId via logger mixin
High-cardinality metric labelsPrometheus memory explosion, slow queriesNever use userId, requestId, or IP as label values
Alerting without runbooksResponders waste time figuring out what to doRequire runbook URL in every alert annotation
Sampling 100% of tracesStorage costs explode, collectors overloadedUse head-based or tail-based sampling (1-10%)
Logging PII in plaintextCompliance violations (GDPR, HIPAA)Redact sensitive fields, use structured redaction
Dashboard sprawl without ownershipStale dashboards with broken queriesAssign team owners, review quarterly
Monitoring only infrastructureMisses application-level failuresAdd business KPI metrics (orders/sec, revenue)
Ignoring metric stalenessStale metrics give false "all clear"Alert on absent metrics with
absent()
function
No log retention policyStorage costs grow indefinitelySet TTLs: 7d hot, 30d warm, 90d cold, archive
Synchronous log shippingLog pipeline failure blocks applicationUse async buffered shipping with local fallback
No baseline for normalCannot detect anomaliesRecord baselines during stable periods

Observability Maturity Checklist

Level 1: Foundations

  • Structured JSON logging with consistent field names across services
  • Health check endpoints (
    /healthz
    ,
    /readyz
    ) on all services
  • Prometheus metrics endpoint exposed on all services
  • Default runtime metrics collected (memory, CPU, GC, event loop)
  • Centralized log aggregation (ELK, Loki, or cloud-native)

Level 2: Instrumentation

  • OpenTelemetry SDK integrated with auto-instrumentation
  • Custom business metrics defined (orders/sec, revenue, conversion)
  • Distributed tracing with context propagation across services
  • Trace context injected into log entries (traceId, spanId)
  • Grafana dashboards with RED metrics (Rate, Errors, Duration)

Level 3: Alerting

  • SLOs defined for each critical user journey
  • Multi-window burn rate alerts replacing threshold alerts
  • Every alert has a linked runbook with remediation steps
  • Alert routing configured by severity (page / ticket / dashboard)
  • Alert fatigue review conducted monthly
  • absent()
    alerts for metrics that stop reporting

Level 4: Advanced

  • Tail-based trace sampling to capture errors and slow requests
  • Exemplars linking metrics to specific traces
  • Error budget tracking visible on team dashboards
  • Anomaly detection on key business metrics
  • Observability costs tracked and optimized