Agentic-creator-os monitoring-observability

Monitoring, logging, and observability patterns. Covers structured logging, metrics, tracing, alerting, and dashboards with tools like Sentry, Datadog, and OpenTelemetry.

install
source · Clone the upstream repo
git clone https://github.com/frankxai/agentic-creator-os
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/frankxai/agentic-creator-os "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/technical/monitoring-observability" ~/.claude/skills/frankxai-agentic-creator-os-monitoring-observability && rm -rf "$T"
manifest: skills/technical/monitoring-observability/SKILL.md
source content

Monitoring & Observability Skill

Implement comprehensive observability for production applications with logging, metrics, and tracing.

The Three Pillars

PillarPurposeTools
LogsWhat happenedPino, Winston, Sentry
MetricsQuantitative dataPrometheus, Datadog
TracesRequest flowOpenTelemetry, Jaeger

Structured Logging

Pino Setup (Recommended)

// lib/logger.ts
import pino from 'pino';

export const logger = pino({
  level: process.env.LOG_LEVEL || 'info',
  formatters: {
    level: (label) => ({ level: label }),
  },
  redact: ['password', 'token', 'authorization', 'cookie'],
  base: {
    env: process.env.NODE_ENV,
    version: process.env.APP_VERSION,
  },
});

// Usage
logger.info({ userId: '123', action: 'login' }, 'User logged in');
logger.error({ err, requestId }, 'Request failed');

Request Logging Middleware

// middleware.ts
import { NextRequest, NextResponse } from 'next/server';
import { logger } from '@/lib/logger';
import { nanoid } from 'nanoid';

export function middleware(request: NextRequest) {
  const requestId = nanoid();
  const start = Date.now();

  const response = NextResponse.next();
  response.headers.set('x-request-id', requestId);

  // Log after response
  logger.info({
    requestId,
    method: request.method,
    path: request.nextUrl.pathname,
    duration: Date.now() - start,
    status: response.status,
    userAgent: request.headers.get('user-agent'),
  }, 'Request completed');

  return response;
}

Error Tracking with Sentry

Setup

// sentry.client.config.ts
import * as Sentry from '@sentry/nextjs';

Sentry.init({
  dsn: process.env.NEXT_PUBLIC_SENTRY_DSN,
  environment: process.env.NODE_ENV,
  tracesSampleRate: 0.1, // 10% of transactions
  replaysSessionSampleRate: 0.1,
  replaysOnErrorSampleRate: 1.0,
});

Error Boundary

// components/ErrorBoundary.tsx
'use client';

import * as Sentry from '@sentry/nextjs';

export function ErrorBoundary({ error, reset }: {
  error: Error & { digest?: string };
  reset: () => void;
}) {
  useEffect(() => {
    Sentry.captureException(error);
  }, [error]);

  return (
    <div className="error-container">
      <h2>Something went wrong</h2>
      <button onClick={reset}>Try again</button>
    </div>
  );
}

Manual Error Capture

import * as Sentry from '@sentry/nextjs';

try {
  await riskyOperation();
} catch (error) {
  Sentry.captureException(error, {
    tags: { feature: 'payment' },
    extra: { userId, orderId },
  });
  throw error;
}

Metrics with Prometheus

Metrics Endpoint

// app/api/metrics/route.ts
import { Registry, Counter, Histogram, collectDefaultMetrics } from 'prom-client';

const register = new Registry();
collectDefaultMetrics({ register });

// Custom metrics
const httpRequestsTotal = new Counter({
  name: 'http_requests_total',
  help: 'Total HTTP requests',
  labelNames: ['method', 'path', 'status'],
  registers: [register],
});

const httpRequestDuration = new Histogram({
  name: 'http_request_duration_seconds',
  help: 'HTTP request duration',
  labelNames: ['method', 'path'],
  buckets: [0.1, 0.3, 0.5, 1, 3, 5, 10],
  registers: [register],
});

export async function GET() {
  const metrics = await register.metrics();
  return new Response(metrics, {
    headers: { 'Content-Type': register.contentType },
  });
}

// Export for use in middleware
export { httpRequestsTotal, httpRequestDuration };

Distributed Tracing with OpenTelemetry

// instrumentation.ts
import { NodeSDK } from '@opentelemetry/sdk-node';
import { getNodeAutoInstrumentations } from '@opentelemetry/auto-instrumentations-node';
import { OTLPTraceExporter } from '@opentelemetry/exporter-trace-otlp-http';

const sdk = new NodeSDK({
  traceExporter: new OTLPTraceExporter({
    url: process.env.OTEL_EXPORTER_OTLP_ENDPOINT,
  }),
  instrumentations: [getNodeAutoInstrumentations()],
});

sdk.start();

Health Checks

// app/api/health/route.ts
import { db } from '@/lib/db';
import { redis } from '@/lib/redis';

export async function GET() {
  const checks = {
    database: await checkDatabase(),
    redis: await checkRedis(),
    uptime: process.uptime(),
  };

  const healthy = Object.values(checks).every(c =>
    typeof c === 'object' ? c.status === 'ok' : true
  );

  return Response.json(
    { status: healthy ? 'healthy' : 'unhealthy', checks },
    { status: healthy ? 200 : 503 }
  );
}

async function checkDatabase() {
  try {
    await db.$queryRaw`SELECT 1`;
    return { status: 'ok' };
  } catch (error) {
    return { status: 'error', error: error.message };
  }
}

async function checkRedis() {
  try {
    await redis.ping();
    return { status: 'ok' };
  } catch (error) {
    return { status: 'error', error: error.message };
  }
}

Alerting Rules

# prometheus/alerts.yml
groups:
  - name: app-alerts
    rules:
      - alert: HighErrorRate
        expr: rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m]) > 0.05
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "High error rate detected"

      - alert: SlowResponses
        expr: histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m])) > 2
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "95th percentile latency above 2s"

Dashboard Queries (Grafana)

# Request rate
rate(http_requests_total[5m])

# Error rate percentage
rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m]) * 100

# P95 latency
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))

# Active users
sum(increase(user_sessions_total[1h]))

Anti-Patterns

❌ Logging sensitive data (passwords, tokens) ❌ No request IDs for correlation ❌ Sampling at 100% in production ❌ Ignoring errors silently ❌ No alerts on critical paths

✅ Structured JSON logs with redaction ✅ Request ID propagation ✅ Appropriate sampling rates ✅ Capture and alert on errors ✅ Runbooks for each alert