Awesome-omni-skill monitoring

Production health check, uptime monitoring, performance metrics. DevOps engineer agent için monitoring best practices.

install

source · Clone the upstream repo

git clone https://github.com/diegosouzapw/awesome-omni-skill

Claude Code · Install into ~/.claude/skills/

T=$(mktemp -d) && git clone --depth=1 https://github.com/diegosouzapw/awesome-omni-skill "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/development/monitoring" ~/.claude/skills/diegosouzapw-awesome-omni-skill-monitoring && rm -rf "$T"

manifest: skills/development/monitoring/SKILL.md

source content

Monitoring Skill

Bu skill, devops-engineer agent'ın sistemleri izlemesi ve sağlık kontrolü yapması için kullanılır.

🎯 Monitoring Prensipleri

USE Method (Utilization, Saturation, Errors)

┌─────────────────────────────────────────┐
│  RESOURCE     │  U   │  S   │  E       │
├─────────────────────────────────────────┤
│  CPU          │ 75%  │ 0.5  │ 0        │
│  Memory       │ 60%  │ 0.1  │ 0        │
│  Disk         │ 40%  │ 0.0  │ 2 errors │
│  Network      │ 30%  │ 0.0  │ 1 timeout│
└─────────────────────────────────────────┘

RED Method (Rate, Errors, Duration)

┌─────────────────────────────────────────┐
│  ENDPOINT          │  R     │  E  │  D  │
├─────────────────────────────────────────┤
│  GET /api/users    │ 150/s  │ 0%  │ 50ms│
│  POST /api/auth    │  20/s  │ 2%  │ 200ms│
│  GET /api/orders   │  80/s  │ 0.5%│ 120ms│
└─────────────────────────────────────────┘

📊 Kritik Metrikler

1. System Metrics

# CPU kullanımı
top -bn1 | grep "Cpu(s)"

# Memory kullanımı
free -h

# Disk kullanımı
df -h

# Disk I/O
iostat -x 1 5

# Network
netstat -i

2. Application Metrics

Response Time (Latency)

// Middleware ile ölç
app.use((req, res, next) => {
  const start = Date.now()

  res.on('finish', () => {
    const duration = Date.now() - start
    metrics.recordResponseTime(req.path, duration)

    // p95 > 500ms ise alert
    if (duration > 500) {
      logger.warn('Slow response', { path: req.path, duration })
    }
  })

  next()
})

Error Rate

let totalRequests = 0
let errorRequests = 0

app.use((req, res, next) => {
  totalRequests++

  res.on('finish', () => {
    if (res.statusCode >= 500) {
      errorRequests++
    }

    const errorRate = (errorRequests / totalRequests) * 100

    // Error rate > %1 ise critical
    if (errorRate > 1) {
      alerting.sendCritical('High error rate', { rate: errorRate })
    }
  })

  next()
})

Throughput

// Request per second
const requestsPerMinute = []
setInterval(() => {
  const rpm = requestsPerMinute.reduce((a, b) => a + b, 0)
  const rps = rpm / 60

  metrics.record('requests_per_second', rps)
  requestsPerMinute = []
}, 60000)

3. Database Metrics

-- Slow queries (PostgreSQL)
SELECT
  query,
  mean_exec_time,
  calls
FROM pg_stat_statements
WHERE mean_exec_time > 1000
ORDER BY mean_exec_time DESC
LIMIT 10;

-- Connection count
SELECT count(*) FROM pg_stat_activity;

-- Database size
SELECT pg_size_pretty(pg_database_size('mydb'));

# MongoDB metrics
mongo --eval "db.serverStatus().connections"
mongo --eval "db.stats()"

🚨 Alerting Stratejisi

Alert Seviyeleri

Seviye	Threshold	Aksiyon
INFO	Normal olay	Log'a yaz
WARNING	Potansiyel sorun	Slack notification
ERROR	Önemli hata	Email + Slack
CRITICAL	Sistem çökmek üzere	PagerDuty + Phone call

Örnek Alert Rules

# Prometheus alert rules
groups:
  - name: app_alerts
    rules:
      - alert: HighErrorRate
        expr: rate(http_requests_total{status=~"5.."}[5m]) > 0.01
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "High error rate detected"

      - alert: HighResponseTime
        expr: http_request_duration_seconds{quantile="0.95"} > 0.5
        for: 10m
        labels:
          severity: warning

      - alert: HighMemoryUsage
        expr: (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) < 0.1
        for: 5m
        labels:
          severity: critical

🏥 Health Check Endpoints

Liveness Check

// /health/live - Servis ayakta mı?
app.get('/health/live', (req, res) => {
  res.status(200).json({ status: 'alive', timestamp: Date.now() })
})

Readiness Check

// /health/ready - Servis trafiğe hazır mı?
app.get('/health/ready', async (req, res) => {
  const checks = {
    database: await checkDatabase(),
    redis: await checkRedis(),
    externalAPI: await checkExternalAPI()
  }

  const allHealthy = Object.values(checks).every(check => check.healthy)

  res.status(allHealthy ? 200 : 503).json({
    status: allHealthy ? 'ready' : 'not_ready',
    checks,
    timestamp: Date.now()
  })
})

async function checkDatabase() {
  try {
    await db.query('SELECT 1')
    return { healthy: true }
  } catch (error) {
    return { healthy: false, error: error.message }
  }
}

Startup Check

// /health/startup - İlk başlatma tamamlandı mı?
let isStartupComplete = false

app.get('/health/startup', (req, res) => {
  if (isStartupComplete) {
    res.status(200).json({ status: 'started' })
  } else {
    res.status(503).json({ status: 'starting' })
  }
})

// Startup tamamlandığında
async function bootstrap() {
  await initializeDatabase()
  await warmupCache()
  await loadConfiguration()

  isStartupComplete = true
}

📈 Performance Monitoring

Golden Signals

1. LATENCY   - Ne kadar hızlı?
2. TRAFFIC   - Ne kadar talep var?
3. ERRORS    - Ne kadar başarısız?
4. SATURATION - Kaynaklar dolu mu?

Node.js Specific Metrics

import v8 from 'v8'
import process from 'process'

function getNodeMetrics() {
  const heapStats = v8.getHeapStatistics()
  const memUsage = process.memoryUsage()

  return {
    // Heap kullanımı
    heap_total: heapStats.total_heap_size,
    heap_used: heapStats.used_heap_size,
    heap_limit: heapStats.heap_size_limit,

    // Memory
    rss: memUsage.rss, // Resident Set Size
    heap_total_mb: Math.round(memUsage.heapTotal / 1024 / 1024),
    heap_used_mb: Math.round(memUsage.heapUsed / 1024 / 1024),

    // Event Loop Lag
    event_loop_lag: getEventLoopLag(),

    // Uptime
    uptime_seconds: process.uptime(),

    // CPU
    cpu_usage: process.cpuUsage()
  }
}

// Event loop lag measurement
let lastCheck = Date.now()
function getEventLoopLag() {
  const now = Date.now()
  const lag = now - lastCheck - 1000 // Expected 1000ms
  lastCheck = now
  return lag
}

setInterval(getEventLoopLag, 1000)

🔍 Log Monitoring

Structured Logging

// ✅ Structured log (JSON)
logger.info({
  message: 'User login',
  userId: '123',
  ip: req.ip,
  userAgent: req.headers['user-agent'],
  timestamp: new Date().toISOString(),
  duration: 250
})

// Log aggregation ile kolay query:
// "Show me all logins from userId=123 in last hour"

Log Levels

const logger = createLogger({
  level: process.env.LOG_LEVEL || 'info'
})

logger.error('Critical error', { error })   // Always logged
logger.warn('Warning', { context })         // Production
logger.info('User action', { userId })      // Production
logger.debug('Variable value', { value })   // Development only
logger.trace('Function call', { args })     // Development only

Log Sampling

// High-traffic endpoint'lerde her log'u yazma
const shouldLog = Math.random() < 0.1 // %10 sample

if (shouldLog) {
  logger.info('Request processed', { path: req.path })
}

🛠️ Monitoring Tools & Integration

Sentry Integration (MCP)

// Sentry MCP ile error tracking
import { SentryMCP } from '@mcp/sentry'

app.use((err, req, res, next) => {
  // Error'ı Sentry'ye gönder
  SentryMCP.captureException(err, {
    user: { id: req.userId },
    tags: { endpoint: req.path },
    extra: { body: req.body }
  })

  res.status(500).json({ error: 'Internal server error' })
})

Prometheus Metrics

import { register, Counter, Histogram } from 'prom-client'

// Counter
const httpRequestsTotal = new Counter({
  name: 'http_requests_total',
  help: 'Total HTTP requests',
  labelNames: ['method', 'path', 'status']
})

// Histogram (latency)
const httpRequestDuration = new Histogram({
  name: 'http_request_duration_seconds',
  help: 'HTTP request duration',
  labelNames: ['method', 'path'],
  buckets: [0.1, 0.5, 1, 2, 5]
})

// Metrics endpoint
app.get('/metrics', async (req, res) => {
  res.set('Content-Type', register.contentType)
  res.end(await register.metrics())
})

📊 Dashboard Örneği

Minimal Monitoring Dashboard

// /dashboard/metrics endpoint
app.get('/dashboard/metrics', async (req, res) => {
  const metrics = {
    system: {
      uptime: process.uptime(),
      memory: process.memoryUsage(),
      cpu: process.cpuUsage()
    },
    application: {
      total_requests: totalRequests,
      error_rate: (errorRequests / totalRequests * 100).toFixed(2) + '%',
      avg_response_time: calculateAvgResponseTime() + 'ms'
    },
    database: {
      active_connections: await getDbConnections(),
      slow_queries: await getSlowQueries()
    },
    alerts: {
      active: await getActiveAlerts(),
      recent: await getRecentAlerts(24) // Last 24h
    }
  }

  res.json(metrics)
})

🚀 Production Monitoring Checklist

Session başında kontrol et:

Health check endpoint'leri çalışıyor mu?
Log aggregation sistemi aktif mi?
Error tracking (Sentry) kurulu mu?
Alerting rules tanımlı mı?
Metrics endpoint expose edilmiş mi?
Dashboard erişilebilir mi?
Backup monitoring çalışıyor mu?
SSL certificate expiry izleniyor mu?

🔔 Alert Response Playbook

Critical Alert Geldiğinde

# 1. Durumu doğrula
curl https://myapp.com/health/ready

# 2. Son log'ları kontrol et
tail -f -n 200 /var/log/app/error.log

# 3. Resource kullanımı
top -bn1
free -h
df -h

# 4. Service durumu
systemctl status myapp
docker ps

# 5. Son değişiklikleri gör
git log --oneline -10

# 6. Gerekirse rollback
git revert HEAD
./deploy.sh

# 7. Incident report yaz
# docs/incidents/YYYY-MM-DD-incident.md

📝 Monitoring Best Practices

DO ✅

Baseline oluştur - Normal değerleri bil
Trend analizi - Zaman içinde nasıl değişiyor?
Alert fatigue önle - Çok alert kötü alert
SLA tanımla - %99.9 uptime hedefle
Regular review - Dashboard'u haftada 1 gözden geçir
Documentation - Alert playbook yaz

DON'T ❌

Reactive monitoring - Sadece hata olunca bakma
Metric overload - 100 metric > 10 critical metric
Silent failures - Error'ları yutma
Production debugging - Monitor et, debug etme
Ignore warnings - Warning bugün, critical yarın

🎯 SLI/SLO/SLA

Service Level Indicators (SLI)

Availability = (Successful Requests / Total Requests) × 100
Latency p95 = 95th percentile response time
Error Rate = (Failed Requests / Total Requests) × 100

Service Level Objectives (SLO)

Target Availability: 99.9% (43.2 min downtime/month)
Target p95 Latency: < 200ms
Target Error Rate: < 0.1%

Service Level Agreements (SLA)

Guaranteed Availability: 99.5%
If < 99.5%: 10% service credit
If < 99.0%: 25% service credit

🔗 Monitoring Stack Örnekleri

Stack 1: Open Source

Prometheus → Metric collection
Grafana → Visualization
AlertManager → Alerting
Loki → Log aggregation
Jaeger → Distributed tracing

Stack 2: Cloud Native

CloudWatch (AWS) → Metrics + Logs
Datadog → APM + Monitoring
Sentry → Error tracking
PagerDuty → On-call alerting

Stack 3: Minimal (MCP)

Sentry MCP → Error tracking
Custom /metrics endpoint → Prometheus scrape
GitHub Actions → Uptime monitoring
Slack → Alerting

📚 Kaynaklar

Son Güncelleme: 2026-01-26 Kullanıcı: devops-engineer agent İlgili Skill: error-recovery, debugging