Awesome-omni-skill monitoring

Production health check, uptime monitoring, performance metrics. DevOps engineer agent için monitoring best practices.

install
source · Clone the upstream repo
git clone https://github.com/diegosouzapw/awesome-omni-skill
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/diegosouzapw/awesome-omni-skill "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/development/monitoring" ~/.claude/skills/diegosouzapw-awesome-omni-skill-monitoring && rm -rf "$T"
manifest: skills/development/monitoring/SKILL.md
source content

Monitoring Skill

Bu skill, devops-engineer agent'ın sistemleri izlemesi ve sağlık kontrolü yapması için kullanılır.


🎯 Monitoring Prensipleri

USE Method (Utilization, Saturation, Errors)

┌─────────────────────────────────────────┐
│  RESOURCE     │  U   │  S   │  E       │
├─────────────────────────────────────────┤
│  CPU          │ 75%  │ 0.5  │ 0        │
│  Memory       │ 60%  │ 0.1  │ 0        │
│  Disk         │ 40%  │ 0.0  │ 2 errors │
│  Network      │ 30%  │ 0.0  │ 1 timeout│
└─────────────────────────────────────────┘

RED Method (Rate, Errors, Duration)

┌─────────────────────────────────────────┐
│  ENDPOINT          │  R     │  E  │  D  │
├─────────────────────────────────────────┤
│  GET /api/users    │ 150/s  │ 0%  │ 50ms│
│  POST /api/auth    │  20/s  │ 2%  │ 200ms│
│  GET /api/orders   │  80/s  │ 0.5%│ 120ms│
└─────────────────────────────────────────┘

📊 Kritik Metrikler

1. System Metrics

# CPU kullanımı
top -bn1 | grep "Cpu(s)"

# Memory kullanımı
free -h

# Disk kullanımı
df -h

# Disk I/O
iostat -x 1 5

# Network
netstat -i

2. Application Metrics

Response Time (Latency)

// Middleware ile ölç
app.use((req, res, next) => {
  const start = Date.now()

  res.on('finish', () => {
    const duration = Date.now() - start
    metrics.recordResponseTime(req.path, duration)

    // p95 > 500ms ise alert
    if (duration > 500) {
      logger.warn('Slow response', { path: req.path, duration })
    }
  })

  next()
})

Error Rate

let totalRequests = 0
let errorRequests = 0

app.use((req, res, next) => {
  totalRequests++

  res.on('finish', () => {
    if (res.statusCode >= 500) {
      errorRequests++
    }

    const errorRate = (errorRequests / totalRequests) * 100

    // Error rate > %1 ise critical
    if (errorRate > 1) {
      alerting.sendCritical('High error rate', { rate: errorRate })
    }
  })

  next()
})

Throughput

// Request per second
const requestsPerMinute = []
setInterval(() => {
  const rpm = requestsPerMinute.reduce((a, b) => a + b, 0)
  const rps = rpm / 60

  metrics.record('requests_per_second', rps)
  requestsPerMinute = []
}, 60000)

3. Database Metrics

-- Slow queries (PostgreSQL)
SELECT
  query,
  mean_exec_time,
  calls
FROM pg_stat_statements
WHERE mean_exec_time > 1000
ORDER BY mean_exec_time DESC
LIMIT 10;

-- Connection count
SELECT count(*) FROM pg_stat_activity;

-- Database size
SELECT pg_size_pretty(pg_database_size('mydb'));
# MongoDB metrics
mongo --eval "db.serverStatus().connections"
mongo --eval "db.stats()"

🚨 Alerting Stratejisi

Alert Seviyeleri

SeviyeThresholdAksiyon
INFONormal olayLog'a yaz
WARNINGPotansiyel sorunSlack notification
ERRORÖnemli hataEmail + Slack
CRITICALSistem çökmek üzerePagerDuty + Phone call

Örnek Alert Rules

# Prometheus alert rules
groups:
  - name: app_alerts
    rules:
      - alert: HighErrorRate
        expr: rate(http_requests_total{status=~"5.."}[5m]) > 0.01
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "High error rate detected"

      - alert: HighResponseTime
        expr: http_request_duration_seconds{quantile="0.95"} > 0.5
        for: 10m
        labels:
          severity: warning

      - alert: HighMemoryUsage
        expr: (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) < 0.1
        for: 5m
        labels:
          severity: critical

🏥 Health Check Endpoints

Liveness Check

// /health/live - Servis ayakta mı?
app.get('/health/live', (req, res) => {
  res.status(200).json({ status: 'alive', timestamp: Date.now() })
})

Readiness Check

// /health/ready - Servis trafiğe hazır mı?
app.get('/health/ready', async (req, res) => {
  const checks = {
    database: await checkDatabase(),
    redis: await checkRedis(),
    externalAPI: await checkExternalAPI()
  }

  const allHealthy = Object.values(checks).every(check => check.healthy)

  res.status(allHealthy ? 200 : 503).json({
    status: allHealthy ? 'ready' : 'not_ready',
    checks,
    timestamp: Date.now()
  })
})

async function checkDatabase() {
  try {
    await db.query('SELECT 1')
    return { healthy: true }
  } catch (error) {
    return { healthy: false, error: error.message }
  }
}

Startup Check

// /health/startup - İlk başlatma tamamlandı mı?
let isStartupComplete = false

app.get('/health/startup', (req, res) => {
  if (isStartupComplete) {
    res.status(200).json({ status: 'started' })
  } else {
    res.status(503).json({ status: 'starting' })
  }
})

// Startup tamamlandığında
async function bootstrap() {
  await initializeDatabase()
  await warmupCache()
  await loadConfiguration()

  isStartupComplete = true
}

📈 Performance Monitoring

Golden Signals

1. LATENCY   - Ne kadar hızlı?
2. TRAFFIC   - Ne kadar talep var?
3. ERRORS    - Ne kadar başarısız?
4. SATURATION - Kaynaklar dolu mu?

Node.js Specific Metrics

import v8 from 'v8'
import process from 'process'

function getNodeMetrics() {
  const heapStats = v8.getHeapStatistics()
  const memUsage = process.memoryUsage()

  return {
    // Heap kullanımı
    heap_total: heapStats.total_heap_size,
    heap_used: heapStats.used_heap_size,
    heap_limit: heapStats.heap_size_limit,

    // Memory
    rss: memUsage.rss, // Resident Set Size
    heap_total_mb: Math.round(memUsage.heapTotal / 1024 / 1024),
    heap_used_mb: Math.round(memUsage.heapUsed / 1024 / 1024),

    // Event Loop Lag
    event_loop_lag: getEventLoopLag(),

    // Uptime
    uptime_seconds: process.uptime(),

    // CPU
    cpu_usage: process.cpuUsage()
  }
}

// Event loop lag measurement
let lastCheck = Date.now()
function getEventLoopLag() {
  const now = Date.now()
  const lag = now - lastCheck - 1000 // Expected 1000ms
  lastCheck = now
  return lag
}

setInterval(getEventLoopLag, 1000)

🔍 Log Monitoring

Structured Logging

// ✅ Structured log (JSON)
logger.info({
  message: 'User login',
  userId: '123',
  ip: req.ip,
  userAgent: req.headers['user-agent'],
  timestamp: new Date().toISOString(),
  duration: 250
})

// Log aggregation ile kolay query:
// "Show me all logins from userId=123 in last hour"

Log Levels

const logger = createLogger({
  level: process.env.LOG_LEVEL || 'info'
})

logger.error('Critical error', { error })   // Always logged
logger.warn('Warning', { context })         // Production
logger.info('User action', { userId })      // Production
logger.debug('Variable value', { value })   // Development only
logger.trace('Function call', { args })     // Development only

Log Sampling

// High-traffic endpoint'lerde her log'u yazma
const shouldLog = Math.random() < 0.1 // %10 sample

if (shouldLog) {
  logger.info('Request processed', { path: req.path })
}

🛠️ Monitoring Tools & Integration

Sentry Integration (MCP)

// Sentry MCP ile error tracking
import { SentryMCP } from '@mcp/sentry'

app.use((err, req, res, next) => {
  // Error'ı Sentry'ye gönder
  SentryMCP.captureException(err, {
    user: { id: req.userId },
    tags: { endpoint: req.path },
    extra: { body: req.body }
  })

  res.status(500).json({ error: 'Internal server error' })
})

Prometheus Metrics

import { register, Counter, Histogram } from 'prom-client'

// Counter
const httpRequestsTotal = new Counter({
  name: 'http_requests_total',
  help: 'Total HTTP requests',
  labelNames: ['method', 'path', 'status']
})

// Histogram (latency)
const httpRequestDuration = new Histogram({
  name: 'http_request_duration_seconds',
  help: 'HTTP request duration',
  labelNames: ['method', 'path'],
  buckets: [0.1, 0.5, 1, 2, 5]
})

// Metrics endpoint
app.get('/metrics', async (req, res) => {
  res.set('Content-Type', register.contentType)
  res.end(await register.metrics())
})

📊 Dashboard Örneği

Minimal Monitoring Dashboard

// /dashboard/metrics endpoint
app.get('/dashboard/metrics', async (req, res) => {
  const metrics = {
    system: {
      uptime: process.uptime(),
      memory: process.memoryUsage(),
      cpu: process.cpuUsage()
    },
    application: {
      total_requests: totalRequests,
      error_rate: (errorRequests / totalRequests * 100).toFixed(2) + '%',
      avg_response_time: calculateAvgResponseTime() + 'ms'
    },
    database: {
      active_connections: await getDbConnections(),
      slow_queries: await getSlowQueries()
    },
    alerts: {
      active: await getActiveAlerts(),
      recent: await getRecentAlerts(24) // Last 24h
    }
  }

  res.json(metrics)
})

🚀 Production Monitoring Checklist

Session başında kontrol et:

  • Health check endpoint'leri çalışıyor mu?
  • Log aggregation sistemi aktif mi?
  • Error tracking (Sentry) kurulu mu?
  • Alerting rules tanımlı mı?
  • Metrics endpoint expose edilmiş mi?
  • Dashboard erişilebilir mi?
  • Backup monitoring çalışıyor mu?
  • SSL certificate expiry izleniyor mu?

🔔 Alert Response Playbook

Critical Alert Geldiğinde

# 1. Durumu doğrula
curl https://myapp.com/health/ready

# 2. Son log'ları kontrol et
tail -f -n 200 /var/log/app/error.log

# 3. Resource kullanımı
top -bn1
free -h
df -h

# 4. Service durumu
systemctl status myapp
docker ps

# 5. Son değişiklikleri gör
git log --oneline -10

# 6. Gerekirse rollback
git revert HEAD
./deploy.sh

# 7. Incident report yaz
# docs/incidents/YYYY-MM-DD-incident.md

📝 Monitoring Best Practices

DO ✅

  • Baseline oluştur - Normal değerleri bil
  • Trend analizi - Zaman içinde nasıl değişiyor?
  • Alert fatigue önle - Çok alert kötü alert
  • SLA tanımla - %99.9 uptime hedefle
  • Regular review - Dashboard'u haftada 1 gözden geçir
  • Documentation - Alert playbook yaz

DON'T ❌

  • Reactive monitoring - Sadece hata olunca bakma
  • Metric overload - 100 metric > 10 critical metric
  • Silent failures - Error'ları yutma
  • Production debugging - Monitor et, debug etme
  • Ignore warnings - Warning bugün, critical yarın

🎯 SLI/SLO/SLA

Service Level Indicators (SLI)

Availability = (Successful Requests / Total Requests) × 100
Latency p95 = 95th percentile response time
Error Rate = (Failed Requests / Total Requests) × 100

Service Level Objectives (SLO)

Target Availability: 99.9% (43.2 min downtime/month)
Target p95 Latency: < 200ms
Target Error Rate: < 0.1%

Service Level Agreements (SLA)

Guaranteed Availability: 99.5%
If < 99.5%: 10% service credit
If < 99.0%: 25% service credit

🔗 Monitoring Stack Örnekleri

Stack 1: Open Source

Prometheus → Metric collection
Grafana → Visualization
AlertManager → Alerting
Loki → Log aggregation
Jaeger → Distributed tracing

Stack 2: Cloud Native

CloudWatch (AWS) → Metrics + Logs
Datadog → APM + Monitoring
Sentry → Error tracking
PagerDuty → On-call alerting

Stack 3: Minimal (MCP)

Sentry MCP → Error tracking
Custom /metrics endpoint → Prometheus scrape
GitHub Actions → Uptime monitoring
Slack → Alerting

📚 Kaynaklar


Son Güncelleme: 2026-01-26 Kullanıcı: devops-engineer agent İlgili Skill: error-recovery, debugging