Awesome-omni-skill Monitoring

Set up observability for applications and infrastructure with metrics, logs, traces, and alerts.

install
source · Clone the upstream repo
git clone https://github.com/diegosouzapw/awesome-omni-skill
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/diegosouzapw/awesome-omni-skill "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/devops/monitoring-zhanbei1" ~/.claude/skills/diegosouzapw-awesome-omni-skill-monitoring-6ad765 && rm -rf "$T"
manifest: skills/devops/monitoring-zhanbei1/SKILL.md
source content

Complexity Levels

LevelToolsSetup TimeBest For
MinimalUptimeRobot, Healthchecks.io15 minSide projects, MVPs
StandardUptime Kuma, Sentry, basic Grafana1-2 hoursSmall teams, startups
ProfessionalPrometheus, Grafana, Loki, Alertmanager1-2 daysProduction systems
EnterpriseDatadog, New Relic, or full OSS stackOngoingLarge-scale operations

The Three Pillars

PillarWhat It AnswersTools
Metrics"How is the system performing?"Prometheus, Grafana, Datadog
Logs"What happened?"Loki, ELK, CloudWatch
Traces"Why is this request slow?"Jaeger, Tempo, Sentry

Quick Start by Use Case

"I just want to know if it's down" → UptimeRobot (free) or Uptime Kuma (self-hosted). See

simple.md
.

"I need to debug production errors" → Sentry with your framework SDK. 5-minute setup. See

apm.md
.

"I want real observability" → Prometheus + Grafana + Loki. See

prometheus.md
.

"I need to centralize logs" → Loki for simple, ELK for complex queries. See

logs.md
.

What to Monitor

Applications (RED Method)

  • Rate — requests per second
  • Errors — error rate by endpoint
  • Duration — latency (p50, p95, p99)

Infrastructure (USE Method)

  • Utilization — CPU, memory, disk usage
  • Saturation — queue depth, load average
  • Errors — hardware/system errors

Alerting Principles

DoDon't
Alert on symptoms (user impact)Alert on causes (CPU high)
Include runbook linkRequire investigation to understand
Set appropriate severityMake everything P1
Require actionAlert on "interesting" metrics

Alert fatigue kills monitoring. If alerts are ignored, you have no monitoring.

For alert configuration, severities, and on-call setup, see

alerting.md
.

Cost Comparison

SolutionMonthly Cost (small)Monthly Cost (medium)
UptimeRobotFree$7
Uptime Kuma$5 (VPS)$5 (VPS)
SentryFree / $26$80
Grafana CloudFree tier$50+
Datadog$15/host$23/host + features
Self-hosted stack$10-20 (VPS)$50-100 (VPS)

Common Mistakes

  • Starting with Prometheus/Grafana when Uptime Kuma would suffice
  • No alerting (dashboards nobody watches)
  • Too many alerts (alert fatigue → ignored)
  • Missing runbooks (alert fires, nobody knows what to do)
  • Not monitoring from outside (only internal checks)
  • Storing logs forever (cost explodes)