Awesome-omni-skill Monitoring

Set up observability for applications and infrastructure with metrics, logs, traces, and alerts.

install

source · Clone the upstream repo

git clone https://github.com/diegosouzapw/awesome-omni-skill

Claude Code · Install into ~/.claude/skills/

T=$(mktemp -d) && git clone --depth=1 https://github.com/diegosouzapw/awesome-omni-skill "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/devops/monitoring-zhanbei1" ~/.claude/skills/diegosouzapw-awesome-omni-skill-monitoring-6ad765 && rm -rf "$T"

manifest: skills/devops/monitoring-zhanbei1/SKILL.md

source content

Complexity Levels

Level	Tools	Setup Time	Best For
Minimal	UptimeRobot, Healthchecks.io	15 min	Side projects, MVPs
Standard	Uptime Kuma, Sentry, basic Grafana	1-2 hours	Small teams, startups
Professional	Prometheus, Grafana, Loki, Alertmanager	1-2 days	Production systems
Enterprise	Datadog, New Relic, or full OSS stack	Ongoing	Large-scale operations

The Three Pillars

Pillar	What It Answers	Tools
Metrics	"How is the system performing?"	Prometheus, Grafana, Datadog
Logs	"What happened?"	Loki, ELK, CloudWatch
Traces	"Why is this request slow?"	Jaeger, Tempo, Sentry

Quick Start by Use Case

"I just want to know if it's down" → UptimeRobot (free) or Uptime Kuma (self-hosted). See

simple.md

"I need to debug production errors" → Sentry with your framework SDK. 5-minute setup. See

apm.md

"I want real observability" → Prometheus + Grafana + Loki. See

prometheus.md

"I need to centralize logs" → Loki for simple, ELK for complex queries. See

logs.md

What to Monitor

Applications (RED Method)

Rate — requests per second
Errors — error rate by endpoint
Duration — latency (p50, p95, p99)

Infrastructure (USE Method)

Utilization — CPU, memory, disk usage
Saturation — queue depth, load average
Errors — hardware/system errors

Alerting Principles

Do	Don't
Alert on symptoms (user impact)	Alert on causes (CPU high)
Include runbook link	Require investigation to understand
Set appropriate severity	Make everything P1
Require action	Alert on "interesting" metrics

Alert fatigue kills monitoring. If alerts are ignored, you have no monitoring.

For alert configuration, severities, and on-call setup, see

alerting.md

Cost Comparison

Solution	Monthly Cost (small)	Monthly Cost (medium)
UptimeRobot	Free	$7
Uptime Kuma	$5 (VPS)	$5 (VPS)
Sentry	Free / $26	$80
Grafana Cloud	Free tier	$50+
Datadog	$15/host	$23/host + features
Self-hosted stack	$10-20 (VPS)	$50-100 (VPS)

Common Mistakes

Starting with Prometheus/Grafana when Uptime Kuma would suffice
No alerting (dashboards nobody watches)
Too many alerts (alert fatigue → ignored)
Missing runbooks (alert fires, nobody knows what to do)
Not monitoring from outside (only internal checks)
Storing logs forever (cost explodes)