Agent-almanac configure-alerting-rules
git clone https://github.com/pjt222/agent-almanac
T=$(mktemp -d) && git clone --depth=1 https://github.com/pjt222/agent-almanac "$T" && mkdir -p ~/.claude/skills && cp -r "$T/i18n/caveman-ultra/skills/configure-alerting-rules" ~/.claude/skills/pjt222-agent-almanac-configure-alerting-rules-c74c09 && rm -rf "$T"
i18n/caveman-ultra/skills/configure-alerting-rules/SKILL.mdConfigure Alerting Rules
Set up Prometheus alerting rules + Alertmanager → reliable, actionable incident notifications.
See Extended Examples for complete config files + templates.
Use When
- Impl proactive monitoring w/ automated incident detection
- Route alerts to correct teams by severity + service ownership
- Cut alert fatigue via intelligent grouping + dedup
- Integrate monitoring w/ on-call systems (PagerDuty, Opsgenie)
- Establish escalation policies for critical prod issues
- Migrate legacy monitoring → Prometheus-based alerting
- Create actionable alerts guiding responders to resolution
In
- Required: Prometheus metrics to alert on (error rates, latency, saturation)
- Required: On-call rotation + escalation policies
- Optional: Existing alert defs to migrate
- Optional: Notification channels (Slack, email, PagerDuty)
- Optional: Runbook docs for common alerts
Do
Step 1: Deploy Alertmanager
Install + configure Alertmanager to receive alerts from Prometheus.
Docker Compose deployment (basic structure):
version: '3.8' services: alertmanager: image: prom/alertmanager:v0.26.0 ports: - "9093:9093" volumes: - ./alertmanager.yml:/etc/alertmanager/alertmanager.yml # ... (see EXAMPLES.md for complete configuration)
Basic Alertmanager config (
alertmanager.yml excerpt):
global: resolve_timeout: 5m slack_api_url: 'https://hooks.slack.com/services/YOUR/SLACK/WEBHOOK' route: receiver: 'default-receiver' group_by: ['alertname', 'cluster', 'service'] group_wait: 30s group_interval: 5m repeat_interval: 4h routes: - match: severity: critical receiver: pagerduty-critical # ... (see EXAMPLES.md for complete routing, inhibition rules, and receivers)
Configure Prometheus to use Alertmanager (
prometheus.yml):
alerting: alertmanagers: - static_configs: - targets: - alertmanager:9093 timeout: 10s api_version: v2
→ Alertmanager UI accessible at
http://localhost:9093, Prometheus "Status > Alertmanagers" shows UP.
If err:
- Check Alertmanager logs:
docker logs alertmanager - Verify Prometheus can reach Alertmanager:
curl http://alertmanager:9093/api/v2/status - Test webhook URLs:
curl -X POST <SLACK_WEBHOOK_URL> -d '{"text":"test"}' - Valid. YAML syntax:
amtool check-config alertmanager.yml
Step 2: Define Alerting Rules in Prometheus
Create alerting rules that fire when conditions met.
Create alerting rules file (
/etc/prometheus/rules/alerts.yml excerpt):
groups: - name: instance_alerts interval: 30s rules: - alert: InstanceDown expr: up == 0 for: 5m labels: severity: critical team: infrastructure annotations: summary: "Instance {{ $labels.instance }} is down" description: "{{ $labels.instance }} has been down for >5min." runbook_url: "https://wiki.example.com/runbooks/instance-down" - alert: HighCPUUsage expr: 100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80 for: 10m labels: severity: warning annotations: summary: "High CPU usage on {{ $labels.instance }}" # ... (see EXAMPLES.md for complete alerts)
Alert design best practices:
duration: Prevents flapping alerts. Use 5-10 min for most alerts.for- Descriptive annotations: Include current value, affected resource, runbook link.
- Severity levels: critical (pages on-call), warning (investigate), info (FYI)
- Team labels: Enable routing to correct team/channel
- Runbook links: Every alert should have runbook URL
Load rules into Prometheus:
# prometheus.yml rule_files: - "rules/*.yml"
Valid. + reload:
promtool check rules /etc/prometheus/rules/alerts.yml curl -X POST http://localhost:9090/-/reload
→ Alerts visible in Prometheus "Alerts" page, alerts fire when thresholds exceeded, Alertmanager receives fired alerts.
If err:
- Check Prometheus logs for rule eval errors
- Valid. rule syntax w/
promtool check rules - Test alert queries independently in Prometheus UI
- Inspect alert state transitions: Inactive → Pending → Firing
Step 3: Create Notification Templates
Design readable, actionable notification msgs.
Create template file (
/etc/alertmanager/templates/default.tmpl excerpt):
{{ define "slack.default.title" }} [{{ .Status | toUpper }}] {{ .GroupLabels.alertname }} {{ end }} {{ define "slack.default.text" }} {{ range .Alerts }} *Alert:* {{ .Labels.alertname }} *Severity:* {{ .Labels.severity }} *Summary:* {{ .Annotations.summary }} {{ if .Annotations.runbook_url }}*Runbook:* {{ .Annotations.runbook_url }}{{ end }} {{ end }} {{ end }} # ... (see EXAMPLES.md for complete email and PagerDuty templates)
Use templates in receivers:
receivers: - name: 'slack-custom' slack_configs: - channel: '#alerts' title: '{{ template "slack.default.title" . }}' text: '{{ template "slack.default.text" . }}'
→ Notifications formatted consistent, include all relevant context, actionable w/ runbook links.
If err:
- Test template rendering:
amtool template test --config.file=alertmanager.yml - Check template syntax errs in Alertmanager logs
- Use
to debug template data structure{{ . | json }}
Step 4: Configure Routing + Grouping
Optimize alert delivery w/ intelligent routing rules.
Advanced routing config (excerpt):
route: receiver: 'default-receiver' group_by: ['alertname', 'cluster', 'service'] group_wait: 30s routes: - match: team: platform receiver: 'team-platform' routes: - match: severity: critical receiver: 'pagerduty-platform' group_wait: 10s repeat_interval: 15m continue: true # Also send to Slack # ... (see EXAMPLES.md for complete routing with time intervals)
Grouping strategies:
# Group by alertname: All HighCPU alerts bundled together group_by: ['alertname'] # Group by alertname AND cluster: Separate notifications per cluster group_by: ['alertname', 'cluster']
→ Alerts routed to correct teams, grouped logically, timing appropriate for severity.
If err:
- Test routing:
amtool config routes test --config.file=alertmanager.yml --alertname=HighCPU --label=severity=critical - Check routing tree:
amtool config routes show --config.file=alertmanager.yml - Valid.
if alert should match multi routescontinue: true
Step 5: Implement Inhibition + Silencing
Cut alert noise w/ inhibition rules + temporary silences.
Inhibition rules (suppress dependent alerts):
inhibit_rules: # Cluster down suppresses all node alerts in that cluster - source_match: alertname: 'ClusterDown' severity: 'critical' target_match_re: alertname: '(InstanceDown|HighCPU|HighMemory)' equal: ['cluster'] # Service down suppresses latency and error alerts - source_match: alertname: 'ServiceDown' target_match_re: alertname: '(HighLatency|HighErrorRate)' equal: ['service', 'namespace'] # ... (see EXAMPLES.md for more inhibition patterns)
Create silences programmatically:
# Silence during maintenance amtool silence add \ instance=app-server-1 \ --author="ops-team" \ --comment="Scheduled maintenance" \ --duration=2h # List and manage silences amtool silence query amtool silence expire <SILENCE_ID>
→ Inhibition cuts cascade alerts auto, silences prevent notifications during planned maintenance.
If err:
- Test inhibition logic w/ live alerts
- Check Alertmanager UI "Silences" tab
- Valid. silence matchers exact (labels must match perfectly)
Step 6: Integrate with External Systems
Connect Alertmanager to PagerDuty, Opsgenie, Jira, etc.
PagerDuty integration (excerpt):
receivers: - name: 'pagerduty' pagerduty_configs: - routing_key: 'YOUR_INTEGRATION_KEY' severity: '{{ .CommonLabels.severity }}' description: '{{ range .Alerts.Firing }}{{ .Annotations.summary }}{{ end }}' details: firing: '{{ .Alerts.Firing | len }}' alertname: '{{ .GroupLabels.alertname }}' # ... (see EXAMPLES.md for complete integration examples)
Webhook for custom integrations:
receivers: - name: 'webhook-custom' webhook_configs: - url: 'https://your-webhook-endpoint.com/alerts' send_resolved: true
→ Alerts create incidents in PagerDuty, appear in team comms channels, trigger on-call escalations.
If err:
- Valid. API keys/tokens valid
- Check network connectivity to external services
- Test webhook endpoints independently w/ curl
- Enable debug mode:
--log.level=debug
Check
- Alertmanager receives alerts from Prometheus successfully
- Alerts routed to correct teams by labels + severity
- Notifications delivered to Slack, email, or PagerDuty
- Alert grouping cuts notification volume appropriately
- Inhibition rules suppress dependent alerts correct
- Silences prevent notifications during maintenance windows
- Notification templates include runbook links + context
- Repeat interval prevents alert fatigue for long-running issues
- Resolved notifications sent when alerts clear
- External integrations (PagerDuty, Opsgenie) create incidents
Traps
- Alert fatigue: Too many low-pri alerts → responders ignore critical ones. Set strict thresholds, use inhibition.
- Missing
duration: Alerts w/ofor
fire on transient spikes. Always use 5-10 min windows.for - Overly broad grouping: Grouping by
sends individual notifications. Use specific label grouping.['...'] - No runbook links: Alerts w/o runbooks leave responders guessing. Every alert needs runbook URL.
- Incorrect severity: Mislabeling warnings as critical desensitizes team. Reserve critical for emergencies.
- Forgotten silences: Silences w/o expiration can hide real issues. Always set end times.
- Single route: All alerts to one channel loses context. Use team-specific routing.
- No inhibition: Cascade alerts during outages create noise. Impl inhibition rules.
→
- Define metrics + recording rules feeding alerting rulessetup-prometheus-monitoring
- Generate SLO burn rate alerts for error budget mgmtdefine-slo-sli-sla
- Create runbooks linked from alert annotationswrite-incident-runbook
- Visualize alert firing history + silence patternsbuild-grafana-dashboards