Agent-almanac configure-alerting-rules
git clone https://github.com/pjt222/agent-almanac
T=$(mktemp -d) && git clone --depth=1 https://github.com/pjt222/agent-almanac "$T" && mkdir -p ~/.claude/skills && cp -r "$T/i18n/caveman/skills/configure-alerting-rules" ~/.claude/skills/pjt222-agent-almanac-configure-alerting-rules-2e2b05 && rm -rf "$T"
i18n/caveman/skills/configure-alerting-rules/SKILL.mdConfigure Alerting Rules
Set up Prometheus alerting rules and Alertmanager for reliable, actionable incident notifications.
See Extended Examples for complete configuration files and templates.
When Use
- Implementing proactive monitoring with automated incident detection
- Routing alerts to appropriate teams based on severity, service ownership
- Reducing alert fatigue through intelligent grouping, deduplication
- Integrating monitoring with on-call systems (PagerDuty, Opsgenie)
- Establishing escalation policies for critical production issues
- Migrating from legacy monitoring systems to Prometheus-based alerting
- Creating actionable alerts that guide responders to resolution
Inputs
- Required: Prometheus metrics to alert on (error rates, latency, saturation)
- Required: On-call rotation, escalation policies
- Optional: Existing alert definitions to migrate
- Optional: Notification channels (Slack, email, PagerDuty)
- Optional: Runbook documentation for common alerts
Steps
Step 1: Deploy Alertmanager
Install, configure Alertmanager to receive alerts from Prometheus.
Docker Compose deployment (basic structure):
version: '3.8' services: alertmanager: image: prom/alertmanager:v0.26.0 ports: - "9093:9093" volumes: - ./alertmanager.yml:/etc/alertmanager/alertmanager.yml # ... (see EXAMPLES.md for complete configuration)
Basic Alertmanager configuration (
alertmanager.yml excerpt):
global: resolve_timeout: 5m slack_api_url: 'https://hooks.slack.com/services/YOUR/SLACK/WEBHOOK' route: receiver: 'default-receiver' group_by: ['alertname', 'cluster', 'service'] group_wait: 30s group_interval: 5m repeat_interval: 4h routes: - match: severity: critical receiver: pagerduty-critical # ... (see EXAMPLES.md for complete routing, inhibition rules, and receivers)
Configure Prometheus to use Alertmanager (
prometheus.yml):
alerting: alertmanagers: - static_configs: - targets: - alertmanager:9093 timeout: 10s api_version: v2
Got: Alertmanager UI accessible at
http://localhost:9093, Prometheus "Status > Alertmanagers" shows UP status.
If fail:
- Check Alertmanager logs:
docker logs alertmanager - Verify Prometheus can reach Alertmanager:
curl http://alertmanager:9093/api/v2/status - Test webhook URLs:
curl -X POST <SLACK_WEBHOOK_URL> -d '{"text":"test"}' - Validate YAML syntax:
amtool check-config alertmanager.yml
Step 2: Define Alerting Rules in Prometheus
Create alerting rules that fire when conditions met.
Create alerting rules file (
/etc/prometheus/rules/alerts.yml excerpt):
groups: - name: instance_alerts interval: 30s rules: - alert: InstanceDown expr: up == 0 for: 5m labels: severity: critical team: infrastructure annotations: summary: "Instance {{ $labels.instance }} is down" description: "{{ $labels.instance }} has been down for >5min." runbook_url: "https://wiki.example.com/runbooks/instance-down" - alert: HighCPUUsage expr: 100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80 for: 10m labels: severity: warning annotations: summary: "High CPU usage on {{ $labels.instance }}" # ... (see EXAMPLES.md for complete alerts)
Alert design best practices:
duration: Prevents flapping alerts. Use 5-10 minutes for most alerts.for- Descriptive annotations: Include current value, affected resource, runbook link.
- Severity levels: critical (pages on-call), warning (investigate), info (FYI)
- Team labels: Enable routing to correct team/channel
- Runbook links: Every alert should have runbook URL
Load rules into Prometheus:
# prometheus.yml rule_files: - "rules/*.yml"
Validate, reload:
promtool check rules /etc/prometheus/rules/alerts.yml curl -X POST http://localhost:9090/-/reload
Got: Alerts visible in Prometheus "Alerts" page, alerts fire when thresholds exceeded, Alertmanager receives fired alerts.
If fail:
- Check Prometheus logs for rule evaluation errors
- Verify rule syntax with
promtool check rules - Test alert queries independently in Prometheus UI
- Inspect alert state transitions: Inactive → Pending → Firing
Step 3: Create Notification Templates
Design readable, actionable notification messages.
Create template file (
/etc/alertmanager/templates/default.tmpl excerpt):
{{ define "slack.default.title" }} [{{ .Status | toUpper }}] {{ .GroupLabels.alertname }} {{ end }} {{ define "slack.default.text" }} {{ range .Alerts }} *Alert:* {{ .Labels.alertname }} *Severity:* {{ .Labels.severity }} *Summary:* {{ .Annotations.summary }} {{ if .Annotations.runbook_url }}*Runbook:* {{ .Annotations.runbook_url }}{{ end }} {{ end }} {{ end }} # ... (see EXAMPLES.md for complete email and PagerDuty templates)
Use templates in receivers:
receivers: - name: 'slack-custom' slack_configs: - channel: '#alerts' title: '{{ template "slack.default.title" . }}' text: '{{ template "slack.default.text" . }}'
Got: Notifications formatted consistently, include all relevant context, actionable with runbook links.
If fail:
- Test template rendering:
amtool template test --config.file=alertmanager.yml - Check template syntax errors in Alertmanager logs
- Use
to debug template data structure{{ . | json }}
Step 4: Configure Routing and Grouping
Optimize alert delivery with intelligent routing rules.
Advanced routing configuration (excerpt):
route: receiver: 'default-receiver' group_by: ['alertname', 'cluster', 'service'] group_wait: 30s routes: - match: team: platform receiver: 'team-platform' routes: - match: severity: critical receiver: 'pagerduty-platform' group_wait: 10s repeat_interval: 15m continue: true # Also send to Slack # ... (see EXAMPLES.md for complete routing with time intervals)
Grouping strategies:
# Group by alertname: All HighCPU alerts bundled together group_by: ['alertname'] # Group by alertname AND cluster: Separate notifications per cluster group_by: ['alertname', 'cluster']
Got: Alerts routed to correct teams, grouped logically, timing appropriate for severity.
If fail:
- Test routing:
amtool config routes test --config.file=alertmanager.yml --alertname=HighCPU --label=severity=critical - Check routing tree:
amtool config routes show --config.file=alertmanager.yml - Verify
if alert should match multiple routescontinue: true
Step 5: Implement Inhibition and Silencing
Reduce alert noise with inhibition rules, temporary silences.
Inhibition rules (suppress dependent alerts):
inhibit_rules: # Cluster down suppresses all node alerts in that cluster - source_match: alertname: 'ClusterDown' severity: 'critical' target_match_re: alertname: '(InstanceDown|HighCPU|HighMemory)' equal: ['cluster'] # Service down suppresses latency and error alerts - source_match: alertname: 'ServiceDown' target_match_re: alertname: '(HighLatency|HighErrorRate)' equal: ['service', 'namespace'] # ... (see EXAMPLES.md for more inhibition patterns)
Create silences programmatically:
# Silence during maintenance amtool silence add \ instance=app-server-1 \ --author="ops-team" \ --comment="Scheduled maintenance" \ --duration=2h # List and manage silences amtool silence query amtool silence expire <SILENCE_ID>
Got: Inhibition reduces cascade alerts automatically, silences prevent notifications during planned maintenance.
If fail:
- Test inhibition logic with live alerts
- Check Alertmanager UI "Silences" tab
- Verify silence matchers are exact (labels must match perfectly)
Step 6: Integrate with External Systems
Connect Alertmanager to PagerDuty, Opsgenie, Jira, etc.
PagerDuty integration (excerpt):
receivers: - name: 'pagerduty' pagerduty_configs: - routing_key: 'YOUR_INTEGRATION_KEY' severity: '{{ .CommonLabels.severity }}' description: '{{ range .Alerts.Firing }}{{ .Annotations.summary }}{{ end }}' details: firing: '{{ .Alerts.Firing | len }}' alertname: '{{ .GroupLabels.alertname }}' # ... (see EXAMPLES.md for complete integration examples)
Webhook for custom integrations:
receivers: - name: 'webhook-custom' webhook_configs: - url: 'https://your-webhook-endpoint.com/alerts' send_resolved: true
Got: Alerts create incidents in PagerDuty, appear in team communication channels, trigger on-call escalations.
If fail:
- Verify API keys/tokens are valid
- Check network connectivity to external services
- Test webhook endpoints independently with curl
- Enable debug mode:
--log.level=debug
Checks
- Alertmanager receives alerts from Prometheus successfully
- Alerts routed to correct teams based on labels and severity
- Notifications delivered to Slack, email, or PagerDuty
- Alert grouping reduces notification volume appropriately
- Inhibition rules suppress dependent alerts correctly
- Silences prevent notifications during maintenance windows
- Notification templates include runbook links and context
- Repeat interval prevents alert fatigue for long-running issues
- Resolved notifications sent when alerts clear
- External integrations (PagerDuty, Opsgenie) create incidents
Pitfalls
- Alert fatigue: Too many low-priority alerts cause responders to ignore critical ones. Set strict thresholds, use inhibition.
- Missing
duration: Alerts withoutfor
fire on transient spikes. Always use 5-10 minute windows.for - Overly broad grouping: Grouping by
sends individual notifications. Use specific label grouping.['...'] - No runbook links: Alerts without runbooks leave responders guessing. Every alert needs runbook URL.
- Incorrect severity: Mislabeling warnings as critical desensitizes team. Reserve critical for emergencies.
- Forgotten silences: Silences without expiration can hide real issues. Always set end times.
- Single route: All alerts to one channel loses context. Use team-specific routing.
- No inhibition: Cascade alerts during outages create noise. Implement inhibition rules.
See Also
- Define metrics and recording rules that feed alerting rulessetup-prometheus-monitoring
- Generate SLO burn rate alerts for error budget managementdefine-slo-sli-sla
- Create runbooks linked from alert annotationswrite-incident-runbook
- Visualize alert firing history and silence patternsbuild-grafana-dashboards