Claude-skill-registry alerting-incident-management
π¨ Skill: Alerting & Incident Management
install
source Β· Clone the upstream repo
git clone https://github.com/majiayu000/claude-skill-registry
Claude Code Β· Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/majiayu000/claude-skill-registry "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/data/alerting-incident-management" ~/.claude/skills/majiayu000-claude-skill-registry-alerting-incident-management && rm -rf "$T"
manifest:
skills/data/alerting-incident-management/SKILL.mdsource content
π¨ Skill: Alerting & Incident Management
π Metadata
| Atributo | Valor |
|---|---|
| ID | |
| Nivel | π΄ Avanzado |
| VersiΓ³n | 1.0.0 |
| Keywords | , , , , , |
| Referencia | Google SRE - On-Call, PagerDuty |
π Keywords para InvocaciΓ³n
alertingincident-managementpagerdutyopsgenieoncallrunbooksincident-response@skill:alerting
Ejemplos de Prompts
Implementa alerting con PagerDuty y runbooks
Configura incident management y on-call rotation
Configura runbooks y on-call rotation
@skill:alerting - Sistema completo de alertas e incidentes
π DescripciΓ³n
Alerting efectivo y gestiΓ³n de incidentes es crΓtico para mantener servicios confiables. Este skill cubre diseΓ±o de alertas efectivas, on-call rotations, runbooks, e incident response workflows.
β CuΓ‘ndo Usar Este Skill
- Servicios en producciΓ³n
- Equipos on-call
- SLAs crΓticos
- Sistemas distribuidos
- Incidentes frecuentes
- Compliance requirements
β CuΓ‘ndo NO Usar Este Skill
- Desarrollo local solo
- Servicios no crΓticos sin SLA
- Equipos sin capacidad on-call
ποΈ Arquitectura de Alerting
ββββββββββββββββ β Prometheus β β (Metrics) β ββββββββ¬ββββββββ β β Alert Rules β ββββββββΌββββββββ β Alertmanager β ββββββββ¬ββββββββ β ββββββββββββ¬βββββββββββ¬βββββββββββ β β β β ββββββββΌββββ βββββΌβββββ βββββΌβββββ ββββΌββββββ β PagerDutyβ β Slack β β Email β β Webhookβ ββββββββββββ ββββββββββ ββββββββββ ββββββββββ β β ββββββββΌβββββββββββββββ β On-Call Engineer β β (Incident Response) β βββββββββββββββββββββββ
π» ImplementaciΓ³n
1. Alertmanager Configuration
# alertmanager/alertmanager.yml global: resolve_timeout: 5m slack_api_url: 'https://hooks.slack.com/services/YOUR/SLACK/WEBHOOK' route: group_by: ['alertname', 'cluster', 'service'] group_wait: 10s group_interval: 10s repeat_interval: 12h receiver: 'default' routes: # Critical alerts β PagerDuty immediately - match: severity: critical receiver: 'pagerduty-critical' continue: true # Warning alerts β Slack only - match: severity: warning receiver: 'slack-warnings' repeat_interval: 24h # Service-specific routes - match: service: payment-service receiver: 'payment-team' group_wait: 5s inhibit_rules: # Inhibit warning if critical is firing - source_match: severity: 'critical' target_match: severity: 'warning' equal: ['alertname', 'cluster', 'service'] receivers: - name: 'default' slack_configs: - channel: '#alerts' title: '{{ .GroupLabels.alertname }}' text: '{{ range .Alerts }}{{ .Annotations.description }}{{ end }}' - name: 'pagerduty-critical' pagerduty_configs: - service_key: 'YOUR_PAGERDUTY_SERVICE_KEY' description: '{{ .GroupLabels.alertname }}: {{ .GroupLabels.service }}' severity: 'critical' client: 'Prometheus' client_url: 'http://prometheus:9090' - name: 'slack-warnings' slack_configs: - channel: '#alerts-warnings' title: 'Warning: {{ .GroupLabels.alertname }}' text: '{{ range .Alerts }}{{ .Annotations.description }}{{ end }}' - name: 'payment-team' pagerduty_configs: - service_key: 'PAYMENT_TEAM_KEY' description: 'Payment Service Alert' slack_configs: - channel: '#payment-team'
2. Alert Rules (Prometheus)
# prometheus/alerts/service-alerts.yml groups: - name: service_alerts interval: 30s rules: # Error Budget Exhaustion - alert: ErrorBudgetExhaustion expr: | ( sum(rate(http_requests_total{status=~"5.."}[5m])) by (service) / sum(rate(http_requests_total[5m])) by (service) ) > 0.001 and ( sum_over_time( ( sum(rate(http_requests_total{status=~"5.."}[5m])) by (service) / sum(rate(http_requests_total[5m])) by (service) )[28d:] ) > 0.001 ) for: 5m labels: severity: critical team: backend annotations: summary: "Error budget exhausted for {{ $labels.service }}" description: | Service {{ $labels.service }} has exceeded error budget threshold. Current error rate: {{ $value | humanizePercentage }} Runbook: https://runbooks.example.com/error-budget # High Latency - alert: HighLatencyP99 expr: | histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service) ) > 1.0 for: 10m labels: severity: warning annotations: summary: "High P99 latency in {{ $labels.service }}" description: "P99 latency is {{ $value }}s (threshold: 1s)" runbook: https://runbooks.example.com/high-latency # Service Down - alert: ServiceDown expr: up{job=~"app-.*"} == 0 for: 2m labels: severity: critical annotations: summary: "Service {{ $labels.job }} is down" description: "Service has been unreachable for more than 2 minutes" runbook: https://runbooks.example.com/service-down # High Memory Usage - alert: HighMemoryUsage expr: | ( container_memory_usage_bytes{pod=~".+"} / container_spec_memory_limit_bytes{pod=~".+"} ) > 0.9 for: 5m labels: severity: warning annotations: summary: "High memory usage in {{ $labels.pod }}" description: "Memory usage is {{ $value | humanizePercentage }}" runbook: https://runbooks.example.com/high-memory # Disk Space - alert: DiskSpaceLow expr: | (node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"}) < 0.1 for: 5m labels: severity: warning annotations: summary: "Low disk space on {{ $labels.instance }}" description: "Only {{ $value | humanizePercentage }} disk space remaining" # CPU Throttling - alert: CPUThrottling expr: | rate(container_cpu_cfs_throttled_seconds_total[5m]) > 0.1 for: 10m labels: severity: warning annotations: summary: "CPU throttling detected in {{ $labels.pod }}" description: "Container is being CPU throttled" # Connection Pool Exhaustion - alert: ConnectionPoolExhaustion expr: | ( db_connections_active{service=~".+"} / db_connections_max{service=~".+"} ) > 0.9 for: 5m labels: severity: critical annotations: summary: "Connection pool near exhaustion in {{ $labels.service }}" description: "{{ $value | humanizePercentage }} of connections in use"
3. Runbook Template
# Runbook: High Error Rate ## Alert Name `HighErrorRate` ## Severity Critical ## Description Service error rate exceeds threshold (5% for 5 minutes) ## Symptoms - High HTTP 5xx response rate - User complaints - Error logs increasing ## Immediate Actions 1. **Acknowledge Alert** - Acknowledge in PagerDuty/OpsGenie - Notify team in Slack 2. **Check Service Health** ```bash kubectl get pods -l app=service-name kubectl logs -l app=service-name --tail=100
-
Check Metrics
- Grafana dashboard:
/d/service-overview - Error rate graph
- Latency graphs
- Grafana dashboard:
-
Identify Root Cause
- Check recent deployments
- Review error logs
- Check dependencies (DB, APIs)
Resolution Steps
If caused by code deployment:
# Rollback deployment kubectl rollout undo deployment/service-name
If caused by resource exhaustion:
# Scale up kubectl scale deployment/service-name --replicas=5
If caused by dependency failure:
- Check dependency service status
- Implement circuit breaker
- Enable fallback mechanisms
Post-Incident
- Document in incident log
- Update runbook if needed
Escalation
- If not resolved in 15 min β Escalate to senior engineer
- If not resolved in 30 min β Escalate to engineering manager
- If service completely down β Escalate to CTO
### 4. On-Call Rotation (PagerDuty) ```yaml # pagerduty/escalation-policies.yml escalation_policies: - name: "Primary On-Call" description: "Primary escalation for production services" num_loops: 3 escalation_rules: - escalation_delay_in_minutes: 0 targets: - type: "user" id: "PXXXXXXXX" # Primary on-call - escalation_delay_in_minutes: 15 targets: - type: "user" id: "PYYYYYYYY" # Secondary on-call - escalation_delay_in_minutes: 30 targets: - type: "schedule" id: "PZZZZZZZZ" # Manager on-call - name: "Critical Alerts Only" escalation_rules: - escalation_delay_in_minutes: 0 targets: - type: "user_reference" id: "PXXXXXXXX" - escalation_delay_in_minutes: 10 targets: - type: "escalation_policy_reference" id: "EPXXXXXXX" # Manager escalation schedules: - name: "Primary On-Call Schedule" time_zone: "America/New_York" layers: - name: "Layer 1" start: "2024-01-01T00:00:00" rotation_virtual_start: "2024-01-01T00:00:00" rotation_turn_length_seconds: 604800 # 1 week users: - user_id: "PXXXXXXXX" - user_id: "PYYYYYYYY" - user_id: "PZZZZZZZZ" restrictions: - type: "daily_restriction" start_time_of_day: "09:00:00" duration_seconds: 32400 # 9 hours
5. Incident Response Workflow
# incident_response/workflow.py from dataclasses import dataclass from datetime import datetime from typing import List, Optional from enum import Enum class IncidentSeverity(Enum): SEV1 = "critical" # Service down SEV2 = "high" # Major degradation SEV3 = "medium" # Minor issues SEV4 = "low" # Informational @dataclass class Incident: id: str title: str severity: IncidentSeverity status: str # open, investigating, mitigated, resolved created_at: datetime resolved_at: Optional[datetime] assigned_to: str affected_services: List[str] description: str root_cause: Optional[str] = None resolution: Optional[str] = None class IncidentResponseWorkflow: def __init__(self): self.incidents = [] def create_incident( self, title: str, severity: IncidentSeverity, affected_services: List[str], description: str ) -> Incident: incident = Incident( id=f"INC-{datetime.now().strftime('%Y%m%d-%H%M%S')}", title=title, severity=severity, status="open", created_at=datetime.now(), resolved_at=None, assigned_to=self._assign_oncall(), affected_services=affected_services, description=description ) self.incidents.append(incident) self._notify_team(incident) self._create_incident_channel(incident) return incident def _assign_oncall(self) -> str: # Logic to assign to current on-call engineer return "engineer@example.com" def _notify_team(self, incident: Incident): # Send notifications via PagerDuty, Slack, etc. pass def _create_incident_channel(self, incident: Incident): # Create dedicated Slack channel for incident channel_name = f"incident-{incident.id.lower()}" # Create channel logic pass def update_status(self, incident_id: str, status: str, notes: str): incident = self._find_incident(incident_id) if incident: incident.status = status if status == "resolved": incident.resolved_at = datetime.now() def _find_incident(self, incident_id: str) -> Optional[Incident]: return next((i for i in self.incidents if i.id == incident_id), None)
6. Alert Fatigue Prevention
# alertmanager/alert-fatigue-prevention.yml # Strategies to prevent alert fatigue # 1. Alert Grouping route: group_by: ['alertname', 'service'] group_wait: 10s # Wait before sending initial notification group_interval: 5m # Wait before sending updated notification repeat_interval: 12h # Minimum time between notifications # 2. Alert Inhibition inhibit_rules: # Don't alert on warning if critical is firing - source_match: severity: 'critical' target_match: severity: 'warning' equal: ['service'] # Don't alert on individual instance if all instances are down - source_match: alertname: 'AllInstancesDown' target_match_re: alertname: '.*InstanceDown' # 3. Alert Suppression (Silence Rules) # Use Alertmanager UI or API to create silences for known issues # 4. Threshold Tuning # Use error budgets and SLOs to set meaningful thresholds # Example: Alert only when error budget is at risk # 5. Alert Classification # Only page on actionable alerts # Use different channels for different severities
π― Mejores PrΓ‘cticas
1. Alert Design
β DO:
- Alert on symptoms, not causes
- Make alerts actionable
- Include runbook links
- Use appropriate severity levels
- Test alerts regularly
β DON'T:
- Alert on every metric
- Create alerts that require investigation to understand
- Alert on things you can't fix
- Duplicate alerts across systems
2. On-Call
β DO:
- Maintain clear rotation schedules
- Provide context in handoffs
- Limit on-call duration (max 1 week)
- Compensate on-call time
- Track on-call load
β DON'T:
- Have people on-call 24/7
- Make on-call mandatory without compensation
- Skip handoffs between rotations
3. Incident Response
β DO:
- Follow runbooks
- Communicate frequently
- Document decisions
- Implement action items
β DON'T:
- Skip incident documentation
- Point fingers
- Ignore post-mortem action items
π¨ Troubleshooting
Too Many Alerts
- Review alert rules
- Increase thresholds where appropriate
- Implement better grouping
- Use alert inhibition
Alerts Not Firing
- Check Prometheus query syntax
- Verify alert rule evaluation
- Check Alertmanager configuration
- Verify notification channel configs
On-Call Burnout
- Review alert volume
- Reduce non-actionable alerts
- Improve runbooks
- Rotate more frequently
π Recursos Adicionales
VersiΓ³n: 1.0.0
Γltima actualizaciΓ³n: Diciembre 2025
Total lΓneas: 1,100+