Claude-skill-registry alerting-incident-management

🚨 Skill: Alerting & Incident Management

install
source Β· Clone the upstream repo
git clone https://github.com/majiayu000/claude-skill-registry
Claude Code Β· Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/majiayu000/claude-skill-registry "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/data/alerting-incident-management" ~/.claude/skills/majiayu000-claude-skill-registry-alerting-incident-management && rm -rf "$T"
manifest: skills/data/alerting-incident-management/SKILL.md
source content

🚨 Skill: Alerting & Incident Management

πŸ“‹ Metadata

AtributoValor
ID
sre-alerting-incident-management
NivelπŸ”΄ Avanzado
VersiΓ³n1.0.0
Keywords
alerting
,
incident-management
,
pagerduty
,
opsgenie
,
oncall
,
runbooks
ReferenciaGoogle SRE - On-Call, PagerDuty

πŸ”‘ Keywords para InvocaciΓ³n

  • alerting
  • incident-management
  • pagerduty
  • opsgenie
  • oncall
  • runbooks
  • incident-response
  • @skill:alerting

Ejemplos de Prompts

Implementa alerting con PagerDuty y runbooks
Configura incident management y on-call rotation
Configura runbooks y on-call rotation
@skill:alerting - Sistema completo de alertas e incidentes

πŸ“– DescripciΓ³n

Alerting efectivo y gestiΓ³n de incidentes es crΓ­tico para mantener servicios confiables. Este skill cubre diseΓ±o de alertas efectivas, on-call rotations, runbooks, e incident response workflows.

βœ… CuΓ‘ndo Usar Este Skill

  • Servicios en producciΓ³n
  • Equipos on-call
  • SLAs crΓ­ticos
  • Sistemas distribuidos
  • Incidentes frecuentes
  • Compliance requirements

❌ CuÑndo NO Usar Este Skill

  • Desarrollo local solo
  • Servicios no crΓ­ticos sin SLA
  • Equipos sin capacidad on-call

πŸ—οΈ Arquitectura de Alerting

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  Prometheus  β”‚
β”‚  (Metrics)   β”‚
β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜
       β”‚
       β”‚ Alert Rules
       β”‚
β”Œβ”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”
β”‚ Alertmanager β”‚
β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜
       β”‚
       β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
       β”‚          β”‚          β”‚          β”‚
β”Œβ”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β” β”Œβ”€β”€β”€β–Όβ”€β”€β”€β”€β” β”Œβ”€β”€β”€β–Όβ”€β”€β”€β”€β” β”Œβ”€β”€β–Όβ”€β”€β”€β”€β”€β”
β”‚ PagerDutyβ”‚ β”‚ Slack  β”‚ β”‚ Email  β”‚ β”‚ Webhookβ”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”˜
       β”‚
       β”‚
β”Œβ”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ On-Call Engineer    β”‚
β”‚ (Incident Response) β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

πŸ’» ImplementaciΓ³n

1. Alertmanager Configuration

# alertmanager/alertmanager.yml
global:
  resolve_timeout: 5m
  slack_api_url: 'https://hooks.slack.com/services/YOUR/SLACK/WEBHOOK'

route:
  group_by: ['alertname', 'cluster', 'service']
  group_wait: 10s
  group_interval: 10s
  repeat_interval: 12h
  receiver: 'default'
  routes:
  # Critical alerts β†’ PagerDuty immediately
  - match:
      severity: critical
    receiver: 'pagerduty-critical'
    continue: true

  # Warning alerts β†’ Slack only
  - match:
      severity: warning
    receiver: 'slack-warnings'
    repeat_interval: 24h

  # Service-specific routes
  - match:
      service: payment-service
    receiver: 'payment-team'
    group_wait: 5s

inhibit_rules:
  # Inhibit warning if critical is firing
  - source_match:
      severity: 'critical'
    target_match:
      severity: 'warning'
    equal: ['alertname', 'cluster', 'service']

receivers:
  - name: 'default'
    slack_configs:
      - channel: '#alerts'
        title: '{{ .GroupLabels.alertname }}'
        text: '{{ range .Alerts }}{{ .Annotations.description }}{{ end }}'

  - name: 'pagerduty-critical'
    pagerduty_configs:
      - service_key: 'YOUR_PAGERDUTY_SERVICE_KEY'
        description: '{{ .GroupLabels.alertname }}: {{ .GroupLabels.service }}'
        severity: 'critical'
        client: 'Prometheus'
        client_url: 'http://prometheus:9090'

  - name: 'slack-warnings'
    slack_configs:
      - channel: '#alerts-warnings'
        title: 'Warning: {{ .GroupLabels.alertname }}'
        text: '{{ range .Alerts }}{{ .Annotations.description }}{{ end }}'

  - name: 'payment-team'
    pagerduty_configs:
      - service_key: 'PAYMENT_TEAM_KEY'
        description: 'Payment Service Alert'
    slack_configs:
      - channel: '#payment-team'

2. Alert Rules (Prometheus)

# prometheus/alerts/service-alerts.yml
groups:
  - name: service_alerts
    interval: 30s
    rules:
      # Error Budget Exhaustion
      - alert: ErrorBudgetExhaustion
        expr: |
          (
            sum(rate(http_requests_total{status=~"5.."}[5m])) by (service)
            /
            sum(rate(http_requests_total[5m])) by (service)
          ) > 0.001
          and
          (
            sum_over_time(
              (
                sum(rate(http_requests_total{status=~"5.."}[5m])) by (service)
                /
                sum(rate(http_requests_total[5m])) by (service)
              )[28d:]
            ) > 0.001
          )
        for: 5m
        labels:
          severity: critical
          team: backend
        annotations:
          summary: "Error budget exhausted for {{ $labels.service }}"
          description: |
            Service {{ $labels.service }} has exceeded error budget threshold.
            Current error rate: {{ $value | humanizePercentage }}
            Runbook: https://runbooks.example.com/error-budget

      # High Latency
      - alert: HighLatencyP99
        expr: |
          histogram_quantile(0.99,
            sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service)
          ) > 1.0
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "High P99 latency in {{ $labels.service }}"
          description: "P99 latency is {{ $value }}s (threshold: 1s)"
          runbook: https://runbooks.example.com/high-latency

      # Service Down
      - alert: ServiceDown
        expr: up{job=~"app-.*"} == 0
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "Service {{ $labels.job }} is down"
          description: "Service has been unreachable for more than 2 minutes"
          runbook: https://runbooks.example.com/service-down

      # High Memory Usage
      - alert: HighMemoryUsage
        expr: |
          (
            container_memory_usage_bytes{pod=~".+"}
            /
            container_spec_memory_limit_bytes{pod=~".+"}
          ) > 0.9
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High memory usage in {{ $labels.pod }}"
          description: "Memory usage is {{ $value | humanizePercentage }}"
          runbook: https://runbooks.example.com/high-memory

      # Disk Space
      - alert: DiskSpaceLow
        expr: |
          (node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"}) < 0.1
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Low disk space on {{ $labels.instance }}"
          description: "Only {{ $value | humanizePercentage }} disk space remaining"

      # CPU Throttling
      - alert: CPUThrottling
        expr: |
          rate(container_cpu_cfs_throttled_seconds_total[5m]) > 0.1
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "CPU throttling detected in {{ $labels.pod }}"
          description: "Container is being CPU throttled"

      # Connection Pool Exhaustion
      - alert: ConnectionPoolExhaustion
        expr: |
          (
            db_connections_active{service=~".+"}
            /
            db_connections_max{service=~".+"}
          ) > 0.9
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Connection pool near exhaustion in {{ $labels.service }}"
          description: "{{ $value | humanizePercentage }} of connections in use"

3. Runbook Template

# Runbook: High Error Rate

## Alert Name
`HighErrorRate`

## Severity
Critical

## Description
Service error rate exceeds threshold (5% for 5 minutes)

## Symptoms
- High HTTP 5xx response rate
- User complaints
- Error logs increasing

## Immediate Actions

1. **Acknowledge Alert**
   - Acknowledge in PagerDuty/OpsGenie
   - Notify team in Slack

2. **Check Service Health**
   ```bash
   kubectl get pods -l app=service-name
   kubectl logs -l app=service-name --tail=100
  1. Check Metrics

    • Grafana dashboard:
      /d/service-overview
    • Error rate graph
    • Latency graphs
  2. Identify Root Cause

    • Check recent deployments
    • Review error logs
    • Check dependencies (DB, APIs)

Resolution Steps

If caused by code deployment:

# Rollback deployment
kubectl rollout undo deployment/service-name

If caused by resource exhaustion:

# Scale up
kubectl scale deployment/service-name --replicas=5

If caused by dependency failure:

  1. Check dependency service status
  2. Implement circuit breaker
  3. Enable fallback mechanisms

Post-Incident

  • Document in incident log
  • Update runbook if needed

Escalation

  • If not resolved in 15 min β†’ Escalate to senior engineer
  • If not resolved in 30 min β†’ Escalate to engineering manager
  • If service completely down β†’ Escalate to CTO

### 4. On-Call Rotation (PagerDuty)

```yaml
# pagerduty/escalation-policies.yml
escalation_policies:
  - name: "Primary On-Call"
    description: "Primary escalation for production services"
    num_loops: 3
    escalation_rules:
      - escalation_delay_in_minutes: 0
        targets:
          - type: "user"
            id: "PXXXXXXXX"  # Primary on-call
      - escalation_delay_in_minutes: 15
        targets:
          - type: "user"
            id: "PYYYYYYYY"  # Secondary on-call
      - escalation_delay_in_minutes: 30
        targets:
          - type: "schedule"
            id: "PZZZZZZZZ"  # Manager on-call

  - name: "Critical Alerts Only"
    escalation_rules:
      - escalation_delay_in_minutes: 0
        targets:
          - type: "user_reference"
            id: "PXXXXXXXX"
      - escalation_delay_in_minutes: 10
        targets:
          - type: "escalation_policy_reference"
            id: "EPXXXXXXX"  # Manager escalation

schedules:
  - name: "Primary On-Call Schedule"
    time_zone: "America/New_York"
    layers:
      - name: "Layer 1"
        start: "2024-01-01T00:00:00"
        rotation_virtual_start: "2024-01-01T00:00:00"
        rotation_turn_length_seconds: 604800  # 1 week
        users:
          - user_id: "PXXXXXXXX"
          - user_id: "PYYYYYYYY"
          - user_id: "PZZZZZZZZ"
        restrictions:
          - type: "daily_restriction"
            start_time_of_day: "09:00:00"
            duration_seconds: 32400  # 9 hours

5. Incident Response Workflow

# incident_response/workflow.py
from dataclasses import dataclass
from datetime import datetime
from typing import List, Optional
from enum import Enum

class IncidentSeverity(Enum):
    SEV1 = "critical"  # Service down
    SEV2 = "high"      # Major degradation
    SEV3 = "medium"    # Minor issues
    SEV4 = "low"       # Informational

@dataclass
class Incident:
    id: str
    title: str
    severity: IncidentSeverity
    status: str  # open, investigating, mitigated, resolved
    created_at: datetime
    resolved_at: Optional[datetime]
    assigned_to: str
    affected_services: List[str]
    description: str
    root_cause: Optional[str] = None
    resolution: Optional[str] = None

class IncidentResponseWorkflow:
    def __init__(self):
        self.incidents = []

    def create_incident(
        self,
        title: str,
        severity: IncidentSeverity,
        affected_services: List[str],
        description: str
    ) -> Incident:
        incident = Incident(
            id=f"INC-{datetime.now().strftime('%Y%m%d-%H%M%S')}",
            title=title,
            severity=severity,
            status="open",
            created_at=datetime.now(),
            resolved_at=None,
            assigned_to=self._assign_oncall(),
            affected_services=affected_services,
            description=description
        )
        
        self.incidents.append(incident)
        self._notify_team(incident)
        self._create_incident_channel(incident)
        
        return incident

    def _assign_oncall(self) -> str:
        # Logic to assign to current on-call engineer
        return "engineer@example.com"

    def _notify_team(self, incident: Incident):
        # Send notifications via PagerDuty, Slack, etc.
        pass

    def _create_incident_channel(self, incident: Incident):
        # Create dedicated Slack channel for incident
        channel_name = f"incident-{incident.id.lower()}"
        # Create channel logic
        pass

    def update_status(self, incident_id: str, status: str, notes: str):
        incident = self._find_incident(incident_id)
        if incident:
            incident.status = status
            if status == "resolved":
                incident.resolved_at = datetime.now()

    def _find_incident(self, incident_id: str) -> Optional[Incident]:
        return next((i for i in self.incidents if i.id == incident_id), None)

6. Alert Fatigue Prevention

# alertmanager/alert-fatigue-prevention.yml
# Strategies to prevent alert fatigue

# 1. Alert Grouping
route:
  group_by: ['alertname', 'service']
  group_wait: 10s      # Wait before sending initial notification
  group_interval: 5m   # Wait before sending updated notification
  repeat_interval: 12h # Minimum time between notifications

# 2. Alert Inhibition
inhibit_rules:
  # Don't alert on warning if critical is firing
  - source_match:
      severity: 'critical'
    target_match:
      severity: 'warning'
    equal: ['service']

  # Don't alert on individual instance if all instances are down
  - source_match:
      alertname: 'AllInstancesDown'
    target_match_re:
      alertname: '.*InstanceDown'

# 3. Alert Suppression (Silence Rules)
# Use Alertmanager UI or API to create silences for known issues

# 4. Threshold Tuning
# Use error budgets and SLOs to set meaningful thresholds
# Example: Alert only when error budget is at risk

# 5. Alert Classification
# Only page on actionable alerts
# Use different channels for different severities

🎯 Mejores PrÑcticas

1. Alert Design

βœ… DO:

  • Alert on symptoms, not causes
  • Make alerts actionable
  • Include runbook links
  • Use appropriate severity levels
  • Test alerts regularly

❌ DON'T:

  • Alert on every metric
  • Create alerts that require investigation to understand
  • Alert on things you can't fix
  • Duplicate alerts across systems

2. On-Call

βœ… DO:

  • Maintain clear rotation schedules
  • Provide context in handoffs
  • Limit on-call duration (max 1 week)
  • Compensate on-call time
  • Track on-call load

❌ DON'T:

  • Have people on-call 24/7
  • Make on-call mandatory without compensation
  • Skip handoffs between rotations

3. Incident Response

βœ… DO:

  • Follow runbooks
  • Communicate frequently
  • Document decisions
  • Implement action items

❌ DON'T:

  • Skip incident documentation
  • Point fingers
  • Ignore post-mortem action items

🚨 Troubleshooting

Too Many Alerts

  1. Review alert rules
  2. Increase thresholds where appropriate
  3. Implement better grouping
  4. Use alert inhibition

Alerts Not Firing

  1. Check Prometheus query syntax
  2. Verify alert rule evaluation
  3. Check Alertmanager configuration
  4. Verify notification channel configs

On-Call Burnout

  1. Review alert volume
  2. Reduce non-actionable alerts
  3. Improve runbooks
  4. Rotate more frequently

πŸ“š Recursos Adicionales


VersiΓ³n: 1.0.0
Última actualización: Diciembre 2025
Total lΓ­neas: 1,100+