Claude-skill-registry incident-response
Incident response procedures — triage, communication, investigation, mitigation, and post-incident review. Use when handling production incidents or writing runbooks.
git clone https://github.com/majiayu000/claude-skill-registry
T=$(mktemp -d) && git clone --depth=1 https://github.com/majiayu000/claude-skill-registry "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/data/incident-response-claude-code-communit-claude-code-resource" ~/.claude/skills/majiayu000-claude-skill-registry-incident-response && rm -rf "$T"
skills/data/incident-response-claude-code-communit-claude-code-resource/SKILL.mdIncident Response
Severity Levels
Assign a severity level immediately upon incident detection. Severity determines response urgency, communication cadence, and escalation path.
| Severity | Name | Description | Examples | Response Time | Update Cadence | Responders |
|---|---|---|---|---|---|---|
| P0 | Total Outage | Complete service unavailability. All users affected. Revenue-impacting. | Site completely down, data corruption, security breach with active exploitation | < 15 minutes | Every 15 minutes | All hands on deck, executive notification |
| P1 | Major Degradation | Core functionality severely impaired. Large portion of users affected. | Payment processing broken, authentication failing for most users, major data pipeline stalled | < 30 minutes | Every 30 minutes | On-call engineer + team lead, stakeholder notification |
| P2 | Partial Impact | Non-core functionality broken or core functionality degraded for a subset of users. | Search feature down, slow responses in one region, intermittent errors for some users | < 2 hours | Every 2 hours | On-call engineer |
| P3 | Minor Issue | Cosmetic issues, minor bugs, or issues with workarounds available. | UI glitch, non-critical background job delayed, minor data inconsistency | Next business day | Daily (if ongoing) | Assigned engineer |
Escalation Rules
- If a P2 is not resolved within 4 hours, escalate to P1.
- If a P1 is not resolved within 2 hours, escalate to P0.
- Any incident involving data breach or security compromise is automatically P0.
- When in doubt, over-classify. It is better to downgrade than to under-respond.
Triage Procedure
When an incident is detected (alert, user report, or monitoring), follow this triage sequence:
Step 1: Assess Impact
Answer these questions within the first 5 minutes:
- Who is affected? (All users, subset, internal only)
- What is broken? (Specific feature, entire service, data integrity)
- When did it start? (Check monitoring for onset time)
- Is it getting worse? (Error rate trending up, stable, or recovering)
Step 2: Assign Severity
Based on the impact assessment, assign a severity level using the table above. Document the severity and reasoning.
Step 3: Assemble the Response Team
| Severity | Who to Notify |
|---|---|
| P0 | On-call engineer, engineering manager, VP Engineering, customer support lead, communications |
| P1 | On-call engineer, engineering manager, customer support lead |
| P2 | On-call engineer, relevant team lead |
| P3 | Assigned engineer (via ticket) |
Step 4: Designate Roles
For P0 and P1 incidents, explicitly assign these roles:
| Role | Responsibility |
|---|---|
| Incident Commander (IC) | Coordinates response, makes decisions, delegates tasks. Does NOT debug. |
| Technical Lead | Leads investigation and mitigation. Communicates findings to IC. |
| Communications Lead | Drafts and sends status updates to stakeholders and customers. |
| Scribe | Documents timeline, actions taken, and decisions in real time. |
Step 5: Open Incident Channel
Create a dedicated communication channel (Slack channel, incident bridge call):
- Name it clearly:
#incident-2024-03-15-payments-down - Pin the initial assessment and severity level.
- All discussion and decisions happen in this channel.
Communication Templates
Initial Alert
INCIDENT DECLARED: [P0/P1/P2] - [Brief description] Impact: [Who is affected and how] Start time: [When it began, UTC] Current status: [Investigating / Identified / Mitigating] Incident Commander: [Name] Channel: #incident-[date]-[topic] Next update in [15/30] minutes.
Status Update
INCIDENT UPDATE: [P0/P1/P2] - [Brief description] Status: [Investigating / Identified / Mitigating / Resolved] Duration: [Time since start] Current state: [What is happening now] Actions taken: [What has been tried] Next steps: [What will be done next] Next update in [15/30] minutes.
Resolution Notification
INCIDENT RESOLVED: [P0/P1/P2] - [Brief description] Duration: [Total time from detection to resolution] Root cause: [One-sentence summary] Resolution: [What fixed it] Customer impact: [Summary of user-facing impact] Follow-up: Post-incident review scheduled for [date/time]. Monitoring for recurrence.
Customer-Facing Communication
[Service Name] Status Update We are aware of an issue affecting [description of user impact]. Our team is actively working to resolve this. Started: [Time, timezone] Current status: [Brief, non-technical description] We will provide updates every [30 minutes / 1 hour]. We apologize for the inconvenience.
Investigation Methodology
Follow this structured approach rather than randomly checking things. Work from symptoms toward root cause.
Step 1: Check Dashboards
Start with the service overview dashboard. Look for:
- Error rate spikes (when exactly did they start?)
- Latency increases (gradual degradation or sudden jump?)
- Traffic anomalies (unexpected spike or drop?)
- Resource utilization (CPU, memory, disk, connections)
Step 2: Check Recent Deploys
Most incidents are caused by recent changes. Check:
# What was deployed recently? git log --oneline --since="2 hours ago" origin/main # Any recent infrastructure changes? # Check deployment pipeline history, Terraform runs, config changes
Questions to answer:
- Was anything deployed in the last 2 hours?
- Was there a configuration change (feature flags, environment variables)?
- Was there an infrastructure change (scaling, migration, certificate renewal)?
Step 3: Examine Logs
Search logs filtered to the incident timeframe:
# Example queries (adapt to your log aggregation platform) # ELK/Kibana: level:ERROR AND service:order-service AND @timestamp >= "2024-03-15T14:00:00" # Datadog: service:order-service status:error # CloudWatch: filter @message like /ERROR/ | sort @timestamp desc
Look for:
- Error messages with stack traces.
- Repeated error patterns (same error thousands of times).
- New error types that were not present before the incident.
- Correlation between errors and the timeline.
Step 4: Hypothesize and Test
Based on data gathered, form a hypothesis and test it:
| Hypothesis | How to Test |
|---|---|
| Bad deploy caused it | Compare error timeline with deploy timestamp. Roll back and observe. |
| Database is overloaded | Check connection pool, slow query log, lock contention. |
| External dependency is down | Check dependency status page, test connectivity, check timeout rates. |
| Traffic spike overwhelmed the service | Check request rate, compare to normal baseline, check auto-scaling. |
| DNS or certificate issue | Test DNS resolution, check certificate expiry, verify SSL handshake. |
| Memory leak | Check memory usage trend, look for OOM kills in system logs. |
| Data corruption | Query for inconsistent data, check recent migration or backfill jobs. |
Step 5: Verify the Fix
After applying a fix:
- Error rate returning to baseline.
- Latency returning to normal.
- No new error patterns appearing.
- Affected functionality manually verified.
- Monitor for at least 15 minutes (P0/P1) or 30 minutes (P2) before declaring resolved.
Common Mitigation Actions
When the root cause is identified (or even before, to reduce impact), apply the appropriate mitigation:
| Action | When to Use | How | Risk |
|---|---|---|---|
| Rollback | Bad deploy identified | Revert to previous known-good version via deployment pipeline | May lose new features; verify database compatibility |
| Feature flag toggle | New feature causing issues | Disable the flag in your feature management system | Requires feature flags to be in place |
| Horizontal scaling | Service overwhelmed by traffic | Increase instance count via auto-scaler or manual scaling | Increased cost; may not help if bottleneck is downstream |
| Cache clear | Stale or corrupted cached data | Flush application cache (Redis , CDN purge) | Temporary increase in origin load after flush |
| Circuit breaker | Failing dependency cascading | Activate circuit breaker to fail fast instead of waiting | Gracefully degraded experience for users |
| Traffic shedding | Total overload | Rate limit or redirect traffic, enable maintenance page | Users see errors or degraded service |
| Database failover | Primary database unresponsive | Promote replica to primary (if configured) | Brief downtime during promotion; verify replication lag |
| DNS redirect | Entire region or provider down | Update DNS to point to backup region or provider | Propagation delay (use low TTL proactively) |
| Restart | Process stuck, memory leak | Rolling restart of application instances | Brief capacity reduction during restart |
| Hotfix | Small targeted code fix needed | Fast-track a minimal change through deployment pipeline | Bypasses normal review; must be reviewed post-incident |
Rollback Procedure
# Verify the last known-good version git log --oneline -10 origin/main # Tag the rollback point git tag -a incident-rollback-2024-03-15 -m "Rolling back due to P1 incident" # Trigger deployment of previous version # (Adapt to your deployment pipeline) # Example: Kubernetes kubectl rollout undo deployment/order-service # Verify rollback is deployed kubectl rollout status deployment/order-service # Monitor error rate and confirm reduction
Post-Incident Review
Conduct a post-incident review (PIR) within 48 hours of resolution for P0/P1 incidents and within one week for P2 incidents.
Post-Incident Review Template
# Post-Incident Review: [Incident Title] **Date**: [date] | **Severity**: [P0/P1/P2] | **Duration**: [time] | **Author**: [name] ## Summary [2-3 sentences: what happened, the impact, and the resolution.] ## Timeline (UTC) | Time | Event | |------|-------| | 14:00 | Alert fires: order error rate > 5% | | 14:05 | P1 declared, incident channel created | | 14:15 | Recent deploy at 13:45 identified as suspect | | 14:25 | Rollback deployed | | 14:45 | Resolved, monitoring for recurrence | ## Impact Users affected, duration, revenue/SLA impact, data impact. ## Root Cause [Detailed technical explanation of what went wrong and why.] ## Five Whys 1. **Why** did orders fail? -> Payment validation threw an exception. 2. **Why** did validation throw? -> Null value for a non-nullable field. 3. **Why** was the field null? -> Migration added column but did not backfill. 4. **Why** was backfill missed? -> No checklist step for backfill verification. 5. **Why** no checklist step? -> Migration procedures were undocumented. ## Action Items | Action | Owner | Priority | Due Date | Status | |--------|-------|----------|----------|--------| | Add integration test for null field validation | @alice | High | YYYY-MM-DD | TODO | | Lower alert threshold from 5% to 2% | @bob | High | YYYY-MM-DD | TODO | | Add feature flag to payment flow | @carol | Medium | YYYY-MM-DD | TODO | ## Lessons Learned - What went well: [e.g., quick detection, rapid team assembly] - What could improve: [e.g., rollback automation, test coverage]
Blameless Culture Principles
Post-incident reviews are learning opportunities, not blame sessions. Adhere to these principles:
| Principle | Practice |
|---|---|
| Assume good intent | People made the best decisions they could with the information they had at the time. |
| Focus on systems, not individuals | Ask "what allowed this to happen?" not "who caused this?" |
| Separate the what from the who | Describe actions taken without naming individuals in the root cause. Use role titles if context is needed. |
| Reward transparency | Publicly thank people who report incidents, share mistakes, or identify risks. |
| Follow through on action items | PIR action items are tracked and completed. Unfixed systemic issues lead to repeat incidents. |
| Share learnings broadly | Publish PIR summaries (redacted if needed) so other teams learn too. |
Runbook Authoring Guide
A runbook is a step-by-step guide for responding to a specific alert or operational scenario.
Runbook Structure
Every runbook follows this structure:
# Runbook: [Alert or Scenario Name] ## When to Use [Describe the alert, symptom, or scenario that triggers this runbook.] ## Prerequisites - Access to [systems, dashboards, tools] - Permissions: [required roles or access levels] ## Steps ### 1. Verify the Problem ```bash curl -s https://monitoring.example.com/api/v1/query \ --data-urlencode 'query=rate(http_errors_total{service="order-service"}[5m])'
Expected: Error rate below 0.01. If above, continue. If normal, check thresholds and close.
2. Apply Mitigation
# Option A: Restart kubectl rollout restart deployment/order-service # Option B: Rollback kubectl rollout undo deployment/order-service
3. Verify Resolution
Expected: Error rate drops below 0.01 within 5 minutes.
4. Escalation
If unresolved: escalate to [team], contact [channel/phone], provide [context].
### Runbook Best Practices | Practice | Reason | |----------|--------| | Use exact commands, not descriptions | Under stress, responders should copy-paste, not interpret | | Include expected output | So responders know if the command worked | | Provide verification after each step | Catch issues early, do not proceed blindly | | Include a rollback for each step | If a mitigation step makes things worse | | Test runbooks regularly | Outdated runbooks cause confusion during real incidents | | Date-stamp and version runbooks | Know when it was last verified | | Link from alert to runbook | Reduce time-to-runbook to one click | ## Incident Response Checklist Quick reference during an active incident: - [ ] Impact assessed (who, what, when, trending). - [ ] Severity assigned and documented. - [ ] Incident channel or bridge opened. - [ ] Roles assigned (IC, Technical Lead, Comms, Scribe). - [ ] Initial stakeholder notification sent. - [ ] Timeline being recorded in real time. - [ ] Investigation following structured methodology (dashboards, deploys, logs, hypothesize, test). - [ ] Mitigation applied and impact reducing. - [ ] Resolution verified with monitoring data. - [ ] All-clear communication sent. - [ ] Post-incident review scheduled (within 48 hours for P0/P1). - [ ] Action items created with owners and due dates.