Claude-skill-registry incident-response

Incident response procedures — triage, communication, investigation, mitigation, and post-incident review. Use when handling production incidents or writing runbooks.

install

source · Clone the upstream repo

git clone https://github.com/majiayu000/claude-skill-registry

Claude Code · Install into ~/.claude/skills/

T=$(mktemp -d) && git clone --depth=1 https://github.com/majiayu000/claude-skill-registry "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/data/incident-response-claude-code-communit-claude-code-resource" ~/.claude/skills/majiayu000-claude-skill-registry-incident-response && rm -rf "$T"

manifest: skills/data/incident-response-claude-code-communit-claude-code-resource/SKILL.md

Incident Response

Severity Levels

Assign a severity level immediately upon incident detection. Severity determines response urgency, communication cadence, and escalation path.

Severity	Name	Description	Examples	Response Time	Update Cadence	Responders
P0	Total Outage	Complete service unavailability. All users affected. Revenue-impacting.	Site completely down, data corruption, security breach with active exploitation	< 15 minutes	Every 15 minutes	All hands on deck, executive notification
P1	Major Degradation	Core functionality severely impaired. Large portion of users affected.	Payment processing broken, authentication failing for most users, major data pipeline stalled	< 30 minutes	Every 30 minutes	On-call engineer + team lead, stakeholder notification
P2	Partial Impact	Non-core functionality broken or core functionality degraded for a subset of users.	Search feature down, slow responses in one region, intermittent errors for some users	< 2 hours	Every 2 hours	On-call engineer
P3	Minor Issue	Cosmetic issues, minor bugs, or issues with workarounds available.	UI glitch, non-critical background job delayed, minor data inconsistency	Next business day	Daily (if ongoing)	Assigned engineer

Escalation Rules

If a P2 is not resolved within 4 hours, escalate to P1.
If a P1 is not resolved within 2 hours, escalate to P0.
Any incident involving data breach or security compromise is automatically P0.
When in doubt, over-classify. It is better to downgrade than to under-respond.

Triage Procedure

When an incident is detected (alert, user report, or monitoring), follow this triage sequence:

Step 1: Assess Impact

Answer these questions within the first 5 minutes:

Who is affected? (All users, subset, internal only)
What is broken? (Specific feature, entire service, data integrity)
When did it start? (Check monitoring for onset time)
Is it getting worse? (Error rate trending up, stable, or recovering)

Step 2: Assign Severity

Based on the impact assessment, assign a severity level using the table above. Document the severity and reasoning.

Step 3: Assemble the Response Team

Severity	Who to Notify
P0	On-call engineer, engineering manager, VP Engineering, customer support lead, communications
P1	On-call engineer, engineering manager, customer support lead
P2	On-call engineer, relevant team lead
P3	Assigned engineer (via ticket)

Step 4: Designate Roles

For P0 and P1 incidents, explicitly assign these roles:

Role	Responsibility
Incident Commander (IC)	Coordinates response, makes decisions, delegates tasks. Does NOT debug.
Technical Lead	Leads investigation and mitigation. Communicates findings to IC.
Communications Lead	Drafts and sends status updates to stakeholders and customers.
Scribe	Documents timeline, actions taken, and decisions in real time.

Step 5: Open Incident Channel

Create a dedicated communication channel (Slack channel, incident bridge call):

Name it clearly:
```
#incident-2024-03-15-payments-down
```
Pin the initial assessment and severity level.
All discussion and decisions happen in this channel.

Communication Templates

Initial Alert

INCIDENT DECLARED: [P0/P1/P2] - [Brief description]

Impact: [Who is affected and how]
Start time: [When it began, UTC]
Current status: [Investigating / Identified / Mitigating]
Incident Commander: [Name]
Channel: #incident-[date]-[topic]

Next update in [15/30] minutes.

Status Update

INCIDENT UPDATE: [P0/P1/P2] - [Brief description]

Status: [Investigating / Identified / Mitigating / Resolved]
Duration: [Time since start]
Current state: [What is happening now]
Actions taken: [What has been tried]
Next steps: [What will be done next]

Next update in [15/30] minutes.

Resolution Notification

INCIDENT RESOLVED: [P0/P1/P2] - [Brief description]

Duration: [Total time from detection to resolution]
Root cause: [One-sentence summary]
Resolution: [What fixed it]
Customer impact: [Summary of user-facing impact]
Follow-up: Post-incident review scheduled for [date/time].

Monitoring for recurrence.

Customer-Facing Communication

[Service Name] Status Update

We are aware of an issue affecting [description of user impact].
Our team is actively working to resolve this.

Started: [Time, timezone]
Current status: [Brief, non-technical description]

We will provide updates every [30 minutes / 1 hour].
We apologize for the inconvenience.

Investigation Methodology

Follow this structured approach rather than randomly checking things. Work from symptoms toward root cause.

Step 1: Check Dashboards

Start with the service overview dashboard. Look for:

Error rate spikes (when exactly did they start?)
Latency increases (gradual degradation or sudden jump?)
Traffic anomalies (unexpected spike or drop?)
Resource utilization (CPU, memory, disk, connections)

Step 2: Check Recent Deploys

Most incidents are caused by recent changes. Check:

# What was deployed recently?
git log --oneline --since="2 hours ago" origin/main

# Any recent infrastructure changes?
# Check deployment pipeline history, Terraform runs, config changes

Questions to answer:

Was anything deployed in the last 2 hours?
Was there a configuration change (feature flags, environment variables)?
Was there an infrastructure change (scaling, migration, certificate renewal)?

Step 3: Examine Logs

Search logs filtered to the incident timeframe:

# Example queries (adapt to your log aggregation platform)
# ELK/Kibana: level:ERROR AND service:order-service AND @timestamp >= "2024-03-15T14:00:00"
# Datadog: service:order-service status:error
# CloudWatch: filter @message like /ERROR/ | sort @timestamp desc

Look for:

Error messages with stack traces.
Repeated error patterns (same error thousands of times).
New error types that were not present before the incident.
Correlation between errors and the timeline.

Step 4: Hypothesize and Test

Based on data gathered, form a hypothesis and test it:

Hypothesis	How to Test
Bad deploy caused it	Compare error timeline with deploy timestamp. Roll back and observe.
Database is overloaded	Check connection pool, slow query log, lock contention.
External dependency is down	Check dependency status page, test connectivity, check timeout rates.
Traffic spike overwhelmed the service	Check request rate, compare to normal baseline, check auto-scaling.
DNS or certificate issue	Test DNS resolution, check certificate expiry, verify SSL handshake.
Memory leak	Check memory usage trend, look for OOM kills in system logs.
Data corruption	Query for inconsistent data, check recent migration or backfill jobs.

Step 5: Verify the Fix

After applying a fix:

Error rate returning to baseline.
Latency returning to normal.
No new error patterns appearing.
Affected functionality manually verified.
Monitor for at least 15 minutes (P0/P1) or 30 minutes (P2) before declaring resolved.

Common Mitigation Actions

When the root cause is identified (or even before, to reduce impact), apply the appropriate mitigation:

Action	When to Use	How	Risk
Rollback	Bad deploy identified	Revert to previous known-good version via deployment pipeline	May lose new features; verify database compatibility
Feature flag toggle	New feature causing issues	Disable the flag in your feature management system	Requires feature flags to be in place
Horizontal scaling	Service overwhelmed by traffic	Increase instance count via auto-scaler or manual scaling	Increased cost; may not help if bottleneck is downstream
Cache clear	Stale or corrupted cached data	Flush application cache (Redis `FLUSHDB` , CDN purge)	Temporary increase in origin load after flush
Circuit breaker	Failing dependency cascading	Activate circuit breaker to fail fast instead of waiting	Gracefully degraded experience for users
Traffic shedding	Total overload	Rate limit or redirect traffic, enable maintenance page	Users see errors or degraded service
Database failover	Primary database unresponsive	Promote replica to primary (if configured)	Brief downtime during promotion; verify replication lag
DNS redirect	Entire region or provider down	Update DNS to point to backup region or provider	Propagation delay (use low TTL proactively)
Restart	Process stuck, memory leak	Rolling restart of application instances	Brief capacity reduction during restart
Hotfix	Small targeted code fix needed	Fast-track a minimal change through deployment pipeline	Bypasses normal review; must be reviewed post-incident

Rollback Procedure

# Verify the last known-good version
git log --oneline -10 origin/main

# Tag the rollback point
git tag -a incident-rollback-2024-03-15 -m "Rolling back due to P1 incident"

# Trigger deployment of previous version
# (Adapt to your deployment pipeline)
# Example: Kubernetes
kubectl rollout undo deployment/order-service

# Verify rollback is deployed
kubectl rollout status deployment/order-service

# Monitor error rate and confirm reduction

Post-Incident Review

Conduct a post-incident review (PIR) within 48 hours of resolution for P0/P1 incidents and within one week for P2 incidents.

Post-Incident Review Template

# Post-Incident Review: [Incident Title]

**Date**: [date] | **Severity**: [P0/P1/P2] | **Duration**: [time] | **Author**: [name]

## Summary
[2-3 sentences: what happened, the impact, and the resolution.]

## Timeline (UTC)
| Time | Event |
|------|-------|
| 14:00 | Alert fires: order error rate > 5% |
| 14:05 | P1 declared, incident channel created |
| 14:15 | Recent deploy at 13:45 identified as suspect |
| 14:25 | Rollback deployed |
| 14:45 | Resolved, monitoring for recurrence |

## Impact
Users affected, duration, revenue/SLA impact, data impact.

## Root Cause
[Detailed technical explanation of what went wrong and why.]

## Five Whys
1. **Why** did orders fail? -> Payment validation threw an exception.
2. **Why** did validation throw? -> Null value for a non-nullable field.
3. **Why** was the field null? -> Migration added column but did not backfill.
4. **Why** was backfill missed? -> No checklist step for backfill verification.
5. **Why** no checklist step? -> Migration procedures were undocumented.

## Action Items
| Action | Owner | Priority | Due Date | Status |
|--------|-------|----------|----------|--------|
| Add integration test for null field validation | @alice | High | YYYY-MM-DD | TODO |
| Lower alert threshold from 5% to 2% | @bob | High | YYYY-MM-DD | TODO |
| Add feature flag to payment flow | @carol | Medium | YYYY-MM-DD | TODO |

## Lessons Learned
- What went well: [e.g., quick detection, rapid team assembly]
- What could improve: [e.g., rollback automation, test coverage]

Blameless Culture Principles

Post-incident reviews are learning opportunities, not blame sessions. Adhere to these principles:

Principle	Practice
Assume good intent	People made the best decisions they could with the information they had at the time.
Focus on systems, not individuals	Ask "what allowed this to happen?" not "who caused this?"
Separate the what from the who	Describe actions taken without naming individuals in the root cause. Use role titles if context is needed.
Reward transparency	Publicly thank people who report incidents, share mistakes, or identify risks.
Follow through on action items	PIR action items are tracked and completed. Unfixed systemic issues lead to repeat incidents.
Share learnings broadly	Publish PIR summaries (redacted if needed) so other teams learn too.

Runbook Authoring Guide

A runbook is a step-by-step guide for responding to a specific alert or operational scenario.

Runbook Structure

Every runbook follows this structure:

# Runbook: [Alert or Scenario Name]

## When to Use
[Describe the alert, symptom, or scenario that triggers this runbook.]

## Prerequisites
- Access to [systems, dashboards, tools]
- Permissions: [required roles or access levels]

## Steps

### 1. Verify the Problem
```bash
curl -s https://monitoring.example.com/api/v1/query \
  --data-urlencode 'query=rate(http_errors_total{service="order-service"}[5m])'

Expected: Error rate below 0.01. If above, continue. If normal, check thresholds and close.

2. Apply Mitigation

# Option A: Restart
kubectl rollout restart deployment/order-service
# Option B: Rollback
kubectl rollout undo deployment/order-service

3. Verify Resolution

Expected: Error rate drops below 0.01 within 5 minutes.

4. Escalation

If unresolved: escalate to [team], contact [channel/phone], provide [context].


### Runbook Best Practices

| Practice | Reason |
|----------|--------|
| Use exact commands, not descriptions | Under stress, responders should copy-paste, not interpret |
| Include expected output | So responders know if the command worked |
| Provide verification after each step | Catch issues early, do not proceed blindly |
| Include a rollback for each step | If a mitigation step makes things worse |
| Test runbooks regularly | Outdated runbooks cause confusion during real incidents |
| Date-stamp and version runbooks | Know when it was last verified |
| Link from alert to runbook | Reduce time-to-runbook to one click |

## Incident Response Checklist

Quick reference during an active incident:

- [ ] Impact assessed (who, what, when, trending).
- [ ] Severity assigned and documented.
- [ ] Incident channel or bridge opened.
- [ ] Roles assigned (IC, Technical Lead, Comms, Scribe).
- [ ] Initial stakeholder notification sent.
- [ ] Timeline being recorded in real time.
- [ ] Investigation following structured methodology (dashboards, deploys, logs, hypothesize, test).
- [ ] Mitigation applied and impact reducing.
- [ ] Resolution verified with monitoring data.
- [ ] All-clear communication sent.
- [ ] Post-incident review scheduled (within 48 hours for P0/P1).
- [ ] Action items created with owners and due dates.