Claude-skill-registry incident-response
Respond to production incidents systematically with triage, investigation, resolution, and post-mortem analysis to minimize downtime and prevent recurrence. Use when handling production outages, triaging incidents, investigating critical bugs, coordinating incident response, implementing hotfixes, conducting post-mortems, or establishing incident response procedures.
install
source · Clone the upstream repo
git clone https://github.com/majiayu000/claude-skill-registry
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/majiayu000/claude-skill-registry "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/data/incident-response" ~/.claude/skills/majiayu000-claude-skill-registry-incident-response-2f4719 && rm -rf "$T"
manifest:
skills/data/incident-response/SKILL.mdsource content
Incident Response - Production Issue Management
When to use this skill
- Responding to production outages
- Triaging critical incidents
- Investigating high-severity bugs
- Coordinating incident response teams
- Implementing emergency hotfixes
- Conducting post-mortem analyses
- Establishing incident response procedures
- Communicating status during incidents
- Creating runbooks for common issues
- Implementing rollback strategies
- Documenting incident timelines
- Preventing incident recurrence
When to use this skill
- Responding to outages, managing incidents, conducting postmortems.
- When working on related tasks or features
- During development that requires this expertise
Use when: Responding to outages, managing incidents, conducting postmortems.
Incident Response Process
1. Detect
- Monitoring alerts
- User reports
- Automated checks
2. Triage
- Assess severity (P0-P4)
- Page on-call engineer
- Create incident channel
3. Mitigate
- Rollback to last known good
- Scale resources
- Apply hotfix
- Communicate status
4. Resolve
- Verify fix
- Monitor metrics
- Update status page
- Close incident
5. Postmortem
- Timeline of events
- Root cause analysis
- Action items
- Follow-up tasks
Severity Levels
- P0 (Critical): Complete outage, data loss
- P1 (High): Major feature broken, revenue impact
- P2 (Medium): Degraded performance, workaround exists
- P3 (Low): Minor bug, cosmetic issue
- P4 (Informational): Enhancement request
Example Runbook
```markdown
High CPU Usage Runbook
Symptoms
- Server CPU > 90%
- Slow response times
- Request timeouts
Investigation
- Check top processes: `top`
- Check memory: `free -h`
- Check logs: `tail -f app.log`
Mitigation
- Scale horizontally: Add servers
- Restart service: `systemctl restart app`
- Rate limit: Enable aggressive rate limiting
Resolution
- Identify root cause (N+1 query, memory leak, etc.)
- Deploy fix
- Monitor for 1 hour ```
Communication Template
``` [INCIDENT] Service X degraded
Status: Investigating Impact: 20% of users seeing slow load times ETA: 30 minutes
Updates:
- 10:00 AM: Issue detected
- 10:05 AM: On-call paged, investigation started
- 10:15 AM: Root cause identified (database bottleneck)
- 10:30 AM: Fix deployed, monitoring
Next update: 11:00 AM ```