install
source · Clone the upstream repo
git clone https://github.com/dcs-soni/awesome-claude-skills
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/dcs-soni/awesome-claude-skills "$T" && mkdir -p ~/.claude/skills && cp -r "$T/incident-response-helper" ~/.claude/skills/dcs-soni-awesome-claude-skills-responding-to-incidents && rm -rf "$T"
manifest:
incident-response-helper/SKILL.mdsource content
Incident Response Helper
Accelerate incident response with structured workflows, log analysis scripts, and automated postmortem generation.
Quick Start
When responding to an incident, copy this checklist:
Incident Response Progress: - [ ] Step 1: Initial Assessment - [ ] Step 2: Collect & Analyze Logs - [ ] Step 3: Build Timeline - [ ] Step 4: Assess Impact - [ ] Step 5: Root Cause Analysis - [ ] Step 6: Resolve & Verify - [ ] Step 7: Generate Postmortem
Workflow
Step 1: Initial Assessment
Gather information quickly:
- What's broken? — Identify affected services/endpoints
- Severity? — P1 (total outage), P2 (major degradation), P3 (partial impact)
- When did it start? — Approximate start time
- Who's affected? — Users, regions, features
Run quick health check if URL known:
python scripts/check_health.py <url> --timeout 10
Step 2: Collect & Analyze Logs
Gather logs from affected services and analyze:
python scripts/analyze_logs.py <log_file> --format json
Output includes:
- Error patterns and frequency
- Exception stack traces
- Latency anomalies
- HTTP status code distribution
For multiple log files, run against each and compare patterns.
Step 3: Build Timeline
Generate chronological incident timeline:
python scripts/generate_timeline.py <log_dir> --start "YYYY-MM-DDTHH:MM:SS" --end "YYYY-MM-DDTHH:MM:SS"
Key events to identify:
- First error occurrence
- Deployment or config changes
- Traffic patterns
- External dependencies failures
Step 4: Assess Impact
Quantify the damage:
| Metric | How to measure |
|---|---|
| Duration | End time - Start time |
| Users affected | Error logs, support tickets |
| Revenue impact | Failed transactions |
| Data loss | Check persistence layer |
Step 5: Root Cause Analysis
Apply systematic analysis:
- What changed? — Deployments, configs, dependencies
- 5 Whys — Keep asking "why" until root cause found
- Contributing factors — List all factors, not just primary cause
For common issues, see RUNBOOKS.md.
Step 6: Resolve & Verify
- Implement fix — Code change, rollback, or config update
- Verify resolution — Run health checks, monitor metrics
- Communicate — Update stakeholders
Verification:
python scripts/check_health.py <url> --timeout 10 # Verify logs show no new errors python scripts/analyze_logs.py <new_logs> --format text
Step 7: Generate Postmortem
Create blameless postmortem document:
python scripts/create_postmortem.py --title "Incident Title" --severity P1 --output postmortem.md
See POSTMORTEM.md for template and guidelines.
Utility Scripts
| Script | Purpose |
|---|---|
| Parse logs, find error patterns |
| Create timeline from logs |
| Generate postmortem template |
| Quick endpoint health check |
Examples
Example 1: Database Connection Exhaustion
User: "Our API is returning 500s, help me investigate"
- Run
→ Confirms 500 errorscheck_health.py - Run
on API logs → Finds "connection pool exhausted"analyze_logs.py - Run
→ Shows spike after traffic increasegenerate_timeline.py - Check RUNBOOKS.md → Database section has resolution
- Fix: Increase pool size, verify with health check
- Generate postmortem
Example 2: Deployment Caused Regression
User: "Users reporting slow responses after deploy"
- Initial assessment: P2, latency issue
- Analyze logs → High latency in specific endpoint
- Timeline → Correlates with deployment time
- Root cause → N+1 query introduced in new code
- Resolution → Rollback deployment
- Create postmortem with action items
Related Skills
- codebase-onboarding — Understand service architecture first
- api-docs-generator — Document API for better debugging