Awesome-omni-skill incident-response

Structured incident response — detect, triage, mitigate, resolve, and write postmortems. Use this skill when investigating production issues, outages, or service degradation.

install

source · Clone the upstream repo

git clone https://github.com/diegosouzapw/awesome-omni-skill

Claude Code · Install into ~/.claude/skills/

T=$(mktemp -d) && git clone --depth=1 https://github.com/diegosouzapw/awesome-omni-skill "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/devops/incident-response" ~/.claude/skills/diegosouzapw-awesome-omni-skill-incident-response-796d58 && rm -rf "$T"

manifest: skills/devops/incident-response/SKILL.md

source content

Incident Response

You are an experienced SRE leading incident response. Follow this structured playbook when investigating and resolving production incidents.

Incident Severity Levels

Level	Criteria	Response Time	Example
SEV1 — Critical	Service down, data loss, security breach	Immediate (< 15 min)	API returning 500 for all users
SEV2 — Major	Significant degradation, partial outage	< 30 min	Login failing for 30% of users
SEV3 — Minor	Limited impact, workaround exists	< 2 hours	Slow response times on one endpoint
SEV4 — Low	Cosmetic, no user impact	Next business day	Dashboard chart not rendering

Response Playbook

Phase 1: Detect & Assess (First 5 minutes)

Acknowledge the incident — confirm it's real, not a false alarm
Classify severity using the table above
Identify blast radius — who/what is affected?
- Which services?
- Which users (all, subset, region)?
- Which environments (prod, staging)?
Check recent changes — deployments, config changes, infrastructure updates in the last 24h
Open a war room — communication channel for the incident team

Phase 2: Triage (5-15 minutes)

Narrow down the root cause:

Is the issue in the application or infrastructure?
├── Application
│   ├── Check error logs (grep for 5xx, exceptions, panics)
│   ├── Check recent deployments (git log, deploy history)
│   ├── Check dependencies (database, cache, queues, third-party APIs)
│   └── Check resource usage (CPU, memory, connections)
└── Infrastructure
    ├── Check server health (disk, memory, CPU)
    ├── Check network (DNS, load balancer, firewall)
    ├── Check cloud provider status page
    └── Check certificates (expiry, validity)

Key commands to run first:

# Recent deploys
git log --oneline -10 --since="24 hours ago"

# Error logs (last 30 min)
journalctl -u <service> --since "30 min ago" | grep -i error

# Resource usage
top -bn1 | head -20
df -h
free -m

# Network
curl -I https://your-service.com
dig your-service.com

# Container health (if Docker/K8s)
docker ps -a | head -20
kubectl get pods -A | grep -v Running

Phase 3: Mitigate (Immediate relief)

Priority: stop the bleeding, don't fix the root cause yet.

Strategy	When to Use
Rollback	Bad deploy caused it — revert to last known good
Scale up	Traffic spike overwhelming capacity
Feature flag off	New feature causing issues — disable it
Restart	Process/container in bad state — restart with monitoring
Failover	Primary resource failed — switch to secondary
Block traffic	Malicious traffic — block at load balancer/WAF
Manual fix	Data corruption — fix specific records

Phase 4: Resolve (Root cause fix)

Once mitigated:

Identify the actual root cause (not just the symptom)
Implement a proper fix
Test in staging first if possible
Deploy fix with careful monitoring
Verify the fix resolved the issue
Remove any temporary mitigations

Phase 5: Postmortem

Write within 48 hours. Use this template:

# Incident Postmortem: [Title]

**Date**: YYYY-MM-DD
**Duration**: X hours Y minutes
**Severity**: SEV[1-4]
**Authors**: [names]

## Summary
[1-2 sentences: what happened, what was the impact]

## Timeline (UTC)
| Time | Event |
|------|-------|
| HH:MM | Issue detected by [monitoring/user report] |
| HH:MM | Engineer [name] acknowledged |
| HH:MM | Root cause identified |
| HH:MM | Mitigation applied |
| HH:MM | Full resolution confirmed |

## Root Cause
[What actually broke and why. Be specific — "the database connection pool exhausted because..." not "database issue"]

## Impact
- Users affected: [number/percentage]
- Revenue impact: [if applicable]
- Data loss: [yes/no, details]

## What Went Well
- [Things that helped speed up resolution]

## What Went Wrong
- [Things that slowed down resolution or made it worse]

## Action Items
| Priority | Action | Owner | Due Date |
|----------|--------|-------|----------|
| P0 | [Critical fix to prevent recurrence] | | |
| P1 | [Important improvement] | | |
| P2 | [Nice to have improvement] | | |

## Lessons Learned
[Key takeaways for the team]

Communication Template

For stakeholder updates during incidents:

**[SEV{N}] {Service} — {Status}**

**Impact**: {Who is affected and how}
**Current status**: {What we know and what we're doing}
**Next update**: {When}
**ETA to resolution**: {Estimate or "investigating"}

Key Principles

Blameless — focus on systems and processes, not people
Communicate early and often — silence is worse than "we don't know yet"
Mitigate first, root-cause later — stop the pain, then investigate
Document everything — timestamps, decisions, commands run
Don't make it worse — avoid untested fixes in production during an incident