Vibecosystem incident-response-patterns
Incident severity classification, runbook templates, root cause analysis, and post-incident review
install
source · Clone the upstream repo
git clone https://github.com/vibeeval/vibecosystem
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/vibeeval/vibecosystem "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/incident-response-patterns" ~/.claude/skills/vibeeval-vibecosystem-incident-response-patterns && rm -rf "$T"
manifest:
skills/incident-response-patterns/SKILL.mdsource content
Incident Response Patterns
Severity Classification
| Level | Tanım | Response Time | Escalation |
|---|---|---|---|
| SEV-1 | Tam kesinti, data loss | 5 dk | Tüm ekip |
| SEV-2 | Major feature bozuk | 30 dk | On-call + lead |
| SEV-3 | Minor feature bozuk | 4 saat | On-call |
| SEV-4 | Kozmetik, low impact | Sonraki iş günü | Ticket |
Incident Response Protocol
1. DETECT → Alert tetiklendi / kullanıcı bildirdi 2. TRIAGE → Severity belirle, incident channel aç 3. ASSIGN → Incident Commander + Responder ata 4. CONTAIN → Bleeding'i durdur (rollback, feature flag off) 5. FIX → Root cause'u düzelt 6. VERIFY → Fix'i doğrula (monitoring, smoke test) 7. RESOLVE → Incident kapat, stakeholder bilgilendir 8. REVIEW → Post-incident review (48 saat içinde)
Runbook Template
# Runbook: [Service Name] - [Issue Type] ## Symptoms - Alert: [alert name] - Dashboard: [link] - Expected: [normal behavior] - Actual: [observed behavior] ## Diagnosis Steps 1. Check [metric/log/dashboard] 2. Run: `kubectl get pods -n production` 3. Check: `SELECT count(*) FROM errors WHERE ...` ## Mitigation 1. Quick fix: [rollback/restart/scale] 2. Feature flag: [disable feature X] 3. Redirect: [failover to backup] ## Escalation - L1: [on-call engineer] - L2: [team lead] - L3: [CTO/VP Eng]
Root Cause Analysis
5 Whys
Problem: API 500 hatası Why 1: Database connection timeout Why 2: Connection pool tükendi Why 3: Slow query connection'ları tutuyor Why 4: Missing index on frequently queried column Why 5: Schema change review process eksik → Action: Schema change PR'da EXPLAIN ANALYZE zorunlu
Post-Incident Review Template
## Incident Summary - Duration: [start - end] - Impact: [affected users/revenue] - Severity: SEV-[X] ## Timeline - HH:MM - Alert triggered - HH:MM - Incident declared - HH:MM - Root cause identified - HH:MM - Fix deployed - HH:MM - Incident resolved ## Root Cause [Detailed technical explanation] ## Action Items - [ ] [Fix] Add missing index (owner: @dev, due: date) - [ ] [Prevent] Add schema review step (owner: @lead) - [ ] [Detect] Add connection pool alert (owner: @sre)
Checklist
- Severity matrix tanımlı
- On-call rotation aktif
- Runbook'lar güncel
- Incident channel template hazır
- Post-incident review 48 saat içinde
- Action items tracked (Jira/Linear)
- Blameless culture enforced
- Alerting threshold'lar doğru
Anti-Patterns
- Blame culture (kişi değil, sistem düzelt)
- Runbook'suz servis
- Post-incident review atlamak
- Alert fatigue (çok fazla false positive)
- Hotfix without rollback plan