Skilllibrary incident-postmortem

Turns failures or regressions into structured root-cause, corrective-action, and prevention artifacts following blameless SRE practices. Trigger on "write postmortem", "incident review", "root cause analysis", "blameless retrospective". Do NOT use for process-doctor (diagnosing agent process failures) or error-handling (code-level error design).

install

source · Clone the upstream repo

git clone https://github.com/merceralex397-collab/skilllibrary

Claude Code · Install into ~/.claude/skills/

T=$(mktemp -d) && git clone --depth=1 https://github.com/merceralex397-collab/skilllibrary "$T" && mkdir -p ~/.claude/skills && cp -r "$T/02-generated-repo-core/incident-postmortem" ~/.claude/skills/merceralex397-collab-skilllibrary-incident-postmortem && rm -rf "$T"

manifest: 02-generated-repo-core/incident-postmortem/SKILL.md

source content

Purpose

Write blameless postmortems following Google SRE practices: document what happened, identify contributing factors (not blame), and produce actionable prevention items. The goal is organizational learning, not punishment—if people fear postmortems, incidents get hidden.

When to use this skill

Use when:

User-visible downtime or degradation occurred
Data loss of any kind happened
On-call engineer had to intervene (rollback, traffic reroute)
Resolution took longer than SLO allows
Monitoring failed to detect the issue
Any stakeholder requests a postmortem

Do NOT use when:

Issue was caught in staging/preview (log it, but no formal postmortem)
No user impact and quick self-recovery
Root cause is already well-understood and tracked

Operating procedure

Establish timeline (facts only, no interpretation yet):

## Timeline (all times UTC)
- 14:23 - Deploy of commit abc123 to production
- 14:31 - First error alert fires (PagerDuty incident #1234)
- 14:35 - On-call acknowledges, begins investigation
- 14:42 - Root cause identified: database connection pool exhausted
- 14:45 - Rollback initiated
- 14:52 - Service restored, monitoring confirms recovery

Document impact:

## Impact
- **Duration**: 29 minutes (14:23 - 14:52)
- **Users affected**: ~12,000 (15% of active users)
- **Severity**: SEV2 - Partial service degradation
- **Business impact**: ~$X revenue loss, Y support tickets

Identify contributing factors (NOT root cause singular—incidents rarely have one cause):

## Contributing Factors
1. New feature increased database connections per request from 1 to 3
2. Connection pool size unchanged from original config (10 connections)
3. Load testing did not include new feature enabled
4. Monitoring alert threshold set too high to catch gradual degradation

Write blameless narrative: Use passive voice or system descriptions, not "Person X failed to...":
- ❌ "John didn't test the connection pool changes"
- ✅ "The testing process did not include connection pool behavior verification"

Define action items with owners and due dates:

## Action Items
| ID | Action | Owner | Priority | Due Date |
|----|--------|-------|----------|----------|
| 1 | Increase connection pool to 50 | @backend-team | P1 | 2024-01-15 |
| 2 | Add connection pool utilization metric | @sre | P1 | 2024-01-20 |
| 3 | Update load test to enable all features | @qa | P2 | 2024-01-30 |
| 4 | Lower alert threshold to 70% pool usage | @sre | P1 | 2024-01-15 |

Review and publish: Share with engineering team, not just incident participants. Learning should spread.

Output defaults

# Postmortem: [Incident Title]

**Date**: [YYYY-MM-DD]  
**Authors**: [names]  
**Status**: Draft | Reviewed | Published

## Summary
[2-3 sentence description of what happened and impact]

## Impact
- Duration: [X minutes/hours]
- Users affected: [number or percentage]
- Severity: [SEV1-4]

## Timeline
[Chronological events with timestamps]

## Contributing Factors
[Numbered list of factors that combined to cause the incident]

## Action Items
[Table with ID, action, owner, priority, due date]

## Lessons Learned
### What went well
- [e.g., Alerting fired within 8 minutes]
### What could be improved
- [e.g., Runbook was outdated]

## Supporting Information
- [Links to dashboards, logs, related incidents]

References

Failure handling

Blame creeping into language: Review and rewrite any sentence that names an individual as the cause; focus on process/system gaps
Action items too vague: Each item must be specific enough to verify completion; "improve monitoring" → "add metric X with alert at threshold Y"
No one owns action items: Assign owner at postmortem review meeting; unowned items don't get done
Postmortem not reviewed: Schedule review meeting within 72 hours of incident; stale postmortems lose context