Software_development_department postmortem-writing
Writes blameless postmortems with root cause analysis, incident timelines, contributing factors, and action items. Use when conducting incident reviews or when the user mentions postmortem, root cause analysis, or blameless review.
install
source · Clone the upstream repo
git clone https://github.com/tranhieutt/software_development_department
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/tranhieutt/software_development_department "$T" && mkdir -p ~/.claude/skills && cp -r "$T/.claude/skills/postmortem-writing" ~/.claude/skills/tranhieutt-software-development-department-postmortem-writing && rm -rf "$T"
manifest:
.claude/skills/postmortem-writing/SKILL.mdsource content
Postmortem Writing
Blameless mindset: the non-obvious part
"Blame-focused: Who caused this? → Blameless: What conditions allowed this?"
Engineers don't fail — systems create conditions where failures become inevitable. The goal is improving systems, not punishing people.
Triggers (when to write one)
- SEV1/SEV2 incidents
- Customer-facing outage > 15 minutes
- Data loss or security incident
- Novel failure modes worth sharing
Timeline: Day 0 → Day 7
Day 0: Incident occurs Day 1-2: Draft postmortem (memory is freshest) Day 3-5: Postmortem meeting Day 5-7: Finalize + create tickets Week 2+: Action item completion
Standard template
# Postmortem: [Title] **Date**: YYYY-MM-DD | **Severity**: SEV2 | **Duration**: 47 min **Authors**: @alice, @bob | **Status**: Draft ## Executive Summary [2-3 sentences: what broke, impact, how resolved] **Impact**: [N customers, N minutes, revenue loss, no data loss] ## Timeline (UTC) | Time | Event | |-------|-------| | 14:23 | v2.3.4 deployed to production | | 14:31 | Alert: payment_error_rate > 5% | | 14:33 | On-call @alice acknowledges | | 14:45 | Root cause identified: DB connections | | 14:52 | Decision to rollback | | 15:10 | Rollback complete, error rate normalizing | | 15:18 | Service recovered | ## Root Cause Analysis ### What happened [Technical description of failure] ### 5 Whys - Why did service fail? → DB connections exhausted - Why exhausted? → Each request opened new connection - Why new connections? → Code bypassed connection pool - Why bypassed? → Developer unfamiliar with DB patterns - Why unfamiliar? → No documentation on connection management ### Contributing factors - Code review missed the infrastructure change - No integration tests for connection pool behavior - Staging traffic too low to expose the issue - Alert threshold too high (90%, should be 70%) ## What worked / what didn't | Worked | Didn't work | |---|---| | Alert fired within 8 min | Took 10 min to correlate with deployment | | Clear Grafana dashboard | No deployment-correlated alerting | | Fast rollback decision | No canary deployment | ## Action items | Priority | Action | Owner | Due | Ticket | |---|---|---|---|---| | P0 | Integration test for connection pool | @alice | 2024-01-22 | ENG-1234 | | P0 | Lower DB alert threshold to 70% | @bob | 2024-01-17 | OPS-567 | | P1 | Document connection management patterns | @alice | 2024-01-29 | DOC-89 | | P2 | Evaluate canary deployment | @charlie | 2024-02-15 | ENG-1235 |
Quick template (SEV3, < 30 min)
# Quick Postmortem: [Title] **Date**: YYYY-MM-DD | **Duration**: 12 min | **Severity**: SEV3 **What happened**: Cache flush caused thundering herd — all requests missed cache simultaneously. **Timeline**: 10:00 flush → 10:02 alerts → 10:05 identified → 10:08 warming enabled → 10:12 normal **Root cause**: Full flush used for minor config update. **Fixes**: Immediate: enabled warming. Long-term: partial invalidation (ENG-999). **Lesson**: Never full-flush production cache; use targeted invalidation.
Meeting structure (60 min)
- Opening (5 min) — state blameless norms explicitly
- Timeline review (15 min) — chronological walkthrough
- Analysis (20 min) — what failed, why, what conditions allowed it
- Action items (15 min) — brainstorm → prioritize → assign owners
- Close (5 min) — confirm owners, schedule follow-up
Anti-patterns
| Anti-pattern | Why it fails |
|---|---|
| "Human error" as root cause | Always dig deeper — why did the system allow it? |
| Shallow analysis (1 why) | Doesn't prevent recurrence |
| No action items | Meeting was a waste of time |
| Unrealistic actions | Never completed |
| No follow-up tracking | Actions forgotten |
Output
Save to
docs/technical/postmortem-YYYY-MM-DD-[slug].md
Deliver: timeline + 1-sentence root cause + max 5 action items with owner and deadline