Software_development_department postmortem-writing

Writes blameless postmortems with root cause analysis, incident timelines, contributing factors, and action items. Use when conducting incident reviews or when the user mentions postmortem, root cause analysis, or blameless review.

install

source · Clone the upstream repo

git clone https://github.com/tranhieutt/software_development_department

Claude Code · Install into ~/.claude/skills/

T=$(mktemp -d) && git clone --depth=1 https://github.com/tranhieutt/software_development_department "$T" && mkdir -p ~/.claude/skills && cp -r "$T/.claude/skills/postmortem-writing" ~/.claude/skills/tranhieutt-software-development-department-postmortem-writing && rm -rf "$T"

manifest: .claude/skills/postmortem-writing/SKILL.md

source content

Postmortem Writing

Blameless mindset: the non-obvious part

"Blame-focused: Who caused this? → Blameless: What conditions allowed this?"

Engineers don't fail — systems create conditions where failures become inevitable. The goal is improving systems, not punishing people.

Triggers (when to write one)

SEV1/SEV2 incidents
Customer-facing outage > 15 minutes
Data loss or security incident
Novel failure modes worth sharing

Timeline: Day 0 → Day 7

Day 0:   Incident occurs
Day 1-2: Draft postmortem (memory is freshest)
Day 3-5: Postmortem meeting
Day 5-7: Finalize + create tickets
Week 2+: Action item completion

Standard template

# Postmortem: [Title]

**Date**: YYYY-MM-DD | **Severity**: SEV2 | **Duration**: 47 min
**Authors**: @alice, @bob | **Status**: Draft

## Executive Summary
[2-3 sentences: what broke, impact, how resolved]

**Impact**: [N customers, N minutes, revenue loss, no data loss]

## Timeline (UTC)

| Time  | Event |
|-------|-------|
| 14:23 | v2.3.4 deployed to production |
| 14:31 | Alert: payment_error_rate > 5% |
| 14:33 | On-call @alice acknowledges |
| 14:45 | Root cause identified: DB connections |
| 14:52 | Decision to rollback |
| 15:10 | Rollback complete, error rate normalizing |
| 15:18 | Service recovered |

## Root Cause Analysis

### What happened
[Technical description of failure]

### 5 Whys
- Why did service fail? → DB connections exhausted
- Why exhausted? → Each request opened new connection
- Why new connections? → Code bypassed connection pool
- Why bypassed? → Developer unfamiliar with DB patterns
- Why unfamiliar? → No documentation on connection management

### Contributing factors
- Code review missed the infrastructure change
- No integration tests for connection pool behavior
- Staging traffic too low to expose the issue
- Alert threshold too high (90%, should be 70%)

## What worked / what didn't

| Worked | Didn't work |
|---|---|
| Alert fired within 8 min | Took 10 min to correlate with deployment |
| Clear Grafana dashboard | No deployment-correlated alerting |
| Fast rollback decision | No canary deployment |

## Action items

| Priority | Action | Owner | Due | Ticket |
|---|---|---|---|---|
| P0 | Integration test for connection pool | @alice | 2024-01-22 | ENG-1234 |
| P0 | Lower DB alert threshold to 70% | @bob | 2024-01-17 | OPS-567 |
| P1 | Document connection management patterns | @alice | 2024-01-29 | DOC-89 |
| P2 | Evaluate canary deployment | @charlie | 2024-02-15 | ENG-1235 |

Quick template (SEV3, < 30 min)

# Quick Postmortem: [Title]
**Date**: YYYY-MM-DD | **Duration**: 12 min | **Severity**: SEV3

**What happened**: Cache flush caused thundering herd — all requests missed cache simultaneously.
**Timeline**: 10:00 flush → 10:02 alerts → 10:05 identified → 10:08 warming enabled → 10:12 normal
**Root cause**: Full flush used for minor config update.
**Fixes**: Immediate: enabled warming. Long-term: partial invalidation (ENG-999).
**Lesson**: Never full-flush production cache; use targeted invalidation.

Meeting structure (60 min)

Opening (5 min) — state blameless norms explicitly
Timeline review (15 min) — chronological walkthrough
Analysis (20 min) — what failed, why, what conditions allowed it
Action items (15 min) — brainstorm → prioritize → assign owners
Close (5 min) — confirm owners, schedule follow-up

Anti-patterns

Anti-pattern	Why it fails
"Human error" as root cause	Always dig deeper — why did the system allow it?
Shallow analysis (1 why)	Doesn't prevent recurrence
No action items	Meeting was a waste of time
Unrealistic actions	Never completed
No follow-up tracking	Actions forgotten

Output

Save to

docs/technical/postmortem-YYYY-MM-DD-[slug].md

Deliver: timeline + 1-sentence root cause + max 5 action items with owner and deadline