Antigravity-awesome-skills postmortem-writing
Write effective blameless postmortems with root cause analysis, timelines, and action items. Use when conducting incident reviews, writing postmortem documents, or improving incident response proce...
install
source · Clone the upstream repo
git clone https://github.com/benjaminasterA/antigravity-awesome-skills
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/benjaminasterA/antigravity-awesome-skills "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/postmortem-writing" ~/.claude/skills/benjaminastera-antigravity-awesome-skills-postmortem-writing && rm -rf "$T"
manifest:
skills/postmortem-writing/SKILL.mdsource content
Postmortem Writing
Comprehensive guide to writing effective, blameless postmortems that drive organizational learning and prevent incident recurrence.
Do not use this skill when
- The task is unrelated to postmortem writing
- You need a different domain or tool outside this scope
Instructions
- Clarify goals, constraints, and required inputs.
- Apply relevant best practices and validate outcomes.
- Provide actionable steps and verification.
- If detailed examples are required, open
.resources/implementation-playbook.md
Use this skill when
- Conducting post-incident reviews
- Writing postmortem documents
- Facilitating blameless postmortem meetings
- Identifying root causes and contributing factors
- Creating actionable follow-up items
- Building organizational learning culture
Core Concepts
1. Blameless Culture
| Blame-Focused | Blameless |
|---|---|
| "Who caused this?" | "What conditions allowed this?" |
| "Someone made a mistake" | "The system allowed this mistake" |
| Punish individuals | Improve systems |
| Hide information | Share learnings |
| Fear of speaking up | Psychological safety |
2. Postmortem Triggers
- SEV1 or SEV2 incidents
- Customer-facing outages > 15 minutes
- Data loss or security incidents
- Near-misses that could have been severe
- Novel failure modes
- Incidents requiring unusual intervention
Quick Start
Postmortem Timeline
Day 0: Incident occurs Day 1-2: Draft postmortem document Day 3-5: Postmortem meeting Day 5-7: Finalize document, create tickets Week 2+: Action item completion Quarterly: Review patterns across incidents
Templates
Template 1: Standard Postmortem
# Postmortem: [Incident Title] **Date**: 2024-01-15 **Authors**: @alice, @bob **Status**: Draft | In Review | Final **Incident Severity**: SEV2 **Incident Duration**: 47 minutes ## Executive Summary On January 15, 2024, the payment processing service experienced a 47-minute outage affecting approximately 12,000 customers. The root cause was a database connection pool exhaustion triggered by a configuration change in deployment v2.3.4. The incident was resolved by rolling back to v2.3.3 and increasing connection pool limits. **Impact**: - 12,000 customers unable to complete purchases - Estimated revenue loss: $45,000 - 847 support tickets created - No data loss or security implications ## Timeline (All times UTC) | Time | Event | |------|-------| | 14:23 | Deployment v2.3.4 completed to production | | 14:31 | First alert: `payment_error_rate > 5%` | | 14:33 | On-call engineer @alice acknowledges alert | | 14:35 | Initial investigation begins, error rate at 23% | | 14:41 | Incident declared SEV2, @bob joins | | 14:45 | Database connection exhaustion identified | | 14:52 | Decision to rollback deployment | | 14:58 | Rollback to v2.3.3 initiated | | 15:10 | Rollback complete, error rate dropping | | 15:18 | Service fully recovered, incident resolved | ## Root Cause Analysis ### What Happened The v2.3.4 deployment included a change to the database query pattern that inadvertently removed connection pooling for a frequently-called endpoint. Each request opened a new database connection instead of reusing pooled connections. ### Why It Happened 1. **Proximate Cause**: Code change in `PaymentRepository.java` replaced pooled `DataSource` with direct `DriverManager.getConnection()` calls. 2. **Contributing Factors**: - Code review did not catch the connection handling change - No integration tests specifically for connection pool behavior - Staging environment has lower traffic, masking the issue - Database connection metrics alert threshold was too high (90%) 3. **5 Whys Analysis**: - Why did the service fail? → Database connections exhausted - Why were connections exhausted? → Each request opened new connection - Why did each request open new connection? → Code bypassed connection pool - Why did code bypass connection pool? → Developer unfamiliar with codebase patterns - Why was developer unfamiliar? → No documentation on connection management patterns ### System Diagram
[Client] → [Load Balancer] → [Payment Service] → [Database] ↓ Connection Pool (broken) ↓ Direct connections (cause)
## Detection ### What Worked - Error rate alert fired within 8 minutes of deployment - Grafana dashboard clearly showed connection spike - On-call response was swift (2 minute acknowledgment) ### What Didn't Work - Database connection metric alert threshold too high - No deployment-correlated alerting - Canary deployment would have caught this earlier ### Detection Gap The deployment completed at 14:23, but the first alert didn't fire until 14:31 (8 minutes). A deployment-aware alert could have detected the issue faster. ## Response ### What Worked - On-call engineer quickly identified database as the issue - Rollback decision was made decisively - Clear communication in incident channel ### What Could Be Improved - Took 10 minutes to correlate issue with recent deployment - Had to manually check deployment history - Rollback took 12 minutes (could be faster) ## Impact ### Customer Impact - 12,000 unique customers affected - Average impact duration: 35 minutes - 847 support tickets (23% of affected users) - Customer satisfaction score dropped 12 points ### Business Impact - Estimated revenue loss: $45,000 - Support cost: ~$2,500 (agent time) - Engineering time: ~8 person-hours ### Technical Impact - Database primary experienced elevated load - Some replica lag during incident - No permanent damage to systems ## Lessons Learned ### What Went Well 1. Alerting detected the issue before customer reports 2. Team collaborated effectively under pressure 3. Rollback procedure worked smoothly 4. Communication was clear and timely ### What Went Wrong 1. Code review missed critical change 2. Test coverage gap for connection pooling 3. Staging environment doesn't reflect production traffic 4. Alert thresholds were not tuned properly ### Where We Got Lucky 1. Incident occurred during business hours with full team available 2. Database handled the load without failing completely 3. No other incidents occurred simultaneously ## Action Items | Priority | Action | Owner | Due Date | Ticket | |----------|--------|-------|----------|--------| | P0 | Add integration test for connection pool behavior | @alice | 2024-01-22 | ENG-1234 | | P0 | Lower database connection alert threshold to 70% | @bob | 2024-01-17 | OPS-567 | | P1 | Document connection management patterns | @alice | 2024-01-29 | DOC-89 | | P1 | Implement deployment-correlated alerting | @bob | 2024-02-05 | OPS-568 | | P2 | Evaluate canary deployment strategy | @charlie | 2024-02-15 | ENG-1235 | | P2 | Load test staging with production-like traffic | @dave | 2024-02-28 | QA-123 | ## Appendix ### Supporting Data #### Error Rate Graph [Link to Grafana dashboard snapshot] #### Database Connection Graph [Link to metrics] ### Related Incidents - 2023-11-02: Similar connection issue in User Service (POSTMORTEM-42) ### References - Connection Pool Best Practices - Deployment Runbook
Template 2: 5 Whys Analysis
# 5 Whys Analysis: [Incident] ## Problem Statement Payment service experienced 47-minute outage due to database connection exhaustion. ## Analysis ### Why #1: Why did the service fail? **Answer**: Database connections were exhausted, causing all new requests to fail. **Evidence**: Metrics showed connection count at 100/100 (max), with 500+ pending requests. --- ### Why #2: Why were database connections exhausted? **Answer**: Each incoming request opened a new database connection instead of using the connection pool. **Evidence**: Code diff shows direct `DriverManager.getConnection()` instead of pooled `DataSource`. --- ### Why #3: Why did the code bypass the connection pool? **Answer**: A developer refactored the repository class and inadvertently changed the connection acquisition method. **Evidence**: PR #1234 shows the change, made while fixing a different bug. --- ### Why #4: Why wasn't this caught in code review? **Answer**: The reviewer focused on the functional change (the bug fix) and didn't notice the infrastructure change. **Evidence**: Review comments only discuss business logic. --- ### Why #5: Why isn't there a safety net for this type of change? **Answer**: We lack automated tests that verify connection pool behavior and lack documentation about our connection patterns. **Evidence**: Test suite has no tests for connection handling; wiki has no article on database connections. ## Root Causes Identified 1. **Primary**: Missing automated tests for infrastructure behavior 2. **Secondary**: Insufficient documentation of architectural patterns 3. **Tertiary**: Code review checklist doesn't include infrastructure considerations ## Systemic Improvements | Root Cause | Improvement | Type | |------------|-------------|------| | Missing tests | Add infrastructure behavior tests | Prevention | | Missing docs | Document connection patterns | Prevention | | Review gaps | Update review checklist | Detection | | No canary | Implement canary deployments | Mitigation |
Template 3: Quick Postmortem (Minor Incidents)
# Quick Postmortem: [Brief Title] **Date**: 2024-01-15 | **Duration**: 12 min | **Severity**: SEV3 ## What Happened API latency spiked to 5s due to cache miss storm after cache flush. ## Timeline - 10:00 - Cache flush initiated for config update - 10:02 - Latency alerts fire - 10:05 - Identified as cache miss storm - 10:08 - Enabled cache warming - 10:12 - Latency normalized ## Root Cause Full cache flush for minor config update caused thundering herd. ## Fix - Immediate: Enabled cache warming - Long-term: Implement partial cache invalidation (ENG-999) ## Lessons Don't full-flush cache in production; use targeted invalidation.
Facilitation Guide
Running a Postmortem Meeting
## Meeting Structure (60 minutes) ### 1. Opening (5 min) - Remind everyone of blameless culture - "We're here to learn, not to blame" - Review meeting norms ### 2. Timeline Review (15 min) - Walk through events chronologically - Ask clarifying questions - Identify gaps in timeline ### 3. Analysis Discussion (20 min) - What failed? - Why did it fail? - What conditions allowed this? - What would have prevented it? ### 4. Action Items (15 min) - Brainstorm improvements - Prioritize by impact and effort - Assign owners and due dates ### 5. Closing (5 min) - Summarize key learnings - Confirm action item owners - Schedule follow-up if needed ## Facilitation Tips - Keep discussion on track - Redirect blame to systems - Encourage quiet participants - Document dissenting views - Time-box tangents
Anti-Patterns to Avoid
| Anti-Pattern | Problem | Better Approach |
|---|---|---|
| Blame game | Shuts down learning | Focus on systems |
| Shallow analysis | Doesn't prevent recurrence | Ask "why" 5 times |
| No action items | Waste of time | Always have concrete next steps |
| Unrealistic actions | Never completed | Scope to achievable tasks |
| No follow-up | Actions forgotten | Track in ticketing system |
Best Practices
Do's
- Start immediately - Memory fades fast
- Be specific - Exact times, exact errors
- Include graphs - Visual evidence
- Assign owners - No orphan action items
- Share widely - Organizational learning
Don'ts
- Don't name and shame - Ever
- Don't skip small incidents - They reveal patterns
- Don't make it a blame doc - That kills learning
- Don't create busywork - Actions should be meaningful
- Don't skip follow-up - Verify actions completed