Awesome-claude-corporate-skills incident-postmortem

Name: incident-postmortem
Author: w95

Write blameless incident postmortems with timeline reconstruction, root cause analysis, action items, and preventive measures

install

source · Clone the upstream repo

git clone https://github.com/w95/awesome-claude-corporate-skills

Claude Code · Install into ~/.claude/skills/

T=$(mktemp -d) && git clone --depth=1 https://github.com/w95/awesome-claude-corporate-skills "$T" && mkdir -p ~/.claude/skills && cp -r "$T/07-operations/incident-postmortem" ~/.claude/skills/w95-awesome-claude-corporate-skills-incident-postmortem && rm -rf "$T"

manifest: 07-operations/incident-postmortem/SKILL.md

source content

Incident Postmortem Builder

Overview

Create blameless incident postmortems that transform operational disruptions into learning opportunities. These documents focus on system failures and process gaps, not individual blame, enabling continuous improvement and preventing recurrence.

Core Principles

Blameless: Focus on systems, not people. "Why did this happen?" not "Who screwed up?"
Psychological Safety: Team members must feel safe discussing root causes without fear
Data-Driven: Base findings on logs, metrics, and facts; not assumptions
Action-Oriented: Every finding leads to actionable improvements
Learning Culture: Treat incidents as valuable learning events, not failures
Transparency: Share findings broadly; communicate changes to prevent similar incidents

Timeline Reconstruction

Create a detailed chronology of events:

Time (UTC) | Who | What | Evidence | Context
-----------|-----|------|----------|----------
2024-02-15 14:32 | Jenkins | Deploy v2.1.3 (buggy) | Logs | Automated Friday deploy
14:35 | Customer | Website errors | CloudFront | 500 errors reported
14:37 | On-call | PagerDuty alert | Alert | Error rate exceeded threshold
14:42 | Eng team | Investigation starts | Slack #incidents | Identified deploy cause
14:55 | Lead | Rollback initiated | Logs | Reverted to v2.1.2
15:02 | On-call | Error rate normal | Metrics | Customers back to normal
15:30 | Team | Root cause meeting | Notes | Identified root cause

Timeline Template:

T+0 (Alert): When first detected
T+X (Detection): When incident was recognized
T+Y (Communication): When stakeholders notified
T+Z (Mitigation): When incident owner took action
T+N (Resolution): When system returned to normal
Duration: Total time from detection to resolution

Root Cause Analysis (5 Whys)

Go beyond the obvious cause to find systemic issues:

Incident: Website down for 28 minutes

Why 1: Why did website go down?
Answer: Deployment v2.1.3 contained a bug causing infinite loop in auth service

Why 2: Why did the bug reach production?
Answer: Code review missed the issue; test suite didn't catch it

Why 3: Why didn't test suite catch the infinite loop?
Answer: Load/stress tests only run occasionally; not part of standard CI pipeline

Why 4: Why aren't load tests mandatory in CI?
Answer: Historically slow; team prioritized speed over reliability

Why 5: Why does team optimize for deploy speed over testing?
Answer: Pressure to ship features fast; no documented standard for testing rigor

ROOT CAUSE: Process gap - no mandatory load testing in CI; pressure to ship

Avoid:

Stopping too early ("operator didn't notice error")
Human error as root cause ("developer made a mistake")
Unclear systemic issues

Focus on:

Process failures
Monitoring gaps
Communication breakdowns
Knowledge gaps
Tool limitations
Architectural weaknesses

Contributing Factors (Swiss Cheese Model)

Most incidents involve multiple failures aligning:

Incident: Late-night 30-minute outage

Contributing Factors:
1. Code change made Friday afternoon (rush to deploy before weekend)
2. No automated rollback capability (manual process)
3. On-call engineer had weak knowledge of new code (hired 3 weeks ago)
4. No load test coverage for auth service changes (technical debt)
5. Monitoring alert threshold set too high (missed early warning)
6. Deployment not staged; went straight to production (process gap)
7. No change advisory board approval (governance gap)

Any ONE of these alone wouldn't have caused incident. Combined: 28-minute outage.

Root Cause vs. Proximate Cause

Proximate Cause (immediate cause):

Infinite loop in authentication code
Deployment that shouldn't have happened
Missing monitor alert

Root Cause (systemic failure):

Code review process insufficient for critical changes
Deployment process lacked staged/canary deployment
Testing strategy doesn't include stress tests
Knowledge gap in on-call team for recent changes

Focus postmortem on root causes, not proximate causes.

Action Items (Follow-up)

Immediate Actions (0-7 days)

CRITICAL | Deploy hotfix for infinite loop | Prevent recurrence | Sarah | 2024-02-15 | DONE
HIGH | Document code change impact | Knowledge transfer | John | 2024-02-16 | IN PROGRESS
MEDIUM | Post-incident communication to customers | Transparency | PM | 2024-02-15 | DONE

Short-term Actions (1-4 weeks)

HIGH | Implement automatic canary deployment | Catch issues pre-production | DevOps | 2024-03-01 | PENDING
HIGH | Add auth load tests to CI pipeline | Catch performance issues early | QA | 2024-03-01 | PENDING
MEDIUM | Onboard new on-call engineer on recent changes | Knowledge gap closure | Tech Lead | 2024-02-28 | IN PROGRESS

Long-term Actions (1-3 months)

MEDIUM | Implement automated rollback capability | Faster recovery time | Arch | 2024-04-15 | PENDING
LOW | Review change advisory board process | Governance improvement | Ops | 2024-05-01 | PENDING
LOW | Schedule quarterly load testing for critical services | Proactive risk management | Perf | 2024-06-01 | PENDING

SMART Action Items:

Specific: What exactly needs to be done?
Measurable: How will we know it's complete?
Assignable: Who owns this?
Realistic: Can it actually be done?
Time-bound: When is it due?

Preventive Measures

Prevention Strategy: How do we prevent this type of incident?

Process Changes:
- Implement staged/canary deployments for critical services
- Add code review requirement for auth service changes
- Require load test passing for critical path changes
Monitoring & Alerting:
- Lower error rate alert threshold (early warning)
- Add CPU/memory alerts for auth service
- Add canary endpoint synthetic monitoring
Automation:
- Automatic rollback if error rate >2%
- Load test gate in CI pipeline (mandatory, not optional)
- Automated chaos engineering tests weekly
Documentation & Training:
- Document architecture of auth service
- Create runbook for auth service incidents
- Schedule knowledge transfer session for on-call team
Organizational:
- Remove deadline pressure; don't deploy Friday afternoon
- Add on-call engineer to code reviews of critical services
- Establish incident SLA: detection to resolution <15 minutes for P0 incidents

Template Structure

INCIDENT POSTMORTEM

Incident ID: INC-2024-047 Date: 2024-02-15 Duration: 28 minutes (14:35 UTC - 15:03 UTC) Impact: Website unavailable; 0.5M page requests failed Severity: P1 (Critical)

INCIDENT SUMMARY [1 paragraph overview of what happened and impact]

TIMELINE [Chronological events table]

ROOT CAUSE ANALYSIS Primary Root Cause: [High-level finding]

5 Whys Analysis: [Why chain]

Contributing Factors: [List of systemic issues]

IMPACT ANALYSIS

Customers Affected: [Number or percentage]
Duration: [Minutes]
Revenue Impact: [If quantifiable]
Reputation Impact: [Qualitative assessment]
Data Loss: [Yes/No, details if yes]

DETECTION & RESPONSE

Detection Time: [Minutes to detect]
Response Time: [Minutes to start mitigation]
Resolution Time: [Minutes to full recovery]
Response Quality: [Smooth/Some delays/Chaotic - why?]

WHAT WENT WELL

[Good thing 1]: Enabled [outcome]
[Good thing 2]: Enabled [outcome]
[Good thing 3]: Enabled [outcome]

Recognize excellent work; reinforce good behaviors

WHAT COULD BE BETTER

[Gap 1]: Impact was [consequence]
[Gap 2]: Impact was [consequence]
[Gap 3]: Impact was [consequence]

ACTION ITEMS [Immediate, short-term, long-term actions with owners and dates]

PREVENTIVE MEASURES [How we prevent this incident class in future]

APPENDICES

Error logs (anonymized)
Customer communication
Monitoring graphs during incident
Architecture diagram
Related incidents (historical patterns)

Postmortem Facilitation

Blameless Meeting Principles:

Start with context, not blame: "Here's what was happening at 14:30 UTC..."
Use neutral language: "Code changed" not "Code was broken"
Ask curious questions: "What were you seeing on your screen?" not "Why didn't you check the logs?"
Encourage storytelling: Let people describe their experience; narrative flow
Capture assumptions: "I assumed..." statements reveal knowledge gaps
No hierarchy: On-call engineer's observations valued same as CTO's
Record decisions: Why did we choose rollback vs. fix? Document the thinking
Record learnings: What surprised people? What did they learn?

Participants:

On-call engineer (incident responder)
Service owner
DevOps/Infrastructure team
Product/Business owner
Facilitator (experienced, neutral party)

Meeting Duration: 30-60 minutes maximum

When to Hold: Within 48 hours of incident resolution (while details are fresh)

Distribution & Follow-up

Share Widely: Postmortem is internal tool for learning; share with full engineering org
Executive Summary: One-page summary for leadership
Customer Communication: Transparency about what happened and prevention measures
Process Review: Monthly review of open action items
Trend Analysis: Quarterly review - are we preventing incident classes or just firefighting?

Preventing Similar Incidents

Incident Class Tracking:

Authentication failures
Database performance degradation
Memory leaks
Configuration errors
Dependency failures
Deployment failures

If same class happens twice: escalate prevention measures If same incident happens three times: organizational escalation (management review)

Use this skill to: Transform incidents into learning opportunities, improve system resilience, and build a psychologically safe incident response culture.