Awesome-claude-corporate-skills incident-postmortem

Write blameless incident postmortems with timeline reconstruction, root cause analysis, action items, and preventive measures

install
source · Clone the upstream repo
git clone https://github.com/w95/awesome-claude-corporate-skills
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/w95/awesome-claude-corporate-skills "$T" && mkdir -p ~/.claude/skills && cp -r "$T/07-operations/incident-postmortem" ~/.claude/skills/w95-awesome-claude-corporate-skills-incident-postmortem && rm -rf "$T"
manifest: 07-operations/incident-postmortem/SKILL.md
source content

Incident Postmortem Builder

Overview

Create blameless incident postmortems that transform operational disruptions into learning opportunities. These documents focus on system failures and process gaps, not individual blame, enabling continuous improvement and preventing recurrence.

Core Principles

  1. Blameless: Focus on systems, not people. "Why did this happen?" not "Who screwed up?"
  2. Psychological Safety: Team members must feel safe discussing root causes without fear
  3. Data-Driven: Base findings on logs, metrics, and facts; not assumptions
  4. Action-Oriented: Every finding leads to actionable improvements
  5. Learning Culture: Treat incidents as valuable learning events, not failures
  6. Transparency: Share findings broadly; communicate changes to prevent similar incidents

Timeline Reconstruction

Create a detailed chronology of events:

Time (UTC) | Who | What | Evidence | Context
-----------|-----|------|----------|----------
2024-02-15 14:32 | Jenkins | Deploy v2.1.3 (buggy) | Logs | Automated Friday deploy
14:35 | Customer | Website errors | CloudFront | 500 errors reported
14:37 | On-call | PagerDuty alert | Alert | Error rate exceeded threshold
14:42 | Eng team | Investigation starts | Slack #incidents | Identified deploy cause
14:55 | Lead | Rollback initiated | Logs | Reverted to v2.1.2
15:02 | On-call | Error rate normal | Metrics | Customers back to normal
15:30 | Team | Root cause meeting | Notes | Identified root cause

Timeline Template:

  • T+0 (Alert): When first detected
  • T+X (Detection): When incident was recognized
  • T+Y (Communication): When stakeholders notified
  • T+Z (Mitigation): When incident owner took action
  • T+N (Resolution): When system returned to normal
  • Duration: Total time from detection to resolution

Root Cause Analysis (5 Whys)

Go beyond the obvious cause to find systemic issues:

Incident: Website down for 28 minutes

Why 1: Why did website go down?
Answer: Deployment v2.1.3 contained a bug causing infinite loop in auth service

Why 2: Why did the bug reach production?
Answer: Code review missed the issue; test suite didn't catch it

Why 3: Why didn't test suite catch the infinite loop?
Answer: Load/stress tests only run occasionally; not part of standard CI pipeline

Why 4: Why aren't load tests mandatory in CI?
Answer: Historically slow; team prioritized speed over reliability

Why 5: Why does team optimize for deploy speed over testing?
Answer: Pressure to ship features fast; no documented standard for testing rigor

ROOT CAUSE: Process gap - no mandatory load testing in CI; pressure to ship

Avoid:

  • Stopping too early ("operator didn't notice error")
  • Human error as root cause ("developer made a mistake")
  • Unclear systemic issues

Focus on:

  • Process failures
  • Monitoring gaps
  • Communication breakdowns
  • Knowledge gaps
  • Tool limitations
  • Architectural weaknesses

Contributing Factors (Swiss Cheese Model)

Most incidents involve multiple failures aligning:

Incident: Late-night 30-minute outage

Contributing Factors:
1. Code change made Friday afternoon (rush to deploy before weekend)
2. No automated rollback capability (manual process)
3. On-call engineer had weak knowledge of new code (hired 3 weeks ago)
4. No load test coverage for auth service changes (technical debt)
5. Monitoring alert threshold set too high (missed early warning)
6. Deployment not staged; went straight to production (process gap)
7. No change advisory board approval (governance gap)

Any ONE of these alone wouldn't have caused incident. Combined: 28-minute outage.

Root Cause vs. Proximate Cause

Proximate Cause (immediate cause):

  • Infinite loop in authentication code
  • Deployment that shouldn't have happened
  • Missing monitor alert

Root Cause (systemic failure):

  • Code review process insufficient for critical changes
  • Deployment process lacked staged/canary deployment
  • Testing strategy doesn't include stress tests
  • Knowledge gap in on-call team for recent changes

Focus postmortem on root causes, not proximate causes.

Action Items (Follow-up)

Structure: [Priority] | [What] | [Why] | [Owner] | [Due Date] | [Status]

Immediate Actions (0-7 days)

CRITICAL | Deploy hotfix for infinite loop | Prevent recurrence | Sarah | 2024-02-15 | DONE
HIGH | Document code change impact | Knowledge transfer | John | 2024-02-16 | IN PROGRESS
MEDIUM | Post-incident communication to customers | Transparency | PM | 2024-02-15 | DONE

Short-term Actions (1-4 weeks)

HIGH | Implement automatic canary deployment | Catch issues pre-production | DevOps | 2024-03-01 | PENDING
HIGH | Add auth load tests to CI pipeline | Catch performance issues early | QA | 2024-03-01 | PENDING
MEDIUM | Onboard new on-call engineer on recent changes | Knowledge gap closure | Tech Lead | 2024-02-28 | IN PROGRESS

Long-term Actions (1-3 months)

MEDIUM | Implement automated rollback capability | Faster recovery time | Arch | 2024-04-15 | PENDING
LOW | Review change advisory board process | Governance improvement | Ops | 2024-05-01 | PENDING
LOW | Schedule quarterly load testing for critical services | Proactive risk management | Perf | 2024-06-01 | PENDING

SMART Action Items:

  • Specific: What exactly needs to be done?
  • Measurable: How will we know it's complete?
  • Assignable: Who owns this?
  • Realistic: Can it actually be done?
  • Time-bound: When is it due?

Preventive Measures

Prevention Strategy: How do we prevent this type of incident?

  1. Process Changes:

    • Implement staged/canary deployments for critical services
    • Add code review requirement for auth service changes
    • Require load test passing for critical path changes
  2. Monitoring & Alerting:

    • Lower error rate alert threshold (early warning)
    • Add CPU/memory alerts for auth service
    • Add canary endpoint synthetic monitoring
  3. Automation:

    • Automatic rollback if error rate >2%
    • Load test gate in CI pipeline (mandatory, not optional)
    • Automated chaos engineering tests weekly
  4. Documentation & Training:

    • Document architecture of auth service
    • Create runbook for auth service incidents
    • Schedule knowledge transfer session for on-call team
  5. Organizational:

    • Remove deadline pressure; don't deploy Friday afternoon
    • Add on-call engineer to code reviews of critical services
    • Establish incident SLA: detection to resolution <15 minutes for P0 incidents

Template Structure

INCIDENT POSTMORTEM

Incident ID: INC-2024-047 Date: 2024-02-15 Duration: 28 minutes (14:35 UTC - 15:03 UTC) Impact: Website unavailable; 0.5M page requests failed Severity: P1 (Critical)

INCIDENT SUMMARY [1 paragraph overview of what happened and impact]

TIMELINE [Chronological events table]

ROOT CAUSE ANALYSIS Primary Root Cause: [High-level finding]

5 Whys Analysis: [Why chain]

Contributing Factors: [List of systemic issues]

IMPACT ANALYSIS

  • Customers Affected: [Number or percentage]
  • Duration: [Minutes]
  • Revenue Impact: [If quantifiable]
  • Reputation Impact: [Qualitative assessment]
  • Data Loss: [Yes/No, details if yes]

DETECTION & RESPONSE

  • Detection Time: [Minutes to detect]
  • Response Time: [Minutes to start mitigation]
  • Resolution Time: [Minutes to full recovery]
  • Response Quality: [Smooth/Some delays/Chaotic - why?]

WHAT WENT WELL

  • [Good thing 1]: Enabled [outcome]
  • [Good thing 2]: Enabled [outcome]
  • [Good thing 3]: Enabled [outcome]

Recognize excellent work; reinforce good behaviors

WHAT COULD BE BETTER

  • [Gap 1]: Impact was [consequence]
  • [Gap 2]: Impact was [consequence]
  • [Gap 3]: Impact was [consequence]

ACTION ITEMS [Immediate, short-term, long-term actions with owners and dates]

PREVENTIVE MEASURES [How we prevent this incident class in future]

APPENDICES

  • Error logs (anonymized)
  • Customer communication
  • Monitoring graphs during incident
  • Architecture diagram
  • Related incidents (historical patterns)

Postmortem Facilitation

Blameless Meeting Principles:

  1. Start with context, not blame: "Here's what was happening at 14:30 UTC..."
  2. Use neutral language: "Code changed" not "Code was broken"
  3. Ask curious questions: "What were you seeing on your screen?" not "Why didn't you check the logs?"
  4. Encourage storytelling: Let people describe their experience; narrative flow
  5. Capture assumptions: "I assumed..." statements reveal knowledge gaps
  6. No hierarchy: On-call engineer's observations valued same as CTO's
  7. Record decisions: Why did we choose rollback vs. fix? Document the thinking
  8. Record learnings: What surprised people? What did they learn?

Participants:

  • On-call engineer (incident responder)
  • Service owner
  • DevOps/Infrastructure team
  • Product/Business owner
  • Facilitator (experienced, neutral party)

Meeting Duration: 30-60 minutes maximum

When to Hold: Within 48 hours of incident resolution (while details are fresh)

Distribution & Follow-up

  1. Share Widely: Postmortem is internal tool for learning; share with full engineering org
  2. Executive Summary: One-page summary for leadership
  3. Customer Communication: Transparency about what happened and prevention measures
  4. Process Review: Monthly review of open action items
  5. Trend Analysis: Quarterly review - are we preventing incident classes or just firefighting?

Preventing Similar Incidents

Incident Class Tracking:

  • Authentication failures
  • Database performance degradation
  • Memory leaks
  • Configuration errors
  • Dependency failures
  • Deployment failures

If same class happens twice: escalate prevention measures If same incident happens three times: organizational escalation (management review)


Use this skill to: Transform incidents into learning opportunities, improve system resilience, and build a psychologically safe incident response culture.