AbsolutelySkilled incident-management

Name: incident-management
Author: AbsolutelySkilled

install

source · Clone the upstream repo

git clone https://github.com/AbsolutelySkilled/AbsolutelySkilled

Claude Code · Install into ~/.claude/skills/

T=$(mktemp -d) && git clone --depth=1 https://github.com/AbsolutelySkilled/AbsolutelySkilled "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/incident-management" ~/.claude/skills/absolutelyskilled-absolutelyskilled-incident-management && rm -rf "$T"

manifest: skills/incident-management/SKILL.md

Incident Management

Incident management is the structured practice of detecting, responding to, resolving, and learning from production failures. It spans the full incident lifecycle - from the moment an alert fires through war room coordination, customer communication via status pages, and the post-mortem that prevents recurrence. This skill provides actionable frameworks for each phase: on-call rotation design, runbook authoring, severity classification, war room protocols, status page communication, and blameless post-mortems. Built for engineering teams that want to move from chaotic firefighting to repeatable, calm incident response.

When to use this skill

Trigger this skill when the user:

Needs to design or improve an on-call rotation or escalation policy
Wants to write, review, or templatize a runbook for an alert or service
Is conducting, writing, or facilitating a post-mortem / post-incident review
Needs to set up or improve a status page and customer communication strategy
Is running or setting up a war room for an active incident
Wants to define severity levels or incident classification criteria
Needs an incident commander playbook or role definitions
Is building incident response tooling or automation

Do NOT trigger this skill for:

Defining SLOs, SLIs, or error budgets without an incident context (use site-reliability skill)
Infrastructure provisioning or deployment pipeline design (use CI/CD or cloud skills)

Key principles

Incidents are system failures, not people failures - Every incident reflects a gap in the system: missing automation, insufficient monitoring, unclear runbooks, or architectural fragility. Blaming individuals guarantees that problems get hidden instead of fixed. Design every process around surfacing systemic issues.
Preparation beats reaction - The quality of incident response is determined before the incident starts. Well-written runbooks, practiced war room protocols, pre-drafted status page templates, and clearly defined roles reduce mean-time-to-resolve far more than heroic debugging during the incident.
Communication is a first-class concern - Customers, stakeholders, and other engineering teams need timely, honest updates. A status page update every 30 minutes during an outage builds trust. Silence destroys it. Assign a dedicated communications role in every major incident.
Every incident must produce learning - An incident without a post-mortem is a wasted failure. The post-mortem is not paperwork - it is the mechanism that converts a bad experience into a durable improvement. Action items without owners and deadlines are wishes, not commitments.
On-call must be sustainable - Unsustainable on-call leads to burnout, attrition, and slower incident response. Track on-call load metrics, enforce rest periods, and treat excessive paging as a reliability problem to fix, not a cost of doing business.

Core concepts

Incident lifecycle

Detection -> Triage -> Response -> Resolution -> Post-mortem -> Prevention
     |           |          |            |              |              |
  Alerts     Severity   War room     Fix/rollback   Review +       Action
  fire       assigned   stands up    deployed       learn          items
                                                                   tracked

Every phase has a defined owner, a set of artifacts, and a handoff to the next phase. Gaps between phases - especially between resolution and post-mortem - are where learning gets lost.

Incident roles

Role	Responsibility	When assigned
Incident Commander (IC)	Owns the response, delegates work, makes decisions	SEV1/SEV2 immediately
Communications Lead	Updates status page, stakeholders, and support teams	SEV1/SEV2 immediately
Technical Lead	Drives root cause investigation and fix implementation	All severities
Scribe	Maintains the incident timeline in real-time	SEV1; optional for SEV2

Role assignment rule: For SEV1, all four roles must be filled within 15 minutes. For SEV2, IC and Technical Lead are mandatory. For SEV3+, the on-call engineer handles all roles.

Severity classification

Severity	Customer impact	Response time	War room	Status page
SEV1	Complete outage or data loss	Page immediately, 5-min ack	Required	Required
SEV2	Degraded core functionality	Page on-call, 15-min ack	Recommended	Required
SEV3	Minor degradation, workaround exists	Next business day	No	Optional
SEV4	Cosmetic or internal-only	Backlog	No	No

Escalation rule: If a SEV2 is not mitigated within 60 minutes, escalate to SEV1 procedures. If the on-call engineer cannot classify severity within 10 minutes, default to SEV2 until more information is available.

Common tasks

Design an on-call rotation

Rotation structure:

Primary on-call:    First responder. Acks within 5 min (SEV1) or 15 min (SEV2).
Secondary on-call:  Backup if primary misses ack window. Auto-escalated by pager.
Manager escalation: If both primary and secondary miss ack. Also for SEV1 war rooms.

Scheduling guidelines:

Rotate weekly. Never assign the same person two consecutive weeks without a gap.
Minimum team size for sustainable on-call: 5 engineers (allows 1-in-5 rotation).
Follow-the-sun for distributed teams: hand off to the next timezone instead of paging at 3am. Each region covers business hours + 2 hours buffer.
Provide comp time or additional pay for after-hours pages. Track and review quarterly.

On-call health metrics:

Metric	Healthy	Unhealthy
Pages per on-call week	< 5	> 10
After-hours pages per week	< 2	> 5
Mean time-to-ack (SEV1)	< 5 min	> 15 min
Mean time-to-ack (SEV2)	< 15 min	> 30 min
Percentage of pages with runbooks	> 80%	< 50%

Write a runbook

Every runbook must contain these sections:

Title:        [Alert name] - [Service name] Runbook
Last updated: [date]
Owner:        [team or individual]

1. SYMPTOM
   What the alert tells you. Quote the alert condition verbatim.

2. IMPACT
   Who is affected. Severity level. Business impact in plain language.

3. INVESTIGATION STEPS
   Numbered steps. Each step has:
   - What to check (command, dashboard link, or query)
   - What a normal result looks like
   - What an abnormal result means and what to do next

4. MITIGATION STEPS
   Numbered steps to stop the bleeding. Prioritize speed over elegance.
   Include rollback commands, feature flag toggles, and traffic shift procedures.

5. ESCALATION
   Who to contact if steps 3-4 do not resolve the issue within [N] minutes.
   Include name, team, and pager handle.

6. CONTEXT
   Links to: service architecture doc, relevant dashboards, past incidents,
   and the service's on-call schedule.

Runbook quality test: A new team member who has never seen this service should be able to follow the runbook and either resolve the issue or escalate correctly within 30 minutes.

Conduct a post-mortem

When to hold one: Every SEV1. Every SEV2 with customer impact. Any incident consuming more than 4 hours of engineering time. Recurring SEV3s from the same cause.

Timeline:

Hour 0:     Incident resolved. IC assigns post-mortem owner.
Day 1:      Owner drafts timeline and initial analysis.
Day 2-3:    Facilitated post-mortem meeting (60-90 minutes).
Day 3-4:    Draft published for 24-hour review period.
Day 5:      Final version published. Action items entered in tracker.
Day 30:     Action item review - are they done?

The five post-mortem questions:

What happened? (factual timeline with timestamps)
Why did it happen? (root cause analysis - use the "five whys" technique)
Why was it not detected sooner? (monitoring and alerting gap)
What slowed down the response? (process and tooling gap)
What prevents recurrence? (action items)

Action item rules: Every action item must have an owner, a due date, a priority (P0/P1/P2), and a measurable definition of done. "Improve monitoring" is not an action item. "Add latency p99 alert for checkout-api with a 500ms threshold, owned by @alice, due 2026-04-01" is.

See

references/postmortem-template.md

for the full template.

Set up a status page

Page structure:

Components:
  - Group by user-facing service (API, Dashboard, Mobile App, Webhooks)
  - Each component has a status: Operational | Degraded | Partial Outage | Major Outage
  - Show uptime percentage over 90 days per component

Incidents:
  - Title: clear, customer-facing description (not internal jargon)
  - Updates: timestamped entries showing investigation progress
  - Resolution: what was fixed and what customers need to do (if anything)

Maintenance:
  - Scheduled windows with start/end times in customer's timezone
  - Description of impact during the window

Communication cadence during incidents:

Phase	Update frequency	Content
Investigating	Every 30 min	"We are aware and investigating" + symptoms
Identified	Every 30 min	Root cause identified, ETA if known
Monitoring	Every 60 min	Fix deployed, monitoring for stability
Resolved	Once	Summary of what happened and what was fixed

Writing rules for status updates:

Use plain language. No internal service names, error codes, or jargon.
State the customer impact first, then what you are doing about it.
Never say "no impact" if customers reported problems.
Include timezone in all timestamps.

Run a war room

War room activation criteria: Any SEV1. Any SEV2 not mitigated within 30 minutes. Any incident affecting multiple services or teams.

War room protocol:

Minute 0-5:   IC opens the war room (video call + shared channel).
              IC states: incident summary, current severity, affected services.
              IC assigns roles: Communications Lead, Technical Lead, Scribe.

Minute 5-15:  Technical Lead drives initial investigation.
              Scribe starts the timeline document.
              Communications Lead posts first status page update.

Every 15 min: IC runs a checkpoint:
              - "What do we know now?"
              - "What are we trying next?"
              - "Do we need to escalate or bring in more people?"
              - "Is the status page current?"

Resolution:   IC confirms the fix is deployed and metrics are recovering.
              Communications Lead posts resolution update.
              IC schedules the post-mortem and assigns an owner.
              War room closed.

War room rules:

One conversation at a time. IC moderates.
No side investigations without telling the IC.
All commands run against production are announced before execution.
The scribe logs every significant action with a timestamp.
If the war room exceeds 2 hours, IC rotates or brings a fresh IC.

Build an escalation policy

Escalation ladder:

Level 0: Automated response (auto-restart, auto-scale, circuit breaker)
Level 1: On-call engineer (primary)
Level 2: On-call engineer (secondary) + team lead
Level 3: Engineering manager + dependent service on-calls
Level 4: Director/VP + incident commander (SEV1 only)

Escalation triggers:

Trigger	Action
Primary on-call does not ack within 5 min (SEV1)	Auto-page secondary
No mitigation progress after 30 min	Escalate one level
Customer-reported incident (not alert-detected)	Escalate one level immediately
Incident spans multiple services	Page all affected service on-calls
Data loss suspected	Immediate SEV1, escalate to Level 4

Anti-patterns / common mistakes

Mistake	Why it is wrong	What to do instead
No runbooks for alerts	Every page becomes an investigation from scratch; MTTR skyrockets	Treat "alert without runbook" as a blocking issue; write the runbook during the incident
Blameful post-mortems	Engineers hide mistakes, avoid risk, and stop reporting near-misses	Use a blameless template; explicitly ban naming individuals as root causes
Status page updates only at resolution	Customers assume you do not know or do not care; support tickets flood in	Update every 30 minutes minimum; assign a dedicated Communications Lead
On-call without compensation or rotation limits	Burnout, attrition, and degraded response quality	Cap rotations, provide comp time, track health metrics quarterly
War rooms without an Incident Commander	Multiple people investigate the same thing, no one communicates, chaos	Always assign an IC first; the IC's job is coordination, not debugging
Post-mortem action items with no owner or deadline	Items rot in a document; the same incident repeats	Every action item needs: owner, due date, priority, and definition of done

Gotchas

Severity escalation delays compound MTTR - The most common cause of a 2-hour incident that should have taken 30 minutes is a 45-minute delay in escalating from SEV3 to SEV2. The escalation rule "if no mitigation progress after 30 minutes, escalate one level" is not optional - build it into your pager escalation policy as an automatic trigger, not a judgment call.
Post-mortem action items decay without a 30-day review - Action items written in the heat of post-mortem often get deprioritized as new features take over the sprint. Without a mandatory 30-day follow-up meeting with the IC and action item owners, the same incident repeats within 6 months. Treat action item review as a blocking ceremony, not a nice-to-have.
Status page updates that use internal jargon erode customer trust - Saying "the Kafka consumer group is lagging due to a partition rebalance" confuses customers and implies you don't know how to communicate. Customers need to know the symptom they're experiencing, whether you're aware, and when you expect resolution. Translate everything to user impact before posting.
War rooms without a single Incident Commander devolve into chaos - When multiple senior engineers simultaneously investigate, propose fixes, and run commands against production without coordination, changes step on each other and the true root cause gets masked by noise. The IC role is not debugging - it is traffic control. Assign an IC before anyone runs a single query.
Runbooks that haven't been tested under stress are not runbooks - A runbook that works when you write it (calm, familiar with the system, full context) may be unusable at 3am by a tired on-call engineer seeing the service for the first time. Run fire drills where engineers who didn't write the runbook follow it end-to-end. Gaps in instructions surface immediately.

References

For detailed guidance on specific incident management domains, load the relevant file from

references/

```
references/postmortem-template.md
```
- full blameless post-mortem template with example entries, facilitation guide, and action item tracker format
```
references/runbook-template.md
```
- detailed runbook template with example investigation steps and mitigation procedures
```
references/status-page-guide.md
```
- status page setup guide with communication templates and incident update examples
```
references/war-room-checklist.md
```
- war room activation checklist, role cards, and checkpoint script

Only load a references file when the current task requires it.

Companion check

On first activation of this skill in a conversation: check which companion skills are installed by running
ls ~/.claude/skills/ ~/.agent/skills/ ~/.agents/skills/ .claude/skills/ .agent/skills/ .agents/skills/ 2>/dev/null
. Compare the results against the
recommended_skills
field in this file's frontmatter. For any that are missing, mention them once and offer to install:
npx skills add AbsolutelySkilled/AbsolutelySkilled --skill <name>
Skip entirely if
recommended_skills
is empty or all companions are already installed.