Claude-Skills incident-commander

install
source · Clone the upstream repo
git clone https://github.com/borghei/Claude-Skills
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/borghei/Claude-Skills "$T" && mkdir -p ~/.claude/skills && cp -r "$T/engineering/incident-commander" ~/.claude/skills/borghei-claude-skills-incident-commander && rm -rf "$T"
manifest: engineering/incident-commander/SKILL.md
source content

Incident Commander

The agent classifies incident severity, reconstructs timelines from heterogeneous event sources, and generates structured post-incident reviews with root cause analysis and action items.


Quick Start

# Classify an incident (JSON or stdin)
echo '{"description": "Database connections timing out", "affected_users": "80%", "business_impact": "high"}' \
  | python scripts/incident_classifier.py --format text

# Multi-dimensional severity scoring
python scripts/severity_classifier.py incident.json --format markdown

# Reconstruct timeline with phase detection and gap analysis
python scripts/timeline_reconstructor.py --input events.json --detect-phases --gap-analysis --format markdown

# Build structured timeline with MTTD/MTTR metrics
python scripts/incident_timeline_builder.py incident_data.json --format markdown

# Generate Post-Incident Review
python scripts/pir_generator.py --incident incident.json --rca-method fishbone --action-items --format markdown

# Generate postmortem with benchmark comparisons
python scripts/postmortem_generator.py incident_data.json --format markdown

Tools Overview

ToolInputOutput
incident_classifier.py
Incident description JSONSeverity level, response teams, communication templates
severity_classifier.py
Incident data with impact/signalsMulti-dimensional score across 5 weighted dimensions
timeline_reconstructor.py
Timestamped events arrayChronological timeline with phases and gap analysis
incident_timeline_builder.py
Incident + events JSONTimeline with MTTD/MTTR, phase distribution, comms templates
pir_generator.py
Incident data + optional timelinePIR document with RCA (5 Whys, Fishbone, Timeline, Bow Tie)
postmortem_generator.py
Incident + resolution + action itemsPostmortem with benchmarks, factor analysis, coverage gaps

Workflow 1: Incident Response (Detection to Resolution)

Step 1 -- Classify severity.

python scripts/severity_classifier.py incident.json --format json

The agent scores across five dimensions: revenue impact (25%), user scope (25%), data/security risk (20%), service criticality (15%), blast radius (15%).

SeverityDefinitionResponse TimeComms Cadence
SEV-1Complete outage, data loss, security breach15 minEvery 15 min
SEV-2Partial degradation, >25% users affected30 minEvery 30 min
SEV-3Single feature affected, workaround available2 hoursAt milestones
SEV-4Cosmetic, dev/test only, no user impactNext business dayStandard cycle

Validation checkpoint: Severity classification includes confidence score and recommended escalation path.

Step 2 -- Establish command.

The Incident Commander:

  • Assigns within 5 min (SEV-1) or 30 min (SEV-2)
  • Creates war room and incident tracking ticket
  • Sends initial notification using generated template
  • Coordinates between technical teams and stakeholders
  • Shields responders from external distractions

Step 3 -- Investigate and mitigate.

The agent generates targeted investigation commands based on the affected service:

kubectl get pods -n production -l app=<service>
kubectl logs -l app=<service> --tail=100
helm history <service> -n production

Decision framework for SEV-1/SEV-2:

  • Bias toward action over analysis
  • Prefer rollbacks to risky fixes under pressure
  • Document every decision for later review
  • Consult SMEs but do not block on them

Step 4 -- Communicate.

The agent generates three communication templates per severity:

  1. Internal notification -- technical details, response team, war room link
  2. Executive summary -- business impact, ETA, leadership actions required
  3. Customer communication -- impact scope, what is being done, next update time

Validation checkpoint: All stakeholders notified within committed timeframes.


Workflow 2: Post-Incident Review

Step 1 -- Reconstruct the timeline.

python scripts/timeline_reconstructor.py --input events.json --detect-phases --gap-analysis --format markdown

The agent accepts events from logs, alerts, Slack messages, and deployment systems. Each event needs a

timestamp
and
description
. Optional fields:
source
,
type
,
actor
,
severity
.

Supported phases: detection, declaration, escalation, investigation, mitigation, communication, resolution.

Step 2 -- Perform root cause analysis.

python scripts/pir_generator.py --incident incident.json --timeline timeline.json --rca-method five_whys --action-items

Available RCA methods:

MethodBest For
five_whys
Linear causal chains, quick analysis
fishbone
Multi-category analysis (People, Process, Technology, Environment)
timeline
Identifying missed decision points and delays
bow_tie
Barriers analysis, prevention and mitigation controls

Step 3 -- Generate action items.

The agent categorizes action items as:

immediate_fix
,
process_improvement
,
monitoring_alerting
,
documentation
,
training
,
architectural
,
tooling
.

Each action item includes: title, owner, priority, deadline, success criteria, and dependencies.

Step 4 -- Validate postmortem quality.

python scripts/postmortem_generator.py incident_data.json --format json

The agent checks:

  • Every contributing factor has at least one action item (coverage gap detection)
  • Action items have quality scores (0-100) based on specificity
  • MTTD/MTTR benchmarked against industry standards
  • Missing actions suggested for uncovered themes

Validation checkpoint: Zero coverage gaps. All P0 action items have owners and deadlines within 48 hours.


Workflow 3: Escalation Management

Technical escalation path:

LevelRoleSEV-1 TriggerSEV-2 Trigger
L1On-call engineerImmediate15 min
L2Senior engineer / Team lead30 min1 hour
L3Engineering Manager / Staff45 min2 hours
L4Director / CTO1 hour4 hours

Business escalation:

SeverityDurationEscalate To
SEV-1ImmediateVP Engineering
SEV-130 minCTO + Customer Success VP
SEV-11 hourCEO + Full Executive Team
SEV-22 hoursVP Engineering
SEV-24 hoursCTO

Anti-Patterns

  1. Individual blame in postmortems -- focus on system failures. "Why did the process allow this?" not "Why did Alice do this?"
  2. Skipping PIR for SEV-2 -- every SEV-1 and SEV-2 gets a postmortem within 3 business days.
  3. Action items without owners -- every item needs a specific person and deadline.
  4. Deploying fixes under pressure without validation -- validate fixes before declaring resolution; plan for secondary failures.
  5. Communication gaps -- provide updates even when there is no new information.

Troubleshooting

ProblemCauseSolution
Classifier assigns SEV1 to minor issuesDescription keywords trigger high severity without impact dataProvide
affected_users
percentage and
business_impact
fields
Timeline shows "No valid events found"Timestamps in unsupported format or missing
timestamp
key
Use ISO-8601,
YYYY-MM-DD HH:MM:SS
, or Unix epoch
PIR produces shallow 5 WhysIncident data lacks detailEnrich input with
affected_services
,
customer_impact
; supply timeline via
--timeline
Postmortem marks all action items invalidMissing required fieldsEach action item needs
title
,
owner
,
priority
,
deadline
Severity score seems too lowFlat description without structured impact dataProvide full schema with
impact
,
signals
,
context
keys

References

GuidePath
Incident Response Framework
references/incident-response-framework.md
Severity Matrix
references/incident_severity_matrix.md
Communication Templates
references/communication_templates.md
RCA Frameworks Guide
references/rca_frameworks_guide.md
SLA Management
references/sla-management-guide.md

Integration Points

SkillIntegration
senior-devops
Monitoring alerts feed timeline; runbook templates inform playbooks
senior-secops
Security incidents auto-escalate to SEV-1; breach indicators trigger SecOps response
release-orchestrator
Deployment events feed timeline; rollback data informs release gates
senior-architect
Architectural root causes escalate to architecture review
code-reviewer
PIR action items route to code review workflows

Last Updated: April 2026 Version: 1.1.0