Claude-Skills incident-commander

install

source · Clone the upstream repo

git clone https://github.com/borghei/Claude-Skills

Claude Code · Install into ~/.claude/skills/

T=$(mktemp -d) && git clone --depth=1 https://github.com/borghei/Claude-Skills "$T" && mkdir -p ~/.claude/skills && cp -r "$T/engineering/incident-commander" ~/.claude/skills/borghei-claude-skills-incident-commander && rm -rf "$T"

manifest: engineering/incident-commander/SKILL.md

source content

Incident Commander

The agent classifies incident severity, reconstructs timelines from heterogeneous event sources, and generates structured post-incident reviews with root cause analysis and action items.

Quick Start

# Classify an incident (JSON or stdin)
echo '{"description": "Database connections timing out", "affected_users": "80%", "business_impact": "high"}' \
  | python scripts/incident_classifier.py --format text

# Multi-dimensional severity scoring
python scripts/severity_classifier.py incident.json --format markdown

# Reconstruct timeline with phase detection and gap analysis
python scripts/timeline_reconstructor.py --input events.json --detect-phases --gap-analysis --format markdown

# Build structured timeline with MTTD/MTTR metrics
python scripts/incident_timeline_builder.py incident_data.json --format markdown

# Generate Post-Incident Review
python scripts/pir_generator.py --incident incident.json --rca-method fishbone --action-items --format markdown

# Generate postmortem with benchmark comparisons
python scripts/postmortem_generator.py incident_data.json --format markdown

Tools Overview

Tool	Input	Output
`incident_classifier.py`	Incident description JSON	Severity level, response teams, communication templates
`severity_classifier.py`	Incident data with impact/signals	Multi-dimensional score across 5 weighted dimensions
`timeline_reconstructor.py`	Timestamped events array	Chronological timeline with phases and gap analysis
`incident_timeline_builder.py`	Incident + events JSON	Timeline with MTTD/MTTR, phase distribution, comms templates
`pir_generator.py`	Incident data + optional timeline	PIR document with RCA (5 Whys, Fishbone, Timeline, Bow Tie)
`postmortem_generator.py`	Incident + resolution + action items	Postmortem with benchmarks, factor analysis, coverage gaps

Workflow 1: Incident Response (Detection to Resolution)

Step 1 -- Classify severity.

python scripts/severity_classifier.py incident.json --format json

The agent scores across five dimensions: revenue impact (25%), user scope (25%), data/security risk (20%), service criticality (15%), blast radius (15%).

Severity	Definition	Response Time	Comms Cadence
SEV-1	Complete outage, data loss, security breach	15 min	Every 15 min
SEV-2	Partial degradation, >25% users affected	30 min	Every 30 min
SEV-3	Single feature affected, workaround available	2 hours	At milestones
SEV-4	Cosmetic, dev/test only, no user impact	Next business day	Standard cycle

Validation checkpoint: Severity classification includes confidence score and recommended escalation path.

Step 2 -- Establish command.

The Incident Commander:

Assigns within 5 min (SEV-1) or 30 min (SEV-2)
Creates war room and incident tracking ticket
Sends initial notification using generated template
Coordinates between technical teams and stakeholders
Shields responders from external distractions

Step 3 -- Investigate and mitigate.

The agent generates targeted investigation commands based on the affected service:

kubectl get pods -n production -l app=<service>
kubectl logs -l app=<service> --tail=100
helm history <service> -n production

Decision framework for SEV-1/SEV-2:

Bias toward action over analysis
Prefer rollbacks to risky fixes under pressure
Document every decision for later review
Consult SMEs but do not block on them

Step 4 -- Communicate.

The agent generates three communication templates per severity:

Internal notification -- technical details, response team, war room link
Executive summary -- business impact, ETA, leadership actions required
Customer communication -- impact scope, what is being done, next update time

Validation checkpoint: All stakeholders notified within committed timeframes.

Workflow 2: Post-Incident Review

Step 1 -- Reconstruct the timeline.

python scripts/timeline_reconstructor.py --input events.json --detect-phases --gap-analysis --format markdown

The agent accepts events from logs, alerts, Slack messages, and deployment systems. Each event needs a

timestamp

and

description

. Optional fields:

source

type

actor

severity

Supported phases: detection, declaration, escalation, investigation, mitigation, communication, resolution.

Step 2 -- Perform root cause analysis.

python scripts/pir_generator.py --incident incident.json --timeline timeline.json --rca-method five_whys --action-items

Available RCA methods:

Method	Best For
`five_whys`	Linear causal chains, quick analysis
`fishbone`	Multi-category analysis (People, Process, Technology, Environment)
`timeline`	Identifying missed decision points and delays
`bow_tie`	Barriers analysis, prevention and mitigation controls

Step 3 -- Generate action items.

The agent categorizes action items as:

immediate_fix

process_improvement

monitoring_alerting

documentation

training

architectural

tooling

Each action item includes: title, owner, priority, deadline, success criteria, and dependencies.

Step 4 -- Validate postmortem quality.

python scripts/postmortem_generator.py incident_data.json --format json

The agent checks:

Every contributing factor has at least one action item (coverage gap detection)
Action items have quality scores (0-100) based on specificity
MTTD/MTTR benchmarked against industry standards
Missing actions suggested for uncovered themes

Validation checkpoint: Zero coverage gaps. All P0 action items have owners and deadlines within 48 hours.

Workflow 3: Escalation Management

Technical escalation path:

Level	Role	SEV-1 Trigger	SEV-2 Trigger
L1	On-call engineer	Immediate	15 min
L2	Senior engineer / Team lead	30 min	1 hour
L3	Engineering Manager / Staff	45 min	2 hours
L4	Director / CTO	1 hour	4 hours

Business escalation:

Severity	Duration	Escalate To
SEV-1	Immediate	VP Engineering
SEV-1	30 min	CTO + Customer Success VP
SEV-1	1 hour	CEO + Full Executive Team
SEV-2	2 hours	VP Engineering
SEV-2	4 hours	CTO

Anti-Patterns

Individual blame in postmortems -- focus on system failures. "Why did the process allow this?" not "Why did Alice do this?"
Skipping PIR for SEV-2 -- every SEV-1 and SEV-2 gets a postmortem within 3 business days.
Action items without owners -- every item needs a specific person and deadline.
Deploying fixes under pressure without validation -- validate fixes before declaring resolution; plan for secondary failures.
Communication gaps -- provide updates even when there is no new information.

Troubleshooting

Problem	Cause	Solution
Classifier assigns SEV1 to minor issues	Description keywords trigger high severity without impact data	Provide `affected_users` percentage and `business_impact` fields
Timeline shows "No valid events found"	Timestamps in unsupported format or missing `timestamp` key	Use ISO-8601, `YYYY-MM-DD HH:MM:SS` , or Unix epoch
PIR produces shallow 5 Whys	Incident data lacks detail	Enrich input with `affected_services` , `customer_impact` ; supply timeline via `--timeline`
Postmortem marks all action items invalid	Missing required fields	Each action item needs `title` , `owner` , `priority` , `deadline`
Severity score seems too low	Flat description without structured impact data	Provide full schema with `impact` , `signals` , `context` keys

References

Guide	Path
Incident Response Framework	`references/incident-response-framework.md`
Severity Matrix	`references/incident_severity_matrix.md`
Communication Templates	`references/communication_templates.md`
RCA Frameworks Guide	`references/rca_frameworks_guide.md`
SLA Management	`references/sla-management-guide.md`

Integration Points

Skill	Integration
`senior-devops`	Monitoring alerts feed timeline; runbook templates inform playbooks
`senior-secops`	Security incidents auto-escalate to SEV-1; breach indicators trigger SecOps response
`release-orchestrator`	Deployment events feed timeline; rollback data informs release gates
`senior-architect`	Architectural root causes escalate to architecture review
`code-reviewer`	PIR action items route to code review workflows

Last Updated: April 2026 Version: 1.1.0