Claude-Skills incident-commander
git clone https://github.com/borghei/Claude-Skills
T=$(mktemp -d) && git clone --depth=1 https://github.com/borghei/Claude-Skills "$T" && mkdir -p ~/.claude/skills && cp -r "$T/engineering/incident-commander" ~/.claude/skills/borghei-claude-skills-incident-commander && rm -rf "$T"
engineering/incident-commander/SKILL.mdIncident Commander
The agent classifies incident severity, reconstructs timelines from heterogeneous event sources, and generates structured post-incident reviews with root cause analysis and action items.
Quick Start
# Classify an incident (JSON or stdin) echo '{"description": "Database connections timing out", "affected_users": "80%", "business_impact": "high"}' \ | python scripts/incident_classifier.py --format text # Multi-dimensional severity scoring python scripts/severity_classifier.py incident.json --format markdown # Reconstruct timeline with phase detection and gap analysis python scripts/timeline_reconstructor.py --input events.json --detect-phases --gap-analysis --format markdown # Build structured timeline with MTTD/MTTR metrics python scripts/incident_timeline_builder.py incident_data.json --format markdown # Generate Post-Incident Review python scripts/pir_generator.py --incident incident.json --rca-method fishbone --action-items --format markdown # Generate postmortem with benchmark comparisons python scripts/postmortem_generator.py incident_data.json --format markdown
Tools Overview
| Tool | Input | Output |
|---|---|---|
| Incident description JSON | Severity level, response teams, communication templates |
| Incident data with impact/signals | Multi-dimensional score across 5 weighted dimensions |
| Timestamped events array | Chronological timeline with phases and gap analysis |
| Incident + events JSON | Timeline with MTTD/MTTR, phase distribution, comms templates |
| Incident data + optional timeline | PIR document with RCA (5 Whys, Fishbone, Timeline, Bow Tie) |
| Incident + resolution + action items | Postmortem with benchmarks, factor analysis, coverage gaps |
Workflow 1: Incident Response (Detection to Resolution)
Step 1 -- Classify severity.
python scripts/severity_classifier.py incident.json --format json
The agent scores across five dimensions: revenue impact (25%), user scope (25%), data/security risk (20%), service criticality (15%), blast radius (15%).
| Severity | Definition | Response Time | Comms Cadence |
|---|---|---|---|
| SEV-1 | Complete outage, data loss, security breach | 15 min | Every 15 min |
| SEV-2 | Partial degradation, >25% users affected | 30 min | Every 30 min |
| SEV-3 | Single feature affected, workaround available | 2 hours | At milestones |
| SEV-4 | Cosmetic, dev/test only, no user impact | Next business day | Standard cycle |
Validation checkpoint: Severity classification includes confidence score and recommended escalation path.
Step 2 -- Establish command.
The Incident Commander:
- Assigns within 5 min (SEV-1) or 30 min (SEV-2)
- Creates war room and incident tracking ticket
- Sends initial notification using generated template
- Coordinates between technical teams and stakeholders
- Shields responders from external distractions
Step 3 -- Investigate and mitigate.
The agent generates targeted investigation commands based on the affected service:
kubectl get pods -n production -l app=<service> kubectl logs -l app=<service> --tail=100 helm history <service> -n production
Decision framework for SEV-1/SEV-2:
- Bias toward action over analysis
- Prefer rollbacks to risky fixes under pressure
- Document every decision for later review
- Consult SMEs but do not block on them
Step 4 -- Communicate.
The agent generates three communication templates per severity:
- Internal notification -- technical details, response team, war room link
- Executive summary -- business impact, ETA, leadership actions required
- Customer communication -- impact scope, what is being done, next update time
Validation checkpoint: All stakeholders notified within committed timeframes.
Workflow 2: Post-Incident Review
Step 1 -- Reconstruct the timeline.
python scripts/timeline_reconstructor.py --input events.json --detect-phases --gap-analysis --format markdown
The agent accepts events from logs, alerts, Slack messages, and deployment systems. Each event needs a
timestamp and description. Optional fields: source, type, actor, severity.
Supported phases: detection, declaration, escalation, investigation, mitigation, communication, resolution.
Step 2 -- Perform root cause analysis.
python scripts/pir_generator.py --incident incident.json --timeline timeline.json --rca-method five_whys --action-items
Available RCA methods:
| Method | Best For |
|---|---|
| Linear causal chains, quick analysis |
| Multi-category analysis (People, Process, Technology, Environment) |
| Identifying missed decision points and delays |
| Barriers analysis, prevention and mitigation controls |
Step 3 -- Generate action items.
The agent categorizes action items as:
immediate_fix, process_improvement, monitoring_alerting, documentation, training, architectural, tooling.
Each action item includes: title, owner, priority, deadline, success criteria, and dependencies.
Step 4 -- Validate postmortem quality.
python scripts/postmortem_generator.py incident_data.json --format json
The agent checks:
- Every contributing factor has at least one action item (coverage gap detection)
- Action items have quality scores (0-100) based on specificity
- MTTD/MTTR benchmarked against industry standards
- Missing actions suggested for uncovered themes
Validation checkpoint: Zero coverage gaps. All P0 action items have owners and deadlines within 48 hours.
Workflow 3: Escalation Management
Technical escalation path:
| Level | Role | SEV-1 Trigger | SEV-2 Trigger |
|---|---|---|---|
| L1 | On-call engineer | Immediate | 15 min |
| L2 | Senior engineer / Team lead | 30 min | 1 hour |
| L3 | Engineering Manager / Staff | 45 min | 2 hours |
| L4 | Director / CTO | 1 hour | 4 hours |
Business escalation:
| Severity | Duration | Escalate To |
|---|---|---|
| SEV-1 | Immediate | VP Engineering |
| SEV-1 | 30 min | CTO + Customer Success VP |
| SEV-1 | 1 hour | CEO + Full Executive Team |
| SEV-2 | 2 hours | VP Engineering |
| SEV-2 | 4 hours | CTO |
Anti-Patterns
- Individual blame in postmortems -- focus on system failures. "Why did the process allow this?" not "Why did Alice do this?"
- Skipping PIR for SEV-2 -- every SEV-1 and SEV-2 gets a postmortem within 3 business days.
- Action items without owners -- every item needs a specific person and deadline.
- Deploying fixes under pressure without validation -- validate fixes before declaring resolution; plan for secondary failures.
- Communication gaps -- provide updates even when there is no new information.
Troubleshooting
| Problem | Cause | Solution |
|---|---|---|
| Classifier assigns SEV1 to minor issues | Description keywords trigger high severity without impact data | Provide percentage and fields |
| Timeline shows "No valid events found" | Timestamps in unsupported format or missing key | Use ISO-8601, , or Unix epoch |
| PIR produces shallow 5 Whys | Incident data lacks detail | Enrich input with , ; supply timeline via |
| Postmortem marks all action items invalid | Missing required fields | Each action item needs , , , |
| Severity score seems too low | Flat description without structured impact data | Provide full schema with , , keys |
References
| Guide | Path |
|---|---|
| Incident Response Framework | |
| Severity Matrix | |
| Communication Templates | |
| RCA Frameworks Guide | |
| SLA Management | |
Integration Points
| Skill | Integration |
|---|---|
| Monitoring alerts feed timeline; runbook templates inform playbooks |
| Security incidents auto-escalate to SEV-1; breach indicators trigger SecOps response |
| Deployment events feed timeline; rollback data informs release gates |
| Architectural root causes escalate to architecture review |
| PIR action items route to code review workflows |
Last Updated: April 2026 Version: 1.1.0