Vibeship-spawner-skills incident-responder

id: incident-responder

install

source · Clone the upstream repo

git clone https://github.com/vibeforge1111/vibeship-spawner-skills

manifest: mind/incident-responder/skill.yaml

BLAMELESS POST-MORTEM STRUCTURE:

1. Summary

""" One paragraph: What happened, how long, who was affected.

Example: "On Jan 15, the orders API was unavailable for 45 minutes between 14:30-15:15 UTC. All checkout attempts failed, affecting approximately 2,000 users. Revenue impact was estimated at $50,000." """

2. Timeline

"""

Time	Event
14:30	Alerts fire for orders API 500 errors
14:32	On-call acknowledges, opens incident channel
14:35	Alice joins, begins investigation
14:40	Identified: database connection exhaustion
14:45	Decision: restart API pods
14:50	Restart complete, errors continue
14:55	Identified: connection leak in new code
15:00	Rollback to previous version initiated
15:10	Rollback complete
15:15	Metrics normalized, incident resolved
"""

3. Contributing Factors (not "Root Cause")

""" ✗ DON'T: "Alice pushed code without testing" ✓ DO: "Code review didn't catch connection leak"

Contributing factors:

Connection pooling change introduced leak
Load testing doesn't simulate long-running connections
No alert for connection pool exhaustion
Rollback took 10 min due to slow deploy pipeline """

4. What Went Well

"""

Alert fired within 2 minutes of issue
Team assembled quickly
Communication was clear and timely
Rollback successfully resolved issue """

5. What Could Be Improved

"""

Connection pool monitoring needed
Load tests should simulate production patterns
Deploy pipeline could be faster
More specific runbook for this scenario """

6. Action Items (with owners and deadlines)

"""

Action	Owner	Due
Add connection pool alerts	Alice	Jan 22
Update load test scenarios	Bob	Jan 29
Create runbook for DB issues	Carol	Jan 25
Investigate deploy speedup	Dave	Feb 5
"""

name: Rollback vs Fix Forward description: Deciding how to resolve the incident when: Incident caused by recent change example: |

THE ROLLBACK DECISION:

Default: Rollback

""" If a recent deploy caused the incident, rollback should be your first instinct.

WHY:
- Immediately restores known-good state
- Doesn't require understanding the bug
- Low risk (returning to tested code)
- Buys time to fix properly """
When to Rollback:
- Recent deploy correlates with incident
- Rollback is fast (< 10 minutes)
- No data migration makes rollback complex
- Root cause is unclear
When to Fix Forward:
- Rollback is impossible (data migration)
- Fix is obvious and quick (< 5 minutes)
- Rollback would cause worse problems
- Root cause is clear and fix is safe
DECISION FRAMEWORK:

""" Is there a recent deploy? ──No──→ Investigate root cause │ Yes │ ▼ Is rollback possible? ──No──→ Fix forward (or workaround) │ Yes │ ▼ Is rollback fast? ──No──→ Consider fix forward if fix is faster │ Yes │ ▼ ROLLBACK FIRST Then fix the bug properly """

The "10-Minute Rule"

""" If you haven't mitigated within 10 minutes of diagnosis, seriously consider rollback or failover.

Debugging under pressure leads to mistakes. Rollback buys time to fix correctly. """
name: Severity Assessment description: Determining incident priority when: Incident detected, need to determine response level example: |

SEVERITY DEFINITIONS:

SEV1 - Critical (All hands)

""" Impact: Complete service outage or security breach Examples: - Site completely down - Database corruption - Security incident / data breach - Payment processing broken

Response: - Page all relevant on-call - All hands if needed - Executives notified - Customer communication prepared """

SEV2 - Major (On-call + team)

""" Impact: Major feature broken, significant user impact Examples: - Checkout broken for some users - API latency 10x normal - One region down, others working - Important integration broken

Response: - On-call responds - Team leads notified - Page additional help if needed - 30-minute update cadence """

SEV3 - Minor (On-call)

""" Impact: Limited feature impact, workaround exists Examples: - Non-critical feature broken - Degraded performance (not severe) - One customer affected - Background job delayed

Response: - On-call investigates - May defer to business hours - Fix within 24 hours - 2-hour update cadence """

SEV4 - Low (Next business day)

""" Impact: Minimal, cosmetic, no user impact Examples: - UI alignment issue - Log errors (no user impact) - Internal tool slow - Dashboard broken

Response: - Document for next day - No after-hours response - Fix when convenient """

SEVERITY ESCALATION:

""" Escalate if:
- Issue not resolved within expected time
- Impact growing
- Need additional expertise
- Customer escalations happening """

anti_patterns:

name: Hero Debugging description: One person trying to solve everything alone why: | The lone hero works in isolation, doesn't communicate, and creates a single point of failure. When they get stuck or burned out, nobody knows what's been tried. Team collaboration solves incidents faster. instead: Designate roles. Communicate constantly. Work in pairs if possible. Share what you're trying.
name: Blame Storming description: Post-mortem focused on who caused the incident why: | Blame discourages transparency. People hide mistakes instead of reporting them. Future incidents are worse because people don't admit early warnings. Learning stops when punishment begins. instead: Focus on contributing factors. Ask "what" and "how," not "who." Blame the process, improve the system.
name: Premature All-Clear description: Declaring incident resolved too quickly why: | "It looks fixed" is not "it's fixed." Declaring resolution too early leads to repeat pages, eroded trust, and incomplete fixes. The incident might not actually be resolved. instead: Wait 15-30 minutes after fix. Check metrics return to baseline. Verify with affected flows.
name: Fix Without Understanding description: Applying changes until something works why: | Random changes might mask the problem or cause new ones. Without understanding the cause, you can't prevent recurrence. You might make things worse. instead: Mitigate first (rollback), then understand. Don't ship a "fix" you don't understand.
name: Silent Incident description: Working on incident without communicating why: | Stakeholders panic when they see impact but hear nothing. Duplicate work happens when people don't know who's handling what. Trust erodes with silence. instead: Post updates every 30 minutes. Even "still investigating" is better than silence.
name: Alert Fatigue Acceptance description: Ignoring alerts because there are too many why: | Noisy alerts train teams to ignore alerts. When a real incident happens, it's missed in the noise. On-call becomes meaningless rotation. instead: Every alert should be actionable. Fix or delete noisy alerts. Quality over quantity.

handoffs:

trigger: deep technical debugging needed to: debugging-master context: Incident mitigated, need root cause investigation
trigger: performance root cause needed to: performance-thinker context: Performance incident, need optimization expertise
trigger: architectural fix needed to: system-designer context: Root cause is architectural, need design changes
trigger: prioritizing post-incident fixes to: decision-maker context: Multiple action items, need prioritization
trigger: incident revealed tech debt to: tech-debt-manager context: Underlying debt contributed to incident

Vibeship-spawner-skills incident-responder

THE OODA LOOP FOR INCIDENTS:

Observe → Orient → Decide → Act → (Repeat)

PHASE 1: DETECT AND ACKNOWLEDGE (first 5 minutes)

PHASE 2: ASSESS SEVERITY

PHASE 3: MITIGATE (first priority)

PHASE 4: COMMUNICATE

PHASE 5: RESOLVE AND VERIFY

PHASE 6: POST-INCIDENT

INCIDENT COMMANDER RESPONSIBILITIES:

1. Coordinate, Don't Fix

2. Delegate Roles

3. Control the Channel

4. Ask the Key Questions

5. Make the Calls

6. Declare Resolution

BLAMELESS POST-MORTEM STRUCTURE:

1. Summary

2. Timeline

3. Contributing Factors (not "Root Cause")

4. What Went Well

5. What Could Be Improved

6. Action Items (with owners and deadlines)

THE ROLLBACK DECISION:

Default: Rollback

When to Rollback:

When to Fix Forward:

DECISION FRAMEWORK:

The "10-Minute Rule"

SEVERITY DEFINITIONS:

SEV1 - Critical (All hands)

SEV2 - Major (On-call + team)

SEV3 - Minor (On-call)

SEV4 - Low (Next business day)

SEVERITY ESCALATION: