Claude-skill-registry Escalation Paths
Clear escalation procedures and paths for incident response
install
source · Clone the upstream repo
git clone https://github.com/majiayu000/claude-skill-registry
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/majiayu000/claude-skill-registry "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/data/escalation-paths" ~/.claude/skills/majiayu000-claude-skill-registry-escalation-paths && rm -rf "$T"
manifest:
skills/data/escalation-paths/SKILL.mdsource content
Escalation Paths
Overview
Escalation paths define when and how to escalate incidents to ensure the right expertise is engaged at the right time. Effective escalation prevents incidents from languishing while avoiding unnecessary wake-up calls.
Core Principle: "Escalate early for critical issues, but don't cry wolf for minor problems."
1. What is Escalation and When to Escalate
Definition
Escalation: The process of engaging additional resources or higher-level expertise when: - Current responder cannot resolve the issue - Incident exceeds time/severity thresholds - Specialized expertise is needed - Executive visibility is required
When to Escalate
✓ Escalate when: - SEV0 incident (always, immediately) - SEV1 not resolved in 30 minutes - SEV2 not resolved in 2 hours - You don't know how to fix it - Issue affects multiple teams - Requires specialized expertise (database, security, networking) - Customer escalation (enterprise customer affected) - Regulatory/legal implications ✗ Don't escalate when: - You can fix it yourself in < 15 minutes - It's a known issue with documented fix - It's outside business hours for SEV3/4 - You haven't tried basic troubleshooting
2. Escalation Triggers
2.1 Severity Thresholds
interface EscalationRule { severity: string; immediateEscalation: boolean; escalateAfter?: number; // minutes escalateTo: string[]; } const escalationRules: EscalationRule[] = [ { severity: 'SEV0', immediateEscalation: true, escalateTo: ['on-call-senior', 'team-lead', 'engineering-manager', 'cto'] }, { severity: 'SEV1', immediateEscalation: false, escalateAfter: 30, escalateTo: ['on-call-senior', 'team-lead'] }, { severity: 'SEV2', immediateEscalation: false, escalateAfter: 120, escalateTo: ['team-lead'] }, { severity: 'SEV3', immediateEscalation: false, escalateAfter: 480, escalateTo: ['team-lead'] } ];
2.2 Time-Based Escalation
Automatic escalation based on duration: SEV0: - 0 minutes: Page on-call engineer - 0 minutes: Page senior engineer (parallel) - 0 minutes: Notify team lead - 15 minutes: Escalate to engineering manager - 30 minutes: Escalate to CTO SEV1: - 0 minutes: Page on-call engineer - 30 minutes: Escalate to senior engineer - 60 minutes: Escalate to team lead - 120 minutes: Escalate to engineering manager SEV2: - 0 minutes: Notify on-call engineer - 120 minutes: Escalate to team lead - 240 minutes: Escalate to engineering manager SEV3: - 0 minutes: Create ticket - 480 minutes: Notify team lead (business hours)
2.3 Expertise Needed
Escalate to subject matter expert (SME) when: Database issues: - Slow queries - Connection pool exhaustion - Replication lag - Failover needed → Escalate to: @database-team Security issues: - Suspected breach - DDoS attack - Vulnerability exploitation → Escalate to: @security-team Infrastructure issues: - Kubernetes cluster problems - Network issues - Cloud provider outage → Escalate to: @platform-team Application-specific: - Payment processing - Authentication - Search functionality → Escalate to: @payments-team, @auth-team, @search-team
2.4 Cross-Team Dependencies
Escalate when issue spans multiple teams: Example: Payment processing down - Affects: Frontend, Backend, Payments, Database - Escalate to: All affected teams - Coordinate: Incident commander needed Example: Database slow - Affects: All services using database - Escalate to: Database team (primary), all service teams (notify)
2.5 Executive Visibility Required
Escalate to executives when: SEV0 incidents (always): - Complete outage - Data breach - Major customer impact Business impact: - Revenue loss > $50k/hour - SLA breach with penalties - Regulatory violation - PR/reputation risk Customer escalation: - Enterprise customer affected - Customer threatening to churn - Legal action threatened
3. Escalation Levels
L1: First Responder (On-Call Engineer)
Role: Primary on-call engineer Responsibilities: - Acknowledge alerts within 5-15 minutes - Perform initial triage - Follow runbooks - Resolve common issues - Escalate when needed Skills: - General system knowledge - Basic troubleshooting - Runbook execution Escalation criteria: - Can't resolve in 30 minutes (SEV1) - Needs specialized expertise - SEV0 incident
L2: Subject Matter Expert / Team Lead
Role: Senior engineer or team lead Responsibilities: - Deep technical investigation - Complex troubleshooting - Decision-making (rollback vs fix forward) - Coordinate with other teams - Guide L1 engineer Skills: - Deep system knowledge - Advanced troubleshooting - Architecture understanding Escalation criteria: - Can't resolve in 60 minutes (SEV1) - Requires architectural decision - Cross-team coordination needed - SEV0 incident
L3: Architect / Principal Engineer
Role: Principal engineer or architect Responsibilities: - Architectural decisions - Complex system-wide issues - Design emergency fixes - Long-term solution planning Skills: - System architecture expertise - Cross-system knowledge - Strategic thinking Escalation criteria: - Architectural change needed - System-wide impact - Novel failure mode - SEV0 lasting > 30 minutes
L4: Director / VP / CTO
Role: Engineering leadership Responsibilities: - Executive decision-making - Resource allocation - Customer communication (enterprise) - PR/legal coordination - Post-incident accountability Escalation criteria: - SEV0 incident (always notified) - Major business impact - Customer escalation - Regulatory/legal issues - PR crisis
4. Escalation Paths by Service/Component
Service-Specific Escalation Matrix
interface ServiceEscalation { service: string; primary: string; secondary: string; sme: string[]; executive: string; } const escalationMatrix: ServiceEscalation[] = [ { service: 'api-gateway', primary: '@oncall-platform', secondary: '@platform-lead', sme: ['@platform-architect', '@networking-team'], executive: '@vp-engineering' }, { service: 'user-service', primary: '@oncall-backend', secondary: '@backend-lead', sme: ['@auth-expert', '@database-team'], executive: '@vp-engineering' }, { service: 'payment-service', primary: '@oncall-payments', secondary: '@payments-lead', sme: ['@payments-architect', '@security-team'], executive: '@cto' // High-stakes service }, { service: 'database', primary: '@oncall-database', secondary: '@database-lead', sme: ['@dba-senior', '@platform-team'], executive: '@vp-engineering' } ];
Escalation Flow Diagram
Incident Detected ↓ L1: On-Call Engineer ↓ Can resolve? ─YES→ Resolve & Document ↓ NO ↓ Needs expertise? ─YES→ L2: SME/Team Lead ↓ NO ↓ ↓ Can resolve? ─YES→ Resolve ↓ ↓ NO ↓ ↓ SEV0 or >30min? ─YES→ L3: Architect/Principal ↓ NO ↓ ↓ Can resolve? ─YES→ Resolve ↓ ↓ NO ↓ ↓ Continue investigation L4: Executive ↓ ↓ Escalate if not Major decisions, resolved in 2 hours resource allocation
5. When NOT to Escalate (Avoid Alert Fatigue)
Don't Escalate If
❌ You haven't tried basic troubleshooting - Check logs - Review recent changes - Follow runbook - Test critical paths ❌ It's a known issue with documented fix - Check runbook first - Search incident history - Review documentation ❌ You can fix it in < 15 minutes - Simple restart - Clear cache - Known configuration fix ❌ It's outside business hours for low severity - SEV3/4 can wait until morning - No customer impact - Non-urgent ❌ You're escalating just to cover yourself - Escalate because you need help, not to avoid responsibility
Alert Fatigue Prevention
// Track escalation patterns interface EscalationMetrics { engineer: string; totalEscalations: number; appropriateEscalations: number; prematureEscalations: number; delayedEscalations: number; } // Flag concerning patterns function analyzeEscalationPatterns(metrics: EscalationMetrics[]) { for (const m of metrics) { const prematureRate = m.prematureEscalations / m.totalEscalations; if (prematureRate > 0.5) { console.log(`⚠️ ${m.engineer} escalates too quickly (${prematureRate * 100}%)`); // Provide additional training } if (m.delayedEscalations > 5) { console.log(`⚠️ ${m.engineer} delays escalation too often`); // Review escalation criteria } } }
6. Escalation Procedures
6.1 Who to Contact
Escalation Directory: L1 (On-Call Engineer): - PagerDuty: @oncall-primary - Slack: #oncall-primary - Phone: (from PagerDuty) L2 (Team Lead): - PagerDuty: @oncall-secondary - Slack: @team-lead - Phone: +1-555-0100 L3 (Architect): - Slack: @principal-engineer - Phone: +1-555-0200 - Email: architect@example.com L4 (Executive): - Slack: @vp-engineering - Phone: +1-555-0300 (SEV0 only) - Email: vp@example.com
6.2 How to Contact
Contact Methods by Severity: SEV0: 1. PagerDuty (immediate page) 2. Phone call (if no response in 2 minutes) 3. Slack @mention + DM 4. Escalate to next level if no response in 5 minutes SEV1: 1. PagerDuty (page) 2. Slack @mention in incident channel 3. Phone call if no response in 15 minutes SEV2: 1. Slack @mention in incident channel 2. PagerDuty (low urgency) 3. Email (if outside business hours) SEV3/4: 1. Slack message (no @mention) 2. Email 3. Create ticket
6.3 What Information to Provide
## Escalation Message Template **Escalating to**: @senior-engineer **From**: @oncall-engineer **Incident**: INC-2024-001 **Severity**: SEV1 **Time**: 10:45 UTC **Summary**: API Gateway returning 503 errors for 30 minutes. 50% of users affected. **What I've Tried**: - ✅ Checked logs (found database connection errors) - ✅ Verified database is running - ✅ Restarted API pods (no improvement) - ✅ Reviewed recent deployments (none in last 24 hours) **Current Status**: - Error rate: 45% - Users affected: ~25,000 - Duration: 30 minutes **Why Escalating**: - SEV1 not resolved in 30 minutes - Database connection issue beyond my expertise - Need database team involvement **Next Steps I Recommend**: - Check database connection pool - Review database performance - Consider database failover **Links**: - Incident channel: #inc-2024-001 - Dashboard: https://grafana.example.com/d/incident - Runbook: https://wiki.example.com/runbooks/db-connection **Questions?** Ask in #inc-2024-001
6.4 Handoff Checklist
## Escalation Handoff Checklist ### Context - [ ] Incident ID and severity - [ ] What's broken (specific symptoms) - [ ] Impact (users, revenue, services) - [ ] Timeline (when started, key events) - [ ] What's been tried - [ ] Current hypothesis - [ ] Relevant links ### Communication - [ ] Incident channel ownership transferred - [ ] Status page update responsibility - [ ] Stakeholder notification - [ ] Next update timing ### Access - [ ] Necessary permissions granted - [ ] VPN/SSH access confirmed - [ ] Tool access verified ### Actions - [ ] Current action in progress - [ ] Next steps documented - [ ] Blockers identified
7. Escalation SLAs
Response Time SLAs
Escalation Level | Acknowledgement | Join Incident -----------------|-----------------|--------------- L1 (On-Call) | 5 min (SEV0) | Immediate | 15 min (SEV1) | Immediate L2 (Team Lead) | 10 min (SEV0) | 5 min | 30 min (SEV1) | 15 min L3 (Architect) | 15 min (SEV0) | 10 min | 60 min (SEV1) | 30 min L4 (Executive) | 30 min (SEV0) | As needed
Escalation Timeout
// Auto-escalate if no response async function escalateWithTimeout( level: string, contact: string, timeoutMinutes: number ) { const escalation = await sendEscalation(level, contact); // Wait for acknowledgement const acknowledged = await waitForAck(escalation.id, timeoutMinutes); if (!acknowledged) { console.log(`No response from ${contact}, escalating to next level`); await escalateToNextLevel(level); } }
8. On-Call Rotation Tiers
Tier Structure
Tier 1 (Primary On-Call): - Role: First responder - Rotation: Weekly - Compensation: On-call pay + overtime - Responsibilities: - Respond to all alerts - Initial triage - Resolve common issues - Escalate when needed Tier 2 (Secondary On-Call): - Role: Backup and escalation - Rotation: Weekly (offset from Tier 1) - Compensation: On-call pay - Responsibilities: - Backup if Tier 1 unavailable - Escalation for complex issues - Subject matter expertise Tier 3 (Management On-Call): - Role: Executive escalation - Rotation: Monthly - Compensation: Included in salary - Responsibilities: - SEV0 incidents - Executive decisions - Customer communication
Rotation Schedule
Week 1: - Tier 1: Alice - Tier 2: Bob - Tier 3: Charlie (Manager) Week 2: - Tier 1: Bob - Tier 2: Charlie - Tier 3: Charlie (Manager) Week 3: - Tier 1: Charlie - Tier 2: Alice - Tier 3: David (Manager)
9. Subject Matter Expert (SME) Registry
SME Directory
interface SME { name: string; expertise: string[]; contact: { slack: string; phone: string; email: string; }; availability: string; escalationCriteria: string; } const smeRegistry: SME[] = [ { name: 'Alice Chen', expertise: ['PostgreSQL', 'Database Performance', 'Replication'], contact: { slack: '@alice', phone: '+1-555-0101', email: 'alice@example.com' }, availability: '24/7 for SEV0/1', escalationCriteria: 'Database issues, slow queries, failover needed' }, { name: 'Bob Smith', expertise: ['Kubernetes', 'Infrastructure', 'Networking'], contact: { slack: '@bob', phone: '+1-555-0102', email: 'bob@example.com' }, availability: 'Business hours + SEV0', escalationCriteria: 'K8s cluster issues, networking, infrastructure' }, { name: 'Carol Johnson', expertise: ['Security', 'Authentication', 'Compliance'], contact: { slack: '@carol', phone: '+1-555-0103', email: 'carol@example.com' }, availability: '24/7 for security incidents', escalationCriteria: 'Security breaches, auth issues, compliance' } ];
SME Lookup Tool
// Find SME for specific issue function findSME(issue: string): SME[] { return smeRegistry.filter(sme => sme.expertise.some(exp => issue.toLowerCase().includes(exp.toLowerCase()) ) ); } // Usage const databaseSMEs = findSME('PostgreSQL slow queries'); console.log(`Contact: ${databaseSMEs[0].contact.slack}`);
10. Cross-Team Escalation
Cross-Team Escalation Matrix
Issue Type | Primary Team | Secondary Team | Coordinator --------------------|---------------|----------------|------------- API Gateway Down | Platform | Backend | Platform Lead Database Slow | Database | All Services | Database Lead Payment Failing | Payments | Backend | Payments Lead Security Breach | Security | All Teams | CISO Network Issues | Platform | Infrastructure | Platform Lead
Cross-Team Communication
## Cross-Team Escalation Template **To**: @backend-team, @frontend-team **From**: @platform-team **Incident**: INC-2024-001 (SEV1) **Impact**: API Gateway down, affecting all services **What We Know**: - API Gateway returning 503 errors - Started at 10:13 UTC - All services affected - Root cause: Under investigation **What We Need From You**: - Backend: Check if your services are receiving traffic - Frontend: Enable fallback UI for offline mode - All: Monitor your error rates **Coordination**: - War room: https://zoom.us/j/123456 - Incident channel: #inc-2024-001 - Next update: 10:30 UTC (15 minutes) **Point of Contact**: @platform-lead
11. Vendor Escalation (AWS Support, etc.)
When to Escalate to Vendor
Escalate to cloud provider when: ✓ Suspected provider outage ✓ Infrastructure issue beyond your control ✓ Need architectural guidance ✓ Performance issue with managed service ✓ Billing/quota issues Don't escalate when: ✗ It's your application code ✗ You haven't checked status page ✗ It's a known limitation
AWS Support Escalation
# Check AWS Service Health aws health describe-events --filter eventTypeCategories=issue # Open support case aws support create-case \ --subject "RDS instance unresponsive" \ --service-code "amazon-rds" \ --severity-code "urgent" \ --category-code "performance" \ --communication-body "Production RDS instance db-prod-01 is unresponsive. All queries timing out. Started at 10:13 UTC. Affecting 100% of users." # Escalate existing case aws support add-communication-to-case \ --case-id "case-123456" \ --communication-body "Issue is SEV0, please escalate to senior support engineer"
Vendor Escalation Tiers
AWS Support Tiers: - Developer: Business hours, general guidance - Business: 24/7, < 1 hour response for production down - Enterprise: 24/7, < 15 min response for business-critical down, TAM GCP Support Tiers: - Basic: Community support only - Standard: 4-hour response for P2 - Enhanced: 1-hour response for P1 - Premium: 15-minute response for P1, TAM Azure Support Tiers: - Basic: Billing and subscription support - Developer: Business hours - Standard: 24/7, < 1 hour for critical - Professional Direct: < 1 hour for critical, TAM
12. Executive Escalation (When to Wake the CTO)
When to Escalate to Executives
Always escalate to CTO/VP for: ✓ SEV0 incidents ✓ Data breach or security incident ✓ Revenue loss > $100k ✓ Major customer threatening to churn ✓ Regulatory violation ✓ PR crisis / media attention ✓ Legal action threatened Consider escalating for: ✓ SEV1 lasting > 2 hours ✓ Multiple SEV1 incidents in short time ✓ Pattern of recurring issues ✓ Team morale crisis
Executive Escalation Template
## Executive Escalation **To**: @cto **From**: @engineering-manager **Urgency**: High **Time**: 11:00 UTC **Situation**: SEV0 incident: Complete service outage for 45 minutes **Impact**: - Users affected: 100% (~50,000 active users) - Revenue loss: ~$75,000 - SLA breach: Yes (99.9% uptime) - Customer complaints: 237 support tickets **Root Cause**: Database connection pool exhausted due to connection leak in v2.5.0 deployment **Current Status**: - Rolled back to v2.4.9 at 10:40 UTC - Service recovering - Error rate dropping (currently 5%, target < 1%) **Next Steps**: - Monitor for 30 minutes - Investigate connection leak offline - Postmortem scheduled for tomorrow 10:00 AM **Customer Communication**: - Status page updated - Email sent to affected users - Enterprise customers notified directly **What We Need From You**: - Approval for postmortem resources - Customer communication review - Decision on compensation for affected customers
13. De-Escalation Procedures
When to De-Escalate
De-escalate when: ✓ Issue resolved ✓ Severity downgraded (SEV0 → SEV1) ✓ Handoff to regular business hours team ✓ Specialized expertise no longer needed
De-Escalation Checklist
## De-Escalation Checklist ### Before De-Escalating - [ ] Issue resolved or significantly mitigated - [ ] Monitoring shows stable state - [ ] Root cause identified (or investigation plan in place) - [ ] Documentation updated - [ ] Stakeholders notified ### De-Escalation Communication - [ ] Thank escalated team members - [ ] Summarize resolution - [ ] Document learnings - [ ] Schedule postmortem (if needed) - [ ] Update incident status ### Handoff - [ ] Transfer ownership to business hours team (if applicable) - [ ] Document remaining work - [ ] Create follow-up tickets
De-Escalation Message
## De-Escalation Notice **Incident**: INC-2024-001 (SEV1 → Resolved) **Time**: 11:30 UTC **Duration**: 77 minutes **Resolution**: Rolled back deployment to v2.4.9. Service fully restored. **Thanks To**: - @alice (database expertise) - @bob (deployment rollback) - @charlie (customer communication) **Next Steps**: - Postmortem scheduled: Tomorrow 10:00 AM - Follow-up ticket: JIRA-1234 (investigate connection leak) - Monitoring: Continue for 24 hours **Status**: - Incident: Resolved - War room: Closed - On-call: Returned to normal rotation
14. Tools: PagerDuty Schedules, Opsgenie Escalation Policies
PagerDuty Escalation Policy
{ "escalation_policy": { "name": "Engineering Escalation", "escalation_rules": [ { "escalation_delay_in_minutes": 0, "targets": [ { "type": "schedule_reference", "id": "ONCALL_PRIMARY" } ] }, { "escalation_delay_in_minutes": 15, "targets": [ { "type": "schedule_reference", "id": "ONCALL_SECONDARY" } ] }, { "escalation_delay_in_minutes": 30, "targets": [ { "type": "user_reference", "id": "ENGINEERING_MANAGER" } ] } ] } }
Opsgenie Escalation
# Opsgenie escalation policy name: "Production Escalation" rules: - condition: "match-all" notify: - type: "schedule" name: "Primary On-Call" delay: 0 - type: "schedule" name: "Secondary On-Call" delay: 15m - type: "team" name: "Engineering Managers" delay: 30m
15. Common Escalation Mistakes
Mistake 1: Too Slow Escalation
❌ Problem: Spending 2 hours trying to fix SEV1 alone ✓ Solution: - Escalate SEV1 after 30 minutes - Don't be a hero - It's better to escalate early than late
Mistake 2: Premature Escalation
❌ Problem: Escalating before trying basic troubleshooting ✓ Solution: - Check runbook first - Try basic fixes (restart, check logs) - Escalate if still stuck after 15 minutes
Mistake 3: Unclear Handoff
❌ Problem: "Hey @senior-engineer, there's an issue, can you help?" ✓ Solution: Use escalation template with: - What's broken - What you've tried - Current status - Why escalating
Mistake 4: Escalating to Wrong Person
❌ Problem: Escalating database issue to frontend team ✓ Solution: - Check SME registry - Escalate to relevant expertise - Use escalation matrix
Mistake 5: No Follow-Up
❌ Problem: Escalating and disappearing ✓ Solution: - Stay engaged after escalating - Provide context as needed - Help with resolution - Document outcome
16. Real Escalation Scenarios
Scenario 1: Database Slow → Escalation to DBA
10:00 UTC - Alert: Database slow 10:02 UTC - L1 engineer investigates 10:05 UTC - Finds long-running queries 10:10 UTC - Attempts to kill queries (no improvement) 10:15 UTC - Escalates to L2 (database team) 10:20 UTC - DBA identifies missing index 10:25 UTC - Creates index 10:30 UTC - Performance restored Escalation: Appropriate (specialized expertise needed) Time to escalate: 15 minutes (good)
Scenario 2: API Down → Immediate Escalation
14:00 UTC - Alert: API returning 100% errors (SEV0) 14:01 UTC - L1 engineer confirms outage 14:02 UTC - Immediately escalates to L2, L3, and management (parallel) 14:05 UTC - War room established 14:10 UTC - Root cause identified (bad deployment) 14:15 UTC - Rollback initiated 14:20 UTC - Service restored Escalation: Appropriate (SEV0 requires immediate all-hands) Time to escalate: 2 minutes (excellent)
Scenario 3: Slow Search → Delayed Escalation
09:00 UTC - Alert: Search latency high (SEV2) 09:05 UTC - L1 engineer investigates 09:30 UTC - Tries various fixes (no improvement) 10:00 UTC - Still investigating alone 11:00 UTC - Finally escalates to search team 11:15 UTC - Search team identifies Elasticsearch issue 11:30 UTC - Issue resolved Escalation: Too slow (should have escalated at 10:00 UTC) Time to escalate: 2 hours (should be 1 hour for SEV2)
Summary
Key takeaways for Escalation Paths:
- Know when to escalate - Severity, time, expertise triggers
- Escalate early for critical issues - SEV0 immediately, SEV1 after 30 min
- Use clear escalation paths - L1 → L2 → L3 → L4
- Provide context - What, tried, status, why escalating
- Don't be a hero - Ask for help when stuck
- Use SME registry - Escalate to right expertise
- Follow SLAs - Response times by level
- Avoid alert fatigue - Don't escalate unnecessarily
- Document everything - Escalation reasons and outcomes
- De-escalate properly - Thank people, document resolution
Related Skills
- Initial assessment before escalation41-incident-management/incident-triage
- Severity determines escalation urgency41-incident-management/severity-levels
- Runbooks to try before escalating41-incident-management/oncall-playbooks
- Communicating escalations41-incident-management/stakeholder-communication