install
source · Clone the upstream repo
git clone https://github.com/MacPhobos/research-mind
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/MacPhobos/research-mind "$T" && mkdir -p ~/.claude/skills && cp -r "$T/.claude/skills/toolchains-universal-emergency-release" ~/.claude/skills/macphobos-research-mind-toolchains-universal-emergency-release && rm -rf "$T"
manifest:
.claude/skills/toolchains-universal-emergency-release/skill.mdsource content
Emergency Release Workflow Skill
Summary
Fast-track workflow for critical production issues requiring immediate deployment. Covers urgency assessment, expedited PR process, deployment verification, and post-incident analysis.
When to Use
- Critical production bugs affecting users
- Security vulnerabilities (CVEs)
- Urgent business requirements
- Data integrity issues
- Service outages
- Payment processing failures
Urgency Assessment
Priority Levels
| Level | Type | Response Time | Deployment | Example |
|---|---|---|---|---|
| P0 | Security vulnerability | < 2 hours | Immediate to production | Auth bypass, data leak, active exploit |
| P1 | Production down | < 4 hours | Same day | App crash, complete feature failure, payment down |
| P2 | Major bug | < 24 hours | Next business day | Critical feature broken, significant user impact |
| P3 | Business critical | < 48 hours | Scheduled release | Marketing campaign blocker, partner deadline |
P0 Criteria (Immediate Action)
- Authentication/authorization bypass
- Data breach or exposure
- Remote code execution vulnerability
- Production service completely unavailable
- Data corruption affecting multiple users
- Payment processing completely broken
P1 Criteria (Same Day)
- Critical feature completely broken
- Error affecting majority of users
- Revenue-impacting bug
- Database connectivity issues
- Third-party integration failure (critical service)
P2 Criteria (Next Business Day)
- Major feature partially broken
- Affects specific user segment
- Workaround available but not ideal
- Performance degradation (not complete failure)
Emergency Release Process
1. Create Hotfix Branch
# Branch from current production (main) git checkout main git pull origin main # Create hotfix branch git checkout -b hotfix/ENG-XXX-brief-description # Example: git checkout -b hotfix/ENG-1234-fix-auth-bypass
2. Implement Minimal Fix
⚠️ CRITICAL: Minimal change only DO: ✅ Fix the immediate issue ✅ Add regression test ✅ Document root cause in comments DON'T: ❌ Refactor surrounding code ❌ Fix unrelated issues ❌ Add new features ❌ Update dependencies (unless that's the fix)
3. Test Thoroughly
# Run full test suite pnpm test # Type check pnpm tsc --noEmit # Build verification pnpm build # Manual testing checklist: # - [ ] Reproduce original issue # - [ ] Verify fix resolves issue # - [ ] Test happy path # - [ ] Test edge cases # - [ ] Verify no new issues introduced
4. Create PR with Labels
git add . git commit -m "fix: [brief description of fix] Fixes critical issue where [description]. Root cause: [explanation]. Ticket: ENG-XXX Priority: P0" git push origin hotfix/ENG-XXX-brief-description
Use clear labels in PR title:
- Direct to production[RELEASE]
- Critical fix, expedited review[HOTFIX]
or[P0]
- Priority indicator[P1]
PR Template for Hotfixes
Hotfix PR Template
## 🚨 [RELEASE] ENG-XXX: Brief description of fix ### Urgency - [x] P0 - Security vulnerability - [ ] P1 - Production down - [ ] P2 - Major bug - [ ] P3 - Business critical ### Impact **Users affected**: [All users / Premium tier / Specific region / etc.] **Severity**: [Choose one] - [ ] Service completely unavailable - [ ] Critical feature broken - [ ] Security vulnerability - [ ] Data integrity issue - [ ] Degraded performance **User impact**: Describe how this affects end users. ### Root Cause [Brief explanation of what caused the issue] **How it happened:** 1. [Step 1] 2. [Step 2] 3. [Result: issue manifested] **Why it wasn't caught:** - [ ] Missing test coverage - [ ] Race condition in production - [ ] External service behavior changed - [ ] Recent deployment introduced regression - [ ] Other: [explain] ### The Fix [What this PR changes to resolve the issue] **Changes made:** - Modified `file.ts` to [specific change] - Added validation for [specific case] - Fixed logic in [specific function] **Why this fixes it:** [Explanation of how the change resolves the root cause] ### Testing - [ ] ✅ Reproduced issue locally - [ ] ✅ Verified fix resolves issue - [ ] ✅ Regression test added - [ ] ✅ No other functionality affected - [ ] ✅ Tested edge cases - [ ] ✅ Deployed to staging and verified ### Regression Test ```typescript // Test added to prevent recurrence describe('ENG-XXX: Auth bypass fix', () => { it('should reject expired tokens', async () => { const expiredToken = generateExpiredToken(); const response = await fetch('/api/protected', { headers: { Authorization: `Bearer ${expiredToken}` } }); expect(response.status).toBe(401); }); });
Rollback Plan
If this causes issues:
# Option 1: Revert commit git revert <commit-hash> git push origin main # Option 2: Deploy previous version vercel rollback # or your platform's rollback command # Option 3: Feature flag Set FEATURE_FIX_XXX=false in environment
Monitoring:
- Error rate in Sentry
- API response times in monitoring dashboard
- User reports in support channels
Deploy Checklist
- PR approved by at least one reviewer (waive for P0 if necessary)
- All CI checks pass
- Deployed to staging and verified
- Monitoring alerts configured
- On-call engineer notified
- Ready for production deployment
Post-Deploy Verification
Immediately after deploy:
- Verify fix in production (test endpoint directly)
- Check error tracking (Sentry, etc.)
- Monitor for new errors
- Confirm user reports stop coming in
Metrics to watch:
- Error rate (should drop)
- API latency (should remain stable)
- User activity (should normalize)
Follow-Up
- Update Linear ticket with resolution
- Schedule post-incident review (if P0/P1)
- Create tickets for proper fix (if this was a band-aid)
- Update runbook/documentation
--- ## Deployment Steps ### Pre-Deployment ```bash # 1. Merge PR to main # (After approval or P0 emergency waiver) # 2. Pull latest git checkout main git pull origin main # 3. Verify commit git log -1 # Confirm this is your hotfix commit # 4. Tag release (if using semantic versioning) git tag -a v2.3.5 -m "Hotfix: Fix auth bypass vulnerability" git push origin v2.3.5
Deployment (Platform-Specific)
Vercel
# Trigger production deployment vercel --prod # Or use Vercel dashboard: # Deployments → Select commit → Deploy to Production # Monitor deployment vercel logs --follow
Netlify
# Deploy via CLI netlify deploy --prod # Or trigger from dashboard: # Deploys → Select commit → Publish deploy
Railway
# Push to main triggers deployment automatically # Monitor in dashboard: railway.app/project/logs
AWS/GCP/Azure
# Follow platform-specific deployment process # Example for AWS Elastic Beanstalk: eb deploy production --staged # Monitor: eb logs --follow
Post-Deployment Verification
1. Smoke Test
# Test the specific fix curl -X POST https://api.production.com/auth/login \ -H "Content-Type: application/json" \ -d '{"token": "expired_token"}' # Expected: 401 Unauthorized
2. Monitor Error Tracking
✅ Check Sentry/Rollbar/etc.: - Error rate should drop - No new errors introduced ⏱️ Monitor for 15-30 minutes after deployment
3. Verify Metrics
Check monitoring dashboard: - API response times (should be normal) - Error rates (should drop) - Database performance (should be stable) - Third-party service health
4. Check User Reports
Monitor support channels: - Support tickets - In-app chat - Social media - Status page comments
Communication
Internal Communication
Slack/Teams Message Template
🚨 **Production Hotfix Deployed** **Issue**: [Brief description] **Ticket**: ENG-XXX **Priority**: P0 **Status**: ✅ Resolved **Timeline:** - Issue discovered: 14:23 UTC - Fix deployed: 15:47 UTC - Duration: 1h 24m **Impact**: [Who was affected and how] **Root Cause**: [Brief explanation] **Fix**: [What was changed] **Verification**: ✅ Error rate dropped from 450/min to 0/min ✅ All systems operating normally **PR**: https://github.com/org/repo/pull/XXX **Follow-up**: - [ ] Post-incident review scheduled for [date] - [ ] Documentation updated
External Communication (if needed)
Status Page Update
🟢 Resolved - [Issue Title] We've resolved an issue that was affecting [feature/service]. **What happened:** Between 14:23 and 15:47 UTC, users experienced [specific issue]. **Current status:** The issue has been fully resolved. All systems are operating normally. **Next steps:** We're conducting a thorough review to prevent similar issues in the future. We apologize for any inconvenience.
Email to Affected Users (for serious issues)
Subject: Update on [Service] Issue - Resolved Hi [User], We're writing to update you on an issue that affected [feature/service] earlier today. **What happened:** Between [time] and [time], you may have experienced [specific issue]. **Resolution:** Our team quickly identified and resolved the root cause. The service is now operating normally. **What we're doing:** We take these issues seriously and are: - Conducting a full review of the incident - Implementing additional safeguards - Improving our monitoring We apologize for any inconvenience this may have caused. If you have any questions or concerns, please reach out to support@company.com. Thank you for your patience. The [Company] Team
Post-Incident
Immediate Actions (Within 24 hours)
- Update Linear ticket with full resolution details
- Add incident to incident log/spreadsheet
- Document timeline of events
- Identify metrics that should have alerted earlier
- Create follow-up tickets for proper fix (if hotfix was temporary)
Post-Incident Review (PIR) - For P0/P1
Schedule within 72 hours
PIR Template
# Post-Incident Review: [ENG-XXX] Date: YYYY-MM-DD Severity: P0/P1 Duration: Xh Xm ## Summary Brief description of the incident. ## Timeline (UTC) - 14:23 - Issue first detected - 14:25 - On-call engineer alerted - 14:30 - Root cause identified - 14:45 - Fix PR opened - 15:20 - PR approved and merged - 15:47 - Fix deployed to production - 16:00 - Verified resolved ## Impact - **Users affected**: ~5,000 users - **Duration**: 1h 24m - **User experience**: Unable to log in - **Revenue impact**: Estimated $X in lost transactions - **Reputation impact**: 23 support tickets, 5 social media mentions ## Root Cause Detailed technical explanation of what caused the issue. [Include code snippets, sequence diagrams if helpful] ## Resolution What was changed to fix the issue. ## What Went Well - ✅ Fast detection (2 minutes after deploy) - ✅ Clear reproduction steps identified quickly - ✅ Team collaborated effectively - ✅ Fix deployed in under 90 minutes ## What Went Wrong - ❌ Missing test coverage for expired token edge case - ❌ Staging didn't catch the issue (different token expiry settings) - ❌ No automatic rollback triggered - ❌ Monitoring alert threshold too high ## Action Items - [ ] ENG-XXX-1: Add test for expired token validation (@engineer, by YYYY-MM-DD) - [ ] ENG-XXX-2: Align staging token expiry with production (@devops, by YYYY-MM-DD) - [ ] ENG-XXX-3: Implement automatic rollback on error spike (@platform, by YYYY-MM-DD) - [ ] ENG-XXX-4: Lower monitoring alert threshold (@observability, by YYYY-MM-DD) - [ ] ENG-XXX-5: Add runbook for similar issues (@oncall, by YYYY-MM-DD) ## Prevention How we'll prevent this from happening again: 1. **Testing**: Add test coverage for edge cases 2. **Monitoring**: Improve alerting thresholds 3. **Process**: Update deployment checklist 4. **Documentation**: Create runbook for on-call ## Lessons Learned Key takeaways for the team.
Backport Strategy
When fix needs to go to multiple branches/environments:
Multiple Environment Deployment
# 1. Fix applied to main (production) git checkout main git cherry-pick <hotfix-commit-hash> # 2. Backport to release candidate git checkout release-candidate git cherry-pick <hotfix-commit-hash> git push origin release-candidate # 3. Backport to develop git checkout develop git cherry-pick <hotfix-commit-hash> git push origin develop # Create PRs for each backport: gh pr create --base release-candidate --head backport/rc/hotfix-ENG-XXX gh pr create --base develop --head backport/dev/hotfix-ENG-XXX
Handling Merge Conflicts
# If cherry-pick fails due to conflicts git cherry-pick <hotfix-commit-hash> # CONFLICT in file.ts # Resolve conflicts manually # Then: git add file.ts git cherry-pick --continue
Alternative: Patch File
# Create patch from hotfix git format-patch -1 <hotfix-commit-hash> # Creates: 0001-fix-auth-bypass.patch # Apply to other branch git checkout release-candidate git apply 0001-fix-auth-bypass.patch
Summary
Emergency Release Quick Reference
Decision Tree
Is production broken? ├─ Yes → Severity level? │ ├─ P0 (security/down) → Deploy immediately, inform after │ ├─ P1 (critical bug) → Fast-track PR, deploy same day │ └─ P2 (major bug) → Standard expedited process └─ No → Use normal deployment process
Time Targets
- P0: Issue → Deploy in < 2 hours
- P1: Issue → Deploy in < 4 hours
- P2: Issue → Deploy in < 24 hours
Key Principles
- Minimal Change: Fix only the immediate issue
- Add Regression Test: Prevent recurrence
- Fast Feedback: Deploy to staging first (except P0)
- Clear Communication: Keep stakeholders informed
- Learn & Improve: Conduct PIR for P0/P1
Checklist
- Assess urgency correctly
- Create hotfix branch from production
- Implement minimal fix with regression test
- Use appropriate PR labels
- Test thoroughly (staging for P1/P2)
- Get approval (or waive for P0)
- Deploy with monitoring
- Verify fix in production
- Communicate status
- Schedule PIR (P0/P1)
- Create follow-up tickets
Use this skill when production issues require immediate attention and fast-track deployment outside normal release processes.