Claude-code-plugins-plus replit-incident-runbook
install
source · Clone the upstream repo
git clone https://github.com/jeremylongshore/claude-code-plugins-plus-skills
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/jeremylongshore/claude-code-plugins-plus-skills "$T" && mkdir -p ~/.claude/skills && cp -r "$T/plugins/saas-packs/replit-pack/skills/replit-incident-runbook" ~/.claude/skills/jeremylongshore-claude-code-plugins-plus-replit-incident-runbook && rm -rf "$T"
manifest:
plugins/saas-packs/replit-pack/skills/replit-incident-runbook/SKILL.mdsource content
Replit Incident Runbook
Overview
Rapid incident response for Replit deployment failures, database issues, and platform outages. Covers triage, diagnosis, remediation, rollback, and communication.
Prerequisites
- Access to Replit Workspace and Deployment settings
- Deployment URL for health checks
- Communication channel (Slack, email)
- Rollback awareness (Deployment History)
Severity Levels
| Level | Definition | Response Time | Examples |
|---|---|---|---|
| P1 | Complete outage | < 15 min | App returns 5xx, DB down |
| P2 | Degraded service | < 1 hour | Slow responses, intermittent errors |
| P3 | Minor impact | < 4 hours | Non-critical feature broken |
| P4 | No user impact | Next business day | Monitoring gap |
Quick Triage (First 5 Minutes)
set -euo pipefail DEPLOY_URL="https://your-app.replit.app" echo "=== TRIAGE ===" # 1. Check Replit platform status echo -n "Replit Status: " curl -s https://status.replit.com/api/v2/summary.json | \ python3 -c "import sys,json;print(json.load(sys.stdin)['status']['description'])" 2>/dev/null || \ echo "Check https://status.replit.com" # 2. Check your deployment health echo -n "App Health: " curl -s -o /dev/null -w "HTTP %{http_code} (%{time_total}s)" "$DEPLOY_URL/health" 2>/dev/null || echo "UNREACHABLE" echo "" # 3. Get health details echo "Health Response:" curl -s "$DEPLOY_URL/health" 2>/dev/null | python3 -m json.tool 2>/dev/null || echo "No response" # 4. Check if it's a cold start issue (Autoscale) echo -n "Second request: " curl -s -o /dev/null -w "HTTP %{http_code} (%{time_total}s)\n" "$DEPLOY_URL/health"
Decision Tree
App not responding? ├─ YES: Is status.replit.com reporting an incident? │ ├─ YES → Platform issue. Wait for Replit. Communicate to users. │ └─ NO → Your deployment issue. Continue below. │ │ Can you access the Replit Workspace? │ ├─ YES → Check deployment logs: │ │ ├─ Build error → Fix code, redeploy │ │ ├─ Runtime crash → Check logs, fix, redeploy │ │ └─ Secret missing → Add to Secrets tab, redeploy │ └─ NO → Network/browser issue. Try incognito window. │ └─ App responds but with errors? ├─ 5xx errors → Check logs for crash/exception ├─ Slow responses → Check database, cold start, memory └─ Auth not working → Verify deployment domain, not dev URL
Remediation by Error Type
Deployment Crash (5xx / App Unreachable)
1. Open Replit Workspace 2. Go to Deployment Settings > Logs 3. Look for the crash reason: - "Error: Cannot find module..." → Missing dependency - "FATAL: Missing secrets..." → Add to Secrets tab - "EADDRINUSE" → Port conflict in .replit config - "JavaScript heap out of memory" → Increase VM size or fix memory leak 4. Fix the issue in code 5. Click "Deploy" to redeploy 6. If fix is unclear, ROLLBACK: - Deployment Settings > History - Click "Rollback" on last known-good version
Database Connection Failure
1. Check database status in Database pane 2. Verify DATABASE_URL is set in Secrets 3. Test connection:
# From Replit Shell node -e " const {Pool} = require('pg'); const pool = new Pool({connectionString: process.env.DATABASE_URL, ssl:{rejectUnauthorized:false}}); pool.query('SELECT NOW()').then(r => console.log('OK:', r.rows[0])).catch(e => console.error('FAIL:', e.message)).finally(() => pool.end()); "
4. If connection fails: - Check if PostgreSQL is provisioned (Database pane) - Try creating a new database - Check for connection pool exhaustion (max connections)
Cold Start Too Slow (Autoscale)
If cold starts exceed acceptable latency: 1. Check deployment type: Autoscale scales to zero 2. Options: a. Switch to Reserved VM (always-on, no cold starts) b. Set up external keep-alive (ping /health every 4 min) c. Optimize startup: lazy imports, defer DB connection 3. To switch: - Update .replit: deploymentTarget = "cloudrun" - Redeploy
Secrets Missing After Deploy
1. Open Secrets tab (lock icon in sidebar) 2. Verify all required secrets are present 3. Check Deployment Settings > Environment Variables 4. Secrets should auto-sync (2025+), but if not: - Remove and re-add the secret - Redeploy 5. For Account-level secrets: - Account Settings > Secrets - These apply to ALL Repls
Rollback Procedure
Replit supports one-click rollback to any previous deployment: 1. Deployment Settings > History 2. Find the last successful deployment 3. Click "Rollback to this version" 4. Verify health endpoint 5. Investigate root cause before redeploying fix Rollback restores: - Code at that deployment's commit - Deployment configuration at that time - Does NOT rollback database changes
Communication Templates
Internal (Slack)
P[1-4] INCIDENT: [App Name] on Replit Status: INVESTIGATING / IDENTIFIED / MONITORING / RESOLVED Impact: [What users are experiencing] Cause: [If known] Action: [What we're doing] ETA: [When we expect resolution] Next update: [Time]
External (Status Page)
[App Name] Service Disruption We are experiencing issues with [specific feature/service]. [Describe user impact]. We have identified the cause and are working on a fix. Estimated resolution: [time]. Last updated: [timestamp]
Post-Incident
Evidence Collection
set -euo pipefail # Capture deployment logs # Go to Deployment Settings > Logs > Copy relevant entries # Capture timeline echo "Timeline of events:" > incident-report.md echo "- [time] Issue detected" >> incident-report.md echo "- [time] Investigation started" >> incident-report.md echo "- [time] Root cause identified" >> incident-report.md echo "- [time] Fix deployed / rollback executed" >> incident-report.md echo "- [time] Service restored" >> incident-report.md
Postmortem Template
## Incident: [Title] **Date:** YYYY-MM-DD **Duration:** X hours Y minutes **Severity:** P[1-4] ### Summary [1-2 sentence description of what happened] ### Root Cause [Technical explanation] ### Timeline - HH:MM — First alert - HH:MM — Investigation started - HH:MM — Root cause found - HH:MM — Fix deployed / rollback - HH:MM — Service restored ### Impact - Users affected: [N] - Downtime: [duration] ### Action Items - [ ] [Prevention measure] — Owner — Due date
Error Handling
| Issue | Cause | Solution |
|---|---|---|
| Can't access Workspace | Replit outage | Use status.replit.com, wait |
| Rollback not available | No previous deployments | Fix forward, deploy fix |
| Logs too short | Container restarted | Set up external log aggregator |
| DB rollback needed | Bad migration | Restore from Replit DB snapshot |
Resources
Next Steps
For data handling patterns, see
replit-data-handling.