Claude-code-plugins vercel-incident-runbook
install
source · Clone the upstream repo
git clone https://github.com/jeremylongshore/claude-code-plugins-plus-skills
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/jeremylongshore/claude-code-plugins-plus-skills "$T" && mkdir -p ~/.claude/skills && cp -r "$T/plugins/saas-packs/vercel-pack/skills/vercel-incident-runbook" ~/.claude/skills/jeremylongshore-claude-code-plugins-vercel-incident-runbook && rm -rf "$T"
manifest:
plugins/saas-packs/vercel-pack/skills/vercel-incident-runbook/SKILL.mdsource content
Vercel Incident Runbook
Overview
Step-by-step incident response for Vercel deployment failures, function errors, and platform outages. Covers rapid triage, instant rollback, communication templates, and postmortem procedures.
Prerequisites
- Access to Vercel dashboard and CLI
- Access to Vercel status page (vercel-status.com)
- Communication channels (Slack, PagerDuty) configured
- Log drain or runtime log access
Instructions
Step 1: Rapid Triage (First 5 Minutes)
# 1. Check if it's a Vercel platform issue curl -s "https://www.vercel-status.com/api/v2/summary.json" \ | jq '.status.description, [.components[] | select(.status != "operational") | {name, status}]' # 2. Check current production deployment status vercel ls --prod vercel inspect $(vercel ls --prod --json | jq -r '.[0].url') # 3. Check recent deployments — did a deploy just happen? curl -s -H "Authorization: Bearer $VERCEL_TOKEN" \ "https://api.vercel.com/v6/deployments?target=production&limit=5&projectId=prj_xxx" \ | jq '.deployments[] | {uid, state, createdAt: (.createdAt/1000 | todate), url}' # 4. Check function logs for errors vercel logs $(vercel ls --prod --json | jq -r '.[0].url') --level=error --limit=20
Step 2: Decision Tree
Is vercel-status.com showing an incident? ├── YES → Vercel platform issue │ ├── Subscribe to updates on status page │ ├── Post internal status: "Vercel platform incident — monitoring" │ └── No action needed from us — wait for Vercel resolution │ └── NO → Issue is in our deployment ├── Did a deployment happen in the last 30 minutes? │ ├── YES → Likely deployment regression │ │ └── ROLLBACK immediately (Step 3) │ └── NO → Application-level issue │ ├── Check function logs for new errors │ ├── Check external dependency status (DB, APIs) │ └── Investigate and hotfix (Step 4) │ └── Is the issue region-specific? ├── YES → Check function regions, possible edge issue └── NO → Global issue, check code and env vars
Step 3: Instant Rollback (< 30 Seconds)
# Option A: Rollback to previous production deployment (fastest) vercel rollback # This instantly swaps production traffic — no rebuild needed # Option B: Rollback to a specific known-good deployment vercel rollback dpl_xxxxxxxxxxxx # Option C: Via API (for automation/PagerDuty integration) curl -X POST "https://api.vercel.com/v9/projects/my-app/promote" \ -H "Authorization: Bearer $VERCEL_TOKEN" \ -H "Content-Type: application/json" \ -d '{"deploymentId": "dpl_known_good_id"}' # Verify rollback succeeded vercel ls --prod curl -s https://yourdomain.com/api/health | jq .
Step 4: Investigate Root Cause
# Collect evidence while it's fresh mkdir incident-$(date +%Y%m%d) cd incident-$(date +%Y%m%d) # Function logs around the incident time vercel logs https://yourdomain.com --limit=200 > function-logs.txt # Deployment diff — what changed? curl -s -H "Authorization: Bearer $VERCEL_TOKEN" \ "https://api.vercel.com/v13/deployments/dpl_broken" \ | jq '.meta' > broken-deployment-meta.json # Compare env vars between working and broken deployments vercel env ls > env-vars.txt # Check git diff between last good and broken commit git log --oneline -10 git diff dpl_good_commit..dpl_broken_commit -- api/ src/
Step 5: Enable Maintenance Page (If Needed)
// vercel.json — temporary maintenance mode via rewrite { "rewrites": [ { "source": "/((?!_next|api/health).*)", "destination": "/maintenance.html" } ] }
<!-- public/maintenance.html --> <!DOCTYPE html> <html> <head><title>Maintenance</title></head> <body> <h1>We'll be right back</h1> <p>We're performing scheduled maintenance. Please check back shortly.</p> </body> </html>
Step 6: Communication Templates
Internal — Slack (Incident Start)
:rotating_light: INCIDENT: [Project Name] production issue detected Status: Investigating Impact: [Description of user impact] Start time: [UTC timestamp] On-call: @[engineer] Thread: replies here
Internal — Slack (Mitigation)
:white_check_mark: MITIGATED: [Project Name] Action: Rolled back to deployment dpl_xxx Impact duration: [X minutes] Root cause: [Brief description] Postmortem: [link] scheduled for [date]
External — Status Page
Title: Degraded performance on [service] Body: We are investigating reports of [issue]. Some users may experience [impact]. Our team is actively working on a resolution. Update: The issue has been resolved. [Brief root cause].
Step 7: Postmortem Template
# Incident Postmortem: [Title] ## Summary - Duration: [start] to [end] ([X minutes]) - Impact: [users/requests affected] - Severity: [P1/P2/P3] ## Timeline (UTC) - HH:MM — [event] - HH:MM — Alert fired - HH:MM — On-call acknowledged - HH:MM — Root cause identified - HH:MM — Rollback executed - HH:MM — Service restored ## Root Cause [What broke and why] ## Resolution [What was done to fix it] ## Action Items - [ ] [Preventive action] — Owner: @xxx — Due: [date] - [ ] [Detection improvement] — Owner: @xxx — Due: [date] - [ ] [Process improvement] — Owner: @xxx — Due: [date]
Incident Severity Levels
| Severity | Definition | Response Time | Rollback? |
|---|---|---|---|
| P1 | Production down, all users affected | < 5 min | Immediate |
| P2 | Degraded, some users affected | < 15 min | If not fixable in 30 min |
| P3 | Minor issue, workaround exists | < 1 hour | No |
| P4 | Cosmetic or non-urgent | Next business day | No |
Output
- Incident categorized and triaged within 5 minutes
- Instant rollback executed if deployment regression detected
- Communication sent to internal and external stakeholders
- Postmortem scheduled with action items
Error Handling
| Scenario | Action |
|---|---|
| Vercel status page shows incident | Monitor, communicate, no deployment changes |
fails | Use API promotion: POST to |
| Rollback deployment also broken | Deploy from a known-good git tag |
| Cannot access Vercel dashboard | Use CLI with saved VERCEL_TOKEN |
| Log retention expired | Check external log drain provider |
Resources
Next Steps
For data handling and compliance, see
vercel-data-handling.