Skillshub anth-incident-runbook
install
source · Clone the upstream repo
git clone https://github.com/ComeOnOliver/skillshub
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/ComeOnOliver/skillshub "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/jeremylongshore/claude-code-plugins-plus-skills/anth-incident-runbook" ~/.claude/skills/comeonoliver-skillshub-anth-incident-runbook && rm -rf "$T"
manifest:
skills/jeremylongshore/claude-code-plugins-plus-skills/anth-incident-runbook/SKILL.mdsource content
Anthropic Incident Runbook
Severity Classification
| Severity | Condition | Response Time |
|---|---|---|
| P1 | API returning 500/529 for all requests | Immediate |
| P2 | Rate limiting (429) or high latency (>10s p99) | 15 minutes |
| P3 | Intermittent errors (<5% error rate) | 1 hour |
| P4 | Degraded quality (not errors) | Next business day |
Immediate Triage (First 5 Minutes)
# 1. Check Anthropic status page curl -s https://status.anthropic.com/api/v2/status.json | python3 -c \ "import sys,json; d=json.load(sys.stdin); print(d['status']['indicator'], '-', d['status']['description'])" # 2. Test API connectivity curl -s -w "\nHTTP %{http_code} | Time: %{time_total}s\n" \ https://api.anthropic.com/v1/messages \ -H "x-api-key: $ANTHROPIC_API_KEY" \ -H "anthropic-version: 2023-06-01" \ -H "content-type: application/json" \ -d '{"model":"claude-haiku-4-20250514","max_tokens":8,"messages":[{"role":"user","content":"1"}]}' # 3. Check rate limit headers curl -s -D - https://api.anthropic.com/v1/messages \ -H "x-api-key: $ANTHROPIC_API_KEY" \ -H "anthropic-version: 2023-06-01" \ -H "content-type: application/json" \ -d '{"model":"claude-haiku-4-20250514","max_tokens":8,"messages":[{"role":"user","content":"1"}]}' \ 2>/dev/null | grep -i "ratelimit\|retry-after\|request-id"
Decision Tree
API returning errors? ├── 401/403 → Key issue → Check ANTHROPIC_API_KEY is set and valid ├── 429 → Rate limited → Check headers, reduce traffic, wait for retry-after ├── 500 → Server error → Check status.anthropic.com, retry with backoff ├── 529 → Overloaded → Temporary, retry after 30-60s └── Timeouts → Network or long generation → Increase timeout, check max_tokens
Mitigation Actions
Rate Limiting (429)
# Immediate: reduce traffic # 1. Enable circuit breaker # 2. Queue non-critical requests # 3. Switch to Message Batches for bulk work # 4. Reduce max_tokens to shorten generation time
API Outage (500/529)
# Graceful degradation def get_response_with_fallback(prompt: str) -> str: try: msg = client.messages.create( model="claude-sonnet-4-20250514", max_tokens=1024, messages=[{"role": "user", "content": prompt}] ) return msg.content[0].text except (anthropic.InternalServerError, anthropic.APIStatusError): return "Our AI assistant is temporarily unavailable. Please try again shortly."
Key Compromise
# 1. Immediately revoke key at console.anthropic.com # 2. Generate new key # 3. Deploy new key to all environments # 4. Audit recent usage for unauthorized calls # 5. File incident report
Postmortem Template
## Incident: [Title] - **Duration:** [start] to [end] - **Severity:** P[1-4] - **Impact:** [what users experienced] - **Root Cause:** [what went wrong] - **Detection:** [how we found out] - **Mitigation:** [what we did to fix it] - **Request IDs:** [from debug logs] - **Action Items:** - [ ] [preventive measure 1] - [ ] [preventive measure 2]
Error Handling
| Symptom | Likely Cause | Quick Fix |
|---|---|---|
| All requests fail 401 | Key rotated/expired | Check Console for active keys |
| Sudden 429 spike | Traffic burst or tier change | Check rate limit headers |
| Slow responses (>10s) | Large max_tokens or complex prompt | Reduce max_tokens, use Haiku |
| Intermittent 500s | Upstream API issue | Check status.anthropic.com |
Resources
Next Steps
For data compliance, see
anth-data-handling.