Skillshub clade-incident-runbook
install
source · Clone the upstream repo
git clone https://github.com/ComeOnOliver/skillshub
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/ComeOnOliver/skillshub "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/jeremylongshore/claude-code-plugins-plus-skills/clade-incident-runbook" ~/.claude/skills/comeonoliver-skillshub-clade-incident-runbook && rm -rf "$T"
manifest:
skills/jeremylongshore/claude-code-plugins-plus-skills/clade-incident-runbook/SKILL.mdsource content
Anthropic Incident Runbook
Overview
Respond to Anthropic API incidents in production — outages, sustained 529 errors, authentication failures, and timeouts. Covers status page checking, severity classification, model fallback activation, communication, and post-incident review.
Step 1: Confirm the Issue
# Check Anthropic status curl -s https://status.anthropic.com/api/v2/status.json | python3 -c " import json, sys d = json.load(sys.stdin) print(f\"Status: {d['status']['description']} ({d['status']['indicator']})\")" # Test API directly curl -s -w "\nHTTP %{http_code} in %{time_total}s\n" \ https://api.anthropic.com/v1/messages \ -H "x-api-key: $ANTHROPIC_API_KEY" \ -H "claude-version: 2023-06-01" \ -H "content-type: application/json" \ -d '{"model":"claude-haiku-4-5-20251001","max_tokens":5,"messages":[{"role":"user","content":"ping"}]}'
Step 2: Classify Severity
| Symptom | Severity | Action |
|---|---|---|
| 529 overloaded (intermittent) | Low | SDK auto-retries handle this |
| 529 overloaded (sustained 5+ min) | Medium | Switch to fallback model |
| 401/403 on all requests | High | API key issue — check console |
| All requests timing out | High | Check status page, activate fallback |
| Status page shows incident | Varies | Follow status page updates |
Step 3: Activate Fallback
async function callWithFallback(params: Anthropic.MessageCreateParams) { try { return await client.messages.create(params); } catch (err) { if (err instanceof Anthropic.APIError && (err.status === 529 || err.status === 500)) { // Try a different model if (params.model.includes('opus')) { return await client.messages.create({ ...params, model: 'claude-sonnet-4-20250514' }); } if (params.model.includes('sonnet')) { return await client.messages.create({ ...params, model: 'claude-haiku-4-5-20251001' }); } } throw err; } }
Step 4: Communicate
- Update your status page if user-facing
- Note: Anthropic incidents typically resolve in 15-60 minutes
Step 5: Post-Incident
- Check your error logs for the incident window
- Calculate impact (failed requests, user impact)
- Verify all systems recovered
Output
- Incident confirmed via status page and direct API test
- Severity classified (Low/Medium/High) based on symptoms
- Fallback activated if needed (downgrade model or queue requests)
- Impact assessed and documented post-incident
Error Handling
| Error | Cause | Solution |
|---|---|---|
| API Error | Check error type and status code | See |
Examples
See Step 1 (curl status check and API test), Step 2 (severity classification table), Step 3 (fallback code with model downgrade), and Step 5 (post-incident checklist) above.
Resources
Next Steps
See
clade-reliability-patterns for building resilient integrations.
Prerequisites
- Production Claude integration deployed
- Fallback model configuration in place (see
)clade-reliability-patterns - Monitoring/alerting configured (see
)clade-observability
Instructions
Step 1: Review the patterns below
Each section contains production-ready code examples. Copy and adapt them to your use case.
Step 2: Apply to your codebase
Integrate the patterns that match your requirements. Test each change individually.
Step 3: Verify
Run your test suite to confirm the integration works correctly.