Claude-code-plugins-plus clade-incident-runbook

install
source · Clone the upstream repo
git clone https://github.com/jeremylongshore/claude-code-plugins-plus-skills
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/jeremylongshore/claude-code-plugins-plus-skills "$T" && mkdir -p ~/.claude/skills && cp -r "$T/plugins/saas-packs/claude-pack/skills/clade-incident-runbook" ~/.claude/skills/jeremylongshore-claude-code-plugins-plus-clade-incident-runbook && rm -rf "$T"
manifest: plugins/saas-packs/claude-pack/skills/clade-incident-runbook/SKILL.md
source content

Anthropic Incident Runbook

Overview

Respond to Anthropic API incidents in production — outages, sustained 529 errors, authentication failures, and timeouts. Covers status page checking, severity classification, model fallback activation, communication, and post-incident review.

Step 1: Confirm the Issue

# Check Anthropic status
curl -s https://status.anthropic.com/api/v2/status.json | python3 -c "
import json, sys
d = json.load(sys.stdin)
print(f\"Status: {d['status']['description']} ({d['status']['indicator']})\")"

# Test API directly
curl -s -w "\nHTTP %{http_code} in %{time_total}s\n" \
  https://api.anthropic.com/v1/messages \
  -H "x-api-key: $ANTHROPIC_API_KEY" \
  -H "claude-version: 2023-06-01" \
  -H "content-type: application/json" \
  -d '{"model":"claude-haiku-4-5-20251001","max_tokens":5,"messages":[{"role":"user","content":"ping"}]}'

Step 2: Classify Severity

SymptomSeverityAction
529 overloaded (intermittent)LowSDK auto-retries handle this
529 overloaded (sustained 5+ min)MediumSwitch to fallback model
401/403 on all requestsHighAPI key issue — check console
All requests timing outHighCheck status page, activate fallback
Status page shows incidentVariesFollow status page updates

Step 3: Activate Fallback

async function callWithFallback(params: Anthropic.MessageCreateParams) {
  try {
    return await client.messages.create(params);
  } catch (err) {
    if (err instanceof Anthropic.APIError && (err.status === 529 || err.status === 500)) {
      // Try a different model
      if (params.model.includes('opus')) {
        return await client.messages.create({ ...params, model: 'claude-sonnet-4-20250514' });
      }
      if (params.model.includes('sonnet')) {
        return await client.messages.create({ ...params, model: 'claude-haiku-4-5-20251001' });
      }
    }
    throw err;
  }
}

Step 4: Communicate

  • Update your status page if user-facing
  • Note: Anthropic incidents typically resolve in 15-60 minutes

Step 5: Post-Incident

  • Check your error logs for the incident window
  • Calculate impact (failed requests, user impact)
  • Verify all systems recovered

Output

  • Incident confirmed via status page and direct API test
  • Severity classified (Low/Medium/High) based on symptoms
  • Fallback activated if needed (downgrade model or queue requests)
  • Impact assessed and documented post-incident

Error Handling

ErrorCauseSolution
API ErrorCheck error type and status codeSee
clade-common-errors

Examples

See Step 1 (curl status check and API test), Step 2 (severity classification table), Step 3 (fallback code with model downgrade), and Step 5 (post-incident checklist) above.

Resources

Next Steps

See

clade-reliability-patterns
for building resilient integrations.

Prerequisites

  • Production Claude integration deployed
  • Fallback model configuration in place (see
    clade-reliability-patterns
    )
  • Monitoring/alerting configured (see
    clade-observability
    )

Instructions

Step 1: Review the patterns below

Each section contains production-ready code examples. Copy and adapt them to your use case.

Step 2: Apply to your codebase

Integrate the patterns that match your requirements. Test each change individually.

Step 3: Verify

Run your test suite to confirm the integration works correctly.