Claude-code-plugins-plus-skills maintainx-incident-runbook
install
source · Clone the upstream repo
git clone https://github.com/jeremylongshore/claude-code-plugins-plus-skills
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/jeremylongshore/claude-code-plugins-plus-skills "$T" && mkdir -p ~/.claude/skills && cp -r "$T/plugins/saas-packs/maintainx-pack/skills/maintainx-incident-runbook" ~/.claude/skills/jeremylongshore-claude-code-plugins-plus-skills-maintainx-incident-runbook && rm -rf "$T"
manifest:
plugins/saas-packs/maintainx-pack/skills/maintainx-incident-runbook/SKILL.mdsource content
MaintainX Incident Runbook
Overview
Step-by-step procedures for responding to MaintainX integration incidents, from detection through resolution and post-mortem.
Prerequisites
- Access to monitoring dashboards
- MaintainX admin API credentials
- On-call contact list
Severity Classification
| Severity | Definition | Response Time |
|---|---|---|
| SEV-1 | Complete integration failure, no work orders processing | 15 min |
| SEV-2 | Partial failure, some endpoints degraded | 1 hour |
| SEV-3 | Performance degradation, slow responses | 4 hours |
| SEV-4 | Non-critical feature broken, workaround available | Next business day |
Instructions
Step 1: Immediate Triage (First 5 Minutes)
#!/bin/bash echo "=== MaintainX Incident Triage ===" echo "Time: $(date -u)" # Check MaintainX API status echo -e "\n--- API Health ---" for endpoint in users workorders assets locations; do CODE=$(curl -s -o /dev/null -w "%{http_code}" \ "https://api.getmaintainx.com/v1/$endpoint?limit=1" \ -H "Authorization: Bearer $MAINTAINX_API_KEY") echo " /$endpoint: HTTP $CODE" done # Check your integration service echo -e "\n--- Integration Service ---" curl -s http://localhost:3000/health | jq . 2>/dev/null || echo " Service unreachable" # Check recent error logs echo -e "\n--- Recent Errors (last 10 min) ---" # Adjust for your log system: # journalctl -u maintainx-sync --since "10 min ago" --no-pager | grep -i error | tail -10
Step 2: Determine Root Cause
| Symptom | Likely Cause | Check |
|---|---|---|
| All endpoints return 401 | API key expired | and test with curl |
| All endpoints return 5xx | MaintainX platform outage | Check status.getmaintainx.com |
| 429 on all requests | Rate limit exceeded | Review request volume in last hour |
| Specific endpoint 404 | API path changed | Check MaintainX changelog |
| Timeouts | Network issue | |
| Your service crashes | Application error | Check container logs, OOM, disk space |
Step 3: Apply Mitigation
API Key Expired (SEV-1):
# Generate new key: MaintainX > Settings > Integrations > New Key # Update in production: # GCP Secret Manager: echo -n "NEW_KEY_HERE" | gcloud secrets versions add maintainx-api-key --data-file=- # Restart service to pick up new key: gcloud run services update maintainx-integration --region us-central1 --no-traffic
Rate Limited (SEV-2):
// Immediately reduce request volume // 1. Enable emergency rate limiting process.env.MAINTAINX_MAX_REQUESTS_PER_SEC = '1'; // 2. Disable non-critical sync jobs await disableScheduledJobs(['asset-sync', 'report-generator']); // 3. Keep only critical work order processing
MaintainX Platform Outage (SEV-1):
// Switch to queue-based processing // Buffer all outgoing requests for replay after recovery const queue: Array<{ method: string; path: string; body: any }> = []; function bufferRequest(method: string, path: string, body?: any) { queue.push({ method, path, body }); console.log(`Buffered: ${method} ${path} (queue size: ${queue.length})`); } // When MaintainX recovers, replay buffered requests async function replayQueue(client: MaintainXClient) { console.log(`Replaying ${queue.length} buffered requests...`); for (const req of queue) { await withRetry(() => client.request(req.method, req.path, req.body)); } queue.length = 0; }
Step 4: Verify Resolution
# Run full health check curl -s http://localhost:3000/health | jq . # Verify data flow echo "Work orders created in last hour:" curl -s "https://api.getmaintainx.com/v1/workorders?createdAtGte=$(date -u -d '1 hour ago' +%Y-%m-%dT%H:%M:%SZ)&limit=5" \ -H "Authorization: Bearer $MAINTAINX_API_KEY" | jq '.workOrders | length' # Check for data gaps echo "Checking sync state..." cat .maintainx-sync-state.json 2>/dev/null || echo "No sync state file found"
Step 5: Post-Incident Documentation
## Incident Report Template **Date**: YYYY-MM-DD **Severity**: SEV-X **Duration**: X hours Y minutes **Impact**: [What was affected - e.g., "work order sync halted for 2 hours"] ### Timeline - HH:MM - Alert triggered - HH:MM - Triage started - HH:MM - Root cause identified - HH:MM - Mitigation applied - HH:MM - Full recovery confirmed ### Root Cause [Technical explanation] ### Resolution [What was done to fix it] ### Action Items - [ ] Implement [specific improvement] - [ ] Add monitoring for [gap found] - [ ] Update runbook with [lesson learned]
Output
- Incident triaged and severity classified
- Root cause identified using diagnostic steps
- Mitigation applied (key rotation, rate reduction, or request buffering)
- Recovery verified with health checks and data flow validation
- Post-incident report documented
Error Handling
| Scenario | Immediate Action |
|---|---|
| Total API failure | Buffer requests, check status page, escalate |
| Intermittent 500s | Enable retry logic, reduce request rate |
| Data sync gap | Note gap window, schedule backfill after recovery |
| Webhook delivery failure | Fall back to polling, queue missed events |
Resources
Next Steps
For data handling patterns, see
maintainx-data-handling.
Examples
Automated alerting on integration health:
// Check health every 5 minutes, alert on failure import cron from 'node-cron'; cron.schedule('*/5 * * * *', async () => { try { const res = await fetch('http://localhost:3000/health'); const health = await res.json(); if (health.status !== 'healthy') { await sendPagerDutyAlert({ severity: 'critical', summary: `MaintainX integration degraded: ${JSON.stringify(health.checks)}`, }); } } catch { await sendPagerDutyAlert({ severity: 'critical', summary: 'MaintainX integration service unreachable', }); } });