Claude-code-plugins-plus-skills lindy-incident-runbook
git clone https://github.com/jeremylongshore/claude-code-plugins-plus-skills
T=$(mktemp -d) && git clone --depth=1 https://github.com/jeremylongshore/claude-code-plugins-plus-skills "$T" && mkdir -p ~/.claude/skills && cp -r "$T/plugins/saas-packs/lindy-pack/skills/lindy-incident-runbook" ~/.claude/skills/jeremylongshore-claude-code-plugins-plus-skills-lindy-incident-runbook && rm -rf "$T"
plugins/saas-packs/lindy-pack/skills/lindy-incident-runbook/SKILL.mdLindy Incident Runbook
Overview
Incident response procedures for Lindy AI agent failures. Covers platform outages, individual agent failures, integration breakdowns, credit exhaustion, and webhook endpoint failures.
Incident Severity Levels
| Severity | Description | Response Time | Examples |
|---|---|---|---|
| SEV1 | All agents failing, customer impact | 15 minutes | Lindy platform outage, all webhooks failing |
| SEV2 | Critical agent down | 30 minutes | Support bot offline, phone agent unreachable |
| SEV3 | Degraded performance | 2 hours | High latency, intermittent failures |
| SEV4 | Minor issue | 24 hours | Non-critical agent misconfigured |
Quick Diagnostics (First 5 Minutes)
Step 1: Check Lindy Platform Status
# Is Lindy up? curl -s -o /dev/null -w "Lindy API: HTTP %{http_code}\n" \ "https://public.lindy.ai" --max-time 5 # Check status page echo "Status page: https://status.lindy.ai"
Step 2: Check Your Integration
# Is your webhook receiver up? curl -s -o /dev/null -w "Our endpoint: HTTP %{http_code}\n" \ "https://api.yourapp.com/health" --max-time 5 # Is the webhook auth working? curl -s -o /dev/null -w "Webhook auth: HTTP %{http_code}\n" \ -X POST "https://api.yourapp.com/lindy/callback" \ -H "Authorization: Bearer $LINDY_WEBHOOK_SECRET" \ -H "Content-Type: application/json" \ -d '{"test": true}' --max-time 5
Step 3: Check Credit Balance
Log in at https://app.lindy.ai > Settings > Billing
- Credits at 0? Agents stop processing
- Credits low? Non-essential agents may be paused
Incident Playbooks
Incident: Lindy Platform Outage (SEV1)
Symptoms: All agents failing, status.lindy.ai shows incident Impact: All Lindy-dependent workflows halted
Runbook:
- Confirm outage at https://status.lindy.ai
- Notify team: "Lindy platform outage confirmed. All agents affected."
- Activate fallback procedures:
- Route support emails to human inbox
- Disable webhook triggers from your app
- Queue events for replay when Lindy recovers
- Monitor status page for recovery
- When recovered: re-enable triggers, replay queued events, verify agent health
Fallback code:
async function triggerLindyWithFallback(payload: any) { try { const response = await fetch(WEBHOOK_URL, { method: 'POST', headers: { 'Authorization': `Bearer ${SECRET}`, 'Content-Type': 'application/json', }, body: JSON.stringify(payload), signal: AbortSignal.timeout(10000), // 10s timeout }); if (!response.ok) throw new Error(`HTTP ${response.status}`); return { routed: 'lindy' }; } catch (error) { console.error('Lindy unreachable, activating fallback:', error); await queueForReplay(payload); // Store for later await notifyTeam(`Lindy trigger failed: ${error}`); return { routed: 'fallback' }; } }
Incident: Individual Agent Failure (SEV2)
Symptoms: Specific agent tasks showing "Failed" status Impact: One workflow affected, others may be fine
Runbook:
- Open agent > Tasks tab > Filter by "Failed"
- Click latest failed task — identify the failing step
- Diagnose based on failing step type:
- Trigger step: Auth expired? Filter too restrictive?
- Action step: Integration token expired? Target API down?
- Condition step: Ambiguous condition prompt?
- Agent step: Looping? Exit conditions unreachable?
- Fix the root cause:
- Re-authorize expired integrations
- Fix action configuration
- Simplify condition prompts
- Add fallback exit conditions
- Test with a manual trigger
- Monitor next 5 tasks for success
Incident: Integration Auth Expired (SEV2-3)
Symptoms: Actions failing with "Not authorized" or "Token expired" Impact: All tasks using that integration fail
Runbook:
- Identify which integration is failing (Gmail, Slack, Sheets, etc.)
- In Lindy dashboard: Settings > Integrations
- Find the expired connection (may show warning icon)
- Click Re-authorize and complete OAuth flow
- Re-test the agent with a manual trigger
- Set calendar reminder for 90-day re-authorization check
Incident: Credit Exhaustion (SEV2-3)
Symptoms: Agents stop running, no new tasks created Impact: All agents paused until credits refill
Runbook:
- Confirm at Settings > Billing: credits at 0
- Immediate: Upgrade plan or purchase additional credits
- Investigate: Which agent consumed the most credits?
- Root cause: trigger storm? looping agent step? large model overuse?
- Fix: Add trigger filters, set exit conditions, downgrade model
- Prevent: Set budget alerts at 50%, 80%, 95% thresholds
Incident: Webhook Endpoint Failure (SEV2-3)
Symptoms: Lindy agent runs but your callback never receives data Impact: Agent completes but results are lost
Runbook:
- Check your endpoint health:
curl -s https://api.yourapp.com/health - Check server logs for incoming requests from Lindy
- Verify the HTTP Request action URL matches your production endpoint
- Test endpoint independently: send a POST with curl
- If endpoint was down: replay failed tasks (re-trigger the agent)
- If URL mismatch: update URL in Lindy agent HTTP Request action
Escalation Matrix
| Level | Contact | When |
|---|---|---|
| L1 | On-call engineer | Initial response, diagnostics |
| L2 | Engineering lead | After 30 min SEV1, 1 hour SEV2 |
| L3 | VP Engineering | After 1 hour SEV1 |
| Lindy Support | support@lindy.ai | Confirmed Lindy platform issue |
Post-Incident Template
## Incident Report **Date**: YYYY-MM-DD **Severity**: SEV[1-4] **Duration**: [start time] to [end time] ([total minutes]) **Impact**: [what was affected, customer impact] ### Timeline - HH:MM — Issue detected via [monitoring/user report] - HH:MM — On-call paged, diagnostics started - HH:MM — Root cause identified: [cause] - HH:MM — Fix applied: [what was done] - HH:MM — Service restored, monitoring confirmed ### Root Cause [Technical description of what failed and why] ### Resolution [What was done to fix it] ### Prevention - [ ] [Action item 1] - [ ] [Action item 2] - [ ] [Action item 3]
Error Handling
| Incident Type | Detection | Automated Response |
|---|---|---|
| Platform outage | Health check fails | Queue events, notify team |
| Agent failure | Task Completed trigger | Slack alert to #ops |
| Auth expiry | Action step fails | Alert + re-auth link |
| Credit exhaustion | Billing check | Pause non-critical agents |
| Endpoint down | Health check | Redirect to fallback |
Resources
Next Steps
Proceed to
lindy-data-handling for data security and compliance.