Claude-code-plugins-plus-skills groq-incident-runbook
install
source · Clone the upstream repo
git clone https://github.com/jeremylongshore/claude-code-plugins-plus-skills
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/jeremylongshore/claude-code-plugins-plus-skills "$T" && mkdir -p ~/.claude/skills && cp -r "$T/plugins/saas-packs/groq-pack/skills/groq-incident-runbook" ~/.claude/skills/jeremylongshore-claude-code-plugins-plus-skills-groq-incident-runbook && rm -rf "$T"
manifest:
plugins/saas-packs/groq-pack/skills/groq-incident-runbook/SKILL.mdsource content
Groq Incident Runbook
Overview
Rapid incident response procedures for Groq API failures. Groq is a third-party inference provider -- when it goes down, your mitigation options are: wait, fall back to a different model, or fall back to a different provider.
Severity Levels
| Level | Definition | Response Time | Examples |
|---|---|---|---|
| P1 | Complete API failure | < 15 min | Groq API returns 5xx on all models |
| P2 | Degraded performance | < 1 hour | High latency, partial 429s, one model down |
| P3 | Minor impact | < 4 hours | Intermittent errors, non-critical feature affected |
| P4 | No user impact | Next business day | Monitoring gap, cost anomaly |
Quick Triage (Run First)
set -euo pipefail echo "=== 1. Groq API Status ===" curl -sf https://status.groq.com > /dev/null && echo "status.groq.com: REACHABLE" || echo "status.groq.com: UNREACHABLE" echo "" echo "=== 2. API Authentication ===" HTTP_CODE=$(curl -s -o /dev/null -w "%{http_code}" \ https://api.groq.com/openai/v1/models \ -H "Authorization: Bearer $GROQ_API_KEY") echo "GET /models: HTTP $HTTP_CODE" echo "" echo "=== 3. Model Availability ===" for model in "llama-3.1-8b-instant" "llama-3.3-70b-versatile"; do CODE=$(curl -s -o /dev/null -w "%{http_code}" \ https://api.groq.com/openai/v1/chat/completions \ -H "Authorization: Bearer $GROQ_API_KEY" \ -H "Content-Type: application/json" \ -d "{\"model\":\"$model\",\"messages\":[{\"role\":\"user\",\"content\":\"ping\"}],\"max_tokens\":1}") echo "$model: HTTP $CODE" done echo "" echo "=== 4. Rate Limit Status ===" curl -si https://api.groq.com/openai/v1/chat/completions \ -H "Authorization: Bearer $GROQ_API_KEY" \ -H "Content-Type: application/json" \ -d '{"model":"llama-3.1-8b-instant","messages":[{"role":"user","content":"ping"}],"max_tokens":1}' \ 2>/dev/null | grep -iE "^(x-ratelimit|retry-after)" || echo "No rate limit headers"
Decision Tree
Is the Groq API responding? ├─ NO (timeout/connection refused): │ ├─ Check status.groq.com │ │ ├─ Incident reported → Wait, enable fallback provider │ │ └─ No incident → Network issue on our side (check DNS, firewall, proxy) │ └─ Check if api.groq.com resolves: dig api.groq.com │ ├─ YES, but 401/403: │ ├─ API key revoked or expired → Rotate key │ └─ Key not set in environment → Check secret manager │ ├─ YES, but 429: │ ├─ retry-after header present → Wait that many seconds │ ├─ All models 429 → Org-level limit hit; reduce traffic or upgrade plan │ └─ One model 429 → Route to a different model │ ├─ YES, but 500/503: │ ├─ One model → Groq capacity issue on that model; use fallback model │ └─ All models → Groq-wide outage; enable fallback provider │ └─ YES, but slow (latency > 2s): ├─ Large prompts → Reduce input size ├─ 70B model → Switch to 8B for speed └─ queue_time high → Groq queue congestion; try different model
Immediate Mitigations
Enable Fallback to Different Model
// If primary model is failing, route to fallback async function mitigateModelFailure(messages: any[]) { const models = [ "llama-3.3-70b-versatile", // Primary "llama-3.3-70b-specdec", // Same quality, different infra "llama-3.1-8b-instant", // Fastest, most available ]; for (const model of models) { try { return await groq.chat.completions.create({ model, messages, max_tokens: 1024, timeout: 10_000, }); } catch (err: any) { console.warn(`Model ${model} failed: ${err.status} ${err.message}`); continue; } } throw new Error("All Groq models unavailable"); }
429 Rate Limit — Immediate Actions
set -euo pipefail # Check exact limit info curl -si https://api.groq.com/openai/v1/chat/completions \ -H "Authorization: Bearer $GROQ_API_KEY" \ -H "Content-Type: application/json" \ -d '{"model":"llama-3.1-8b-instant","messages":[{"role":"user","content":"ping"}],"max_tokens":1}' \ 2>/dev/null | grep -i "x-ratelimit\|retry-after" # Options: # 1. Wait for retry-after seconds # 2. Switch to a different model (each model has separate limits) # 3. Reduce request volume (disable non-critical features) # 4. If persistent, upgrade Groq plan at console.groq.com
401 Auth Failure — Key Rotation
set -euo pipefail # 1. Verify current key echo "Current key prefix: ${GROQ_API_KEY:0:8}" # 2. Create new key at console.groq.com/keys # 3. Test new key curl -s -o /dev/null -w "%{http_code}" \ https://api.groq.com/openai/v1/models \ -H "Authorization: Bearer $NEW_GROQ_KEY" # 4. Deploy new key to production # 5. Delete old key in console
Communication Templates
Internal Alert (Slack/PagerDuty)
P[1-4] INCIDENT: Groq API [Error Type] Status: INVESTIGATING | MITIGATING | RESOLVED Impact: [What users see] Current action: [What we're doing] Fallback: [Enabled/Disabled] Next update in: [Time] Commander: @[name]
Status Page (External)
AI Feature Performance Issue We're experiencing [degraded performance / intermittent errors] with our AI features. [Feature X] may respond slower than usual. We've activated backup systems and are monitoring the situation. Last updated: [timestamp]
Post-Incident
Evidence Collection
set -euo pipefail INCIDENT_DIR="groq-incident-$(date +%Y%m%d-%H%M%S)" mkdir -p "$INCIDENT_DIR" # API diagnostics curl -s https://api.groq.com/openai/v1/models \ -H "Authorization: Bearer $GROQ_API_KEY" > "$INCIDENT_DIR/models.json" # Application logs (redacted) kubectl logs -l app=your-app --since=1h 2>/dev/null | \ grep -i "groq\|429\|error\|timeout" | \ sed 's/gsk_[a-zA-Z0-9]*/gsk_REDACTED/g' | \ tail -100 > "$INCIDENT_DIR/app-logs.txt" tar -czf "$INCIDENT_DIR.tar.gz" "$INCIDENT_DIR" echo "Evidence bundle: $INCIDENT_DIR.tar.gz"
Postmortem Template
## Incident: Groq [Error Type] — [Date] **Duration:** X hours Y minutes **Severity:** P[1-4] **Impact:** [N users affected, feature X degraded] ### Timeline - HH:MM — First alert fired - HH:MM — On-call acknowledged, began triage - HH:MM — Root cause identified: [cause] - HH:MM — Mitigation applied: [what] - HH:MM — Resolved, monitoring ### Root Cause [Was it Groq-side or our side? Rate limit hit? Model deprecated? Key expired?] ### What Went Well - [Fallback activated automatically] ### What Could Improve - [Alert fired too late / fallback didn't work / no runbook] ### Action Items - [ ] [Action] — Owner — Due date
Error Handling
| Issue | Cause | Solution |
|---|---|---|
| Can't reach status.groq.com | Network issue | Use mobile or different network |
| All models failing | Groq-wide outage | Enable fallback provider (OpenAI, etc.) |
| Key rotation fails | No admin access | Escalate to team lead with console access |
| Fallback provider also down | Multi-provider outage | Degrade gracefully, show cached content |
Resources
Next Steps
For data handling compliance, see
groq-data-handling.