Claude-skill-registry groq-incident-runbook
install
source · Clone the upstream repo
git clone https://github.com/majiayu000/claude-skill-registry
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/majiayu000/claude-skill-registry "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/data/groq-incident-runbook" ~/.claude/skills/majiayu000-claude-skill-registry-groq-incident-runbook && rm -rf "$T"
manifest:
skills/data/groq-incident-runbook/SKILL.mdsource content
Groq Incident Runbook
Overview
Rapid incident response procedures for Groq-related outages.
Prerequisites
- Access to Groq dashboard and status page
- kubectl access to production cluster
- Prometheus/Grafana access
- Communication channels (Slack, PagerDuty)
Severity Levels
| Level | Definition | Response Time | Examples |
|---|---|---|---|
| P1 | Complete outage | < 15 min | Groq API unreachable |
| P2 | Degraded service | < 1 hour | High latency, partial failures |
| P3 | Minor impact | < 4 hours | Webhook delays, non-critical errors |
| P4 | No user impact | Next business day | Monitoring gaps |
Quick Triage
# 1. Check Groq status curl -s https://status.groq.com | jq # 2. Check our integration health curl -s https://api.yourapp.com/health | jq '.services.groq' # 3. Check error rate (last 5 min) curl -s localhost:9090/api/v1/query?query=rate(groq_errors_total[5m]) # 4. Recent error logs kubectl logs -l app=groq-integration --since=5m | grep -i error | tail -20
Decision Tree
Groq API returning errors? ├─ YES: Is status.groq.com showing incident? │ ├─ YES → Wait for Groq to resolve. Enable fallback. │ └─ NO → Our integration issue. Check credentials, config. └─ NO: Is our service healthy? ├─ YES → Likely resolved or intermittent. Monitor. └─ NO → Our infrastructure issue. Check pods, memory, network.
Immediate Actions by Error Type
401/403 - Authentication
# Verify API key is set kubectl get secret groq-secrets -o jsonpath='{.data.api-key}' | base64 -d # Check if key was rotated # → Verify in Groq dashboard # Remediation: Update secret and restart pods kubectl create secret generic groq-secrets --from-literal=api-key=NEW_KEY --dry-run=client -o yaml | kubectl apply -f - kubectl rollout restart deployment/groq-integration
429 - Rate Limited
# Check rate limit headers curl -v https://api.groq.com 2>&1 | grep -i rate # Enable request queuing kubectl set env deployment/groq-integration RATE_LIMIT_MODE=queue # Long-term: Contact Groq for limit increase
500/503 - Groq Errors
# Enable graceful degradation kubectl set env deployment/groq-integration GROQ_FALLBACK=true # Notify users of degraded service # Update status page # Monitor Groq status for resolution
Communication Templates
Internal (Slack)
🔴 P1 INCIDENT: Groq Integration Status: INVESTIGATING Impact: [Describe user impact] Current action: [What you're doing] Next update: [Time] Incident commander: @[name]
External (Status Page)
Groq Integration Issue We're experiencing issues with our Groq integration. Some users may experience [specific impact]. We're actively investigating and will provide updates. Last updated: [timestamp]
Post-Incident
Evidence Collection
# Generate debug bundle ./scripts/groq-debug-bundle.sh # Export relevant logs kubectl logs -l app=groq-integration --since=1h > incident-logs.txt # Capture metrics curl "localhost:9090/api/v1/query_range?query=groq_errors_total&start=2h" > metrics.json
Postmortem Template
## Incident: Groq [Error Type] **Date:** YYYY-MM-DD **Duration:** X hours Y minutes **Severity:** P[1-4] ### Summary [1-2 sentence description] ### Timeline - HH:MM - [Event] - HH:MM - [Event] ### Root Cause [Technical explanation] ### Impact - Users affected: N - Revenue impact: $X ### Action Items - [ ] [Preventive measure] - Owner - Due date
Instructions
Step 1: Quick Triage
Run the triage commands to identify the issue source.
Step 2: Follow Decision Tree
Determine if the issue is Groq-side or internal.
Step 3: Execute Immediate Actions
Apply the appropriate remediation for the error type.
Step 4: Communicate Status
Update internal and external stakeholders.
Output
- Issue identified and categorized
- Remediation applied
- Stakeholders notified
- Evidence collected for postmortem
Error Handling
| Issue | Cause | Solution |
|---|---|---|
| Can't reach status page | Network issue | Use mobile or VPN |
| kubectl fails | Auth expired | Re-authenticate |
| Metrics unavailable | Prometheus down | Check backup metrics |
| Secret rotation fails | Permission denied | Escalate to admin |
Examples
One-Line Health Check
curl -sf https://api.yourapp.com/health | jq '.services.groq.status' || echo "UNHEALTHY"
Resources
Next Steps
For data handling, see
groq-data-handling.