Claude-skill-registry-data mistral-incident-runbook
install
source · Clone the upstream repo
git clone https://github.com/majiayu000/claude-skill-registry-data
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/majiayu000/claude-skill-registry-data "$T" && mkdir -p ~/.claude/skills && cp -r "$T/data/mistral-incident-runbook" ~/.claude/skills/majiayu000-claude-skill-registry-data-mistral-incident-runbook && rm -rf "$T"
manifest:
data/mistral-incident-runbook/SKILL.mdsource content
Mistral AI Incident Runbook
Overview
Rapid incident response procedures for Mistral AI-related outages.
Prerequisites
- Access to Mistral AI console
- kubectl access to production cluster (if applicable)
- Prometheus/Grafana access
- Communication channels (Slack, PagerDuty)
Severity Levels
| Level | Definition | Response Time | Examples |
|---|---|---|---|
| P1 | Complete outage | < 15 min | Mistral API unreachable, all requests failing |
| P2 | Degraded service | < 1 hour | High latency, partial failures, rate limiting |
| P3 | Minor impact | < 4 hours | Occasional errors, non-critical feature down |
| P4 | No user impact | Next business day | Monitoring gaps, documentation issues |
Quick Triage
#!/bin/bash # mistral-triage.sh echo "=== Mistral AI Quick Triage ===" echo "Timestamp: $(date)" echo "" # 1. Check Mistral API health echo "1. Checking Mistral API..." HTTP_STATUS=$(curl -s -o /dev/null -w "%{http_code}" \ -H "Authorization: Bearer ${MISTRAL_API_KEY}" \ https://api.mistral.ai/v1/models) echo " API Status: $HTTP_STATUS" # 2. Check our health endpoint echo "" echo "2. Checking our service health..." curl -s https://api.yourapp.com/health | jq '.services.mistral' 2>/dev/null || echo " Health check failed" # 3. Check recent error rate echo "" echo "3. Recent errors (if Prometheus available)..." curl -s "localhost:9090/api/v1/query?query=rate(mistral_errors_total[5m])" | jq '.data.result' 2>/dev/null || echo " Prometheus not available" # 4. Check recent logs echo "" echo "4. Recent error logs..." kubectl logs -l app=mistral-service --since=5m 2>/dev/null | grep -i error | tail -10 || echo " kubectl not available"
Decision Tree
Mistral API returning errors? ├─ YES: Check api.mistral.ai/v1/models with curl │ ├─ 401 → API key issue (see Auth section) │ ├─ 429 → Rate limited (see Rate Limit section) │ ├─ 5xx → Mistral service issue (wait & monitor) │ └─ Timeout → Network issue (check connectivity) └─ NO: Our service returning errors? ├─ YES → Check our logs & config └─ NO → Likely resolved, continue monitoring
Immediate Actions by Error Type
401 Unauthorized - Authentication Failed
# 1. Verify API key is set echo "API Key length: ${#MISTRAL_API_KEY}" echo "API Key prefix: ${MISTRAL_API_KEY:0:10}..." # 2. Test API key directly curl -v -H "Authorization: Bearer ${MISTRAL_API_KEY}" \ https://api.mistral.ai/v1/models # 3. Check if key was rotated # → Verify in Mistral console: console.mistral.ai # 4. Update key if needed kubectl create secret generic mistral-secrets \ --from-literal=api-key="$NEW_API_KEY" \ --dry-run=client -o yaml | kubectl apply -f - # 5. Restart pods kubectl rollout restart deployment/mistral-service
429 Rate Limited
# 1. Check current rate limit status curl -v -H "Authorization: Bearer ${MISTRAL_API_KEY}" \ https://api.mistral.ai/v1/models 2>&1 | grep -i "rate\|retry" # 2. Enable request queuing (if supported) kubectl set env deployment/mistral-service RATE_LIMIT_MODE=queue # 3. Reduce request concurrency kubectl set env deployment/mistral-service MAX_CONCURRENT_REQUESTS=5 # 4. Long-term: Contact Mistral for limit increase # → console.mistral.ai or support@mistral.ai
500/503 Service Error
# 1. Check Mistral status (if available) echo "Checking Mistral status..." # 2. Enable graceful degradation kubectl set env deployment/mistral-service MISTRAL_FALLBACK=true # 3. Notify users # → Update status page # 4. Monitor for recovery watch -n 30 'curl -s -o /dev/null -w "%{http_code}" -H "Authorization: Bearer ${MISTRAL_API_KEY}" https://api.mistral.ai/v1/models'
Timeout/Network Error
# 1. Test connectivity curl -v --connect-timeout 5 https://api.mistral.ai/v1/models # 2. Check DNS resolution nslookup api.mistral.ai # 3. Increase timeout kubectl set env deployment/mistral-service MISTRAL_TIMEOUT=60000 # 4. Check egress rules kubectl get networkpolicy
Communication Templates
Internal (Slack)
:red_circle: P1 INCIDENT: Mistral AI Integration **Status**: INVESTIGATING **Impact**: [Users cannot use AI features / Degraded AI responses] **Current action**: [What you're doing] **Next update**: [Time - typically every 15-30 min for P1] **Incident commander**: @[name]
External (Status Page)
AI Feature Degradation We're experiencing issues with our AI-powered features. Some users may experience slower responses or temporary unavailability. Our team is actively investigating and working with our AI provider to resolve this. Affected services: - [List affected features] Workaround: [If available] Last updated: [timestamp]
Post-Incident
Evidence Collection
#!/bin/bash # collect-evidence.sh INCIDENT_DIR="incident-$(date +%Y%m%d-%H%M%S)" mkdir -p "$INCIDENT_DIR" # Collect logs kubectl logs -l app=mistral-service --since=1h > "$INCIDENT_DIR/logs.txt" # Export metrics curl "localhost:9090/api/v1/query_range?query=mistral_errors_total&start=$(date -d '2 hours ago' +%s)&end=$(date +%s)&step=60" \ > "$INCIDENT_DIR/metrics.json" # Collect config (redacted) kubectl get deployment mistral-service -o yaml | grep -v "api-key" > "$INCIDENT_DIR/deployment.yaml" # Create bundle tar -czf "$INCIDENT_DIR.tar.gz" "$INCIDENT_DIR" echo "Evidence bundle: $INCIDENT_DIR.tar.gz"
Postmortem Template
## Incident: Mistral AI [Error Type] **Date:** YYYY-MM-DD **Duration:** X hours Y minutes **Severity:** P[1-4] **Incident Commander:** [Name] ### Summary [1-2 sentence description of what happened] ### Timeline (UTC) | Time | Event | |------|-------| | HH:MM | First alert triggered | | HH:MM | Incident declared | | HH:MM | Root cause identified | | HH:MM | Mitigation applied | | HH:MM | Service restored | ### Root Cause [Technical explanation of what went wrong] ### Impact - Users affected: [Number or percentage] - Duration of impact: [Time] - Requests failed: [Number] - Revenue impact: [If applicable] ### Detection How was the incident detected? - [ ] Automated alerting - [ ] Customer report - [ ] Internal testing - [ ] Other: ___ ### Resolution [What was done to fix the issue] ### Action Items | Priority | Action | Owner | Due Date | Status | |----------|--------|-------|----------|--------| | P1 | [Immediate fix] | @name | YYYY-MM-DD | [ ] | | P2 | [Preventive measure] | @name | YYYY-MM-DD | [ ] | | P3 | [Improvement] | @name | YYYY-MM-DD | [ ] | ### Lessons Learned - What went well: - What could be improved: - What we got lucky with:
Output
- Issue identified and categorized
- Mitigation applied
- Stakeholders notified
- Evidence collected for postmortem
Error Handling
| Issue | Cause | Solution |
|---|---|---|
| kubectl fails | Auth expired | Re-authenticate with cloud provider |
| Metrics unavailable | Prometheus down | Check backup metrics or logs |
| Secret rotation fails | Permission denied | Escalate to admin |
| Fallback not working | Not implemented | Use cached responses or error page |
Examples
One-Line Health Check
curl -sf -H "Authorization: Bearer ${MISTRAL_API_KEY}" https://api.mistral.ai/v1/models | jq '.data[0].id' || echo "UNHEALTHY"
Quick Rollback
kubectl rollout undo deployment/mistral-service && \ kubectl rollout status deployment/mistral-service
Resources
Next Steps
For data handling, see
mistral-data-handling.