Claude-skill-registry-data mistral-incident-runbook

install
source · Clone the upstream repo
git clone https://github.com/majiayu000/claude-skill-registry-data
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/majiayu000/claude-skill-registry-data "$T" && mkdir -p ~/.claude/skills && cp -r "$T/data/mistral-incident-runbook" ~/.claude/skills/majiayu000-claude-skill-registry-data-mistral-incident-runbook && rm -rf "$T"
manifest: data/mistral-incident-runbook/SKILL.md
source content

Mistral AI Incident Runbook

Overview

Rapid incident response procedures for Mistral AI-related outages.

Prerequisites

  • Access to Mistral AI console
  • kubectl access to production cluster (if applicable)
  • Prometheus/Grafana access
  • Communication channels (Slack, PagerDuty)

Severity Levels

LevelDefinitionResponse TimeExamples
P1Complete outage< 15 minMistral API unreachable, all requests failing
P2Degraded service< 1 hourHigh latency, partial failures, rate limiting
P3Minor impact< 4 hoursOccasional errors, non-critical feature down
P4No user impactNext business dayMonitoring gaps, documentation issues

Quick Triage

#!/bin/bash
# mistral-triage.sh

echo "=== Mistral AI Quick Triage ==="
echo "Timestamp: $(date)"
echo ""

# 1. Check Mistral API health
echo "1. Checking Mistral API..."
HTTP_STATUS=$(curl -s -o /dev/null -w "%{http_code}" \
  -H "Authorization: Bearer ${MISTRAL_API_KEY}" \
  https://api.mistral.ai/v1/models)
echo "   API Status: $HTTP_STATUS"

# 2. Check our health endpoint
echo ""
echo "2. Checking our service health..."
curl -s https://api.yourapp.com/health | jq '.services.mistral' 2>/dev/null || echo "   Health check failed"

# 3. Check recent error rate
echo ""
echo "3. Recent errors (if Prometheus available)..."
curl -s "localhost:9090/api/v1/query?query=rate(mistral_errors_total[5m])" | jq '.data.result' 2>/dev/null || echo "   Prometheus not available"

# 4. Check recent logs
echo ""
echo "4. Recent error logs..."
kubectl logs -l app=mistral-service --since=5m 2>/dev/null | grep -i error | tail -10 || echo "   kubectl not available"

Decision Tree

Mistral API returning errors?
├─ YES: Check api.mistral.ai/v1/models with curl
│   ├─ 401 → API key issue (see Auth section)
│   ├─ 429 → Rate limited (see Rate Limit section)
│   ├─ 5xx → Mistral service issue (wait & monitor)
│   └─ Timeout → Network issue (check connectivity)
└─ NO: Our service returning errors?
    ├─ YES → Check our logs & config
    └─ NO → Likely resolved, continue monitoring

Immediate Actions by Error Type

401 Unauthorized - Authentication Failed

# 1. Verify API key is set
echo "API Key length: ${#MISTRAL_API_KEY}"
echo "API Key prefix: ${MISTRAL_API_KEY:0:10}..."

# 2. Test API key directly
curl -v -H "Authorization: Bearer ${MISTRAL_API_KEY}" \
  https://api.mistral.ai/v1/models

# 3. Check if key was rotated
# → Verify in Mistral console: console.mistral.ai

# 4. Update key if needed
kubectl create secret generic mistral-secrets \
  --from-literal=api-key="$NEW_API_KEY" \
  --dry-run=client -o yaml | kubectl apply -f -

# 5. Restart pods
kubectl rollout restart deployment/mistral-service

429 Rate Limited

# 1. Check current rate limit status
curl -v -H "Authorization: Bearer ${MISTRAL_API_KEY}" \
  https://api.mistral.ai/v1/models 2>&1 | grep -i "rate\|retry"

# 2. Enable request queuing (if supported)
kubectl set env deployment/mistral-service RATE_LIMIT_MODE=queue

# 3. Reduce request concurrency
kubectl set env deployment/mistral-service MAX_CONCURRENT_REQUESTS=5

# 4. Long-term: Contact Mistral for limit increase
# → console.mistral.ai or support@mistral.ai

500/503 Service Error

# 1. Check Mistral status (if available)
echo "Checking Mistral status..."

# 2. Enable graceful degradation
kubectl set env deployment/mistral-service MISTRAL_FALLBACK=true

# 3. Notify users
# → Update status page

# 4. Monitor for recovery
watch -n 30 'curl -s -o /dev/null -w "%{http_code}" -H "Authorization: Bearer ${MISTRAL_API_KEY}" https://api.mistral.ai/v1/models'

Timeout/Network Error

# 1. Test connectivity
curl -v --connect-timeout 5 https://api.mistral.ai/v1/models

# 2. Check DNS resolution
nslookup api.mistral.ai

# 3. Increase timeout
kubectl set env deployment/mistral-service MISTRAL_TIMEOUT=60000

# 4. Check egress rules
kubectl get networkpolicy

Communication Templates

Internal (Slack)

:red_circle: P1 INCIDENT: Mistral AI Integration
**Status**: INVESTIGATING
**Impact**: [Users cannot use AI features / Degraded AI responses]
**Current action**: [What you're doing]
**Next update**: [Time - typically every 15-30 min for P1]
**Incident commander**: @[name]

External (Status Page)

AI Feature Degradation

We're experiencing issues with our AI-powered features.
Some users may experience slower responses or temporary unavailability.

Our team is actively investigating and working with our AI provider to resolve this.

Affected services:
- [List affected features]

Workaround: [If available]

Last updated: [timestamp]

Post-Incident

Evidence Collection

#!/bin/bash
# collect-evidence.sh

INCIDENT_DIR="incident-$(date +%Y%m%d-%H%M%S)"
mkdir -p "$INCIDENT_DIR"

# Collect logs
kubectl logs -l app=mistral-service --since=1h > "$INCIDENT_DIR/logs.txt"

# Export metrics
curl "localhost:9090/api/v1/query_range?query=mistral_errors_total&start=$(date -d '2 hours ago' +%s)&end=$(date +%s)&step=60" \
  > "$INCIDENT_DIR/metrics.json"

# Collect config (redacted)
kubectl get deployment mistral-service -o yaml | grep -v "api-key" > "$INCIDENT_DIR/deployment.yaml"

# Create bundle
tar -czf "$INCIDENT_DIR.tar.gz" "$INCIDENT_DIR"
echo "Evidence bundle: $INCIDENT_DIR.tar.gz"

Postmortem Template

## Incident: Mistral AI [Error Type]
**Date:** YYYY-MM-DD
**Duration:** X hours Y minutes
**Severity:** P[1-4]
**Incident Commander:** [Name]

### Summary
[1-2 sentence description of what happened]

### Timeline (UTC)
| Time | Event |
|------|-------|
| HH:MM | First alert triggered |
| HH:MM | Incident declared |
| HH:MM | Root cause identified |
| HH:MM | Mitigation applied |
| HH:MM | Service restored |

### Root Cause
[Technical explanation of what went wrong]

### Impact
- Users affected: [Number or percentage]
- Duration of impact: [Time]
- Requests failed: [Number]
- Revenue impact: [If applicable]

### Detection
How was the incident detected?
- [ ] Automated alerting
- [ ] Customer report
- [ ] Internal testing
- [ ] Other: ___

### Resolution
[What was done to fix the issue]

### Action Items
| Priority | Action | Owner | Due Date | Status |
|----------|--------|-------|----------|--------|
| P1 | [Immediate fix] | @name | YYYY-MM-DD | [ ] |
| P2 | [Preventive measure] | @name | YYYY-MM-DD | [ ] |
| P3 | [Improvement] | @name | YYYY-MM-DD | [ ] |

### Lessons Learned
- What went well:
- What could be improved:
- What we got lucky with:

Output

  • Issue identified and categorized
  • Mitigation applied
  • Stakeholders notified
  • Evidence collected for postmortem

Error Handling

IssueCauseSolution
kubectl failsAuth expiredRe-authenticate with cloud provider
Metrics unavailablePrometheus downCheck backup metrics or logs
Secret rotation failsPermission deniedEscalate to admin
Fallback not workingNot implementedUse cached responses or error page

Examples

One-Line Health Check

curl -sf -H "Authorization: Bearer ${MISTRAL_API_KEY}" https://api.mistral.ai/v1/models | jq '.data[0].id' || echo "UNHEALTHY"

Quick Rollback

kubectl rollout undo deployment/mistral-service && \
kubectl rollout status deployment/mistral-service

Resources

Next Steps

For data handling, see

mistral-data-handling
.