Claude-skill-registry investigate-incident

Investigate platform incidents, perform RCA, create incident documentation, and follow alert runbooks in the Kagenti platform

install

source · Clone the upstream repo

git clone https://github.com/majiayu000/claude-skill-registry

Claude Code · Install into ~/.claude/skills/

T=$(mktemp -d) && git clone --depth=1 https://github.com/majiayu000/claude-skill-registry "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/data/investigate-incident" ~/.claude/skills/majiayu000-claude-skill-registry-investigate-incident && rm -rf "$T"

manifest: skills/data/investigate-incident/SKILL.md

Investigate Incident Skill

This skill helps you investigate incidents, perform root cause analysis (RCA), and create comprehensive incident documentation.

When to Use

After alerts fire (check runbooks first)
When tests fail unexpectedly
User reports service unavailable
Pods in CrashLoopBackOff or ImagePullBackOff
Services showing degraded performance
After deployment issues

What This Skill Does

Follow Runbooks: Execute alert-specific investigation steps
Gather Evidence: Collect logs, metrics, events, pod status
Root Cause Analysis: Identify underlying issues
Document Incidents: Create structured RCA in TODO_INCIDENTS.md
Track Fixes: Plan and validate remediation

Investigation Workflow

1. Check If Alert Has Runbook

# List all available runbooks
ls -1 docs/runbooks/alerts/*.md

# For specific alert, check annotation
kubectl exec -n observability deployment/grafana -- \
  curl -s 'http://localhost:3000/api/v1/provisioning/alert-rules' \
  -u admin:admin123 | python3 -c "
import sys, json
rules = json.load(sys.stdin)
alert_uid = 'prometheus-down'  # Change this
rule = next((r for r in rules if r.get('uid') == alert_uid), None)
if rule:
    print('Runbook URL:', rule['annotations'].get('runbook_url', 'N/A'))
"

2. Follow Runbook Steps

Runbooks are located at:

docs/runbooks/alerts/<alert-uid>.md

Standard runbook sections:

Meaning - What this alert indicates
Impact - Business/platform impact
Diagnosis - Investigation commands
Mitigation - How to fix

Example: Follow Prometheus Down runbook:

# From docs/runbooks/alerts/prometheus-down.md

# 1. Check pod status
kubectl get pods -n observability -l app=prometheus

# 2. Check pod logs
kubectl logs -n observability deployment/prometheus --tail=100

# 3. Check events
kubectl get events -n observability --field-selector involvedObject.name=prometheus --sort-by='.lastTimestamp'

# 4. Test Prometheus endpoint
kubectl exec -n observability deployment/grafana -- \
  curl -s http://prometheus.observability.svc:9090/-/ready

3. Gather Comprehensive Evidence

Pod Status & Events:

# Get pod status in namespace
kubectl get pods -n <namespace>

# Detailed pod description
kubectl describe pod <pod-name> -n <namespace>

# Recent events sorted by time
kubectl get events -n <namespace> --sort-by='.lastTimestamp' | tail -20

# All failing pods across platform
kubectl get pods -A | grep -E "Error|CrashLoop|ImagePull|Pending"

Logs:

# Current container logs
kubectl logs -n <namespace> <pod-name> --tail=100

# Previous container (if crashed)
kubectl logs -n <namespace> <pod-name> --previous

# Specific container in pod
kubectl logs -n <namespace> <pod-name> -c <container-name>

# All containers in pod
kubectl logs -n <namespace> <pod-name> --all-containers=true

# Query Loki for errors (last 5 minutes)
kubectl exec -n observability deployment/grafana -- \
  curl -s -G 'http://loki.observability.svc:3100/loki/api/v1/query_range' \
  --data-urlencode 'query={kubernetes_namespace_name="<namespace>"} |= "error"' \
  --data-urlencode 'limit=100' \
  --data-urlencode "start=$(date -u -v-5M +%s)000000000" \
  --data-urlencode "end=$(date -u +%s)000000000" | python3 -m json.tool

Metrics:

# Check if service is up
kubectl exec -n observability deployment/grafana -- \
  curl -s -G 'http://prometheus.observability.svc:9090/api/v1/query' \
  --data-urlencode 'query=up{job="<job-name>"}' | python3 -m json.tool

# Check replica availability
kubectl exec -n observability deployment/grafana -- \
  curl -s -G 'http://prometheus.observability.svc:9090/api/v1/query' \
  --data-urlencode 'query=kube_deployment_status_replicas_available{deployment="<name>"}' \
  | python3 -m json.tool

# Check pod restarts
kubectl exec -n observability deployment/grafana -- \
  curl -s -G 'http://prometheus.observability.svc:9090/api/v1/query' \
  --data-urlencode 'query=kube_pod_container_status_restarts_total{pod=~"<pod-pattern>"}' \
  | python3 -m json.tool

ArgoCD Application Status:

# Check application health
argocd app get <app-name> --port-forward --port-forward-namespace argocd --grpc-web

# Check sync status
argocd app list --port-forward --port-forward-namespace argocd --grpc-web | grep -E "Degraded|OutOfSync"

# View recent sync history
argocd app history <app-name> --port-forward --port-forward-namespace argocd --grpc-web

4. Identify Root Cause

Common Root Causes:

Configuration Error

Check recent Git commits:
```
git log --oneline -10
```

Check ArgoCD diff:

argocd app diff <app-name> --port-forward ...

Validate YAML:
```
kustomize build components/...
```

Image Issues

Check image availability:

docker exec kagenti-demo-control-plane crictl images | grep <image>

Check ImagePullBackOff:

kubectl describe pod <pod-name> | grep -A10 "Events"

Resource Constraints

Check pod resources:
```
kubectl top pods -n <namespace>
```
Check node resources:
```
kubectl top nodes
```
Check OOM kills:
```
kubectl get events -A | grep OOM
```

Dependency Failure

Check if dependent service is healthy
Check service endpoints:
```
kubectl get endpoints -n <namespace>
```

Test connectivity:

kubectl run debug-curl -n <namespace> --image=curlimages/curl --rm -it -- curl http://service-name

mTLS/Network Issues

Check Istio sidecar:
```
kubectl get pods -n <namespace>
```
(should show 2/2)
Check PeerAuthentication:
```
kubectl get peerauthentication -A
```

Check sidecar logs:

kubectl logs -n <namespace> <pod-name> -c istio-proxy

Certificate Issues

Check certificate status:
```
kubectl get certificate -A
```

Check cert-manager logs:

kubectl logs -n cert-manager deployment/cert-manager

5. Create Incident Documentation

Create entry in TODO_INCIDENTS.md:

## Incident #X: [Alert Name] - [Brief Description]

**Status**: 🔴 Active / 🟡 Investigating / 🟢 Resolved

**Detected**: 2025-11-17 08:31:40 UTC

**Severity**: Critical / Warning / Info

**Components Affected**:
- Component 1
- Component 2

### Summary

Brief description of what happened.

### Investigation

**Timeline**:
- 08:31 - Alert fired
- 08:35 - Checked pod status, found CrashLoopBackOff
- 08:40 - Reviewed logs, identified error message
- 08:45 - Identified root cause

**Evidence Collected**:

1. **Pod Status**:

NAME READY STATUS RESTARTS component-xxx 0/2 CrashLoopBackOff 5


2. **Error Logs**:

ERROR: Failed to connect to database: connection refused


3. **Events**:

Back-off restarting failed container


### Root Cause Analysis

**Root Cause**: [Specific technical reason]

**Why it happened**:
- Contributing factor 1
- Contributing factor 2

**Why alert fired**:
- PromQL query: `query_here`
- Query returned: `value`
- Threshold: `> threshold`

### Resolution

**Fix Applied**:
```bash
# Commands to fix the issue
kubectl apply -f ...

Verification:

# Commands to verify fix
kubectl get pods -n <namespace>
# Output showing healthy state

Time to Resolution: XX minutes

Lessons Learned

What went well: Early detection via alerting
What could improve: Need better validation before deploy
Action items:
- Add pre-deployment validation
- Update runbook with this scenario
- Add integration test for this case

Alert:
```
alert-uid
```
Runbook: docs/runbooks/alerts/alert-uid.md
Git commits: abc1234, def5678


### 6. Validate Fix

**After applying fix**:

```bash
# 1. Verify pods are healthy
kubectl get pods -n <namespace>

# 2. Check alert stopped firing
kubectl exec -n observability deployment/grafana -- \
  curl -s 'http://localhost:3000/api/alertmanager/grafana/api/v2/alerts' \
  -u admin:admin123 | python3 -c "
import sys, json
alerts = json.load(sys.stdin)
firing = [a for a in alerts if a.get('status', {}).get('state') == 'active']
for a in firing:
    print(f\"{a['labels']['alertname']}: {a['labels']['severity']}\")
"

# 3. Run integration tests
pytest tests/integration/test_<component>.py -v

# 4. Check platform status
./scripts/platform-status.sh

# 5. Verify in Grafana UI
open https://grafana.localtest.me:9443/alerting/list

Common Investigation Patterns

Pattern 1: Pod Won't Start

# Investigation flow
kubectl get pods -n <namespace>  # Get pod status
kubectl describe pod <pod-name> -n <namespace>  # Check events
kubectl logs <pod-name> -n <namespace>  # Check logs (if available)

# Common causes:
# - ImagePullBackOff: Image not found or not loaded
# - CrashLoopBackOff: Application exits on startup
# - Pending: Resource constraints or scheduling issues
# - Init:Error: Init container failed

Pattern 2: Service Unavailable

# Investigation flow
kubectl get pods -n <namespace> -l app=<service>  # Check pods
kubectl get svc -n <namespace> <service-name>  # Check service
kubectl get endpoints -n <namespace> <service-name>  # Check endpoints
kubectl get httproute -n <namespace>  # Check routes (if using Gateway API)

# Test connectivity
kubectl run debug-curl -n <namespace> --image=curlimages/curl --rm -it \
  -- curl -v http://<service-name>.<namespace>.svc:PORT

Pattern 3: High Resource Usage

# Investigation flow
kubectl top pods -n <namespace> --sort-by=memory  # Memory usage
kubectl top pods -n <namespace> --sort-by=cpu  # CPU usage

# Check limits
kubectl get pods -n <namespace> -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.spec.containers[*].resources}{"\n"}{end}'

# Check for OOM kills
kubectl get events -n <namespace> | grep OOM

Pattern 4: Frequent Restarts

# Investigation flow
kubectl get pods -n <namespace>  # Check RESTARTS column
kubectl describe pod <pod-name> -n <namespace>  # Check restart reason
kubectl logs <pod-name> -n <namespace> --previous  # Logs before crash

# Check restart metrics
kubectl exec -n observability deployment/grafana -- \
  curl -s -G 'http://prometheus.observability.svc:9090/api/v1/query' \
  --data-urlencode 'query=increase(kube_pod_container_status_restarts_total{pod="<pod-name>"}[1h])'

Pattern 5: ArgoCD App OutOfSync

# Investigation flow
argocd app get <app-name> --port-forward ...  # Get status
argocd app diff <app-name> --port-forward ...  # See differences

# Check last sync
argocd app history <app-name> --port-forward ...

# Force sync if needed
argocd app sync <app-name> --force --prune --port-forward ...

Integration with Other Skills

Use check-alerts skill to find which alerts are firing:

# This automatically invokes check-alerts skill
"What alerts are currently firing?"

Use check-logs skill to query specific error patterns:

# This automatically invokes check-logs skill
"Show me error logs from keycloak namespace in the last 10 minutes"

Use check-metrics skill to verify service health:

# This automatically invokes check-metrics skill
"What's the CPU usage of pods in observability namespace?"

Incident Priority Guidelines

Critical (P0):

Platform-wide outage
All users affected
Data loss risk
Security breach

High (P1):

Major feature unavailable
Multiple users affected
Performance severely degraded

Medium (P2):

Minor feature unavailable
Some users affected
Workaround available

Low (P3):

Cosmetic issue
Minimal impact
Can be fixed in next release

Pro Tips

Always check runbook first: Save time by following proven steps
Capture evidence early: Logs/events may be lost if pod restarts
Use --previous logs: For crashed pods, previous container logs are critical
Check Git history: Recent commits often reveal configuration changes
Document as you go: Don't wait until end to write RCA
Verify fix completely: Check pods, alerts, tests, and platform status
Update runbook: Add new scenarios discovered during investigation

🤖 Generated with Claude Code