Claude-skill-registry investigate-incident

Investigate platform incidents, perform RCA, create incident documentation, and follow alert runbooks in the Kagenti platform

install
source · Clone the upstream repo
git clone https://github.com/majiayu000/claude-skill-registry
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/majiayu000/claude-skill-registry "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/data/investigate-incident" ~/.claude/skills/majiayu000-claude-skill-registry-investigate-incident && rm -rf "$T"
manifest: skills/data/investigate-incident/SKILL.md
source content

Investigate Incident Skill

This skill helps you investigate incidents, perform root cause analysis (RCA), and create comprehensive incident documentation.

When to Use

  • After alerts fire (check runbooks first)
  • When tests fail unexpectedly
  • User reports service unavailable
  • Pods in CrashLoopBackOff or ImagePullBackOff
  • Services showing degraded performance
  • After deployment issues

What This Skill Does

  1. Follow Runbooks: Execute alert-specific investigation steps
  2. Gather Evidence: Collect logs, metrics, events, pod status
  3. Root Cause Analysis: Identify underlying issues
  4. Document Incidents: Create structured RCA in TODO_INCIDENTS.md
  5. Track Fixes: Plan and validate remediation

Investigation Workflow

1. Check If Alert Has Runbook

# List all available runbooks
ls -1 docs/runbooks/alerts/*.md

# For specific alert, check annotation
kubectl exec -n observability deployment/grafana -- \
  curl -s 'http://localhost:3000/api/v1/provisioning/alert-rules' \
  -u admin:admin123 | python3 -c "
import sys, json
rules = json.load(sys.stdin)
alert_uid = 'prometheus-down'  # Change this
rule = next((r for r in rules if r.get('uid') == alert_uid), None)
if rule:
    print('Runbook URL:', rule['annotations'].get('runbook_url', 'N/A'))
"

2. Follow Runbook Steps

Runbooks are located at:

docs/runbooks/alerts/<alert-uid>.md

Standard runbook sections:

  1. Meaning - What this alert indicates
  2. Impact - Business/platform impact
  3. Diagnosis - Investigation commands
  4. Mitigation - How to fix

Example: Follow Prometheus Down runbook:

# From docs/runbooks/alerts/prometheus-down.md

# 1. Check pod status
kubectl get pods -n observability -l app=prometheus

# 2. Check pod logs
kubectl logs -n observability deployment/prometheus --tail=100

# 3. Check events
kubectl get events -n observability --field-selector involvedObject.name=prometheus --sort-by='.lastTimestamp'

# 4. Test Prometheus endpoint
kubectl exec -n observability deployment/grafana -- \
  curl -s http://prometheus.observability.svc:9090/-/ready

3. Gather Comprehensive Evidence

Pod Status & Events:

# Get pod status in namespace
kubectl get pods -n <namespace>

# Detailed pod description
kubectl describe pod <pod-name> -n <namespace>

# Recent events sorted by time
kubectl get events -n <namespace> --sort-by='.lastTimestamp' | tail -20

# All failing pods across platform
kubectl get pods -A | grep -E "Error|CrashLoop|ImagePull|Pending"

Logs:

# Current container logs
kubectl logs -n <namespace> <pod-name> --tail=100

# Previous container (if crashed)
kubectl logs -n <namespace> <pod-name> --previous

# Specific container in pod
kubectl logs -n <namespace> <pod-name> -c <container-name>

# All containers in pod
kubectl logs -n <namespace> <pod-name> --all-containers=true

# Query Loki for errors (last 5 minutes)
kubectl exec -n observability deployment/grafana -- \
  curl -s -G 'http://loki.observability.svc:3100/loki/api/v1/query_range' \
  --data-urlencode 'query={kubernetes_namespace_name="<namespace>"} |= "error"' \
  --data-urlencode 'limit=100' \
  --data-urlencode "start=$(date -u -v-5M +%s)000000000" \
  --data-urlencode "end=$(date -u +%s)000000000" | python3 -m json.tool

Metrics:

# Check if service is up
kubectl exec -n observability deployment/grafana -- \
  curl -s -G 'http://prometheus.observability.svc:9090/api/v1/query' \
  --data-urlencode 'query=up{job="<job-name>"}' | python3 -m json.tool

# Check replica availability
kubectl exec -n observability deployment/grafana -- \
  curl -s -G 'http://prometheus.observability.svc:9090/api/v1/query' \
  --data-urlencode 'query=kube_deployment_status_replicas_available{deployment="<name>"}' \
  | python3 -m json.tool

# Check pod restarts
kubectl exec -n observability deployment/grafana -- \
  curl -s -G 'http://prometheus.observability.svc:9090/api/v1/query' \
  --data-urlencode 'query=kube_pod_container_status_restarts_total{pod=~"<pod-pattern>"}' \
  | python3 -m json.tool

ArgoCD Application Status:

# Check application health
argocd app get <app-name> --port-forward --port-forward-namespace argocd --grpc-web

# Check sync status
argocd app list --port-forward --port-forward-namespace argocd --grpc-web | grep -E "Degraded|OutOfSync"

# View recent sync history
argocd app history <app-name> --port-forward --port-forward-namespace argocd --grpc-web

4. Identify Root Cause

Common Root Causes:

  1. Configuration Error

    • Check recent Git commits:
      git log --oneline -10
    • Check ArgoCD diff:
      argocd app diff <app-name> --port-forward ...
    • Validate YAML:
      kustomize build components/...
  2. Image Issues

    • Check image availability:
      docker exec kagenti-demo-control-plane crictl images | grep <image>
    • Check ImagePullBackOff:
      kubectl describe pod <pod-name> | grep -A10 "Events"
  3. Resource Constraints

    • Check pod resources:
      kubectl top pods -n <namespace>
    • Check node resources:
      kubectl top nodes
    • Check OOM kills:
      kubectl get events -A | grep OOM
  4. Dependency Failure

    • Check if dependent service is healthy
    • Check service endpoints:
      kubectl get endpoints -n <namespace>
    • Test connectivity:
      kubectl run debug-curl -n <namespace> --image=curlimages/curl --rm -it -- curl http://service-name
  5. mTLS/Network Issues

    • Check Istio sidecar:
      kubectl get pods -n <namespace>
      (should show 2/2)
    • Check PeerAuthentication:
      kubectl get peerauthentication -A
    • Check sidecar logs:
      kubectl logs -n <namespace> <pod-name> -c istio-proxy
  6. Certificate Issues

    • Check certificate status:
      kubectl get certificate -A
    • Check cert-manager logs:
      kubectl logs -n cert-manager deployment/cert-manager

5. Create Incident Documentation

Create entry in TODO_INCIDENTS.md:

## Incident #X: [Alert Name] - [Brief Description]

**Status**: 🔴 Active / 🟡 Investigating / 🟢 Resolved

**Detected**: 2025-11-17 08:31:40 UTC

**Severity**: Critical / Warning / Info

**Components Affected**:
- Component 1
- Component 2

### Summary

Brief description of what happened.

### Investigation

**Timeline**:
- 08:31 - Alert fired
- 08:35 - Checked pod status, found CrashLoopBackOff
- 08:40 - Reviewed logs, identified error message
- 08:45 - Identified root cause

**Evidence Collected**:

1. **Pod Status**:

NAME READY STATUS RESTARTS component-xxx 0/2 CrashLoopBackOff 5


2. **Error Logs**:

ERROR: Failed to connect to database: connection refused


3. **Events**:

Back-off restarting failed container


### Root Cause Analysis

**Root Cause**: [Specific technical reason]

**Why it happened**:
- Contributing factor 1
- Contributing factor 2

**Why alert fired**:
- PromQL query: `query_here`
- Query returned: `value`
- Threshold: `> threshold`

### Resolution

**Fix Applied**:
```bash
# Commands to fix the issue
kubectl apply -f ...

Verification:

# Commands to verify fix
kubectl get pods -n <namespace>
# Output showing healthy state

Time to Resolution: XX minutes

Lessons Learned

  1. What went well: Early detection via alerting
  2. What could improve: Need better validation before deploy
  3. Action items:
    • Add pre-deployment validation
    • Update runbook with this scenario
    • Add integration test for this case

Related

  • Alert:
    alert-uid
  • Runbook: docs/runbooks/alerts/alert-uid.md
  • Git commits: abc1234, def5678

### 6. Validate Fix

**After applying fix**:

```bash
# 1. Verify pods are healthy
kubectl get pods -n <namespace>

# 2. Check alert stopped firing
kubectl exec -n observability deployment/grafana -- \
  curl -s 'http://localhost:3000/api/alertmanager/grafana/api/v2/alerts' \
  -u admin:admin123 | python3 -c "
import sys, json
alerts = json.load(sys.stdin)
firing = [a for a in alerts if a.get('status', {}).get('state') == 'active']
for a in firing:
    print(f\"{a['labels']['alertname']}: {a['labels']['severity']}\")
"

# 3. Run integration tests
pytest tests/integration/test_<component>.py -v

# 4. Check platform status
./scripts/platform-status.sh

# 5. Verify in Grafana UI
open https://grafana.localtest.me:9443/alerting/list

Common Investigation Patterns

Pattern 1: Pod Won't Start

# Investigation flow
kubectl get pods -n <namespace>  # Get pod status
kubectl describe pod <pod-name> -n <namespace>  # Check events
kubectl logs <pod-name> -n <namespace>  # Check logs (if available)

# Common causes:
# - ImagePullBackOff: Image not found or not loaded
# - CrashLoopBackOff: Application exits on startup
# - Pending: Resource constraints or scheduling issues
# - Init:Error: Init container failed

Pattern 2: Service Unavailable

# Investigation flow
kubectl get pods -n <namespace> -l app=<service>  # Check pods
kubectl get svc -n <namespace> <service-name>  # Check service
kubectl get endpoints -n <namespace> <service-name>  # Check endpoints
kubectl get httproute -n <namespace>  # Check routes (if using Gateway API)

# Test connectivity
kubectl run debug-curl -n <namespace> --image=curlimages/curl --rm -it \
  -- curl -v http://<service-name>.<namespace>.svc:PORT

Pattern 3: High Resource Usage

# Investigation flow
kubectl top pods -n <namespace> --sort-by=memory  # Memory usage
kubectl top pods -n <namespace> --sort-by=cpu  # CPU usage

# Check limits
kubectl get pods -n <namespace> -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.spec.containers[*].resources}{"\n"}{end}'

# Check for OOM kills
kubectl get events -n <namespace> | grep OOM

Pattern 4: Frequent Restarts

# Investigation flow
kubectl get pods -n <namespace>  # Check RESTARTS column
kubectl describe pod <pod-name> -n <namespace>  # Check restart reason
kubectl logs <pod-name> -n <namespace> --previous  # Logs before crash

# Check restart metrics
kubectl exec -n observability deployment/grafana -- \
  curl -s -G 'http://prometheus.observability.svc:9090/api/v1/query' \
  --data-urlencode 'query=increase(kube_pod_container_status_restarts_total{pod="<pod-name>"}[1h])'

Pattern 5: ArgoCD App OutOfSync

# Investigation flow
argocd app get <app-name> --port-forward ...  # Get status
argocd app diff <app-name> --port-forward ...  # See differences

# Check last sync
argocd app history <app-name> --port-forward ...

# Force sync if needed
argocd app sync <app-name> --force --prune --port-forward ...

Integration with Other Skills

Use check-alerts skill to find which alerts are firing:

# This automatically invokes check-alerts skill
"What alerts are currently firing?"

Use check-logs skill to query specific error patterns:

# This automatically invokes check-logs skill
"Show me error logs from keycloak namespace in the last 10 minutes"

Use check-metrics skill to verify service health:

# This automatically invokes check-metrics skill
"What's the CPU usage of pods in observability namespace?"

Incident Priority Guidelines

Critical (P0):

  • Platform-wide outage
  • All users affected
  • Data loss risk
  • Security breach

High (P1):

  • Major feature unavailable
  • Multiple users affected
  • Performance severely degraded

Medium (P2):

  • Minor feature unavailable
  • Some users affected
  • Workaround available

Low (P3):

  • Cosmetic issue
  • Minimal impact
  • Can be fixed in next release

Related Documentation

Pro Tips

  1. Always check runbook first: Save time by following proven steps
  2. Capture evidence early: Logs/events may be lost if pod restarts
  3. Use --previous logs: For crashed pods, previous container logs are critical
  4. Check Git history: Recent commits often reveal configuration changes
  5. Document as you go: Don't wait until end to write RCA
  6. Verify fix completely: Check pods, alerts, tests, and platform status
  7. Update runbook: Add new scenarios discovered during investigation

🤖 Generated with Claude Code