Gsd-skill-creator chaos-engineering

Provides chaos engineering best practices for resilience testing, fault injection, and game day planning. Use when designing resilience experiments, configuring chaos tools, planning game days, or when user mentions 'chaos engineering', 'resilience', 'litmus', 'game day', 'fault injection', 'chaos monkey', 'blast radius', 'steady state', 'failure mode'.

install
source · Clone the upstream repo
git clone https://github.com/Tibsfox/gsd-skill-creator
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/Tibsfox/gsd-skill-creator "$T" && mkdir -p ~/.claude/skills && cp -r "$T/examples/skills/ops/chaos-engineering" ~/.claude/skills/tibsfox-gsd-skill-creator-chaos-engineering && rm -rf "$T"
manifest: examples/skills/ops/chaos-engineering/SKILL.md
source content

Chaos Engineering

Best practices for systematically injecting failures to discover weaknesses before they cause outages, using steady-state hypotheses, controlled experiments, and progressive blast radius expansion.

Chaos Engineering Principles

Chaos engineering is not random destruction. It is disciplined experimentation on distributed systems to build confidence in their resilience.

Define Steady State --> Form Hypothesis --> Design Experiment --> Control Blast Radius --> Run --> Analyze --> Fix --> Repeat
PrincipleDescriptionWhy It Matters
Define steady stateIdentify measurable normal behavior (latency, error rate, throughput)Without a baseline, you cannot detect degradation
Hypothesize around steady statePredict the system will maintain steady state during faultForces explicit thinking about expected behavior
Vary real-world eventsInject failures that actually happen (network, disk, process, dependency)Simulated failures must map to real failure modes
Run in productionTest where real complexity exists (with safeguards)Staging rarely matches production topology
Minimize blast radiusStart small, expand gradually, have kill switchesChaos should reveal problems, not cause outages
Automate experimentsRepeatable experiments run in CI/CD or on scheduleManual experiments don't scale and introduce bias
Build a hypothesis backlogTrack what you want to test and what you've learnedSystematic coverage prevents blind spots

Steady-State Hypothesis Template

Every chaos experiment begins with a hypothesis. This template ensures experiments are structured and measurable.

# Steady-State Hypothesis Document
experiment:
  name: "Payment service database failover"
  id: "CHAOS-042"
  date: "2026-02-07"
  owner: "team-payments"
  reviewer: "sre-team"

steady_state:
  description: "Payment service processes transactions within SLO"
  metrics:
    - name: "p99 latency"
      source: "prometheus"
      query: 'histogram_quantile(0.99, rate(payment_request_duration_seconds_bucket[5m]))'
      threshold: "< 500ms"
    - name: "error rate"
      source: "prometheus"
      query: 'rate(payment_request_errors_total[5m]) / rate(payment_request_total[5m])'
      threshold: "< 0.1%"
    - name: "transaction throughput"
      source: "prometheus"
      query: 'rate(payment_transactions_total[5m])'
      threshold: "> 100 tx/s"

hypothesis: >
  When the primary database replica fails, the payment service will
  failover to the secondary replica within 30 seconds, maintaining
  p99 latency below 2 seconds and error rate below 1% during failover.

experiment_design:
  action: "Kill primary PostgreSQL pod in payment-db StatefulSet"
  duration: "5 minutes"
  blast_radius: "payment namespace only"
  rollback: "PostgreSQL operator will auto-recreate pod; manual failback if needed"

abort_conditions:
  - "Error rate exceeds 5% for more than 60 seconds"
  - "Total service outage detected (zero throughput for 30 seconds)"
  - "Cascading failures detected in upstream services"
  - "Any P1/P2 incident triggered by unrelated system"

expected_outcome:
  - "Failover completes within 30 seconds"
  - "p99 latency spikes to < 2s during failover, recovers to < 500ms"
  - "Error rate stays below 1%"
  - "No data loss or corruption"

actual_outcome: null  # Filled after experiment
findings: null
action_items: null

Litmus Chaos Experiments

LitmusChaos Engine Manifest

LitmusChaos is a CNCF project for Kubernetes-native chaos engineering. The ChaosEngine connects your application to chaos experiments.

apiVersion: litmuschaos.io/v1alpha1
kind: ChaosEngine
metadata:
  name: payment-service-chaos
  namespace: payments
spec:
  appinfo:
    appns: payments
    applabel: "app=payment-service"
    appkind: deployment
  engineState: active
  chaosServiceAccount: litmus-admin
  monitoring: true

  # Steady-state checks before and after experiment
  components:
    runner:
      resources:
        requests:
          cpu: "100m"
          memory: "128Mi"
        limits:
          cpu: "200m"
          memory: "256Mi"

  experiments:
    - name: pod-delete
      spec:
        probe:
          # Steady-state verification probe
          - name: "payment-api-health"
            type: httpProbe
            mode: Continuous
            httpProbe/inputs:
              url: "http://payment-service.payments.svc:8080/health"
              insecureSkipVerify: false
              method:
                get:
                  criteria: "=="
                  responseCode: "200"
            runProperties:
              probeTimeout: 5s
              interval: 10s
              retry: 3
              probePollingInterval: 2s

          - name: "payment-latency-check"
            type: promProbe
            mode: Edge
            promProbe/inputs:
              endpoint: "http://prometheus.monitoring.svc:9090"
              query: 'histogram_quantile(0.99, rate(payment_request_duration_seconds_bucket{namespace="payments"}[1m]))'
              comparator:
                type: float
                criteria: "<="
                value: "2.0"
            runProperties:
              probeTimeout: 10s
              interval: 30s
              retry: 2

        components:
          env:
            - name: TOTAL_CHAOS_DURATION
              value: "120"
            - name: CHAOS_INTERVAL
              value: "30"
            - name: FORCE
              value: "true"
            - name: PODS_AFFECTED_PERC
              value: "50"

Litmus Network Chaos Experiment

apiVersion: litmuschaos.io/v1alpha1
kind: ChaosEngine
metadata:
  name: payment-network-chaos
  namespace: payments
spec:
  appinfo:
    appns: payments
    applabel: "app=payment-service"
    appkind: deployment
  engineState: active
  chaosServiceAccount: litmus-admin

  experiments:
    - name: pod-network-latency
      spec:
        components:
          env:
            - name: TOTAL_CHAOS_DURATION
              value: "180"
            - name: NETWORK_LATENCY
              value: "300"  # 300ms added latency
            - name: JITTER
              value: "100"  # 100ms jitter
            - name: NETWORK_INTERFACE
              value: "eth0"
            - name: DESTINATION_IPS
              value: "10.96.0.10"  # Target specific dependency (e.g., database)
            - name: CONTAINER_RUNTIME
              value: "containerd"
            - name: SOCKET_PATH
              value: "/run/containerd/containerd.sock"
        probe:
          - name: "transaction-success-rate"
            type: promProbe
            mode: Continuous
            promProbe/inputs:
              endpoint: "http://prometheus.monitoring.svc:9090"
              query: 'rate(payment_transactions_success_total{namespace="payments"}[1m]) / rate(payment_transactions_total{namespace="payments"}[1m]) * 100'
              comparator:
                type: float
                criteria: ">="
                value: "95.0"
            runProperties:
              probeTimeout: 10s
              interval: 15s
              retry: 3

Chaos Monkey Configuration

Netflix's Chaos Monkey randomly terminates instances in production. Modern implementations support Kubernetes and use Spinnaker integration.

# Chaos Monkey for Spring Boot (Simian Army successor)
# application.yml
chaos:
  monkey:
    enabled: true
    watcher:
      controller: true
      restController: true
      service: true
      repository: true
      component: false

    assaults:
      level: 5                    # 1 in 5 requests affected
      latencyActive: true
      latencyRangeStart: 1000     # 1 second
      latencyRangeEnd: 5000       # 5 seconds
      exceptionsActive: true
      exception:
        type: java.lang.RuntimeException
        arguments:
          - className: java.lang.String
            value: "Chaos Monkey - simulated failure"
      killApplicationActive: false  # DANGER: only enable in controlled tests
      memoryActive: false

    runtime:
      # Only active during business hours (safety net)
      scheduleEnabled: true
      scheduleExpression: "0 0 9-17 * * MON-FRI"

---
# Kube-monkey configuration (Kubernetes-native Chaos Monkey)
# configmap.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: kube-monkey-config
  namespace: kube-system
data:
  config.toml: |
    [kubemonkey]
    run_hour = 8
    start_hour = 10
    end_hour = 16
    grace_period_sec = 5
    cluster_dns_name = "cluster.local"
    whitelisted_namespaces = ["payments", "orders", "inventory"]
    blacklisted_namespaces = ["kube-system", "monitoring", "istio-system"]
    time_zone = "America/New_York"

    [debug]
    enabled = true
    schedule_immediate_kill = false

Blast Radius Control Matrix

Controlling blast radius is the difference between chaos engineering and simply breaking things.

LevelScopeExampleRiskWhen to Use
1 - UnitSingle container/processKill one pod replicaMinimalStarting out, new experiments
2 - ServiceAll replicas of one serviceDelete all pods in deploymentLowAfter Level 1 succeeds
3 - DependencyDegrade a dependencyAdd latency to database connectionMediumTesting circuit breakers, retries
4 - ZoneEntire availability zoneDrain all nodes in one AZHighQuarterly DR exercises
5 - RegionFull region failureRedirect all traffic to secondary regionCriticalAnnual DR exercises, game days

Blast Radius Escalation Protocol

Level 1: Single pod kill
  |
  +-- Pass? --> Level 2: Full service disruption
  |               |
  |               +-- Pass? --> Level 3: Dependency failure
  |               |               |
  |               |               +-- Pass? --> Level 4: Zone failure (game day)
  |               |               |               |
  |               |               |               +-- Pass? --> Level 5: Region failover (annual)
  |               |               |
  |               |               +-- Fail --> Fix, retest Level 3
  |               |
  |               +-- Fail --> Fix, retest Level 2
  |
  +-- Fail --> Fix, retest Level 1

Game Day Runbook Template

Game days are structured chaos engineering exercises involving multiple teams. They test both technical resilience and human response.

# Game Day Runbook: [Scenario Name]

## Metadata
- **Date:** YYYY-MM-DD
- **Time Window:** HH:MM - HH:MM (timezone)
- **Game Master:** [name]
- **Participants:** [teams]
- **Communication Channel:** #gameday-YYYY-MM-DD

## Pre-Game Checklist
- [ ] All participating teams briefed (do NOT reveal exact failure scenario)
- [ ] Monitoring dashboards open and shared
- [ ] Rollback procedures documented and tested
- [ ] Customer communication templates prepared
- [ ] Stakeholders notified of game day window
- [ ] On-call engineers aware and standing by
- [ ] Kill switch tested and ready

## Scenario
**Narrative:** [Real-world scenario description]
**Technical Action:** [Exact fault injection steps]
**Expected Impact:** [What should happen if systems are resilient]
**Worst Case:** [What happens if resilience mechanisms fail]

## Timeline
| Time | Action | Owner |
|------|--------|-------|
| T-30m | Final go/no-go check | Game Master |
| T-15m | Verify steady state metrics | SRE |
| T-0 | Inject failure | Game Master |
| T+5m | Observe initial response | All teams |
| T+15m | Check: has system self-healed? | SRE |
| T+30m | Decision: continue or abort | Game Master |
| T+60m | End experiment, begin recovery | Game Master |
| T+90m | Verify full recovery | SRE |
| T+120m | Hot debrief | All participants |

## Observation Checklist
- [ ] Did alerts fire within expected time?
- [ ] Did on-call respond within SLO?
- [ ] Did failover mechanisms activate?
- [ ] Were customers impacted? For how long?
- [ ] Did runbooks match actual recovery steps?
- [ ] Were any cascading failures observed?
- [ ] Did communication flow correctly?

## Post-Game
- [ ] Write-up completed within 48 hours
- [ ] Action items created with owners and deadlines
- [ ] Findings shared with engineering org
- [ ] Next game day scenario identified

Progressive Chaos Maturity Levels

LevelNamePracticesExperimentsFrequency
0NoneNo chaos practiceNoneNever
1ExploratoryAd-hoc experiments in stagingPod kills, restartsQuarterly
2SystematicHypothesis-driven, documentedNetwork faults, dependency failuresMonthly
3AutomatedChaos in CI/CD, scheduled experimentsMulti-service scenarios, zone failuresWeekly
4AdvancedProduction chaos, game days, culture of resilienceRegion failover, data plane chaosContinuous

Chaos in CI/CD Pipelines

Integrate chaos experiments as quality gates in your deployment pipeline.

GitHub Actions Chaos Workflow

name: Resilience Tests

on:
  push:
    branches: [main]
  schedule:
    - cron: '0 10 * * 1'  # Every Monday at 10:00 UTC

permissions:
  contents: read

jobs:
  chaos-tests:
    runs-on: ubuntu-latest
    timeout-minutes: 30

    steps:
      - uses: actions/checkout@v4

      - name: Set up test cluster
        uses: helm/kind-action@v1
        with:
          cluster_name: chaos-test
          config: test/kind-config.yaml

      - name: Deploy application
        run: |
          kubectl apply -f k8s/namespace.yaml
          kubectl apply -f k8s/deployment.yaml
          kubectl apply -f k8s/service.yaml
          kubectl wait --for=condition=available deployment/payment-service \
            -n payments --timeout=120s

      - name: Install LitmusChaos
        run: |
          kubectl apply -f https://litmuschaos.github.io/litmus/litmus-operator-v3.0.0.yaml
          kubectl wait --for=condition=ready pod -l app.kubernetes.io/name=litmus \
            -n litmus --timeout=120s

      - name: Verify steady state
        run: |
          # Check application health before chaos
          kubectl exec -n payments deploy/payment-service -- \
            curl -sf http://localhost:8080/health || exit 1

      - name: Run chaos experiment
        run: |
          kubectl apply -f chaos/pod-delete-experiment.yaml
          # Wait for experiment to complete
          kubectl wait --for=jsonpath='{.status.engineStatus}'=completed \
            chaosengine/payment-chaos -n payments --timeout=300s

      - name: Verify resilience
        run: |
          # Check experiment verdict
          VERDICT=$(kubectl get chaosresult payment-chaos-pod-delete \
            -n payments -o jsonpath='{.status.experimentStatus.verdict}')
          echo "Experiment verdict: $VERDICT"
          if [ "$VERDICT" != "Pass" ]; then
            echo "CHAOS TEST FAILED: System did not maintain steady state"
            kubectl logs -n payments -l app=payment-service --tail=100
            exit 1
          fi

      - name: Collect results
        if: always()
        run: |
          kubectl get chaosresult -n payments -o yaml > chaos-results.yaml

      - uses: actions/upload-artifact@v4
        if: always()
        with:
          name: chaos-results
          path: chaos-results.yaml
          retention-days: 30

Common Failure Modes to Test

Failure ModeTool/MethodWhat It Validates
Pod terminationLitmusChaos pod-deleteAuto-scaling, health checks, restart policies
Network latencytc netem / LitmusChaosTimeouts, circuit breakers, retry logic
Network partitioniptables / LitmusChaosSplit-brain handling, quorum mechanisms
DNS failureCoreDNS manipulationDNS caching, fallback resolution
CPU stressstress-ng / LitmusChaosAutoscaling triggers, throttling behavior
Memory pressurestress-ng / LitmusChaosOOM handling, graceful degradation
Disk I/O saturationfio / LitmusChaosWrite-ahead log performance, disk alerts
Dependency unavailableNetwork block / mockCircuit breakers, fallback responses, bulkheads
Clock skewchrony manipulationCertificate validation, token expiry, cron jobs
Configuration driftMutate ConfigMap/SecretConfig reload, graceful failure on bad config

Anti-Patterns

Anti-PatternProblemFix
Chaos without hypothesisRandom destruction teaches nothingAlways write a hypothesis before injecting failures
Testing only in stagingStaging rarely matches production complexityGraduate to production chaos with proper safeguards
No abort criteriaExperiments can escalate into real outagesDefine and automate kill switches before every experiment
Big bang experimentsStarting with region-level failures on day oneFollow blast radius levels: pod -> service -> zone -> region
Chaos as punishmentUsing chaos to blame teams for failuresFrame chaos as learning; celebrate finding weaknesses
No follow-throughRunning experiments but never fixing findingsTrack action items with owners and deadlines; re-test fixes
Manual-only experimentsExperiments that depend on one person to runAutomate experiments, integrate into CI/CD pipeline
Ignoring human factorsOnly testing technical resilienceGame days should test alerting, communication, and runbooks too
Secret chaosRunning experiments without telling anyoneCommunicate schedules; surprise chaos erodes trust
Skipping steady-state verificationNo baseline to compare againstAlways measure before, during, and after injection
Chaos without observabilityCannot measure impact of experimentsInstrument first, then inject chaos; monitoring is prerequisite
One-and-done experimentsRunning an experiment once and declaring successSystems change; re-run experiments regularly to catch regressions

Chaos Engineering Readiness Checklist

  • Observability in place (metrics, logs, traces) for target services
  • Steady-state metrics identified and baselined for target services
  • Chaos tooling installed and configured (LitmusChaos, Gremlin, or equivalent)
  • Service account and RBAC configured for chaos operator
  • Abort criteria defined and kill switch mechanism tested
  • First hypothesis document written and reviewed
  • Blast radius limited to single pod/container for initial experiments
  • Monitoring dashboards prepared for experiment observation
  • On-call team briefed and aware of experiment schedule
  • Rollback procedures documented and tested independently
  • Communication channel established for experiment coordination
  • First experiment successfully run in non-production environment
  • Results documented with findings and action items
  • Leadership briefed on chaos engineering program and value
  • Game day planned within 90 days of first successful experiment
  • Chaos experiments integrated into CI/CD pipeline (or plan to do so)
  • Resilience scorecard created for tracked services