Vibecosystem chaos-engineering

Failure injection patterns, blast radius control, steady state hypothesis, and gameday planning for resilience testing.

install

source · Clone the upstream repo

git clone https://github.com/vibeeval/vibecosystem

manifest: skills/chaos-engineering/skill.md

Chaos Engineering

Systematic resilience testing to discover weaknesses before they cause outages.

Steady State Hypothesis

# Define BEFORE injecting chaos - what "normal" looks like
steady_state_hypothesis:
  title: "API serves traffic within SLO"
  probes:
    - name: "API response time p95 < 500ms"
      type: http
      url: "https://api.example.com/health"
      threshold: 500

    - name: "Error rate < 1%"
      type: prometheus
      query: "rate(http_requests_total{status=~'5..'}[5m]) / rate(http_requests_total[5m])"
      threshold: 0.01

    - name: "Order processing queue depth < 100"
      type: cloudwatch
      metric: "ApproximateNumberOfMessagesVisible"
      threshold: 100

    - name: "Database connections < 80% capacity"
      type: prometheus
      query: "pg_stat_activity_count / pg_settings_max_connections"
      threshold: 0.8

Failure Injection Patterns

# Using Chaos Toolkit (chaostoolkit.org)
# experiment.json

{
  "title": "Database failover resilience",
  "description": "Verify app handles primary DB failover gracefully",

  "steady-state-hypothesis": {
    "title": "API responds normally",
    "probes": [
      {
        "name": "api-health",
        "type": "probe",
        "provider": {
          "type": "http",
          "url": "https://api.example.com/health",
          "timeout": 5
        },
        "tolerance": {"status": 200}
      }
    ]
  },

  "method": [
    {
      "name": "failover-primary-db",
      "type": "action",
      "provider": {
        "type": "python",
        "module": "chaosaws.rds.actions",
        "func": "failover_db_cluster",
        "arguments": {
          "db_cluster_identifier": "prod-cluster"
        }
      },
      "pauses": {"after": 60}
    }
  ],

  "rollbacks": [
    {
      "name": "verify-db-recovered",
      "type": "probe",
      "provider": {
        "type": "python",
        "module": "chaosaws.rds.probes",
        "func": "cluster_status",
        "arguments": {
          "db_cluster_identifier": "prod-cluster"
        }
      },
      "tolerance": "available"
    }
  ]
}

Blast Radius Control

# ALWAYS limit the impact of chaos experiments

class BlastRadiusController:
    """Control and limit chaos experiment impact."""

    def __init__(self, config: dict):
        self.max_affected_percentage = config.get('max_affected_pct', 5)
        self.max_duration_seconds = config.get('max_duration_s', 300)
        self.excluded_services = config.get('excluded', ['auth', 'payments'])
        self.kill_switch_url = config.get('kill_switch_url')

    def can_inject(self, target: str, scope: str) -> bool:
        # Never chaos-test critical services without explicit approval
        if target in self.excluded_services:
            return False

        # Never inject during peak hours
        hour = datetime.now().hour
        if 9 <= hour <= 17:  # Business hours (adjust per timezone)
            return False

        # Never affect more than N% of instances
        if self.get_affected_percentage(target, scope) > self.max_affected_percentage:
            return False

        return True

    def get_affected_percentage(self, target: str, scope: str) -> float:
        total = self.get_total_instances(target)
        affected = self.get_affected_instances(target, scope)
        return (affected / total) * 100 if total > 0 else 100

    async def emergency_stop(self) -> None:
        """Kill switch: immediately halt all chaos experiments."""
        await httpx.post(self.kill_switch_url, json={"action": "stop_all"})

Common Chaos Experiments

# Experiment catalog - start with these

level_1_basic:
  - name: "Kill a single pod"
    tool: "kubectl delete pod <name>"
    validates: "Pod auto-recovery, health checks"
    blast_radius: "1 pod"

  - name: "CPU stress on one node"
    tool: "stress-ng --cpu 4 --timeout 60"
    validates: "Autoscaling, request routing"
    blast_radius: "1 node"

  - name: "Inject 500ms network latency"
    tool: "tc qdisc add dev eth0 root netem delay 500ms"
    validates: "Timeout handling, circuit breakers"
    blast_radius: "1 container"

level_2_intermediate:
  - name: "Kill entire availability zone"
    tool: "Chaos Toolkit / AWS FIS"
    validates: "Multi-AZ failover, data replication"
    blast_radius: "1 AZ"

  - name: "DNS resolution failure"
    tool: "iptables -A OUTPUT -p udp --dport 53 -j DROP"
    validates: "DNS caching, fallback resolution"
    blast_radius: "1 service"

  - name: "Disk fill to 95%"
    tool: "fallocate -l 50G /tmp/disk_fill"
    validates: "Disk space alerts, log rotation"
    blast_radius: "1 node"

level_3_advanced:
  - name: "Split brain network partition"
    tool: "Toxiproxy / Linux iptables"
    validates: "Consensus protocols, data consistency"
    blast_radius: "Cluster segment"

  - name: "Clock skew injection"
    tool: "timedatectl set-time +5min"
    validates: "Certificate validation, token expiry"
    blast_radius: "1 node"

Gameday Checklist

## Pre-Gameday (1 week before)
- [ ] Define steady state hypothesis with measurable probes
- [ ] Identify blast radius and set hard limits
- [ ] Ensure kill switch is tested and accessible
- [ ] Notify on-call team and stakeholders
- [ ] Verify rollback procedures are documented and tested
- [ ] Set up monitoring dashboards for the experiment
- [ ] Run experiment in staging first

## During Gameday
- [ ] Verify steady state BEFORE injecting chaos
- [ ] Start with smallest blast radius, escalate gradually
- [ ] Monitor dashboards continuously during experiment
- [ ] Document observations in real-time (shared doc)
- [ ] If SLO violated: trigger kill switch immediately
- [ ] Time-box each experiment (max 5 minutes per injection)

## Post-Gameday
- [ ] Verify system returned to steady state
- [ ] Document findings: what broke, what recovered, what surprised
- [ ] Create action items for discovered weaknesses
- [ ] Update runbooks based on learnings
- [ ] Share results with broader engineering team
- [ ] Schedule fixes and re-test

Checklist

Define steady state hypothesis before every experiment
Never run chaos in production without a tested kill switch
Start in staging, graduate to production with reduced blast radius
Exclude critical services (auth, payments) unless specifically targeting them
Time-box experiments (max 5 minutes injection, 30 minutes observation)
Run during low-traffic windows, never during peak
Document every experiment: hypothesis, method, observations, findings
Automate recurring experiments in CI/CD pipeline

Anti-Patterns

Chaos without hypothesis: random breaking is not engineering
No kill switch: unable to stop experiment when things go wrong
Running in production first: always validate in staging
Affecting too many instances: never exceed 5% without explicit approval
Chaos during incidents: only inject chaos on healthy systems
Not fixing findings: experiments without follow-up action items are wasted