Claude-skill-registry chaos-engineering
Test system resilience through controlled failures. Use when validating fault tolerance, disaster recovery, or system reliability. Covers chaos experiments.
install
source · Clone the upstream repo
git clone https://github.com/majiayu000/claude-skill-registry
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/majiayu000/claude-skill-registry "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/data/chaos-engineering" ~/.claude/skills/majiayu000-claude-skill-registry-chaos-engineering && rm -rf "$T"
manifest:
skills/data/chaos-engineering/SKILL.mdsource content
Chaos Engineering
Principles
- Build a Hypothesis: Define expected behavior
- Minimize Blast Radius: Start small
- Run in Production: Real conditions matter
- Automate: Make experiments repeatable
- Minimize Impact: Have abort conditions
Experiment Process
- Steady State: Define normal metrics
- Hypothesis: "System will maintain X under condition Y"
- Introduce Variables: Inject failure
- Observe: Compare to steady state
- Analyze: Confirm or disprove hypothesis
Common Experiments
Network Failures
# Add latency tc qdisc add dev eth0 root netem delay 100ms # Packet loss tc qdisc add dev eth0 root netem loss 10% # Remove tc qdisc del dev eth0 root
Resource Exhaustion
# CPU stress stress --cpu 4 --timeout 60s # Memory stress stress --vm 2 --vm-bytes 1G --timeout 60s # Disk fill dd if=/dev/zero of=/tmp/fill bs=1M count=1024
Service Failures
- Kill processes
- Restart containers
- Terminate instances
- Block dependencies
Chaos Tools
- Chaos Monkey: Random instance termination
- Gremlin: Comprehensive chaos platform
- Litmus: Kubernetes chaos engineering
- Chaos Mesh: Cloud-native chaos
Experiment Template
## Experiment: [Name] ### Hypothesis If [condition], then [expected behavior]. ### Steady State - Metric A: [baseline value] - Metric B: [baseline value] ### Method 1. [Step 1] 2. [Step 2] 3. [Step 3] ### Abort Conditions - If [condition], stop immediately ### Results [What happened] ### Findings [What we learned]
Safety Rules
- Start in non-production
- Have rollback ready
- Monitor continuously
- Communicate with team
- Document everything