Awesome-omni-skill Chaos Engineering
Design and execute controlled failure experiments to validate system resilience
install
source · Clone the upstream repo
git clone https://github.com/diegosouzapw/awesome-omni-skill
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/diegosouzapw/awesome-omni-skill "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/design/chaos-engineering" ~/.claude/skills/diegosouzapw-awesome-omni-skill-chaos-engineering && rm -rf "$T"
manifest:
skills/design/chaos-engineering/SKILL.mdsource content
Chaos Engineering Skill
Design and execute controlled failure experiments to validate system resilience.
Trigger Conditions
- Pre-release resilience validation needed
- Post-deploy verification of fault tolerance
- User invokes with "chaos experiment" or "resilience test"
Input Contract
- Required: System under test
- Required: Steady-state hypothesis (measurable)
- Optional: Blast radius constraints, failure types to inject
Output Contract
- Experiment definition with hypothesis
- Results report with pass/fail
- Findings and remediation recommendations
- Updated resilience scorecard
Tool Permissions
- Read: Service configs, circuit breaker configs, monitoring dashboards
- Write: Experiment logs, findings reports
- Execute: Failure injection tools (network, compute, storage)
Execution Steps
- Define steady-state hypothesis with measurable metrics
- Select failure injection type (network, pod kill, CPU, disk, dependency)
- Constrain blast radius (start small: single pod, single AZ)
- Execute experiment while monitoring steady state
- Observe and record system behavior
- Compare actual behavior against hypothesis
- Document findings and remediation
Success Criteria
- Hypothesis clearly defined before experiment
- Blast radius contained as planned
- Monitoring remained functional during experiment
- Findings documented with severity and remediation
Escalation Rules
- Escalate if experiment causes unexpected customer impact
- Escalate if monitoring fails during the experiment
- Escalate if recovery takes longer than MTTR target
Example Invocations
Input: "Test what happens when the Redis cache becomes unavailable"
Output: Hypothesis: API latency stays <500ms p99 with cache miss fallback to DB. Experiment: kill Redis pod. Result: FAIL — latency spiked to 3.2s, circuit breaker did not trip (misconfigured threshold). Remediation: lower circuit breaker threshold from 50% to 20% error rate, add cache stampede protection.