Awesome-omni-skill Chaos Engineering

Design and execute controlled failure experiments to validate system resilience

install

source · Clone the upstream repo

git clone https://github.com/diegosouzapw/awesome-omni-skill

Claude Code · Install into ~/.claude/skills/

T=$(mktemp -d) && git clone --depth=1 https://github.com/diegosouzapw/awesome-omni-skill "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/design/chaos-engineering" ~/.claude/skills/diegosouzapw-awesome-omni-skill-chaos-engineering && rm -rf "$T"

manifest: skills/design/chaos-engineering/SKILL.md

source content

Chaos Engineering Skill

Design and execute controlled failure experiments to validate system resilience.

Trigger Conditions

Pre-release resilience validation needed
Post-deploy verification of fault tolerance
User invokes with "chaos experiment" or "resilience test"

Input Contract

Required: System under test
Required: Steady-state hypothesis (measurable)
Optional: Blast radius constraints, failure types to inject

Output Contract

Experiment definition with hypothesis
Results report with pass/fail
Findings and remediation recommendations
Updated resilience scorecard

Tool Permissions

Read: Service configs, circuit breaker configs, monitoring dashboards
Write: Experiment logs, findings reports
Execute: Failure injection tools (network, compute, storage)

Execution Steps

Define steady-state hypothesis with measurable metrics
Select failure injection type (network, pod kill, CPU, disk, dependency)
Constrain blast radius (start small: single pod, single AZ)
Execute experiment while monitoring steady state
Observe and record system behavior
Compare actual behavior against hypothesis
Document findings and remediation

Success Criteria

Hypothesis clearly defined before experiment
Blast radius contained as planned
Monitoring remained functional during experiment
Findings documented with severity and remediation

Escalation Rules

Escalate if experiment causes unexpected customer impact
Escalate if monitoring fails during the experiment
Escalate if recovery takes longer than MTTR target

Example Invocations

Input: "Test what happens when the Redis cache becomes unavailable"

Output: Hypothesis: API latency stays <500ms p99 with cache miss fallback to DB. Experiment: kill Redis pod. Result: FAIL — latency spiked to 3.2s, circuit breaker did not trip (misconfigured threshold). Remediation: lower circuit breaker threshold from 50% to 20% error rate, add cache stampede protection.