install
source · Clone the upstream repo
git clone https://github.com/Intense-Visions/harness-engineering
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/Intense-Visions/harness-engineering "$T" && mkdir -p ~/.claude/skills && cp -r "$T/agents/skills/claude-code/resilience-chaos-testing" ~/.claude/skills/intense-visions-harness-engineering-resilience-chaos-testing && rm -rf "$T"
manifest:
agents/skills/claude-code/resilience-chaos-testing/SKILL.mdsource content
Chaos Testing
Validate resilience by injecting controlled failures to verify that fallbacks, retries, and circuit breakers work under real conditions
When to Use
- Validating that resilience patterns (circuit breakers, retries, fallbacks) actually work
- Preparing for production incidents by simulating them in controlled environments
- Building confidence that the system degrades gracefully under partial failure
- Discovering hidden dependencies and single points of failure
Instructions
- Start with a steady state hypothesis: "Users can still check out even when the recommendation service is down."
- Inject one failure at a time. Do not combine failures until individual effects are understood.
- Start in development/staging. Move to production only with tight blast radius controls.
- Types of failure injection: latency, errors, resource exhaustion, dependency unavailability, clock skew.
- Measure impact on user-facing metrics (error rate, latency p99, success rate), not just internal metrics.
- Build failure injection as middleware or wrappers that can be toggled on/off.
// chaos/fault-injector.ts interface FaultConfig { enabled: boolean; latencyMs?: number; // Add artificial latency errorRate?: number; // 0.0 to 1.0 probability of error errorCode?: number; // HTTP status to return timeoutRate?: number; // 0.0 to 1.0 probability of timeout targetServices?: string[]; // Only affect specific services } export class FaultInjector { private config: FaultConfig = { enabled: false }; configure(config: Partial<FaultConfig>) { this.config = { ...this.config, ...config }; } async maybeInjectFault(serviceName: string): Promise<void> { if (!this.config.enabled) return; if (this.config.targetServices && !this.config.targetServices.includes(serviceName)) return; // Inject latency if (this.config.latencyMs) { await new Promise((r) => setTimeout(r, this.config.latencyMs)); } // Inject timeout (never resolves until AbortController cancels) if (this.config.timeoutRate && Math.random() < this.config.timeoutRate) { await new Promise(() => {}); // Hang forever — caller's timeout should catch this } // Inject error if (this.config.errorRate && Math.random() < this.config.errorRate) { throw new ChaosError(`Injected fault for ${serviceName}`, this.config.errorCode ?? 500); } } } export class ChaosError extends Error { constructor( message: string, public readonly statusCode: number ) { super(message); this.name = 'ChaosError'; } }
// Integration with services const faultInjector = new FaultInjector(); // Enable in test/staging via environment variable if (process.env.CHAOS_ENABLED === 'true') { faultInjector.configure({ enabled: true, targetServices: ['payment-api'], errorRate: 0.3, // 30% of payment API calls fail latencyMs: 2000, // Add 2s latency to all calls }); } // Wrap service calls export async function callPaymentAPI(orderId: string): Promise<PaymentResult> { await faultInjector.maybeInjectFault('payment-api'); return fetch(`https://payment.example.com/charge/${orderId}`).then((r) => r.json()); }
// Chaos test scenario describe('checkout resilience', () => { it('completes checkout when payment service has 50% error rate', async () => { faultInjector.configure({ enabled: true, targetServices: ['payment-api'], errorRate: 0.5, }); // Circuit breaker + retry should handle transient failures const result = await checkout(testOrder); expect(result.status).toBe('completed'); faultInjector.configure({ enabled: false }); }); it('uses cached prices when pricing service is down', async () => { faultInjector.configure({ enabled: true, targetServices: ['pricing-api'], errorRate: 1.0, // 100% failure }); const result = await getProductPrice('sku-123'); expect(result.source).toBe('cache'); expect(result.price).toBeGreaterThan(0); faultInjector.configure({ enabled: false }); }); });
Details
Chaos engineering principles (Netflix):
- Define steady state (what "normal" looks like in metrics)
- Hypothesize that steady state continues during failure
- Introduce real-world failures (network, disk, process)
- Try to disprove the hypothesis
- Fix weaknesses found
Failure types to test:
- Latency injection: Simulate slow responses (100ms, 1s, 5s, 30s)
- Error injection: Return 500, 503, connection refused
- Resource exhaustion: Fill disk, exhaust memory, saturate CPU
- Dependency death: Kill a database, cache, or downstream service entirely
- Clock skew: Jump time forward/backward (affects TTLs, JWT expiry)
- Network partition: Split services so they cannot communicate
Tools:
toxiproxy (TCP proxy with configurable toxics), chaos-mesh (Kubernetes-native), litmus (Kubernetes chaos), gremlin (SaaS platform), pumba (Docker container chaos).
Production chaos safety:
- Always have a kill switch to stop the experiment immediately
- Limit blast radius (specific percentage of traffic, specific instances)
- Run during business hours when the team is available
- Start with the smallest possible impact and scale up
- Monitor user-facing metrics, not just infrastructure metrics
Source
https://principlesofchaos.org/
Process
- Read the instructions and examples in this document.
- Apply the patterns to your implementation, adapting to your specific context.
- Verify your implementation against the details and edge cases listed above.
Harness Integration
- Type: knowledge — this skill is a reference document, not a procedural workflow.
- No tools or state — consumed as context by other skills and agents.
Success Criteria
- The patterns described in this document are applied correctly in the implementation.
- Edge cases and anti-patterns listed in this document are avoided.