Harness-engineering resilience-chaos-testing

Chaos Testing

install

source · Clone the upstream repo

git clone https://github.com/Intense-Visions/harness-engineering

Claude Code · Install into ~/.claude/skills/

T=$(mktemp -d) && git clone --depth=1 https://github.com/Intense-Visions/harness-engineering "$T" && mkdir -p ~/.claude/skills && cp -r "$T/agents/skills/claude-code/resilience-chaos-testing" ~/.claude/skills/intense-visions-harness-engineering-resilience-chaos-testing && rm -rf "$T"

manifest: agents/skills/claude-code/resilience-chaos-testing/SKILL.md

source content

Chaos Testing

Validate resilience by injecting controlled failures to verify that fallbacks, retries, and circuit breakers work under real conditions

When to Use

Validating that resilience patterns (circuit breakers, retries, fallbacks) actually work
Preparing for production incidents by simulating them in controlled environments
Building confidence that the system degrades gracefully under partial failure
Discovering hidden dependencies and single points of failure

Instructions

Start with a steady state hypothesis: "Users can still check out even when the recommendation service is down."
Inject one failure at a time. Do not combine failures until individual effects are understood.
Start in development/staging. Move to production only with tight blast radius controls.
Types of failure injection: latency, errors, resource exhaustion, dependency unavailability, clock skew.
Measure impact on user-facing metrics (error rate, latency p99, success rate), not just internal metrics.
Build failure injection as middleware or wrappers that can be toggled on/off.

// chaos/fault-injector.ts
interface FaultConfig {
  enabled: boolean;
  latencyMs?: number; // Add artificial latency
  errorRate?: number; // 0.0 to 1.0 probability of error
  errorCode?: number; // HTTP status to return
  timeoutRate?: number; // 0.0 to 1.0 probability of timeout
  targetServices?: string[]; // Only affect specific services
}

export class FaultInjector {
  private config: FaultConfig = { enabled: false };

  configure(config: Partial<FaultConfig>) {
    this.config = { ...this.config, ...config };
  }

  async maybeInjectFault(serviceName: string): Promise<void> {
    if (!this.config.enabled) return;
    if (this.config.targetServices && !this.config.targetServices.includes(serviceName)) return;

    // Inject latency
    if (this.config.latencyMs) {
      await new Promise((r) => setTimeout(r, this.config.latencyMs));
    }

    // Inject timeout (never resolves until AbortController cancels)
    if (this.config.timeoutRate && Math.random() < this.config.timeoutRate) {
      await new Promise(() => {}); // Hang forever — caller's timeout should catch this
    }

    // Inject error
    if (this.config.errorRate && Math.random() < this.config.errorRate) {
      throw new ChaosError(`Injected fault for ${serviceName}`, this.config.errorCode ?? 500);
    }
  }
}

export class ChaosError extends Error {
  constructor(
    message: string,
    public readonly statusCode: number
  ) {
    super(message);
    this.name = 'ChaosError';
  }
}

// Integration with services
const faultInjector = new FaultInjector();

// Enable in test/staging via environment variable
if (process.env.CHAOS_ENABLED === 'true') {
  faultInjector.configure({
    enabled: true,
    targetServices: ['payment-api'],
    errorRate: 0.3, // 30% of payment API calls fail
    latencyMs: 2000, // Add 2s latency to all calls
  });
}

// Wrap service calls
export async function callPaymentAPI(orderId: string): Promise<PaymentResult> {
  await faultInjector.maybeInjectFault('payment-api');
  return fetch(`https://payment.example.com/charge/${orderId}`).then((r) => r.json());
}

// Chaos test scenario
describe('checkout resilience', () => {
  it('completes checkout when payment service has 50% error rate', async () => {
    faultInjector.configure({
      enabled: true,
      targetServices: ['payment-api'],
      errorRate: 0.5,
    });

    // Circuit breaker + retry should handle transient failures
    const result = await checkout(testOrder);
    expect(result.status).toBe('completed');

    faultInjector.configure({ enabled: false });
  });

  it('uses cached prices when pricing service is down', async () => {
    faultInjector.configure({
      enabled: true,
      targetServices: ['pricing-api'],
      errorRate: 1.0, // 100% failure
    });

    const result = await getProductPrice('sku-123');
    expect(result.source).toBe('cache');
    expect(result.price).toBeGreaterThan(0);

    faultInjector.configure({ enabled: false });
  });
});

Details

Chaos engineering principles (Netflix):

Define steady state (what "normal" looks like in metrics)
Hypothesize that steady state continues during failure
Introduce real-world failures (network, disk, process)
Try to disprove the hypothesis
Fix weaknesses found

Failure types to test:

Latency injection: Simulate slow responses (100ms, 1s, 5s, 30s)
Error injection: Return 500, 503, connection refused
Resource exhaustion: Fill disk, exhaust memory, saturate CPU
Dependency death: Kill a database, cache, or downstream service entirely
Clock skew: Jump time forward/backward (affects TTLs, JWT expiry)
Network partition: Split services so they cannot communicate

Tools:

toxiproxy

(TCP proxy with configurable toxics),

chaos-mesh

(Kubernetes-native),

litmus

(Kubernetes chaos),

gremlin

(SaaS platform),

pumba

(Docker container chaos).

Production chaos safety:

Always have a kill switch to stop the experiment immediately
Limit blast radius (specific percentage of traffic, specific instances)
Run during business hours when the team is available
Start with the smallest possible impact and scale up
Monitor user-facing metrics, not just infrastructure metrics

Source

https://principlesofchaos.org/

Process

Read the instructions and examples in this document.
Apply the patterns to your implementation, adapting to your specific context.
Verify your implementation against the details and edge cases listed above.

Harness Integration

Type: knowledge — this skill is a reference document, not a procedural workflow.
No tools or state — consumed as context by other skills and agents.

Success Criteria

The patterns described in this document are applied correctly in the implementation.
Edge cases and anti-patterns listed in this document are avoided.