Awesome-omni-skill qa-resilience

Resilience engineering for QA: failure mode testing (timeouts/retries/dependency failures), chaos experiments with blast-radius controls, degraded-mode UX expectations, and reliability gates.

install
source · Clone the upstream repo
git clone https://github.com/diegosouzapw/awesome-omni-skill
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/diegosouzapw/awesome-omni-skill "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/design/qa-resilience" ~/.claude/skills/diegosouzapw-awesome-omni-skill-qa-resilience && rm -rf "$T"
manifest: skills/design/qa-resilience/SKILL.md
source content

QA Resilience (Dec 2025) — Failure Mode Testing & Production Hardening

This skill provides execution-ready patterns for building resilient, fault-tolerant systems that handle failures gracefully, and for validating those behaviors with tests.

Core references: Principles of Chaos Engineering (https://principlesofchaos.org/), AWS Well-Architected Reliability Pillar (https://docs.aws.amazon.com/wellarchitected/latest/reliability-pillar/welcome.html), and Kubernetes probes (https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-startup-probes/).


When to Use This Skill

Claude should invoke this skill when a user requests:

  • Circuit breaker implementation
  • Retry strategies and exponential backoff
  • Bulkhead pattern for resource isolation
  • Timeout policies for external dependencies
  • Graceful degradation and fallback mechanisms
  • Health check design (liveness vs readiness)
  • Error handling best practices
  • Chaos engineering setup
  • Production hardening strategies
  • Fault injection testing

Core QA (Default)

Failure Mode Testing (What to Validate)

  • Timeouts: every network call and DB query has a bounded timeout; validate timeout budgets across chained calls.
  • Retries: bounded retries with backoff + jitter; validate idempotency and “retry storm” safeguards.
  • Dependency failure: partial outage, slow downstream, rate limiting, DNS failures, auth failures.
  • Degraded-mode UX: what the user sees/gets when dependencies fail (cached/stale/partial responses).
  • Health checks: validate liveness/readiness/startup probe behavior (Kubernetes probes: https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-startup-probes/).

Right-Sized Chaos Engineering (Safe by Construction)

  • Define steady state and hypothesis (Principles of Chaos Engineering: https://principlesofchaos.org/).
  • Start in non-prod; in prod, use minimal blast radius, timeboxed runs, and explicit abort criteria.
  • REQUIRED: rollback plan, owners, and observability signals before running experiments.

Load/Perf + Production Guardrails

  • Load tests validate capacity and tail latency; resilience tests validate behavior under failure.
  • Guardrails [Inference]:
    • Run heavy resilience/perf suites on schedule (nightly) and on canary deploys, not on every PR.
    • Gate releases on regression budgets (p99 latency, error rate) rather than on raw CPU/memory.

Flake Control for Resilience Tests

  • Chaos/fault injection can look “flaky” if the experiment is not deterministic.
  • Stabilize the experiment first: fixed blast radius, controlled fault parameters, deterministic duration, strong observability.

Debugging Ergonomics

  • Every resilience test run should capture: experiment parameters, target scope, timestamps, and trace/log links for failures.
  • Prefer tracing/metrics to confirm the failure is the expected one (not collateral damage).

Do / Avoid

Do:

  • Test degraded mode explicitly; document expected UX and API responses.
  • Validate retries/timeouts in integration tests with fault injection.

Avoid:

  • Unbounded retries and missing timeouts (amplifies incidents).
  • “Happy-path only” testing that ignores downstream failure classes.

Quick Reference

PatternLibrary/ToolWhen to UseConfiguration
Circuit BreakerOpossum (Node.js), pybreaker (Python)External API calls, database connectionsThreshold: 50%, timeout: 30s, volume: 10
Retry with Backoffp-retry (Node.js), tenacity (Python)Transient failures, rate limitsMax retries: 5, exponential backoff with jitter
Bulkhead IsolationSemaphore pattern, thread poolsPrevent resource exhaustionPool size based on workload (CPU cores + wait/service time)
Timeout PoliciesAbortSignal, statement timeoutSlow dependencies, database queriesConnection: 5s, API: 10-30s, DB query: 5-10s
Graceful DegradationFeature flags, cached fallbackNon-critical features, ML recommendationsCache recent data, default values, reduced functionality
Health ChecksKubernetes probes, /health endpointsService orchestration, load balancingLiveness: simple, readiness: dependency checks, startup: slow apps
Chaos EngineeringChaos Toolkit, Netflix Chaos MonkeyProactive resilience testingStart non-prod, define hypothesis, automate failure injection

Decision Tree: Resilience Pattern Selection

Failure scenario: [System Dependency Type]
    ├─ External API/Service?
    │   ├─ Transient errors? → Retry with exponential backoff + jitter
    │   ├─ Cascading failures? → Circuit breaker + fallback
    │   ├─ Rate limiting? → Retry with Retry-After header respect
    │   └─ Slow response? → Timeout + circuit breaker
    │
    ├─ Database Dependency?
    │   ├─ Connection pool exhaustion? → Bulkhead isolation + timeout
    │   ├─ Query timeout? → Statement timeout (5-10s)
    │   ├─ Replica lag? → Read from primary fallback
    │   └─ Connection failures? → Retry + circuit breaker
    │
    ├─ Non-Critical Feature?
    │   ├─ ML recommendations? → Feature flag + default values fallback
    │   ├─ Search service? → Cached results or basic SQL fallback
    │   ├─ Email/notifications? → Log error, don't block main flow
    │   └─ Analytics? → Fire-and-forget, circuit breaker for protection
    │
    ├─ Kubernetes/Orchestration?
    │   ├─ Service discovery? → Liveness + readiness + startup probes
    │   ├─ Slow startup? → Startup probe (failureThreshold: 30)
    │   ├─ Load balancing? → Readiness probe (check dependencies)
    │   └─ Auto-restart? → Liveness probe (simple check)
    │
    └─ Testing Resilience?
        ├─ Pre-production? → Chaos Toolkit experiments
        ├─ Production (low risk)? → Feature flags + canary deployments
        ├─ Scheduled testing? → Game days (quarterly)
        └─ Continuous chaos? → Netflix Chaos Monkey (1% failure injection)

Navigation: Core Resilience Patterns

  • Circuit Breaker Patterns - Prevent cascading failures

    • Classic circuit breaker implementation (Node.js, Python)
    • Adaptive circuit breakers with ML-based thresholds (2024-2025)
    • Fallback strategies and event monitoring
  • Retry Patterns - Handle transient failures

    • Exponential backoff with jitter
    • Retry decision table (which errors to retry)
    • Idempotency patterns and Retry-After headers
  • Bulkhead Isolation - Resource compartmentalization

    • Semaphore pattern for thread/connection pools
    • Database connection pooling strategies
    • Queue-based bulkheads with load shedding
  • Timeout Policies - Prevent resource exhaustion

    • Connection, request, and idle timeouts
    • Database query timeouts (PostgreSQL, MySQL)
    • Nested timeout budgets for chained operations
  • Graceful Degradation - Maintain partial functionality

    • Cached fallback strategies
    • Default values and feature toggles
    • Partial responses with Promise.allSettled
  • Health Check Patterns - Service availability monitoring

    • Liveness, readiness, and startup probes
    • Kubernetes probe configuration
    • Shallow vs deep health checks

Navigation: Operational Resources

  • Resilience Checklists - Production hardening checklists

    • Dependency resilience
    • Health and readiness probes
    • Observability for resilience
    • Failure testing
  • Chaos Engineering Guide - Safe reliability experiments

    • Planning chaos experiments
    • Common failure injection scenarios
    • Execution steps and debrief checklist

Navigation: Templates

  • Resilience Runbook Template - Service hardening profile

    • Dependencies and SLOs
    • Fallback strategies
    • Rollback procedures
  • Fault Injection Playbook - Chaos testing script

    • Success signals
    • Rollback criteria
    • Post-experiment debrief
  • Resilience Test Plan Template - Failure mode test plan (timeouts/retries/degraded mode)

    • Scope and dependencies
    • Fault matrix and expected behavior
    • Observability signals and pass/fail criteria

Quick Decision Matrix

ScenarioRecommendation
External API callsCircuit breaker + retry with exponential backoff
Database queriesTimeout + connection pooling + circuit breaker
Slow dependencyBulkhead isolation + timeout
Non-critical featureFeature flag + graceful degradation
Kubernetes deploymentLiveness + readiness + startup probes
Testing resilienceChaos engineering experiments
Transient failuresRetry with exponential backoff + jitter
Cascading failuresCircuit breaker + bulkhead

Anti-Patterns to Avoid

  • No timeouts - Infinite waits exhaust resources
  • Infinite retries - Amplifies problems (thundering herd)
  • No circuit breakers - Cascading failures
  • Tight coupling - One failure breaks everything
  • Silent failures - No observability into degraded state
  • No bulkheads - Shared thread pools exhaust all resources
  • Testing only happy path - Production reveals failures

Optional: AI / Automation

Do:

  • Use AI to propose failure-mode scenarios from an explicit risk register; keep only scenarios that map to known dependencies and business journeys.
  • Use AI to summarize experiment results (metrics deltas, error clusters) and draft postmortem timelines; verify with telemetry.

Avoid:

  • “Scenario generation” without a risk map (creates noise and wasted load).
  • Letting AI relax timeouts/retries or remove guardrails.

Related Skills


Usage Notes

Pattern Selection:

  • Start with circuit breakers for external dependencies
  • Add retries for transient failures (network, rate limits)
  • Use bulkheads to prevent resource exhaustion
  • Combine patterns for defense-in-depth

Observability:

  • Track circuit breaker state changes
  • Monitor retry attempts and success rates
  • Alert on degraded mode duration
  • Measure recovery time after failures

Testing:

  • Start chaos experiments in non-production
  • Define hypothesis before failure injection
  • Set blast radius limits and auto-revert
  • Document learnings and action items

Success Criteria: Systems gracefully handle failures, recover automatically, maintain partial functionality during outages, and fail fast to prevent cascading failures. Resilience is tested proactively through chaos engineering.