git clone https://github.com/Intense-Visions/harness-engineering
T=$(mktemp -d) && git clone --depth=1 https://github.com/Intense-Visions/harness-engineering "$T" && mkdir -p ~/.claude/skills && cp -r "$T/agents/skills/claude-code/harness-resilience" ~/.claude/skills/intense-visions-harness-engineering-harness-resilience-3ddba7 && rm -rf "$T"
agents/skills/claude-code/harness-resilience/SKILL.mdHarness Resilience
Circuit breakers, rate limiting, bulkheads, retry patterns, and fault tolerance analysis. Detects missing resilience patterns, evaluates failure modes, and recommends concrete configurations for production-grade fault tolerance.
When to Use
- When adding new external service integrations (APIs, databases, message queues) that need fault tolerance
- On PRs that modify service-to-service communication, HTTP clients, or middleware chains
- To audit existing resilience patterns for correctness, completeness, and observability
- NOT for load testing or capacity planning (use harness-load-testing)
- NOT for incident response after a failure has occurred (use harness-incident-response)
- NOT for security-focused rate limiting like DDoS protection (use harness-security-review)
Process
Phase 1: DETECT -- Identify Dependencies and Existing Patterns
-
Inventory external dependencies. Scan the codebase for outbound connections:
- HTTP clients:
,axios
,fetch
,got
,HttpClient
,RestTemplatereqwest - Database connections: connection pool configs, ORM initialization, query builders
- Message queues: RabbitMQ, Kafka, SQS, Redis pub/sub client initialization
- gRPC channels: proto client stubs, channel creation, dial options
- Third-party SDKs: Stripe, Twilio, SendGrid, AWS SDK calls
- HTTP clients:
-
Map existing resilience patterns. For each dependency found, check for:
- Circuit breakers:
,opossum
,cockatiel
,Polly
,resilience4j
usagehystrix - Retry logic: exponential backoff, jitter, max attempts configuration
- Timeouts: connection and request timeout settings
- Rate limiters: token bucket, sliding window, or fixed window implementations
- Bulkheads: thread pool isolation, semaphore limits, connection pool sizing
- Fallbacks: cache-aside patterns, default values, degraded responses
- Circuit breakers:
-
Detect anti-patterns. Flag common resilience mistakes:
- Unbounded retries without backoff or max attempts
- Missing timeouts on HTTP clients or database queries
- Circuit breaker without a fallback handler
- Retry on non-idempotent operations (POST, DELETE without idempotency keys)
- Rate limiter with no monitoring or alerting on limit hits
-
Build the dependency map. Produce a structured inventory:
- Dependency name, type (HTTP, gRPC, database, queue), criticality (critical, degraded, optional)
- Current resilience patterns applied (or "none")
- Identified gaps and anti-patterns
Phase 2: ANALYZE -- Evaluate Failure Modes
-
Classify failure modes per dependency. For each external dependency:
- Timeout: The dependency responds too slowly or not at all
- Error burst: The dependency returns errors at a rate above normal
- Partial degradation: The dependency responds but with reduced functionality
- Total outage: The dependency is completely unreachable
- Data inconsistency: The dependency returns stale or incorrect data
-
Assess blast radius. For each failure mode:
- Which features become unavailable?
- Which downstream services are affected?
- What is the user-visible impact?
- Can the system continue to serve other requests?
-
Evaluate current coverage. Score each dependency on resilience coverage:
- Full: Circuit breaker + retry + timeout + fallback + monitoring
- Partial: Some patterns present but gaps exist (e.g., retry without circuit breaker)
- None: No resilience patterns applied
-
Prioritize gaps by risk. Combine criticality and coverage:
- Critical dependency with no resilience = P0 (immediate)
- Critical dependency with partial resilience = P1 (next sprint)
- Optional dependency with no resilience = P2 (backlog)
- Any dependency with anti-patterns = P0 (anti-patterns are active risks)
-
Check observability. For existing patterns, verify they emit metrics:
- Circuit breaker state changes (open/half-open/closed)
- Retry attempt counts and final outcomes
- Rate limiter rejection counts
- Timeout occurrences
Phase 3: DESIGN -- Recommend Resilience Patterns
-
Select patterns per dependency. Based on the failure mode analysis:
- HTTP APIs: Circuit breaker (opossum/cockatiel) + exponential backoff with jitter + request timeout + fallback
- Databases: Connection pool sizing + query timeout + read replica fallback + bulkhead isolation
- Message queues: Dead letter queue + retry with backoff + idempotent consumers + circuit breaker on publish
- gRPC services: Deadline propagation + retry policy + load balancing + circuit breaker
-
Provide concrete configurations. For each recommended pattern, specify:
- Library and version to use
- Configuration values with rationale (e.g., "timeout: 3000ms based on p99 latency of 1200ms with 2.5x headroom")
- Threshold values for circuit breakers (failure rate, sample window, reset timeout)
- Retry parameters (max attempts, base delay, max delay, jitter factor)
- Rate limits (requests per window, window size, burst allowance)
-
Design fallback strategies. For each critical dependency:
- Cache fallback: Serve stale data from Redis/memory cache with a staleness indicator
- Default fallback: Return a safe default value with a degraded flag
- Queue fallback: Accept the request and process it asynchronously when the dependency recovers
- Feature flag fallback: Disable the feature entirely via feature flag
-
Generate implementation templates. Produce code snippets for:
- Circuit breaker wrapping an existing HTTP client
- Retry middleware with exponential backoff and jitter
- Rate limiter middleware for Express/Fastify/NestJS
- Bulkhead pattern using semaphore or connection pool limits
-
Define health check contracts. Specify how each dependency should be health-checked:
- Endpoint or query to use for liveness check
- Timeout for the health check itself
- Frequency and failure threshold before marking unhealthy
Phase 4: VALIDATE -- Verify Implementation and Observability
-
Check pattern correctness. For each implemented pattern:
- Circuit breaker: Verify threshold configuration, half-open behavior, and reset timeout
- Retry: Verify idempotency of retried operations, backoff curve, and max attempts
- Timeout: Verify timeout values are set on both client and server sides
- Rate limiter: Verify limit values, window type, and rejection response format
-
Verify test coverage. Check that resilience patterns are tested:
- Circuit breaker tests: closed-to-open transition, open rejection, half-open recovery
- Retry tests: successful retry, max attempts exhaustion, non-retryable error bypass
- Timeout tests: timeout triggers fallback, timeout does not leak connections
- Rate limiter tests: under-limit passes, at-limit rejects, window reset behavior
-
Verify observability. Confirm that metrics are emitted:
- Check for Prometheus counters/histograms or StatsD calls on pattern events
- Verify structured logging includes circuit breaker state, retry attempt number, and rate limit headers
- Confirm dashboard or alert configurations reference the new metrics
-
Produce the resilience report. Output a summary:
- Number of dependencies analyzed
- Coverage before and after (percentage with full/partial/none resilience)
- Anti-patterns found and resolved
- Remaining gaps with priority and recommended timeline
-
Run integration verification. If integration tests exist:
- Execute tests that exercise the resilience patterns (chaos test stubs, fault injection)
- Verify graceful degradation under simulated failure conditions
- Confirm that fallbacks produce acceptable user-facing responses
Harness Integration
-- Primary CLI entry point. Runs all four phases.harness skill run harness-resilience
-- Run after implementing recommended patterns to verify project integrity.harness validate
-- Verify that new resilience libraries are properly declared and within boundary rules.harness check-deps
-- Used at pattern selection (checkpoint:decision) when multiple valid patterns exist and trade-offs require human judgment.emit_interaction
-- Discover HTTP clients, middleware chains, and existing resilience pattern files.Glob
-- Search for timeout configurations, retry logic, circuit breaker initialization, and anti-patterns.Grep
-- Generate implementation templates and resilience configuration files.Write
-- Add resilience wrappers to existing service clients.Edit
Success Criteria
- All external dependencies are inventoried with their resilience coverage level
- Anti-patterns are identified with specific file locations and line numbers
- Recommendations include concrete library versions and configuration values, not just pattern names
- Fallback strategies are defined for every critical dependency
- Implementation templates compile and follow the project's existing code style
- Observability is addressed: every pattern emits metrics or structured logs
Examples
Example: Express.js API with Stripe and PostgreSQL
Phase 1: DETECT Dependencies found: - Stripe API (HTTP, critical): axios client in src/payments/stripe-client.ts Resilience: timeout=5000ms, no retry, no circuit breaker, no fallback - PostgreSQL (database, critical): pg pool in src/db/pool.ts Resilience: pool max=20, no query timeout, no read replica fallback - SendGrid (HTTP, optional): @sendgrid/mail in src/notifications/email.ts Resilience: none Anti-patterns: - src/payments/stripe-client.ts:45 — retry on POST /charges without idempotency key - src/db/pool.ts — no statement_timeout configured Phase 2: ANALYZE Stripe failure modes: - Timeout: Payment page hangs, user retries, duplicate charges possible - Outage: All payments fail, revenue impact immediate - Blast radius: checkout flow, subscription renewal, refund processing Risk: P0 (critical + partial coverage + anti-pattern) Phase 3: DESIGN Stripe recommendations: - Add opossum circuit breaker: failureThreshold=50%, resetTimeout=30s - Add idempotency key to all Stripe charge requests - Set timeout to 8000ms (Stripe p99 is ~3s, 2.5x headroom) - Fallback: queue payment for async retry via Bull queue PostgreSQL recommendations: - Set statement_timeout=5000 in pool config - Add pg-pool error handler with connection retry - Configure read replica for GET endpoints via pgBouncer Phase 4: VALIDATE Resilience coverage: 33% -> 100% (3/3 dependencies covered) Anti-patterns resolved: 2/2 Tests needed: circuit breaker state transitions, idempotency key generation
Example: NestJS Microservices with gRPC and Redis
Phase 1: DETECT Dependencies found: - user-service (gRPC, critical): @grpc/grpc-js in src/clients/user.client.ts Resilience: deadline=5s, no retry, no circuit breaker - inventory-service (gRPC, critical): no resilience configured - Redis (cache, degraded): ioredis in src/cache/redis.ts Resilience: reconnectOnError, no bulkhead, no fallback Phase 2: ANALYZE inventory-service outage: - Product pages return 503, search results empty - Blast radius: catalog, search, cart validation - Risk: P0 (critical + no coverage) Phase 3: DESIGN inventory-service recommendations: - Add cockatiel circuit breaker with ConsecutiveBreaker(5) - Add retry with exponentialBackoff(1000, 2) maxAttempts=3 - Add deadline propagation from gateway timeout - Fallback: serve cached inventory from Redis with staleness header Redis recommendations: - Add bulkhead: maxPoolSize=50, separate pools for cache vs sessions - Add fallback: in-memory LRU cache (lru-cache, max 1000 items) - Monitor: emit redis.command.duration histogram Phase 4: VALIDATE Coverage: 33% -> 100% Tests verified: gRPC circuit breaker opens after 5 failures, Redis fallback serves from LRU when Redis is down
Rationalizations to Reject
| Rationalization | Reality |
|---|---|
| "That third-party API has 99.99% uptime — we don't need a circuit breaker" | 99.99% uptime means 52 minutes of downtime per year. That downtime will not occur as one predictable window — it will happen as degraded responses and timeouts during a traffic spike. Without a circuit breaker, every caller blocks for the full timeout duration, exhausting thread pools and cascading across the system. |
| "We have retry logic, so failures are handled" | Retry logic without a circuit breaker amplifies failures. When the downstream service is degraded, retries multiply the load on an already struggling system. Circuit breakers and retries are complementary controls, not alternatives. |
| "The fallback adds complexity — we'll add it if the circuit breaker actually opens" | A circuit breaker without a fallback is a different kind of failure mode, not resilience. When the circuit opens, users see an error instead of a degraded-but-functional experience. Fallbacks must be designed and tested before the circuit ever opens in production. |
| "Our database connection pool is 100 connections — that's plenty" | Connection pool size without query timeouts means slow queries hold connections indefinitely. A single slow query spike can exhaust the pool, causing every subsequent request to wait. Pool sizing and query timeouts are both required. |
| "The service is internal — it doesn't need rate limiting" | Internal services are often called by automated processes, CI pipelines, and batch jobs that can spike traffic in ways user-facing services do not. Missing rate limiting on internal services is a common cause of self-inflicted outages during deployments and data migrations. |
Gates
- No retry on non-idempotent operations without idempotency keys. Retrying a POST or DELETE that lacks an idempotency mechanism can cause data duplication or data loss. This is a blocking finding. The operation must be made idempotent before retry logic is added.
- No circuit breaker without a fallback. A circuit breaker that opens and returns a raw error to the user is not resilience -- it is a different kind of failure. Every circuit breaker must have a defined fallback behavior (cache, default, queue, or feature flag).
- No unbounded retries. Retry logic must have a max attempt limit and use exponential backoff with jitter. Unbounded retries with fixed delays cause thundering herd problems and amplify failures.
- No resilience pattern without observability. A circuit breaker that opens silently is invisible to operations. Every pattern must emit metrics or structured logs that can trigger alerts.
Escalation
- When a dependency has no documentation on failure behavior: Report: "The [dependency] has no documented error codes or failure modes. Recommend contacting the provider for SLA details, or instrumenting the client to collect failure statistics over a 2-week baseline period."
- When resilience patterns conflict with latency requirements: Adding retries and circuit breakers increases tail latency. Report: "The recommended retry configuration adds up to [N]ms to worst-case latency. If the latency budget is [M]ms, consider reducing max attempts or using a hedged request pattern instead."
- When the team has no experience with the recommended library: Report: "The team has not used [library] before. Recommend starting with a single non-critical dependency as a pilot, with a production bake time of 2 weeks before rolling out to critical paths."
- When existing resilience patterns use a different library than recommended: Do not recommend switching libraries mid-project. Report: "The project already uses [existing library] for resilience. Recommend continuing with [existing library] for consistency, adapting the configuration recommendations to its API."