Harness-engineering resilience-bulkhead-pattern

Bulkhead Pattern

install

source · Clone the upstream repo

git clone https://github.com/Intense-Visions/harness-engineering

Claude Code · Install into ~/.claude/skills/

T=$(mktemp -d) && git clone --depth=1 https://github.com/Intense-Visions/harness-engineering "$T" && mkdir -p ~/.claude/skills && cp -r "$T/agents/skills/claude-code/resilience-bulkhead-pattern" ~/.claude/skills/intense-visions-harness-engineering-resilience-bulkhead-pattern && rm -rf "$T"

manifest: agents/skills/claude-code/resilience-bulkhead-pattern/SKILL.md

source content

Bulkhead Pattern

Isolate failures by partitioning resources so one failing component cannot exhaust capacity for others

When to Use

A slow dependency is consuming all available connections, starving other services
Need to guarantee capacity for critical operations even when non-critical ones are overloaded
Limiting concurrent requests to a specific external service
Preventing a single API consumer from monopolizing shared resources

Instructions

Partition resources by concern. Each partition (bulkhead) has its own concurrency limit.
Implement as a semaphore that limits concurrent executions. When the limit is reached, new requests are queued or rejected.
Set separate bulkheads for each external dependency — the payment API and the notification API should not share a concurrency pool.
Configure a queue size for burst handling. Reject requests that exceed the queue to prevent unbounded memory growth.
Combine with timeouts — a bulkhead without timeouts can still deadlock if slots are never released.
Monitor bulkhead utilization and rejection rate to tune limits.

// utils/bulkhead.ts
export class Bulkhead {
  private active = 0;
  private queue: Array<{ resolve: () => void; reject: (err: Error) => void }> = [];

  constructor(
    private readonly maxConcurrent: number,
    private readonly maxQueue: number = 100
  ) {}

  async execute<T>(fn: () => Promise<T>): Promise<T> {
    await this.acquire();
    try {
      return await fn();
    } finally {
      this.release();
    }
  }

  private acquire(): Promise<void> {
    if (this.active < this.maxConcurrent) {
      this.active++;
      return Promise.resolve();
    }

    if (this.queue.length >= this.maxQueue) {
      return Promise.reject(
        new BulkheadRejectError(`Bulkhead full: ${this.active} active, ${this.queue.length} queued`)
      );
    }

    return new Promise((resolve, reject) => {
      this.queue.push({ resolve, reject });
    });
  }

  private release(): void {
    const next = this.queue.shift();
    if (next) {
      next.resolve();
    } else {
      this.active--;
    }
  }

  get stats() {
    return { active: this.active, queued: this.queue.length };
  }
}

export class BulkheadRejectError extends Error {
  constructor(message: string) {
    super(message);
    this.name = 'BulkheadRejectError';
  }
}

// services/payment-service.ts
const paymentBulkhead = new Bulkhead(10, 50); // Max 10 concurrent, 50 queued
const notificationBulkhead = new Bulkhead(20, 100);

export async function processPayment(orderId: string): Promise<PaymentResult> {
  return paymentBulkhead.execute(() =>
    fetch(`https://payment-api.example.com/charge/${orderId}`, { method: 'POST' }).then((r) =>
      r.json()
    )
  );
}

export async function sendNotification(userId: string, message: string): Promise<void> {
  return notificationBulkhead.execute(() =>
    fetch('https://notification-api.example.com/send', {
      method: 'POST',
      body: JSON.stringify({ userId, message }),
    }).then(() => undefined)
  );
}

Details

Types of bulkheads:

Semaphore-based: Limits concurrent executions (shown above). Best for async I/O in Node.js.
Thread pool-based: Separate thread pools per dependency. Common in Java (Hystrix). Less relevant in Node.js due to its single-threaded model, but applicable with worker threads.
Connection pool-based: Separate connection pools per database or service.

Sizing bulkheads: Start with

expected_concurrent_requests * 1.5

. Monitor rejection rates. If rejections are too high, increase the limit. If the downstream service is overwhelmed, decrease it.

Combining patterns: Bulkhead + circuit breaker + timeout is the resilience trifecta:

Timeout: Prevents individual requests from hanging
Bulkhead: Limits total resource consumption
Circuit breaker: Stops all requests when failure rate is too high

async function resilientCall<T>(fn: () => Promise<T>): Promise<T> {
  return circuitBreaker.execute(() => bulkhead.execute(() => withTimeout(fn, 5000)));
}

Libraries:

cockatiel

provides composable bulkhead policy.

p-limit

is a simpler concurrency limiter.

bottleneck

adds rate limiting and clustering support.

Source

https://learn.microsoft.com/en-us/azure/architecture/patterns/bulkhead

Process

Read the instructions and examples in this document.
Apply the patterns to your implementation, adapting to your specific context.
Verify your implementation against the details and edge cases listed above.

Harness Integration

Type: knowledge — this skill is a reference document, not a procedural workflow.
No tools or state — consumed as context by other skills and agents.

Success Criteria

The patterns described in this document are applied correctly in the implementation.
Edge cases and anti-patterns listed in this document are avoided.