Agent-almanac circuit-breaker-pattern

install

source · Clone the upstream repo

git clone https://github.com/pjt222/agent-almanac

Claude Code · Install into ~/.claude/skills/

T=$(mktemp -d) && git clone --depth=1 https://github.com/pjt222/agent-almanac "$T" && mkdir -p ~/.claude/skills && cp -r "$T/i18n/caveman-ultra/skills/circuit-breaker-pattern" ~/.claude/skills/pjt222-agent-almanac-circuit-breaker-pattern-b2d4bd && rm -rf "$T"

manifest: i18n/caveman-ultra/skills/circuit-breaker-pattern/SKILL.md

source content

Circuit Breaker Pattern

Graceful degradation when tools fail. Agent w/ 5 tools, 1 broken → don't fail whole → spot broken tool, stop calling, shrink scope to achievable, report honest what skipped. Codifies circuit breaker from distributed systems → agentic tool orchestration.

Core insight from kirapixelads' "Kitchen Fire Problem": expeditor (orch layer) must NOT cook. Separate what to attempt from how → orchestrator stays out of broken tool's retry loop.

Use When

Building agents w/ many tools, varying reliability
Fault-tolerant workflows → partial > total failure
Agent stuck in retry loop on broken tool, not moving forward
Mid-task tool outage → graceful recovery
Hardening existing agents vs. cascading failures
Stale/cached tool out being treated as fresh

In

Required: Tool list (names + purposes)
Required: Task to accomplish
Optional: Known reliability issues / past fail patterns
Optional: Fail threshold (default: 3 consecutive fails → open)
Optional: Fail budget per cycle (default: 5 total fails → pause)
Optional: Half-open probe interval (default: every 3rd attempt post-open)

Do

Step 1: Build Capability Map

Declare each tool's capability + alternatives. Map = foundation for scope reduction → w/o it, fail leaves agent guessing.

capability_map:
  - tool: Grep
    provides: content search across files
    alternatives:
      - tool: Bash
        method: "rg or grep command"
        degradation: "loses Grep's built-in output formatting"
      - tool: Read
        method: "read suspected files directly"
        degradation: "requires knowing which files to check; no broad search"
    fallback: "ask the user which files to examine"

  - tool: Bash
    provides: command execution, build tools, git operations
    alternatives: []
    fallback: "report commands that need to be run manually"

  - tool: Read
    provides: file content inspection
    alternatives:
      - tool: Bash
        method: "cat or head command"
        degradation: "loses line numbering and truncation safety"
    fallback: "ask the user to paste file contents"

  - tool: Write
    provides: file creation
    alternatives:
      - tool: Edit
        method: "create via full-file edit"
        degradation: "requires file to already exist for Edit"
      - tool: Bash
        method: "echo/cat heredoc"
        degradation: "loses Write's atomic file creation"
    fallback: "output file contents for the user to save manually"

  - tool: WebSearch
    provides: external information retrieval
    alternatives: []
    fallback: "state what information is needed; ask user to provide it"

Each tool, doc:

Capability (one line)
Alternative tools (w/ degradation notes)
Manual fallback when no tool alternative

→ Full map covers every tool agent uses. Each entry has fallback even if no tool alt. Map makes explicit what's implicit: critical tools (no alts) vs. substitutable.

If err: Tool list unclear → start w/

allowed-tools

from skill frontmatter. Alts uncertain → mark

degradation: "unknown — test before relying on this route"

vs. omit.

Step 2: Initialize Circuit Breaker State

State tracker per tool. All tools start CLOSED (healthy).

Circuit Breaker State Table:
+------------+--------+-------------------+------------------+-----------------+
| Tool       | State  | Consecutive Fails | Last Failure     | Last Success    |
+------------+--------+-------------------+------------------+-----------------+
| Grep       | CLOSED | 0                 | —                | —               |
| Bash       | CLOSED | 0                 | —                | —               |
| Read       | CLOSED | 0                 | —                | —               |
| Write      | CLOSED | 0                 | —                | —               |
| Edit       | CLOSED | 0                 | —                | —               |
| WebSearch  | CLOSED | 0                 | —                | —               |
+------------+--------+-------------------+------------------+-----------------+

Failure budget: 0 / 5 consumed

State defs:

CLOSED — Tool healthy. Use normally. Track consecutive fails.
OPEN — Tool known-broken. Don't call. Route to alts or shrink scope.
HALF-OPEN — Tool was broken, maybe recovered. Single probe call. Success → CLOSED. Fail → OPEN.

Transitions:

CLOSED → OPEN: Consecutive fails ≥ threshold (default: 3)
OPEN → HALF-OPEN: After interval (e.g., every 3rd task step)
HALF-OPEN → CLOSED: Probe success
HALF-OPEN → OPEN: Probe fail

→ State table init'd all tools CLOSED, zero fails. Threshold + budget declared.

If err: Can't enumerate tools upfront (dynamic discovery) → init state on first use. Pattern still works → build table incrementally.

Step 3: Implement Call-and-Track Loop

Agent needs tool call → follow decision seq. This = expeditor logic → decides whether to call, not how to execute.

BEFORE each tool call:
  1. Check tool state in the circuit breaker table
  2. If OPEN:
     a. Check if it is time for a half-open probe
        - Yes → transition to HALF-OPEN, proceed with probe call
        - No  → skip this tool, route to alternative (Step 4)
  3. If HALF-OPEN:
     a. Make one probe call
     b. Success → transition to CLOSED, reset consecutive fails to 0
     c. Failure → transition to OPEN, increment failure budget
  4. If CLOSED:
     a. Make the call normally

AFTER each tool call:
  1. Success:
     - Reset consecutive fails to 0
     - Record last success timestamp
  2. Failure:
     - Increment consecutive fails
     - Record last failure timestamp and error message
     - Increment failure budget consumed
     - If consecutive fails >= threshold:
         transition to OPEN
         log: "Circuit OPENED for [tool]: [failure count] consecutive failures"
     - If failure budget exhausted:
         PAUSE — do not continue the task
         Report to user (Step 6)

Expeditor NEVER retries failed call immediately. Record fail, check thresholds, move on. Retries via HALF-OPEN probe at later step only.

→ Clear decision loop before + after every tool call. Tool health tracked continuous. Expeditor layer never blocks on failing tool.

If err: Tracking state across calls impractical (stateless exec) → degrade to simpler: count total fails, pause at budget. Three-state breaker = ideal; fail counter = min viable.

Step 4: Route to Alternatives on Open Circuit

Tool OPEN → consult capability map (Step 1), route to best alt.

Routing priority:

Tool alt, low degradation — Similar capability tool. Note degradation in out.
Tool alt, high degradation — Big capability loss. Label what's missing.
Manual fallback — Report what agent can't do, what user needs to provide.
Scope reduction — No alt + no fallback → drop dependent sub-task (Step 5).

Example routing decision:

Tool needed: Grep (circuit OPEN)
Task: find all files containing "API_KEY"

Route 1: Bash with rg command
  → Degradation: loses Grep's built-in formatting
  → Decision: ACCEPTABLE — use this route

If Bash also OPEN:
Route 2: Read suspected config files directly
  → Degradation: requires guessing which files; no broad search
  → Decision: PARTIAL — try known config paths only

If Read also OPEN:
Route 3: Ask user
  → "I need to find files containing 'API_KEY' but my search
     tools are unavailable. Can you run: grep -r 'API_KEY' ."
  → Decision: FALLBACK — user provides the information

If user unavailable:
Route 4: Scope reduction
  → Remove "find API key references" from task scope
  → Document: "SKIPPED: API key search — no tools available"

→ Tool circuit opens → agent transparently routes to alt or degrades scope. Decision + degradation documented in out → user knows what's affected.

If err: Map incomplete (no alts listed) → default scope reduction + report. NEVER silently skip → always doc what + why.

Step 5: Reduce Scope to Achievable Work

Tools OPEN + alts exhausted → shrink task to what working tools can do. Not failure → honest scope mgmt.

Scope reduction proc:

List remaining sub-tasks
Each sub-task → check tools required
All required tools CLOSED or viable alts → keep
Any required tool OPEN no alt → mark DEFERRED
Continue w/ reduced scope
Report deferred at end

Scope Reduction Report:

Original scope: 5 sub-tasks
  [x] 1. Read configuration files          (Read: CLOSED)
  [x] 2. Search for deprecated patterns    (Grep: CLOSED)
  [ ] 3. Run test suite                    (Bash: OPEN — no alternative)
  [x] 4. Update documentation             (Edit: CLOSED)
  [ ] 5. Deploy to staging                 (Bash: OPEN — no alternative)

Reduced scope: 3 sub-tasks achievable
Deferred: 2 sub-tasks require Bash (circuit OPEN)

Recommendation: Complete sub-tasks 1, 2, 4 now.
Sub-tasks 3 and 5 require Bash — will probe on next cycle
or user can run commands manually.

Do NOT attempt deferred. Do NOT retry open-circuited tools hoping they work. Breaker exists to prevent this → trust its state.

→ Clear partition of task → achievable + deferred. Agent finishes achievable, reports deferred w/ reason + unblock path.

If err: Scope reduction removes all sub-tasks (all tools broken) → skip to Step 6 pause-and-report. Agent w/ no working tools must not fake progress.

Step 6: Handle Staleness + Label Data Quality

Tool returns maybe-stale data (cached, old snapshot, prev fetched) → label explicit, not treat as fresh.

Staleness indicators:

Tool out matches prev call exactly (cache hit?)
Data timestamps older than current task
Tool doc mentions caching
Results contradict other recent observations

Labeling proc:

When presenting potentially stale data:

"[STALE DATA — retrieved at {timestamp}, may not reflect current state]
 File contents as of last successful Read:
 ..."

"[CACHED RESULT — Grep returned identical results to previous call;
 filesystem may have changed since]"

"[UNVERIFIED — WebSearch result from {date}; current status unknown]"

NEVER silently present stale as current. User / downstream agent must know data quality.

→ All maybe-stale outs labeled. Fresh not labeled (labels = uncertainty, not confirmation).

If err: Can't determine staleness (no timestamps, no baseline) → note: "[FRESHNESS UNKNOWN — no baseline for comparison]". Uncertainty = info.

Step 7: Enforce Failure Budget

Track total fails across all tools. Budget exhausted → pause + report vs. keep accumulating errs.

Failure Budget Enforcement:

Budget: 5 failures per cycle
Current: 4 / 5 consumed

  Failure 1: Bash — "permission denied" (step 3)
  Failure 2: Bash — "command not found" (step 3)
  Failure 3: Bash — "timeout after 120s" (step 4)
  Failure 4: WebSearch — "connection refused" (step 5)

Status: 1 failure remaining before mandatory pause

→ Next tool call proceeds with heightened caution
→ If it fails: PAUSE and generate status report

On budget exhaustion:

FAILURE BUDGET EXHAUSTED — PAUSING

Completed work:
  - Sub-task 1: Read configuration files (SUCCESS)
  - Sub-task 2: Search for deprecated patterns (SUCCESS)

Incomplete work:
  - Sub-task 3: Run test suite (FAILED — Bash circuit OPEN)
  - Sub-task 4: Update documentation (NOT ATTEMPTED — paused)
  - Sub-task 5: Deploy to staging (NOT ATTEMPTED — paused)

Tool health:
  Grep: CLOSED (healthy)
  Read: CLOSED (healthy)
  Edit: CLOSED (healthy)
  Bash: OPEN (3 consecutive failures — permission/command/timeout)
  WebSearch: OPEN (1 failure — connection refused)

Failures: 5 / 5 budget consumed

Recommendation:
  1. Investigate Bash failures — likely environment issue
  2. Check network connectivity for WebSearch
  3. Resume from sub-task 4 after resolution

Pause-and-report = breaker in electrical systems → prevents damage accumulating. Agent that keeps calling broken tools wastes ctx, confuses user w/ repeat errs, inconsistent partial results.

→ Agent stops clean on budget exhaust. Report covers done work, incomplete work, tool health, actionable next steps.

If err: Can't generate clean report (state tracking lost) → out whatever avail. Partial report > silent continuation.

Step 8: Separation of Concerns — Expeditor vs. Executor

Valid. orchestration logic (Steps 2-7) cleanly separated from tool exec.

Expeditor (orch) does:

Track tool health state
Decide call, skip, probe
Route to alts on open
Enforce fail budget
Generate status reports

Expeditor does NOT:

Retry failed calls immediately
Modify call params to work around errs
Catch + suppress tool errs
Assume why tool failed
Exec fallback logic that itself needs tools

Expeditor "cooking" (calling tools to work around other fails) → separation broken. Expeditor routes to alt or shrinks scope, NOT fixes broken tool.

→ Clean boundary orch decisions vs. tool exec. Expeditor described w/o ref to specific tool APIs or err types.

If err: Orch + exec entangled → refactor → extract decision logic to separate step before each tool call. Decision step outs: CALL, SKIP, PROBE, PAUSE. Exec step acts on out.

Step 9: Detect Cascading Failures

Many tools share infra (network, fs, perms) → single root cause trips many breakers at once. Detect correlated pattern vs. treat each breaker indep.

Cascade indicators:

3+ tools OPEN in same task step / narrow window
Fails share common err signature (e.g., "connection refused," "permission denied")
Tools w/ indep fail history suddenly fail together

Response proc:

Second breaker opens → check if fail category matches first
Correlated → flag systemic failure → pause all tool calls, not just broken
Report suspected root: "Multiple tools failing with [shared pattern] — likely [network/filesystem/permissions] issue"
Don't probe half-open during systemic fail → probes also fail, waste budget
Resume probing only after user confirms infra fixed

Backoff compounding: Cascade trips → exponential backoff for half-open probes: probe step 3, then 6, then 12. Cap max interval 20 steps → prevent permanent circuit lock. Stops rapid-fire probes overwhelming recovering system.

→ Correlated fails detected + treated as single systemic event, not N indep trips. Fail budget counts systemic event once, not N times.

If err: Correlation detection impractical (diff err sigs, shared cause) → fallback indep per-tool breakers. System still degrades → just consumes budget faster.

Step 10: Pre-Call Tool Selection Layer

Before circuit breaker loop (Step 3) → optionally valid. tool available + likely succeed. Cuts unnecessary trips from predictable fails.

Pre-call checks:

Check	Method	Action on failure
Tool exists	Verify tool is in the allowed-tools list	Skip — do not even attempt
MCP server health	Check server process/connection status	Route to alternative immediately
Resource availability	Verify target file/URL/endpoint exists	Route or degrade scope

Decision table:

Pre-call score:
  AVAILABLE  → proceed to circuit breaker loop (Step 3)
  DEGRADED   → proceed with caution, lower the failure threshold by 1
  UNAVAILABLE → skip tool, route to alternative (Step 4) without consuming budget

Pre-call = advisory, not authoritative. Tool passing pre-call can still fail during exec. Breaker = primary reliability mechanism.

→ Predictable fails (missing tools, unreachable servers) caught before budget consumed. Breaker handles only genuine runtime fails.

If err: Pre-call checks unavail or too much overhead → skip entirely. Step 3 breaker loop handles all fails → pre-call = opt, not req.

Check

Traps

Retry vs. circuit-break: Calling broken tool repeat wastes budget + ctx. 3 consecutive fails = pattern, not bad luck. OPEN it.
Cooking in expeditor: Orch decides what, not how to fix broken. Expeditor crafting workaround cmds for Bash fails → crossed the boundary.
Silent scope reduction: Dropping sub-tasks w/o doc → looks complete, isn't. Always report skipped.
Treat stale as fresh: Cached/prev results maybe not current. Label uncertainty, don't ignore.
Open circuits too eagerly: Single transient fail shouldn't open. Threshold (default: 3) filters noise from signal.
Never probe post-open: Permanently open → agent never finds recovery. Half-open probes essential.
Ignore fail budget: W/o budget → agent accumulates dozens of fails across tools while "progressing" on paper. Budget forces honest checkpoint.
Cascade backoff multiply: Many tools in dep chain each apply own exp backoff → compound delay grows multiplicatively. Cap total aggregate backoff across chain, not just per tool.
Stale discovery scores: Step 10 caches tool avail. No invalidation when conditions change → agent skips recovered tool or attempts unavail. Re-check scores after systemic fail event.

→

```
fail-early-pattern
```
— complementary: fail-early validates in before work; circuit-breaker manages fails during work
```
escalate-issues
```
— budget exhausted / scope reduction significant → escalate to specialist or human
```
write-incident-runbook
```
— doc recurring fail patterns as runbooks
```
assess-context
```
— eval if current approach can adapt when many tools degraded; pairs w/ scope reduction
```
du-dum
```
— two-clock arch sep'ng observe from decide; complement for cutting observe cost in agent loops