Marketplace error-recovery
Strategies for handling subagent failures with retry logic and escalation patterns.
install
source · Clone the upstream repo
git clone https://github.com/aiskillstore/marketplace
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/aiskillstore/marketplace "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/clouder0/error-recovery" ~/.claude/skills/aiskillstore-marketplace-error-recovery && rm -rf "$T"
manifest:
skills/clouder0/error-recovery/SKILL.mdsource content
Error Recovery Skill
Pattern for handling subagent failures gracefully with appropriate retry strategies.
When to Load This Skill
- You are spawning subagents that may fail
- A subagent returned an error or unexpected output
- You need to decide whether to retry, escalate, or abort
Failure Categories
| Category | Symptoms | Strategy |
|---|---|---|
| Transient | Timeout, malformed output, parsing error | Simple Retry |
| Context Gap | "I don't have enough information", unclear task | Context Enhancement |
| Complexity | Partial completion, scope creep, tangents | Scope Reduction |
| Boundary/Contract | , boundary_violation, contract_change | Escalation |
| Fatal | Repeated failures (3+), fundamental misunderstanding | Abort with Report |
Retry Strategies
Strategy 1: Simple Retry
For transient failures. Same prompt, up to 3 attempts.
# Track attempts attempts: 0 max_attempts: 3 # On failure IF attempts < max_attempts: attempts += 1 Task(same_subagent_type, same_model, same_prompt) ELSE: Mark as FAILED, move on
Use when:
- Output was malformed or truncated
- Timeout occurred
- Agent returned empty/null response
Strategy 2: Context Enhancement
Add more information to help the agent succeed.
Task( subagent_type: "implementer", model: "sonnet", prompt: | ## PREVIOUS ATTEMPT FAILED Error: {error_message} Output received: {partial_output} ## ADDITIONAL CONTEXT Here is more information that may help: - Related file: @{additional_file_path} - Pattern to follow: {example_pattern} - Specific guidance: {clarification} ## ORIGINAL TASK {original_task_description} Output to: {output_path} )
Use when:
- Agent said "I don't understand" or "unclear requirements"
- Agent made incorrect assumptions
- Agent asked questions in output
Context to add:
- Related code files the agent might need
- Similar implementations as examples
- Explicit clarification of ambiguous points
- Error message from previous attempt
Strategy 3: Scope Reduction
Break the failing task into smaller, more manageable pieces.
# Original task failed Task: "Implement full authentication system" # Split into subtasks Task(implementer, "Implement password hashing utility") Task(implementer, "Implement session token generation") Task(implementer, "Implement login endpoint") Task(implementer, "Implement logout endpoint")
Use when:
- Agent completed partial work then failed
- Task description was too broad
- Agent went off on tangents
- Output shows confusion about scope
Splitting guidelines:
- Each subtask should be independently completable
- Each subtask should have clear boundaries
- Subtasks can run in parallel if no dependencies
- Recombine outputs after all subtasks complete
Strategy 4: Escalation
Route to specialized agent for resolution.
# For boundary violations Task( subagent_type: "contract-resolver", model: "sonnet", prompt: | A task is blocked due to boundary/contract issues. Blocked task output: memory/tasks/{task_id}/output.json Blocked reason: {blocked_reason} Current contracts: {contract_paths} Analyze impact and provide resolution. Output to: memory/contracts/resolution_{task_id}.json )
Escalation paths:
| Failure Type | Escalate To | Action |
|---|---|---|
| contract-resolver | Expand boundaries or redesign |
| contract-resolver | Modify contract, re-verify dependents |
| executor (self) | Re-check dependency status |
| Repeated implementation failures | architect | Reconsider design approach |
Strategy 5: Abort with Report
When recovery is not possible, fail gracefully.
{"tasks":[{"id":"{task_id}","status":"failed","failure_reason":"{specific reason}","attempts_made":3,"recovery_attempted":[{"strategy":"simple_retry","result":"same_error"},{"strategy":"context_enhancement","result":"different_error"},{"strategy":"scope_reduction","result":"subtasks_also_failed"}],"recommendation":"Task may need architectural redesign"}]}
Use when:
- 3+ retry attempts failed
- Different strategies all failed
- Fundamental misunderstanding of requirements
- Task is actually impossible given constraints
Decision Tree
On Subagent Failure: │ ├─ Is output malformed/empty/timeout? │ └─ YES → Strategy 1: Simple Retry (up to 3x) │ ├─ Did agent say "unclear" or ask questions? │ └─ YES → Strategy 2: Context Enhancement │ ├─ Did agent complete partial work? │ └─ YES → Strategy 3: Scope Reduction │ ├─ Is status "blocked" with boundary/contract reason? │ └─ YES → Strategy 4: Escalation to contract-resolver │ ├─ Have we tried 3+ strategies already? │ └─ YES → Strategy 5: Abort with Report │ └─ Unknown error └─ Try Strategy 2 first, then escalate
Retry State Tracking
Track retry attempts in the execution state file:
{"tasks":[{"id":"task-001","status":"running","attempts":2,"last_error":"Timeout after 120s","retry_strategy":"simple_retry"},{"id":"task-002","status":"running","attempts":1,"last_error":"Needs access to src/config/db.ts","retry_strategy":"context_enhancement","context_added":["src/config/db.ts","src/types/config.ts"]}]}
Integration with Executor Loop
# Enhanced execution loop WHILE tasks remain incomplete: 1. Read state file 2. Find ready tasks 3. Spawn ready tasks 4. Check completed tasks: FOR each completed task: IF status == pre_complete: spawn verifier ELIF status == blocked: apply Strategy 4 (Escalation) ELIF status == failed: determine_failure_category() apply_appropriate_strategy() update_retry_state() 5. Update state file 6. IF all verified: EXIT 7. IF all failed with no recovery: EXIT with failure report
Principles
- Fail fast, recover smart - Don't retry blindly; analyze the failure first
- Preserve partial work - If agent completed 50%, don't discard it
- Escalate early - Boundary/contract issues need resolver, not retries
- Track everything - Log all attempts for reflection phase
- Know when to quit - 3 failed strategies = abort, don't loop forever