Claude-skill-registry agent-ops-recovery

Handle failures and errors during workflow. Use when build breaks, tests fail unexpectedly, or agent gets stuck. Semi-automatic recovery with user confirmation for destructive actions.

install
source · Clone the upstream repo
git clone https://github.com/majiayu000/claude-skill-registry
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/majiayu000/claude-skill-registry "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/data/agent-ops-recovery" ~/.claude/skills/majiayu000-claude-skill-registry-agent-ops-recovery && rm -rf "$T"
manifest: skills/data/agent-ops-recovery/SKILL.md
source content

Error Recovery workflow

Trigger conditions

Use this skill when:

  • Build/lint fails unexpectedly after agent changes
  • Tests fail that were passing in baseline
  • Agent encounters ambiguity it cannot resolve
  • Implementation is stuck or going in circles

Recovery procedure

Step 1: Diagnose (invoke debugging)

For non-trivial failures, invoke

agent-ops-debugging
:

  1. Apply systematic debugging process:
    • Reproduce the issue consistently
    • Define expected vs actual behavior
    • Form hypothesis about root cause
  2. Use debugging output to inform recovery decision
  3. If root cause unclear after initial analysis, continue debugging before recovery

Step 2: Assess rollback options

  • Option A: Fix forward — issue is minor, can be resolved quickly
  • Option B: Partial rollback — revert specific file(s) to last good state
  • Option C: Full rollback — revert all agent changes since checkpoint
  • Option D: Escalate — document the issue, mark task blocked, ask user

Step 3: Propose action

Present options to user with:

  • What will be reverted/changed
  • Risk assessment
  • Recommendation

Step 4: Execute (with confirmation)

  • For non-destructive actions (fix forward): proceed
  • For destructive actions (rollback): ask user first
  • Update
    .agent/focus.md
    with recovery action taken

Destructive actions (require confirmation)

  • git reset
  • git checkout -- <file>
    (discard changes)
  • git revert
  • Deleting files
  • Overwriting files with previous versions

Non-destructive actions (can proceed)

  • git stash
  • Reading files
  • Running diagnostics
  • Updating focus/tasks with findings

Post-recovery

  1. Update
    .agent/focus.md
    with what happened
  2. Invoke
    agent-ops-tasks
    to create issue for root cause investigation
  3. Update
    .agent/memory.md
    with "pitfall to avoid" if applicable
  4. Re-run baseline comparison before continuing

Issue Discovery After Recovery

After recovery, invoke

agent-ops-tasks
discovery procedure:

  1. Create issue for the incident:

    📋 Recovery completed. Create issue to track root cause?
    
    Suggested:
    - [BUG] Investigate: {description of what failed}
      - What happened: {failure description}
      - Recovery action: {what was done}
      - Root cause: TBD
    
    Create this issue? [Y]es / [N]o
    
  2. If pattern detected, create prevention issue:

    This failure pattern has occurred before. Create improvement issue?
    
    - [CHORE] Add validation to prevent {failure type}
    - [TEST] Add regression test for {scenario}
    
    Create these? [A]ll / [S]elect / [N]one
    
  3. After creating issues:

    Created {N} issues for tracking. What's next?
    
    1. Investigate root cause now (BUG-0024@abc123)
    2. Continue with original work (defer investigation)
    3. Review recovery actions