Claude-skill-registry agent-ops-recovery

Handle failures and errors during workflow. Use when build breaks, tests fail unexpectedly, or agent gets stuck. Semi-automatic recovery with user confirmation for destructive actions.

install

source · Clone the upstream repo

git clone https://github.com/majiayu000/claude-skill-registry

Claude Code · Install into ~/.claude/skills/

T=$(mktemp -d) && git clone --depth=1 https://github.com/majiayu000/claude-skill-registry "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/data/agent-ops-recovery" ~/.claude/skills/majiayu000-claude-skill-registry-agent-ops-recovery && rm -rf "$T"

manifest: skills/data/agent-ops-recovery/SKILL.md

Error Recovery workflow

Trigger conditions

Use this skill when:

Build/lint fails unexpectedly after agent changes
Tests fail that were passing in baseline
Agent encounters ambiguity it cannot resolve
Implementation is stuck or going in circles

Recovery procedure

Step 1: Diagnose (invoke debugging)

For non-trivial failures, invoke

agent-ops-debugging

Apply systematic debugging process:
- Reproduce the issue consistently
- Define expected vs actual behavior
- Form hypothesis about root cause
Use debugging output to inform recovery decision
If root cause unclear after initial analysis, continue debugging before recovery

Step 2: Assess rollback options

Option A: Fix forward — issue is minor, can be resolved quickly
Option B: Partial rollback — revert specific file(s) to last good state
Option C: Full rollback — revert all agent changes since checkpoint
Option D: Escalate — document the issue, mark task blocked, ask user

Step 3: Propose action

Present options to user with:

What will be reverted/changed
Risk assessment
Recommendation

Step 4: Execute (with confirmation)

For non-destructive actions (fix forward): proceed
For destructive actions (rollback): ask user first
Update
```
.agent/focus.md
```
with recovery action taken

Destructive actions (require confirmation)

```
git reset
```
```
git checkout -- <file>
```
(discard changes)
```
git revert
```
Deleting files
Overwriting files with previous versions

Non-destructive actions (can proceed)

```
git stash
```
Reading files
Running diagnostics
Updating focus/tasks with findings

Post-recovery

Update
```
.agent/focus.md
```
with what happened
Invoke
```
agent-ops-tasks
```
to create issue for root cause investigation
Update
```
.agent/memory.md
```
with "pitfall to avoid" if applicable
Re-run baseline comparison before continuing

Issue Discovery After Recovery

After recovery, invoke

agent-ops-tasks

discovery procedure:

Create issue for the incident:

📋 Recovery completed. Create issue to track root cause?

Suggested:
- [BUG] Investigate: {description of what failed}
  - What happened: {failure description}
  - Recovery action: {what was done}
  - Root cause: TBD

Create this issue? [Y]es / [N]o

If pattern detected, create prevention issue:

This failure pattern has occurred before. Create improvement issue?

- [CHORE] Add validation to prevent {failure type}
- [TEST] Add regression test for {scenario}

Create these? [A]ll / [S]elect / [N]one

After creating issues:

Created {N} issues for tracking. What's next?

1. Investigate root cause now (BUG-0024@abc123)
2. Continue with original work (defer investigation)
3. Review recovery actions