Claude-skill-registry error-recovery
Use when encountering failures - assess severity, preserve evidence, execute rollback decision tree, and verify post-recovery state
git clone https://github.com/majiayu000/claude-skill-registry
T=$(mktemp -d) && git clone --depth=1 https://github.com/majiayu000/claude-skill-registry "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/data/error-recovery" ~/.claude/skills/majiayu000-claude-skill-registry-error-recovery-5877f0 && rm -rf "$T"
skills/data/error-recovery/SKILL.mdError Recovery
Overview
Handle failures gracefully with structured recovery.
Core principle: When things break, don't panic. Assess, preserve, recover, verify.
Announce at start: "I'm using error-recovery to handle this failure."
The Recovery Protocol
Error Detected │ ▼ ┌─────────────┐ │ 1. ASSESS │ ← Severity? Scope? Impact? └──────┬──────┘ │ ▼ ┌─────────────┐ │ 2. PRESERVE │ ← Capture evidence before it's lost └──────┬──────┘ │ ▼ ┌─────────────┐ │ 3. RECOVER │ ← Follow decision tree └──────┬──────┘ │ ▼ ┌─────────────┐ │ 4. VERIFY │ ← Confirm clean state └──────┬──────┘ │ ▼ ┌─────────────┐ │ 5. DOCUMENT │ ← Record what happened └─────────────┘
Step 1: Assess Severity
Severity Levels
| Level | Description | Examples |
|---|---|---|
| Critical | System unusable, data at risk | Build completely broken, tests cause data loss |
| Major | Significant functionality broken | Feature doesn't work, many tests failing |
| Minor | Isolated issue, workaround exists | Single test flaky, style error |
| Info | Warning only, not blocking | Deprecation notice, performance hint |
Assessment Questions
## Error Assessment **Error:** [Description of error] **Location:** [Where it occurred] ### Severity Checklist - [ ] Is the system still functional? - [ ] Is any data at risk? - [ ] Are other features affected? - [ ] Is this blocking progress? ### Scope - Files affected: [list] - Features affected: [list] - Users affected: [none/some/all]
Step 2: Preserve Evidence
Capture BEFORE attempting fixes:
Error Logs
# Capture error output pnpm test 2>&1 | tee error-log.txt # Or from failed command ./failing-command 2>&1 | tee error-log.txt
Stack Traces
## Stack Trace
Error: Connection refused at Database.connect (src/db/connection.ts:45) at UserService.init (src/services/user.ts:23) at main (src/index.ts:12)
State Capture
# Git state git status git diff # Environment state env | grep -E "NODE|NPM|PATH" # Dependency state pnpm list
Screenshot (if visual)
For UI errors, capture screenshots before changes.
Step 3: Recover
Decision Tree
What type of failure? │ ┌────┴────┬────────────┬────────────┐ │ │ │ │ Code Build Environment External Error Error Issue Service │ │ │ │ ▼ ▼ ▼ ▼ ┌────┐ ┌────┐ ┌────┐ ┌────┐ │Git │ │Clean│ │Re- │ │Wait/│ │reco│ │build│ │init │ │Retry│ │very│ │ │ │ │ │ │ └────┘ └────┘ └────┘ └────┘
Code Error Recovery
Single file broken:
# Revert just that file git checkout HEAD -- path/to/file.ts
Feature broken (multiple files):
# Find last good commit git log --oneline # Revert to that commit (soft reset keeps changes staged) git reset --soft [GOOD_COMMIT] # Or hard reset (discards changes) git reset --hard [GOOD_COMMIT]
Working directory is a mess:
# Stash current changes git stash # Verify clean state git status # Optionally recover stash later git stash pop
Build Error Recovery
# Clean build artifacts rm -rf node_modules dist build .cache # Reinstall dependencies pnpm install --frozen-lockfile # Clean install from lock file # Rebuild pnpm build
Environment Error Recovery
# Check environment env | grep -E "NODE|PNPM" # Reset Node modules rm -rf node_modules pnpm install --frozen-lockfile # If using nvm, verify version nvm use # Re-run init script ./scripts/init.sh
External Service Error
# Check if service is up curl -I https://service.example.com/health # If down, wait and retry sleep 60 curl -I https://service.example.com/health # If still down, check status page # Document as external blocker
Step 4: Verify
After recovery, verify clean state:
Basic Verification
# Clean working directory git status # Expected: "nothing to commit, working tree clean" or known changes # Tests pass pnpm test # Build succeeds pnpm build # Types check pnpm typecheck
Functionality Verification
# Run the specific thing that was broken pnpm test --grep "specific test" # Or verify the feature manually
Step 5: Document
Issue Comment
gh issue comment [ISSUE_NUMBER] --body "## Error Recovery **Error encountered:** [Description] **Severity:** Major **Evidence:** \`\`\` [Error output] \`\`\` **Recovery actions:** 1. [Action 1] 2. [Action 2] **Verification:** - [x] Tests pass - [x] Build succeeds **Root cause:** [If known] **Prevention:** [If applicable] "
Knowledge Graph
// Store for future reference mcp__memory__add_observations({ observations: [{ entityName: "Issue #[NUMBER]", contents: [ "Encountered [error type] on [date]", "Caused by: [root cause]", "Resolved by: [recovery action]" ] }] });
Common Recovery Patterns
"Tests were passing, now failing"
# What changed? git diff HEAD~3 # Did dependencies change? git diff HEAD~3 pnpm-lock.yaml # Clean reinstall rm -rf node_modules && pnpm install --frozen-lockfile
"Works locally, fails in CI"
# Check for environment differences # - Node version # - OS differences # - Env vars # Run with CI-like settings CI=true pnpm test
"Build was working, now broken"
# Check TypeScript errors pnpm typecheck # Check for circular dependencies pnpm dlx madge --circular src/ # Clean build rm -rf dist && pnpm build
"I broke everything"
# Don't panic # Find last known good state git log --oneline # Reset to that state git reset --hard [GOOD_COMMIT] # Verify pnpm test # Start again more carefully
Escalation
If recovery fails after 2-3 attempts:
## Escalation: Unrecoverable Error **Issue:** #[NUMBER] **Error:** [Description] **Recovery attempts:** 1. [Attempt 1] - [Result] 2. [Attempt 2] - [Result] **Current state:** [Broken/Partially working] **Evidence preserved:** [Links to logs, screenshots] **Requesting help with:** [Specific question]
Mark issue as Blocked and await human input.
Checklist
When error occurs:
- Severity assessed
- Evidence preserved (logs, state, screenshots)
- Recovery action selected
- Recovery executed
- Clean state verified
- Tests pass
- Build succeeds
- Issue documented
Integration
This skill is called by:
- When errors occurissue-driven-development
- CI failuresci-monitoring
This skill may trigger:
- If cause is unknownresearch-after-failure- Issue update via
issue-lifecycle