Learn-skills.dev error-coordinator
Expert in making multi-agent systems resilient. Specializes in detecting loops, hallucinations, and failures, and implementing self-healing workflows. Use when designing error handling for agent systems, implementing retry strategies, or building resilient AI workflows.
install
source · Clone the upstream repo
git clone https://github.com/NeverSight/learn-skills.dev
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/NeverSight/learn-skills.dev "$T" && mkdir -p ~/.claude/skills && cp -r "$T/data/skills-md/404kidwiz/claude-supercode-skills/error-coordinator" ~/.claude/skills/neversight-learn-skills-dev-error-coordinator && rm -rf "$T"
manifest:
data/skills-md/404kidwiz/claude-supercode-skills/error-coordinator/SKILL.mdsource content
Error Coordinator
Purpose
Provides expertise in building resilient multi-agent systems with robust error handling, failure detection, and recovery mechanisms. Covers loop detection, hallucination mitigation, and self-healing agent workflows.
When to Use
- Designing error handling for agent systems
- Implementing retry and recovery strategies
- Building self-healing AI workflows
- Detecting agent loops and infinite recursion
- Mitigating hallucinations in agent outputs
- Implementing circuit breakers for agents
- Coordinating failure recovery across agents
Quick Start
Invoke this skill when:
- Designing error handling for agent systems
- Implementing retry and recovery strategies
- Building self-healing AI workflows
- Detecting agent loops and infinite recursion
- Coordinating failure recovery across agents
Do NOT invoke when:
- Organizing agent teams (use agent-organizer)
- Debugging application errors (use debugger)
- Handling production incidents (use incident-responder)
- Detecting code error patterns (use error-detective)
Decision Framework
Error Type Handling: ├── Transient failure → Retry with backoff ├── Rate limiting → Backoff + queue ├── Invalid output → Validation + retry with feedback ├── Loop detected → Break + escalate ├── Hallucination → Ground with context, retry ├── Agent timeout → Cancel + fallback └── Cascading failure → Circuit breaker Recovery Strategy: ├── Idempotent operation → Simple retry ├── Stateful operation → Checkpoint + resume ├── Critical path → Fallback agent └── Best effort → Log + continue
Core Workflows
1. Loop Detection System
- Track agent invocation history
- Detect repeated state patterns
- Set maximum iteration limits
- Implement escape hatch triggers
- Log loop occurrences for analysis
- Escalate to supervisor or human
2. Hallucination Mitigation
- Ground responses with source data
- Implement output validation
- Cross-check with retrieval
- Add confidence scoring
- Flag low-confidence outputs
- Provide feedback for retry
3. Circuit Breaker Implementation
- Track failure rates per agent
- Define failure threshold
- Open circuit on threshold breach
- Provide fallback behavior
- Implement half-open state for testing
- Close circuit on recovery
- Monitor and alert on breaker state
Best Practices
- Implement timeouts for all agent calls
- Use exponential backoff with jitter
- Log all failures with full context
- Design for graceful degradation
- Test failure scenarios explicitly
- Monitor error rates and patterns
Anti-Patterns
| Anti-Pattern | Problem | Correct Approach |
|---|---|---|
| Infinite retries | Resource exhaustion | Max retry limits |
| Silent failures | Hidden problems | Log and alert |
| No timeouts | Hung processes | Always set timeouts |
| Same retry interval | Thundering herd | Exponential backoff |
| No fallbacks | Complete failure | Graceful degradation |