Learn-skills.dev error-coordinator

Expert in making multi-agent systems resilient. Specializes in detecting loops, hallucinations, and failures, and implementing self-healing workflows. Use when designing error handling for agent systems, implementing retry strategies, or building resilient AI workflows.

install

source · Clone the upstream repo

git clone https://github.com/NeverSight/learn-skills.dev

Claude Code · Install into ~/.claude/skills/

T=$(mktemp -d) && git clone --depth=1 https://github.com/NeverSight/learn-skills.dev "$T" && mkdir -p ~/.claude/skills && cp -r "$T/data/skills-md/404kidwiz/claude-supercode-skills/error-coordinator" ~/.claude/skills/neversight-learn-skills-dev-error-coordinator && rm -rf "$T"

manifest: data/skills-md/404kidwiz/claude-supercode-skills/error-coordinator/SKILL.md

source content

Error Coordinator

Purpose

Provides expertise in building resilient multi-agent systems with robust error handling, failure detection, and recovery mechanisms. Covers loop detection, hallucination mitigation, and self-healing agent workflows.

When to Use

Designing error handling for agent systems
Implementing retry and recovery strategies
Building self-healing AI workflows
Detecting agent loops and infinite recursion
Mitigating hallucinations in agent outputs
Implementing circuit breakers for agents
Coordinating failure recovery across agents

Quick Start

Invoke this skill when:

Designing error handling for agent systems
Implementing retry and recovery strategies
Building self-healing AI workflows
Detecting agent loops and infinite recursion
Coordinating failure recovery across agents

Do NOT invoke when:

Organizing agent teams (use agent-organizer)
Debugging application errors (use debugger)
Handling production incidents (use incident-responder)
Detecting code error patterns (use error-detective)

Decision Framework

Error Type Handling:
├── Transient failure → Retry with backoff
├── Rate limiting → Backoff + queue
├── Invalid output → Validation + retry with feedback
├── Loop detected → Break + escalate
├── Hallucination → Ground with context, retry
├── Agent timeout → Cancel + fallback
└── Cascading failure → Circuit breaker

Recovery Strategy:
├── Idempotent operation → Simple retry
├── Stateful operation → Checkpoint + resume
├── Critical path → Fallback agent
└── Best effort → Log + continue

Core Workflows

1. Loop Detection System

Track agent invocation history
Detect repeated state patterns
Set maximum iteration limits
Implement escape hatch triggers
Log loop occurrences for analysis
Escalate to supervisor or human

2. Hallucination Mitigation

Ground responses with source data
Implement output validation
Cross-check with retrieval
Add confidence scoring
Flag low-confidence outputs
Provide feedback for retry

3. Circuit Breaker Implementation

Track failure rates per agent
Define failure threshold
Open circuit on threshold breach
Provide fallback behavior
Implement half-open state for testing
Close circuit on recovery
Monitor and alert on breaker state

Best Practices

Implement timeouts for all agent calls
Use exponential backoff with jitter
Log all failures with full context
Design for graceful degradation
Test failure scenarios explicitly
Monitor error rates and patterns

Anti-Patterns

Anti-Pattern	Problem	Correct Approach
Infinite retries	Resource exhaustion	Max retry limits
Silent failures	Hidden problems	Log and alert
No timeouts	Hung processes	Always set timeouts
Same retry interval	Thundering herd	Exponential backoff
No fallbacks	Complete failure	Graceful degradation