Skills robust-agent-design

Apply robust Agent design patterns for building fault-tolerant, state-driven automation systems. Use when designing or refactoring systems that require high reliability, error recovery, graceful degradation, and distributed component coordination. Triggers on requests involving Agent architecture, fault tolerance design, state management, retry mechanisms, compensation transactions, or system robustness improvements.

install
source · Clone the upstream repo
git clone https://github.com/openclaw/skills
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/openclaw/skills "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/bhbb2000/robust-agent-design" ~/.claude/skills/openclaw-skills-robust-agent-design && rm -rf "$T"
OpenClaw · Install into ~/.openclaw/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/openclaw/skills "$T" && mkdir -p ~/.openclaw/skills && cp -r "$T/skills/bhbb2000/robust-agent-design" ~/.openclaw/skills/openclaw-skills-robust-agent-design && rm -rf "$T"
manifest: skills/bhbb2000/robust-agent-design/SKILL.md
source content

Robust Agent Design Patterns

A design methodology based on loose coupling, state-driven architecture, and fault-tolerance-first principles.

Core Design Principles

1. Node-Based vs Function-Based

  • Each functional unit is encapsulated as an independent Agent
  • Agents communicate via messages/state rather than function calls
  • Each Agent has its own lifecycle and state management

2. State-Driven vs Flow-Driven

  • System state is explicitly stored and managed
  • Decisions are based on state rather than hardcoded flows
  • Supports checkpoint recovery and state restoration

3. Fault-Tolerance-First vs Success-First

  • Assume all components can fail
  • Design recovery strategies for each failure scenario
  • "Failure is the norm, success requires guarantees"

Three-Level Fault Handling Mechanism

LevelFault TypeHandling StrategyApplicable Scenarios
L1Transient FaultAuto-retry + Exponential BackoffNetwork jitter, API rate limiting, temporary unavailability
L2Resource FaultResource cleanup + State resetDisk space exhausted, memory overflow, connection pool depleted
L3Logic FaultHuman intervention + CompensationData inconsistency, business logic errors, external dependency failures

Agent Design Template

Basic Agent Class Structure

class RobustAgent:
    def __init__(self, config):
        self.id = generate_uuid()
        self.state = 'initialized'  # initialized|waiting|processing|completed|failed
        self.input_queue = []
        self.output_queue = []
        self.retry_count = 0
        self.max_retries = config.get('max_retries', 3)
        self.compensation_actions = config.get('compensation_actions', [])
        self.state_persistence = config.get('state_persistence', 'file')  # file|db|memory
    
    async def execute(self, task):
        """Main execution entry point"""
        try:
            # 1. State transition
            self.state = 'processing'
            self._persist_state()
            
            # 2. Execute work
            result = await self._do_work(task)
            
            # 3. Validate result
            await self._validate_result(result)
            
            # 4. Complete state
            self.state = 'completed'
            self._persist_state()
            return result
            
        except Exception as error:
            # 5. Fault handling
            return await self._handle_failure(error, task)
    
    async def _handle_failure(self, error, task):
        """Fault handling logic"""
        # L1: Transient fault - retry
        if self._is_transient_error(error) and self.retry_count < self.max_retries:
            self.retry_count += 1
            await self._exponential_backoff(self.retry_count)
            return await self.execute(task)
        
        # L2: Resource fault - cleanup and reset
        if self._is_resource_error(error):
            await self._cleanup_resources()
            self.state = 'waiting'
            self._persist_state()
            raise ResourceExhaustedError(f"Resource fault: {error}")
        
        # L3: Logic fault - compensation
        self.state = 'failed'
        self._persist_state()
        await self._execute_compensation()
        raise BusinessLogicError(f"Logic fault: {error}")
    
    def _persist_state(self):
        """State persistence"""
        state_data = {
            'agent_id': self.id,
            'state': self.state,
            'retry_count': self.retry_count,
            'timestamp': datetime.now().isoformat()
        }
        # Persist to file/database based on configuration
        save_state(state_data, self.state_persistence)

State Management Protocol

{
  "agent_id": "uuid",
  "current_state": "waiting_for_input|processing|completed|failed",
  "input_state": {
    "data": {},
    "checksum": "md5_hash",
    "source": "previous_agent_id",
    "timestamp": "iso8601"
  },
  "output_state": {
    "data": {},
    "quality_metrics": {},
    "validation_status": "passed|failed",
    "next_step": "agent_id_to_notify"
  },
  "retry_info": {
    "count": 0,
    "max_retries": 3,
    "backoff_strategy": "exponential"
  }
}

Compensation Transaction Pattern

Compensation Chain

class CompensationChain:
    def __init__(self):
        self.actions = []
    
    def add_action(self, action_func, params, rollback_func=None):
        self.actions.append({
            'action': action_func,
            'params': params,
            'rollback': rollback_func
        })
    
    async def execute(self):
        executed = []
        try:
            for action in self.actions:
                result = await action['action'](**action['params'])
                executed.append(action)
            return True
        except Exception as e:
            # Rollback executed actions
            for action in reversed(executed):
                if action['rollback']:
                    await action['rollback'](**action['params'])
            raise CompensationError(f"Compensation failed: {e}")

Usage Example

# Compensation after email sending failure
class MailAgent(RobustAgent):
    async def send_with_compensation(self, email_data):
        try:
            result = await mail_service.send(email_data)
            return result
        except Exception as error:
            compensation = CompensationChain()
            compensation.add_action(
                log_failure, 
                {'error': error, 'email': email_data}
            )
            compensation.add_action(
                notify_monitoring,
                {'severity': 'warning', 'agent_id': self.id}
            )
            compensation.add_action(
                queue_for_retry,
                {'email': email_data, 'delay': 300}
            )
            compensation.add_action(
                fallback_to_sms,
                {'summary': email_data.subject, 'recipient': email_data.to}
            )
            await compensation.execute()
            raise

Graceful Degradation Strategies

DEGRADATION_STRATEGIES = {
    "primary_service_unavailable": {
        "primary": "wait_and_retry",
        "fallback": "use_backup_service",
        "final": "queue_for_manual_processing"
    },
    "resource_exhausted": {
        "primary": "clean_temp_files",
        "fallback": "compress_existing_data",
        "final": "pause_until_manual_cleanup"
    },
    "quality_threshold_not_met": {
        "primary": "retry_with_different_params",
        "fallback": "use_simplified_algorithm",
        "final": "flag_for_human_review"
    }
}

System Architecture Patterns

Basic Architecture

┌─────────────────────────────────────────┐
│           Orchestrator                  │
│  ┌─────┬─────┬─────┬─────┬─────┐       │
│  │Collect│Process│Report│Send│Monitor│  │
│  │Agent  │Agent  │Agent │Agent│Agent │  │
│  └─────┴─────┴─────┴─────┴─────┘       │
└─────────────────────────────────────────┘
         ↓         ↓         ↓
    [State Store] [Message Queue] [Monitoring Log]

Agent Collaboration Flow

Input → Agent A → [State A] → Agent B → [State B] → Agent C → Output
         ↓ Failure          ↓ Failure          ↓ Failure
    [Compensation]    [Retry/Degrade]    [Human Intervention]

Implementation Checklist

Each Agent Must Include

  • Unique identifier (UUID)
  • Clear input/output interface definitions
  • Built-in result validation mechanism
  • State persistence capability
  • Fault recovery logic (three-level handling)
  • Monitoring metrics reporting
  • Logging and tracing integration

System-Level Guarantees

  • At-least-once message delivery guarantee
  • Eventual state consistency guarantee
  • Data integrity verification (checksum)
  • Operation traceability (full-link tracing)
  • Performance monitoring and alerting

Application Scenarios

Scenario 1: Information Collection System

CrawlerAgent → ClassifierAgent → ReporterAgent → MailerAgent
     ↓               ↓                ↓              ↓
 [State:Collecting][State:Classifying][State:Generating][State:Sending]

Scenario 2: Data Analysis Pipeline

DataFetcherAgent → CleanerAgent → AnalyzerAgent → VisualizationAgent

Scenario 3: Automation Workflow

TriggerAgent → ApprovalAgent → ExecutorAgent → NotifyAgent

Best Practices

1. Interface Design

  • Interfaces are stable and backward compatible
  • Versioned API design (v1, v2)
  • Clear error code system

2. State Management

  • State storage separated from business logic
  • Support for snapshots and rollback
  • State change audit tracking

3. Testing Strategy

  • Unit tests: Individual Agent functionality
  • Integration tests: Agent collaboration
  • Chaos engineering: Fault injection testing

4. Observability

  • Each Agent reports health status
  • Real-time monitoring of key metrics
  • Full link tracing coverage

Anti-Pattern Warnings

❌ Don't Do This

  • Design Agents as pure functions without state management
  • Ignore failure scenarios, assume everything works
  • Hardcode flows that cannot be dynamically adjusted
  • Lack compensation mechanisms, fail and terminate immediately

✅ Do This Instead

  • Explicitly manage state and lifecycle
  • Design recovery strategies for each failure scenario
  • Make decisions based on state, support dynamic flows
  • Implement compensation transactions, support graceful degradation

Reference Implementation

See

references/
directory:

  • agent_template.py
    - Complete Agent template
  • compensation_example.py
    - Compensation transaction examples