Claude-skill-registry checkpoint-workflow-builder
Build resumable state-machine workflows with checkpoint patterns, progress preservation, and automatic recovery for complex multi-phase operations that need to survive interruptions, timeouts, and failures.
install
source · Clone the upstream repo
git clone https://github.com/majiayu000/claude-skill-registry
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/majiayu000/claude-skill-registry "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/data/checkpoint-workflow-builder" ~/.claude/skills/majiayu000-claude-skill-registry-checkpoint-workflow-builder && rm -rf "$T"
manifest:
skills/data/checkpoint-workflow-builder/SKILL.mdsource content
Checkpoint Workflow Builder
Design and implement fault-tolerant workflows that can resume from any point of failure.
Overview
Complex workflows often fail mid-execution due to:
- Network timeouts
- System crashes
- Resource exhaustion
- External service failures
- User interruptions
This skill teaches you to build workflows that:
- Save progress at key checkpoints
- Resume from last successful state
- Handle partial failures gracefully
- Provide clear progress visibility
- Enable manual intervention points
When to Use
Use this skill when:
- Building multi-phase data pipelines
- Implementing long-running migration scripts
- Creating deployment workflows
- Processing large batches of items
- Orchestrating multi-system operations
- Building ETL (Extract, Transform, Load) workflows
- Implementing saga patterns for distributed systems
- Creating user-facing wizards with save/resume
Core Concepts
State Machine Pattern
┌─────────┐ │ INIT │ └────┬────┘ │ ▼ ┌────────────┐ │ DOWNLOAD │ └─────┬──────┘ │ ▼ ┌────────────┐ │ PROCESS │ └─────┬──────┘ │ ▼ ┌────────────┐ │ VALIDATE │ └─────┬──────┘ │ ▼ ┌────────────┐ │ FINALIZE │ └─────┬──────┘ │ ▼ ┌────────────┐ │ COMPLETE │ └────────────┘
Each state:
- Has clear entry conditions
- Performs specific operations
- Saves checkpoint before transition
- Can be resumed independently
Basic Implementation
Simple State Machine
#!/bin/bash # state-machine.sh - Basic resumable workflow STATE_FILE=".workflow_state" # Read current state (default: INIT) CURRENT_STATE=$(cat "$STATE_FILE" 2>/dev/null || echo "INIT") echo "Current state: $CURRENT_STATE" case "$CURRENT_STATE" in INIT) echo "=== Phase 1: Initialization ===" # Initialize workspace mkdir -p workspace mkdir -p results # Download dependencies echo "Setting up environment..." # Save next state echo "DOWNLOAD" > "$STATE_FILE" echo "✓ Initialization complete" echo "Run again to continue" ;; DOWNLOAD) echo "=== Phase 2: Download Data ===" # Download data files echo "Downloading data..." # curl -o workspace/data.zip https://example.com/data.zip # Verify download if [ -f "workspace/data.zip" ]; then echo "EXTRACT" > "$STATE_FILE" echo "✓ Download complete" echo "Run again to continue" else echo "✗ Download failed - fix and run again" exit 1 fi ;; EXTRACT) echo "=== Phase 3: Extract Data ===" # Extract files echo "Extracting data..." # unzip workspace/data.zip -d workspace/ echo "PROCESS" > "$STATE_FILE" echo "✓ Extraction complete" echo "Run again to continue" ;; PROCESS) echo "=== Phase 4: Process Data ===" # Process data echo "Processing data..." # ./process_data.sh workspace/ results/ echo "VALIDATE" > "$STATE_FILE" echo "✓ Processing complete" echo "Run again to continue" ;; VALIDATE) echo "=== Phase 5: Validate Results ===" # Validate results echo "Validating results..." # ./validate.sh results/ if [ $? -eq 0 ]; then echo "FINALIZE" > "$STATE_FILE" echo "✓ Validation passed" echo "Run again to finalize" else echo "✗ Validation failed" echo "Fix issues and change state to PROCESS to reprocess" exit 1 fi ;; FINALIZE) echo "=== Phase 6: Finalize ===" # Cleanup and finalize echo "Finalizing workflow..." # mv results/ final/ # rm -rf workspace/ echo "COMPLETE" > "$STATE_FILE" echo "✓ Workflow complete!" ;; COMPLETE) echo "=== Workflow Already Complete ===" echo "Results available in: final/" ;; *) echo "✗ Unknown state: $CURRENT_STATE" echo "Reset with: echo 'INIT' > $STATE_FILE" exit 1 ;; esac
Enhanced with Progress Tracking
#!/bin/bash # enhanced-state-machine.sh - With detailed progress STATE_FILE=".workflow_state" PROGRESS_FILE=".workflow_progress.json" # Initialize progress tracking init_progress() { cat > "$PROGRESS_FILE" << EOF { "current_state": "INIT", "started_at": "$(date -Iseconds)", "updated_at": "$(date -Iseconds)", "phases": { "INIT": {"status": "pending", "started": null, "completed": null}, "DOWNLOAD": {"status": "pending", "started": null, "completed": null}, "PROCESS": {"status": "pending", "started": null, "completed": null}, "VALIDATE": {"status": "pending", "started": null, "completed": null}, "FINALIZE": {"status": "pending", "started": null, "completed": null} } } EOF } # Update progress update_progress() { local state="$1" local status="$2" # "running", "completed", "failed" local timestamp="$(date -Iseconds)" if [ ! -f "$PROGRESS_FILE" ]; then init_progress fi jq --arg state "$state" \ --arg status "$status" \ --arg timestamp "$timestamp" \ '.current_state = $state | .updated_at = $timestamp | .phases[$state].status = $status | (.phases[$state].started //= $timestamp) | (if $status == "completed" then .phases[$state].completed = $timestamp else . end)' \ "$PROGRESS_FILE" > "$PROGRESS_FILE.tmp" mv "$PROGRESS_FILE.tmp" "$PROGRESS_FILE" } # Show progress show_progress() { if [ ! -f "$PROGRESS_FILE" ]; then echo "No progress file found" return fi echo "=== Workflow Progress ===" echo "" jq -r '.phases | to_entries | .[] | "[\(.value.status | ascii_upcase)] \(.key)" + (if .value.started then " (started: " + .value.started + ")" else "" end)' \ "$PROGRESS_FILE" echo "" echo "Current: $(jq -r '.current_state' $PROGRESS_FILE)" echo "Updated: $(jq -r '.updated_at' $PROGRESS_FILE)" } # Workflow implementation run_workflow() { CURRENT_STATE=$(cat "$STATE_FILE" 2>/dev/null || echo "INIT") # Show progress before executing show_progress case "$CURRENT_STATE" in INIT) update_progress "INIT" "running" echo "Initializing..." # ... initialization logic ... update_progress "INIT" "completed" echo "DOWNLOAD" > "$STATE_FILE" ;; DOWNLOAD) update_progress "DOWNLOAD" "running" echo "Downloading..." # ... download logic ... update_progress "DOWNLOAD" "completed" echo "PROCESS" > "$STATE_FILE" ;; # ... other states ... esac } run_workflow
Advanced Patterns
Pattern 1: Batched Checkpoint
#!/bin/bash # batched-checkpoint.sh - Process items with batch checkpoints ITEMS_FILE="items.txt" CHECKPOINT_FILE=".batch_checkpoint" BATCH_SIZE=10 # Load checkpoint if [ -f "$CHECKPOINT_FILE" ]; then LAST_COMPLETED=$(cat "$CHECKPOINT_FILE") echo "Resuming from item $LAST_COMPLETED" else LAST_COMPLETED=0 fi TOTAL_ITEMS=$(wc -l < "$ITEMS_FILE") ITEMS_PROCESSED=0 # Process in batches tail -n +$((LAST_COMPLETED + 1)) "$ITEMS_FILE" | while read -r item; do # Process item echo "Processing: $item" process_item "$item" ITEMS_PROCESSED=$((ITEMS_PROCESSED + 1)) # Checkpoint every batch if [ $((ITEMS_PROCESSED % BATCH_SIZE)) -eq 0 ]; then CURRENT_POSITION=$((LAST_COMPLETED + ITEMS_PROCESSED)) echo "$CURRENT_POSITION" > "$CHECKPOINT_FILE" echo "Checkpoint: $CURRENT_POSITION/$TOTAL_ITEMS" # Optional: Break for timeout management if [ $((ITEMS_PROCESSED)) -ge $((BATCH_SIZE * 5)) ]; then echo "Processed 5 batches ($(($BATCH_SIZE * 5)) items)" echo "Run again to continue" exit 0 fi fi done # Final checkpoint echo "$TOTAL_ITEMS" > "$CHECKPOINT_FILE" echo "✓ All items processed!"
Pattern 2: Rollback Support
#!/bin/bash # rollback-workflow.sh - State machine with rollback STATE_FILE=".state" ROLLBACK_DIR=".rollback" mkdir -p "$ROLLBACK_DIR" # Save rollback point save_rollback() { local state="$1" local timestamp=$(date +%s) tar -czf "$ROLLBACK_DIR/${state}_${timestamp}.tar.gz" workspace/ 2>/dev/null || true echo "${state}_${timestamp}" > "$ROLLBACK_DIR/latest" } # Perform rollback rollback() { local rollback_point="$1" if [ -z "$rollback_point" ]; then rollback_point=$(cat "$ROLLBACK_DIR/latest" 2>/dev/null) fi if [ -z "$rollback_point" ] || [ ! -f "$ROLLBACK_DIR/${rollback_point}.tar.gz" ]; then echo "✗ No rollback point found" exit 1 fi echo "Rolling back to: $rollback_point" # Restore from backup rm -rf workspace/ tar -xzf "$ROLLBACK_DIR/${rollback_point}.tar.gz" # Set state STATE=$(echo "$rollback_point" | cut -d'_' -f1) echo "$STATE" > "$STATE_FILE" echo "✓ Rolled back to state: $STATE" } # In workflow case "$CURRENT_STATE" in DOWNLOAD) save_rollback "PRE_DOWNLOAD" # ... download logic ... ;; PROCESS) save_rollback "PRE_PROCESS" # ... process logic ... ;; esac
Pattern 3: Parallel Phase Execution
#!/bin/bash # parallel-phases.sh - Execute independent phases in parallel STATE_FILE=".parallel_state.json" # Initialize parallel state init_parallel_state() { cat > "$STATE_FILE" << EOF { "phase_a": "pending", "phase_b": "pending", "phase_c": "pending", "finalize": "pending" } EOF } # Update phase status update_phase() { local phase="$1" local status="$2" jq --arg phase "$phase" --arg status "$status" \ '.[$phase] = $status' "$STATE_FILE" > "$STATE_FILE.tmp" mv "$STATE_FILE.tmp" "$STATE_FILE" } # Check if all phases complete all_phases_complete() { local incomplete=$(jq -r '[.phase_a, .phase_b, .phase_c] | map(select(. != "completed")) | length' "$STATE_FILE") [ "$incomplete" -eq 0 ] } # Execute phase execute_phase() { local phase="$1" local current_status=$(jq -r ".$phase" "$STATE_FILE") if [ "$current_status" = "completed" ]; then echo "$phase: Already complete" return 0 fi echo "$phase: Running" update_phase "$phase" "running" # Phase-specific logic case "$phase" in phase_a) # Independent task A task_a ;; phase_b) # Independent task B task_b ;; phase_c) # Independent task C task_c ;; esac if [ $? -eq 0 ]; then update_phase "$phase" "completed" echo "$phase: Completed" else update_phase "$phase" "failed" echo "$phase: Failed" return 1 fi } # Main workflow [ ! -f "$STATE_FILE" ] && init_parallel_state # Run independent phases execute_phase "phase_a" execute_phase "phase_b" execute_phase "phase_c" # Check if ready for finalization if all_phases_complete; then finalize_status=$(jq -r '.finalize' "$STATE_FILE") if [ "$finalize_status" != "completed" ]; then echo "All phases complete. Finalizing..." finalize update_phase "finalize" "completed" echo "✓ Workflow complete!" fi else echo "Some phases incomplete. Run again to retry." fi
Real-World Examples
Example 1: Database Migration Workflow
#!/bin/bash # db-migration-workflow.sh STATE_FILE=".migration_state" MIGRATION_LOG="migration.log" log() { echo "[$(date -Iseconds)] $1" | tee -a "$MIGRATION_LOG" } CURRENT_STATE=$(cat "$STATE_FILE" 2>/dev/null || echo "INIT") case "$CURRENT_STATE" in INIT) log "=== Starting Migration ===" log "Checking prerequisites..." # Verify database connection psql -U user -d database -c "SELECT 1" > /dev/null 2>&1 || { log "✗ Database connection failed" exit 1 } # Verify backup destination [ -d "/backups" ] || mkdir -p /backups echo "BACKUP" > "$STATE_FILE" log "✓ Prerequisites checked" ;; BACKUP) log "=== Creating Backup ===" BACKUP_FILE="/backups/db_$(date +%Y%m%d_%H%M%S).sql" pg_dump -U user database > "$BACKUP_FILE" if [ -f "$BACKUP_FILE" ] && [ -s "$BACKUP_FILE" ]; then log "✓ Backup created: $BACKUP_FILE" echo "MIGRATE" > "$STATE_FILE" else log "✗ Backup failed" exit 1 fi ;; MIGRATE) log "=== Running Migration ===" # Apply migration scripts in order for script in migrations/*.sql; do log "Applying: $(basename $script)" psql -U user -d database -f "$script" >> "$MIGRATION_LOG" 2>&1 if [ $? -ne 0 ]; then log "✗ Migration failed: $script" log "To rollback: ./db-migration-workflow.sh rollback" exit 1 fi done echo "VALIDATE" > "$STATE_FILE" log "✓ Migrations applied" ;; VALIDATE) log "=== Validating Migration ===" # Run validation queries VALIDATION_RESULT=$(psql -U user -d database -f validation.sql 2>&1) if echo "$VALIDATION_RESULT" | grep -q "PASS"; then log "✓ Validation passed" echo "COMPLETE" > "$STATE_FILE" else log "✗ Validation failed" log "$VALIDATION_RESULT" log "Review and fix, then change state to MIGRATE to retry" exit 1 fi ;; COMPLETE) log "=== Migration Complete ===" log "Database migrated successfully" ;; rollback) log "=== Rolling Back Migration ===" # Find latest backup LATEST_BACKUP=$(ls -t /backups/db_*.sql | head -1) if [ -z "$LATEST_BACKUP" ]; then log "✗ No backup found" exit 1 fi log "Restoring from: $LATEST_BACKUP" psql -U user -d database < "$LATEST_BACKUP" log "✓ Rollback complete" echo "INIT" > "$STATE_FILE" ;; esac
Example 2: Data Pipeline Workflow
#!/bin/bash # data-pipeline.sh - ETL pipeline with checkpoints STATE_FILE=".pipeline_state.json" init_state() { cat > "$STATE_FILE" << EOF { "state": "EXTRACT", "extracted_files": [], "transformed_files": [], "loaded_records": 0, "total_records": 0 } EOF } [ ! -f "$STATE_FILE" ] && init_state CURRENT_STATE=$(jq -r '.state' "$STATE_FILE") case "$CURRENT_STATE" in EXTRACT) echo "=== Extract Phase ===" # Extract from sources for source in source1 source2 source3; do # Check if already extracted if jq -e ".extracted_files | index(\"$source\")" "$STATE_FILE" > /dev/null; then echo "✓ $source already extracted" continue fi echo "Extracting from: $source" extract_from_source "$source" > "raw_data/$source.csv" # Update state jq ".extracted_files += [\"$source\"]" "$STATE_FILE" > "$STATE_FILE.tmp" mv "$STATE_FILE.tmp" "$STATE_FILE" done # Move to next state jq '.state = "TRANSFORM"' "$STATE_FILE" > "$STATE_FILE.tmp" mv "$STATE_FILE.tmp" "$STATE_FILE" echo "✓ Extraction complete" ;; TRANSFORM) echo "=== Transform Phase ===" for file in raw_data/*.csv; do filename=$(basename "$file") # Check if already transformed if jq -e ".transformed_files | index(\"$filename\")" "$STATE_FILE" > /dev/null; then echo "✓ $filename already transformed" continue fi echo "Transforming: $filename" transform_data "$file" > "processed_data/$filename" # Update state jq ".transformed_files += [\"$filename\"]" "$STATE_FILE" > "$STATE_FILE.tmp" mv "$STATE_FILE.tmp" "$STATE_FILE" done # Count total records for loading TOTAL=$(wc -l processed_data/*.csv | tail -1 | awk '{print $1}') jq ".total_records = $TOTAL | .state = \"LOAD\"" "$STATE_FILE" > "$STATE_FILE.tmp" mv "$STATE_FILE.tmp" "$STATE_FILE" echo "✓ Transformation complete" ;; LOAD) echo "=== Load Phase ===" LOADED=$(jq -r '.loaded_records' "$STATE_FILE") TOTAL=$(jq -r '.total_records' "$STATE_FILE") echo "Progress: $LOADED/$TOTAL records" # Load data in batches BATCH_SIZE=1000 tail -n +$((LOADED + 1)) processed_data/combined.csv | head -$BATCH_SIZE | while read record; do load_record "$record" LOADED=$((LOADED + 1)) # Checkpoint every 100 records if [ $((LOADED % 100)) -eq 0 ]; then jq ".loaded_records = $LOADED" "$STATE_FILE" > "$STATE_FILE.tmp" mv "$STATE_FILE.tmp" "$STATE_FILE" fi done # Check if complete LOADED=$(jq -r '.loaded_records' "$STATE_FILE") if [ "$LOADED" -ge "$TOTAL" ]; then jq '.state = "COMPLETE"' "$STATE_FILE" > "$STATE_FILE.tmp" mv "$STATE_FILE.tmp" "$STATE_FILE" echo "✓ Loading complete" else echo "Loaded $LOADED/$TOTAL - run again to continue" fi ;; COMPLETE) echo "=== Pipeline Complete ===" jq '.' "$STATE_FILE" ;; esac
Best Practices
✅ DO
- Design clear states - Each state has single responsibility
- Save frequently - Checkpoint after each significant operation
- Validate transitions - Verify state before transitioning
- Handle failures - Plan for failure at every state
- Log everything - Comprehensive logging for debugging
- Enable inspection - Make state file human-readable (JSON)
- Support rollback - Save enough info to undo
- Test resumption - Verify workflow resumes correctly
❌ DON'T
- Don't mix concerns - Keep states focused
- Don't skip validation - Validate before each transition
- Don't hide state - Make current state obvious
- Don't lose progress - Always save before risky operations
- Don't ignore errors - Handle errors explicitly
- Don't hardcode paths - Use configuration
- Don't forget cleanup - Remove checkpoint files when done
- Don't nest too deep - Keep state machine flat
Quick Reference
# Basic state machine STATE=$(cat .state || echo "INIT") case "$STATE" in STATE1) action1; echo "STATE2" > .state ;; STATE2) action2; echo "STATE3" > .state ;; esac # With progress tracking (JSON) jq '.state = "NEW_STATE" | .updated = "'$(date -Iseconds)'"' .state.json # Checkpoint pattern process_batch && echo "$POSITION" > .checkpoint # Rollback support tar -czf .rollback/pre_${STATE}.tar.gz data/ # Reset workflow echo "INIT" > .state && rm .progress.json
Version: 1.0.0 Author: Harvested from timeout-prevention and state machine patterns Last Updated: 2025-11-18 License: MIT Key Principle: Every operation should be resumable from its last successful checkpoint.
Build workflows that never lose progress! 🔄