Oh-my-droid swarm
N coordinated agents on shared task list with SQLite-based atomic claiming
git clone https://github.com/MeroZemory/oh-my-droid
T=$(mktemp -d) && git clone --depth=1 https://github.com/MeroZemory/oh-my-droid "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/swarm" ~/.claude/skills/merozemory-oh-my-droid-swarm && rm -rf "$T"
skills/swarm/SKILL.mdSwarm Skill
Spawn N coordinated agents working on a shared task list with SQLite-based atomic claiming. Like a dev team tackling multiple files in parallel—fast, reliable, and with full fault tolerance.
Usage
/swarm N:agent-type "task description"
Parameters
- N - Number of agents (1-5, enforced by Factory Droid limit)
- agent-type - Agent to spawn (e.g., executor, build-fixer, architect)
- task - High-level task to decompose and distribute
Examples
/swarm 5:executor "fix all TypeScript errors" /swarm 3:build-fixer "fix build errors in src/" /swarm 4:designer "implement responsive layouts for all components" /swarm 2:architect "analyze and document all API endpoints"
Architecture
User: "/swarm 5:executor fix all TypeScript errors" | v [SWARM ORCHESTRATOR] | +--+--+--+--+--+ | | | | | v v v v v E1 E2 E3 E4 E5 | | | | | +--+--+--+--+ | v [SQLITE DATABASE] ┌─────────────────────┐ │ tasks table │ ├─────────────────────┤ │ id, description │ │ status (pending, │ │ claimed, done, │ │ failed) │ │ claimed_by, claimed_at │ completed_at, result│ │ error │ ├─────────────────────┤ │ heartbeats table │ │ (agent monitoring) │ └─────────────────────┘
Key Features:
- SQLite transactions ensure only one agent can claim a task
- Lease-based ownership with automatic timeout and recovery
- Heartbeat monitoring for detecting dead agents
- Full ACID compliance for task state
Workflow
1. Parse Input
- Extract N (agent count)
- Extract agent-type
- Extract task description
- Validate N <= 5
2. Create Task Pool
- Analyze codebase based on task
- Break into file-specific subtasks
- Initialize SQLite database with task pool
- Each task gets: id, description, status (pending), and metadata columns
3. Spawn Agents
- Launch N agents via Task tool
- Set
for allrun_in_background: true - Each agent connects to the SQLite database
- Agents enter claiming loop automatically
3.1. Agent Preamble (IMPORTANT)
When spawning swarm agents, ALWAYS wrap the task with the worker preamble to prevent recursive sub-agent spawning:
import { wrapWithPreamble } from '../droids/preamble.js'; // When spawning each agent: const agentPrompt = wrapWithPreamble(` Connect to swarm at ${cwd}/.omd/state/swarm.db Claim tasks with claimTask('agent-${n}') Complete work with completeTask() or failTask() Send heartbeat every 60 seconds Exit when hasPendingWork() returns false `); Task({ subagent_type: 'oh-my-droid:executor', prompt: agentPrompt, run_in_background: true });
The worker preamble ensures agents:
- Execute tasks directly using tools (Read, Write, Edit, Bash)
- Do NOT spawn sub-agents (prevents recursive agent storms)
- Report results with absolute file paths
4. Task Claiming Protocol (SQLite Transactional)
Each agent follows this loop:
LOOP: 1. Call claimTask(agentId) 2. SQLite transaction: - Find first pending task - UPDATE status='claimed', claimed_by=agentId, claimed_at=now - INSERT/UPDATE heartbeat record - Atomically commit (only one agent succeeds) 3. Execute task 4. Call completeTask(agentId, taskId, result) or failTask() 5. GOTO LOOP (until hasPendingWork() returns false)
Atomic Claiming Details:
- SQLite
transaction prevents race conditionsIMMEDIATE - Only agent updating the row successfully gets the task
- Heartbeat automatically updated on claim
- If claim fails (already claimed), agent retries with next task
- Lease Timeout: 5 minutes per task
- If timeout exceeded + no heartbeat, cleanupStaleClaims releases task back to pending
5. Heartbeat Protocol
- Agents call
every 60 seconds (or custom interval)heartbeat(agentId) - Heartbeat records: agent_id, last_heartbeat timestamp, current_task_id
- Orchestrator runs cleanupStaleClaims every 60 seconds
- If heartbeat is stale (>5 minutes old) and task claimed, task auto-releases
6. Progress Tracking
- Orchestrator monitors via TaskOutput
- Shows live progress: pending/claimed/done/failed counts
- Active agent count via getActiveAgents()
- Reports which agent is working on which task via getAgentTasks()
- Detects idle agents (all tasks claimed by others)
7. Completion
Exit when ANY of:
- isSwarmComplete() returns true (all tasks done or failed)
- All agents idle (no pending tasks, no claimed tasks)
- User cancels via
/cancel
Storage
SQLite Database (.omd/state/swarm.db
)
.omd/state/swarm.dbThe swarm uses a single SQLite database stored at
.omd/state/swarm.db. This provides:
- ACID compliance - All task state transitions are atomic
- Concurrent access - Multiple agents query/update safely
- Persistence - State survives agent crashes
- Query efficiency - Fast status lookups and filtering
tasks
Table Schema
tasksCREATE TABLE tasks ( id TEXT PRIMARY KEY, description TEXT NOT NULL, status TEXT NOT NULL DEFAULT 'pending', -- pending: waiting to be claimed -- claimed: claimed by an agent, in progress -- done: completed successfully -- failed: completed with error claimed_by TEXT, -- agent ID that claimed this task claimed_at INTEGER, -- Unix timestamp when claimed completed_at INTEGER, -- Unix timestamp when completed result TEXT, -- Optional result/output from task error TEXT -- Error message if task failed );
heartbeats
Table Schema
heartbeatsCREATE TABLE heartbeats ( agent_id TEXT PRIMARY KEY, last_heartbeat INTEGER NOT NULL, -- Unix timestamp of last heartbeat current_task_id TEXT -- Task agent is currently working on );
session
Table Schema
sessionCREATE TABLE session ( id TEXT PRIMARY KEY, agent_count INTEGER NOT NULL, started_at INTEGER NOT NULL, completed_at INTEGER, active INTEGER DEFAULT 1 );
Task Claiming Protocol (Detailed)
Atomic Claim Operation with SQLite
The core strength of the new implementation is transactional atomicity:
function claimTask(agentId: string): ClaimResult { // Transaction ensures only ONE agent succeeds const claimTransaction = db.transaction(() => { // Step 1: Find first pending task const task = db.prepare( 'SELECT id, description FROM tasks WHERE status = "pending" ORDER BY id LIMIT 1' ).get(); if (!task) { return { success: false, reason: 'No pending tasks' }; } // Step 2: Attempt claim (will only succeed if status is still 'pending') const result = db.prepare( 'UPDATE tasks SET status = "claimed", claimed_by = ?, claimed_at = ? WHERE id = ? AND status = "pending"' ).run(agentId, Date.now(), task.id); if (result.changes === 0) { // Another agent claimed it between SELECT and UPDATE - try next return { success: false, reason: 'Task was claimed by another agent' }; } // Step 3: Update heartbeat to show we're alive and working db.prepare( 'INSERT OR REPLACE INTO heartbeats (agent_id, last_heartbeat, current_task_id) VALUES (?, ?, ?)' ).run(agentId, Date.now(), task.id); return { success: true, taskId: task.id, description: task.description }; }).immediate(); // Explicitly acquire RESERVED lock for immediate transaction return claimTransaction(); // Atomic execution }
Why SQLite Transactions Work:
- Transactions are called with
to acquire RESERVED lock.immediate() - Prevents other agents from modifying rows between SELECT and UPDATE
- All-or-nothing atomicity: claim succeeds completely or fails completely
- No race conditions, no lost updates
Lease Timeout & Auto-Release
Tasks are automatically released if claimed too long without heartbeat:
function cleanupStaleClaims(leaseTimeout: number = 5 * 60 * 1000) { // Default 5-minute timeout const cutoffTime = Date.now() - leaseTimeout; const cleanupTransaction = db.transaction(() => { // Find claimed tasks where: // 1. Claimed longer than timeout, OR // 2. Agent hasn't sent heartbeat in that time const staleTasks = db.prepare(` SELECT t.id FROM tasks t LEFT JOIN heartbeats h ON t.claimed_by = h.agent_id WHERE t.status = 'claimed' AND t.claimed_at < ? AND (h.last_heartbeat IS NULL OR h.last_heartbeat < ?) `).all(cutoffTime, cutoffTime); // Release each stale task back to pending for (const staleTask of staleTasks) { db.prepare('UPDATE tasks SET status = "pending", claimed_by = NULL, claimed_at = NULL WHERE id = ?') .run(staleTask.id); } return staleTasks.length; }).immediate(); // Explicitly acquire RESERVED lock for immediate transaction return cleanupTransaction(); }
How Recovery Works:
- Orchestrator calls cleanupStaleClaims() every 60 seconds
- If agent hasn't sent heartbeat in 5 minutes, task is auto-released
- Another agent picks up the orphaned task
- Original agent can continue working (it doesn't know it was released)
- When original agent tries to mark task as done, verification fails safely
API Reference
Agents interact with the swarm via a TypeScript API:
Initialization
import { startSwarm, connectToSwarm } from './swarm'; // Orchestrator starts the swarm await startSwarm({ agentCount: 5, tasks: ['fix a.ts', 'fix b.ts', ...], leaseTimeout: 5 * 60 * 1000, // 5 minutes (default) heartbeatInterval: 60 * 1000 // 60 seconds (default) }); // Agents join existing swarm await connectToSwarm(process.cwd());
Agent Loop Pattern
import { claimTask, completeTask, failTask, heartbeat, hasPendingWork, disconnectFromSwarm } from './swarm'; const agentId = 'agent-1'; // Main work loop while (hasPendingWork()) { // Claim next task const claim = claimTask(agentId); if (!claim.success) { console.log('No tasks available:', claim.reason); break; } const { taskId, description } = claim; console.log(`Agent ${agentId} working on: ${description}`); try { // Do the work... const result = await executeTask(description); // Mark complete completeTask(agentId, taskId, result); console.log(`Agent ${agentId} completed task ${taskId}`); } catch (error) { // Mark failed failTask(agentId, taskId, error.message); console.error(`Agent ${agentId} failed on ${taskId}:`, error); } // Send heartbeat every 60 seconds (while working on long tasks) heartbeat(agentId); } // Cleanup disconnectFromSwarm();
Core API Functions
startSwarm(config: SwarmConfig): Promise<boolean>
startSwarm(config: SwarmConfig): Promise<boolean>Initialize the swarm with task pool and start cleanup timer.
const success = await startSwarm({ agentCount: 5, tasks: ['task 1', 'task 2', 'task 3'], leaseTimeout: 5 * 60 * 1000, heartbeatInterval: 60 * 1000 });
stopSwarm(deleteDatabase?: boolean): boolean
stopSwarm(deleteDatabase?: boolean): booleanStop the swarm and optionally delete the database.
stopSwarm(true); // Delete database on cleanup
claimTask(agentId: string): ClaimResult
claimTask(agentId: string): ClaimResultClaim the next pending task. Returns
{ success, taskId, description, reason }.
const claim = claimTask('agent-1'); if (claim.success) { console.log(`Claimed: ${claim.description}`); }
completeTask(agentId: string, taskId: string, result?: string): boolean
completeTask(agentId: string, taskId: string, result?: string): booleanMark a task as done. Only succeeds if agent still owns the task.
completeTask('agent-1', 'task-1', 'Fixed the bug');
failTask(agentId: string, taskId: string, error: string): boolean
failTask(agentId: string, taskId: string, error: string): booleanMark a task as failed with error details.
failTask('agent-1', 'task-1', 'Could not compile: missing dependency');
heartbeat(agentId: string): boolean
heartbeat(agentId: string): booleanSend a heartbeat to indicate agent is alive. Call every 60 seconds during long-running tasks.
heartbeat('agent-1');
cleanupStaleClaims(leaseTimeout?: number): number
cleanupStaleClaims(leaseTimeout?: number): numberManually trigger cleanup of expired claims. Called automatically every 60 seconds.
const released = cleanupStaleClaims(5 * 60 * 1000); console.log(`Released ${released} stale tasks`);
hasPendingWork(): boolean
hasPendingWork(): booleanCheck if there are unclaimed tasks available.
if (!hasPendingWork()) { console.log('All tasks claimed or completed'); }
isSwarmComplete(): boolean
isSwarmComplete(): booleanCheck if all tasks are done or failed.
if (isSwarmComplete()) { console.log('Swarm finished!'); }
getSwarmStats(): SwarmStats | null
getSwarmStats(): SwarmStats | nullGet task counts and timing info.
const stats = getSwarmStats(); console.log(`${stats.doneTasks}/${stats.totalTasks} done`);
getActiveAgents(): number
getActiveAgents(): numberGet count of agents with recent heartbeats.
const active = getActiveAgents(); console.log(`${active} agents currently active`);
getAllTasks(): SwarmTask[]
getAllTasks(): SwarmTask[]Get all tasks with current status.
const tasks = getAllTasks(); const pending = tasks.filter(t => t.status === 'pending');
getTasksWithStatus(status: string): SwarmTask[]
getTasksWithStatus(status: string): SwarmTask[]Filter tasks by status: 'pending', 'claimed', 'done', 'failed'.
const failed = getTasksWithStatus('failed');
getAgentTasks(agentId: string): SwarmTask[]
getAgentTasks(agentId: string): SwarmTask[]Get all tasks claimed by a specific agent.
const myTasks = getAgentTasks('agent-1');
retryTask(agentId: string, taskId: string): ClaimResult
retryTask(agentId: string, taskId: string): ClaimResultAttempt to reclaim a failed task.
const retry = retryTask('agent-1', 'task-1'); if (retry.success) { console.log('Task reclaimed, trying again...'); }
Configuration (SwarmConfig)
interface SwarmConfig { agentCount: number; // Number of agents (1-5) tasks: string[]; // Task descriptions agentType?: string; // Agent type (default: 'executor') leaseTimeout?: number; // Milliseconds (default: 5 min) heartbeatInterval?: number; // Milliseconds (default: 60 sec) cwd?: string; // Working directory }
Types
interface SwarmTask { id: string; description: string; status: 'pending' | 'claimed' | 'done' | 'failed'; claimedBy: string | null; claimedAt: number | null; completedAt: number | null; error?: string; result?: string; } interface ClaimResult { success: boolean; taskId: string | null; description?: string; reason?: string; } interface SwarmStats { totalTasks: number; pendingTasks: number; claimedTasks: number; doneTasks: number; failedTasks: number; activeAgents: number; elapsedTime: number; }
Key Parameters
- Max Agents: 5 (enforced by Factory Droid background task limit)
- Lease Timeout: 5 minutes (default, configurable)
- Tasks claimed longer than this without heartbeat are auto-released
- Heartbeat Interval: 60 seconds (recommended)
- Agents should call
at least this oftenheartbeat() - Prevents false timeout while working on long tasks
- Agents should call
- Cleanup Interval: 60 seconds
- Orchestrator automatically runs
to release orphaned taskscleanupStaleClaims()
- Orchestrator automatically runs
- Database: SQLite (stored at
).omd/state/swarm.db- One database per swarm session
- Survives agent crashes
- Provides ACID guarantees
Error Handling & Recovery
Agent Crash
- Task is claimed but agent stops sending heartbeats
- After 5 minutes of no heartbeat, cleanupStaleClaims() releases the task
- Task returns to 'pending' status for another agent to claim
- Original agent's incomplete work is safely abandoned
Task Completion Failure
- Agent calls
but is no longer the owner (was released)completeTask() - The update silently fails (no agent matches in WHERE clause)
- Agent can detect this by checking return value
- Agent should log error and continue to next task
Database Unavailable
returns false if database initialization failsstartSwarm()
returnsclaimTask(){ success: false, reason: 'Database not initialized' }- Check
before proceedingisSwarmReady()
All Agents Idle
- Orchestrator detects via
orgetActiveAgents() === 0hasPendingWork() === false - Triggers final cleanup and marks swarm as complete
- Remaining failed tasks are preserved in database
No Tasks Available
returns success=false with reason 'No pending tasks available'claimTask()- Agent should check
before loopinghasPendingWork() - Safe for agent to exit cleanly when no work remains
Cancel Swarm
User can cancel via
/cancel:
- Stops orchestrator monitoring
- Signals all background agents to exit
- Preserves partial progress in SQLite database
- Marks session as "cancelled" in database
Use Cases
1. Fix All Type Errors
/swarm 5:executor "fix all TypeScript type errors"
Spawns 5 executors, each claiming and fixing individual files.
2. Implement UI Components
/swarm 3:designer "implement Material-UI styling for all components in src/components/"
Spawns 3 designers, each styling different component files.
3. Security Audit
/swarm 4:security-reviewer "review all API endpoints for vulnerabilities"
Spawns 4 security reviewers, each auditing different endpoints.
4. Documentation Sprint
/swarm 2:writer "add JSDoc comments to all exported functions"
Spawns 2 writers, each documenting different modules.
Benefits of SQLite-Based Implementation
Atomicity & Safety
- Race-Condition Free: SQLite transactions guarantee only one agent claims each task
- No Lost Updates: ACID compliance means state changes are durable
- Orphan Prevention: Expired claims are automatically released without manual intervention
Performance
- Fast Queries: Indexed lookups on task status and agent ID
- Concurrent Access: Multiple agents read/write without blocking
- Minimal Lock Time: Transactions are microseconds, not seconds
Reliability
- Crash Recovery: Database survives agent failures
- Automatic Cleanup: Stale claims don't block progress
- Lease-Based: Time-based expiration prevents indefinite hangs
Developer Experience
- Simple API: Just
,claimTask()
,completeTask()heartbeat() - Full Visibility: Query any task or agent status at any time
- Easy Debugging: SQL queries show exact state without decoding JSON
Scalability
- 10s to 1000s of Tasks: SQLite handles easily
- Full Task Retention: Complete history in database for analysis
- Extensible Schema: Add custom columns for task metadata
STATE CLEANUP ON COMPLETION
IMPORTANT: Delete state files on completion - do NOT just set active: false
When all tasks are done:
# Delete swarm state files rm -f .omd/state/swarm-state.json rm -f .omd/state/swarm-tasks.json rm -f .omd/state/swarm-claims.json
Implementation Notes
The orchestrator (main skill handler) is responsible for:
- Initial task decomposition (via explore/architect)
- Creating and initializing SQLite database via
startSwarm() - Spawning N background agents
- Monitoring progress via
andgetSwarmStats()getActiveAgents() - Running
automatically (via setInterval)cleanupStaleClaims() - Detecting completion via
isSwarmComplete() - Reporting final summary from database query
Each agent is a standard Task invocation with:
run_in_background: true- Agent-specific prompt with work loop instructions
- API import:
import { claimTask, completeTask, ... } from './swarm' - Connection:
to join existing swarmawait connectToSwarm(cwd) - Loop: repeatedly call
→ do work →claimTask()
orcompleteTask()failTask()