Skills clawdcursor
git clone https://github.com/openclaw/skills
T=$(mktemp -d) && git clone --depth=1 https://github.com/openclaw/skills "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/amrdab/clawdcursor" ~/.claude/skills/openclaw-skills-clawdcursor-270728 && rm -rf "$T"
T=$(mktemp -d) && git clone --depth=1 https://github.com/openclaw/skills "$T" && mkdir -p ~/.openclaw/skills && cp -r "$T/skills/amrdab/clawdcursor" ~/.openclaw/skills/openclaw-skills-clawdcursor-270728 && rm -rf "$T"
skills/amrdab/clawdcursor/SKILL.mdIf a human can do it on a screen, you can too. No API? No integration? No problem.
USE AS A FALLBACK — NOT FIRST CHOICE
Before reaching for any ClawdCursor tool, ask:
- Is there a native API? (Gmail API, GitHub API, Slack API) → use the API
- Is there a CLI? (
,git,npm) → use the CLIcurl- Can you edit the file directly? → do that
- Is there a browser automation layer? (Playwright, Puppeteer) → use that
None of the above work? Now use ClawdCursor. It's for the last mile.
Modes at a Glance
| Mode | Command | Brain | Tools available |
|---|---|---|---|
| | You (REST client) | All 42 tools via HTTP |
| | You (MCP client) | All 42 tools via MCP stdio |
| | Built-in LLM pipeline | All 42 tools + autonomous agent |
In
serve and mcp modes: you reason, ClawdCursor acts. There is no built-in LLM. You call tools, interpret results, decide next steps.
Connecting
Option A — REST (clawdcursor serve
)
clawdcursor serveclawdcursor serve # starts on http://127.0.0.1:3847
All POST endpoints require:
Authorization: Bearer <token> (token saved to ~/.clawdcursor/token)
GET /tools → all tool schemas (OpenAI function-calling format) POST /execute/{name} → run a tool: {"param": "value"} GET /health → {"status":"ok","version":"0.7.5"} GET /docs → full documentation
Example:
POST /execute/get_windows {} POST /execute/mouse_click {"x": 640, "y": 400} POST /execute/type_text {"text": "hello world"}
If the server isn't running, start it yourself — don't ask the user:
clawdcursor serve # wait 2 seconds, then verify: GET /health
Option B — MCP (clawdcursor mcp
)
clawdcursor mcp{ "mcpServers": { "clawdcursor": { "command": "clawdcursor", "args": ["mcp"] } } }
Works with Claude Code, Cursor, Windsurf, Zed, or any MCP-compatible client. All 42 tools are exposed identically.
Option C — Autonomous agent (clawdcursor start
)
clawdcursor startPOST /task {"task": "Open Notepad and write Hello"} → submit task GET /status → {"status": "acting"} | "idle" | "waiting_confirm" POST /confirm {"approved": true} → approve safety-gated action POST /abort → stop current task
Use
delegate_to_agent tool to submit tasks from within MCP/REST sessions. Requires clawdcursor start running on port 3847.
Polling pattern:
POST /task {"task": "...", "returnPartial": true} → poll GET /status every 2s: "acting" → still running, keep polling "waiting_confirm" → STOP. Ask user → POST /confirm {"approved": true} "idle" → done, check GET /task-logs for result → if 60s+ with no progress: POST /abort, retry with simpler phrasing
returnPartial mode — send
{"returnPartial": true} with POST /task:
ClawdCursor skips Stage 3 (expensive vision) and returns control to you if Stage 2 fails:
{"partial": true, "stepsCompleted": [...], "context": "got stuck on dialog"}
You finish the task with MCP tools, then call POST /learn to save what worked.
POST /learn — adaptive learning: After completing a task with your own tool calls, teach ClawdCursor for next time:
POST /learn { "processName": "EXCEL", "task": "create table with headers", "actions": [ {"action": "key", "description": "Ctrl+Home to go to A1"}, {"action": "type", "description": "Type header name"}, {"action": "key", "description": "Tab to next column"} ], "shortcuts": {"next_cell": "Tab", "next_row": "Enter"}, "tips": ["Use Tab between columns, Enter between rows"] }
This enriches the app's guide JSON. Stage 2 reads it on the next run — no vision fallback needed.
The Universal Loop
Every GUI task follows the same pattern regardless of transport:
1. ORIENT → read_screen() or get_windows() see what's open and focused 2. ACT → smart_click() / smart_type() / key_press() do the thing 3. VERIFY → check return value → window state → text check → screenshot 4. REPEAT → until done
Verification (cheapest to most expensive)
- Tool return value — every tool reports success/failure. Check it first.
- Window state —
,get_active_window()
— did a dialog appear? Did the title change?get_windows() - Text check —
orread_screen()
— is the expected text visible?smart_read() - Screenshot —
— only when text methods fail. Costs the most.desktop_screenshot() - Negative check — look for error dialogs, wrong window, unchanged screen.
Always verify after: sends, saves, deletes, form submissions. Skip verification for: mid-sequence keystrokes, scrolling.
Tool Decision Trees
Perception — always start here
read_screen() → FIRST. Accessibility tree: buttons, inputs, text, with coords. Fast, structured, works on native apps. ocr_read_screen() → When a11y tree is empty (canvas UIs, image-based apps). smart_read() → Combines OCR + a11y. Good first call when unsure. desktop_screenshot() → LAST RESORT. Only when you need pixel-level visual detail. desktop_screenshot_region(x,y,w,h) → Zoomed crop when you need detail in one area.
Clicking
smart_click("Save") → FIRST. Finds by label/text via OCR + a11y, clicks. Pass processId to target the right window. invoke_element(name="Save") → When you know the exact automation ID from read_screen. cdp_click(text="Submit") → Browser elements. Requires cdp_connect() first. mouse_click(x, y) → LAST RESORT. Raw coordinates from a screenshot.
Typing
smart_type("Email", "user@x.com") → FIRST. Finds field by label, focuses, types. cdp_type(label="Email", text="…") → Browser inputs. Requires cdp_connect() first. type_text("hello") → Clipboard paste into whatever is focused. Use after manually focusing with smart_click.
Browser / CDP
1. navigate_browser(url) → opens URL, auto-enables CDP 2. cdp_connect() → connect to browser DevTools Protocol 3. cdp_page_context() → list interactive elements on page 4. cdp_read_text() → extract DOM text (returns empty on canvas apps → use OCR) 5. cdp_click(text="…") → click by visible text 6. cdp_type(label, text) → fill input by label 7. cdp_evaluate(script) → run JavaScript in page context 8. cdp_scroll(direction, px) → scroll page via DOM (not mouse wheel) 9. cdp_list_tabs() → list all open tabs 10. cdp_switch_tab(target) → switch to a specific tab
If CDP isn't connected, switch tabs with keyboard:
key_press("ctrl+1") → tab 1 key_press("ctrl+tab") → next tab key_press("ctrl+shift+tab") → previous tab
Window Management
get_windows() → list all open windows (use to find PIDs) get_active_window() → what's in the foreground right now focus_window(processName="Discord") → bring to front (auto-minimizes phantom off-screen windows) minimize_window(processName="calc") → minimize a window — 1 call, cross-platform also accepts: processId, title
Rule: Always
focus_window() before key_press() or type_text(). Keystrokes go to whatever has focus — if that's your terminal, not the target app.
Canvas apps (Google Docs, Figma, Notion)
DOM has no readable text. Pattern:
ocr_read_screen() → read content (DOM extraction fails) mouse_click(x, y) → click into the canvas area type_text("your text") → clipboard paste works even on canvas
Quick Patterns
Open app and type:
open_app("notepad") → wait(2) → smart_read() → type_text("Hello") → smart_read()
Read a webpage:
navigate_browser(url) → wait(3) → cdp_connect() → cdp_read_text()
Fill a web form:
cdp_connect() → cdp_type("Email", "x@x.com") → cdp_type("Password", "…") → cdp_click("Submit")
Cross-app copy/paste:
focus_window("Chrome") → key_press("ctrl+a") → key_press("ctrl+c") → read_clipboard() → focus_window("Notepad") → type_text(clipboard)
Send email via Outlook:
open_app("outlook") → wait(2) → smart_click("New Email") → mouse_click(to_field_x, to_field_y) → type_text("recipient@x.com") → key_press("Tab") → mouse_click(subject_x, subject_y) → type_text("Subject") → key_press("Tab") → mouse_click(body_x, body_y) → type_text("Body text") → mouse_click(send_x, send_y)
Autonomous complex task (requires
):clawdcursor start
delegate_to_agent("Open Gmail, find latest email from Stripe, forward to billing@x.com") → poll GET /status every 2s → if waiting_confirm: ask user → POST /confirm {"approved": true} → if idle: task done
Full Tool Reference (42 tools)
Speed: ⚡ Free/instant · 🔵 Cheap · 🟡 Moderate · 🔴 Vision (expensive)
Perception (6)
| Tool | What it does | When |
|---|---|---|
| A11y tree — buttons, inputs, text, coords | ⚡ Default first read |
| OCR + a11y combined | 🔵 When unsure which to use |
| Raw OCR text with bounding boxes | 🔵 Canvas UIs, empty a11y trees |
| Full screen image (1280px wide) | ⚡ Last resort visual check |
| Zoomed crop of specific area | ⚡ Fine-grained visual detail |
| Screen dimensions and DPI | ⚡ Coordinate calculations |
Mouse (7)
| Tool | What it does | When |
|---|---|---|
| Find element by text/label, click | 🔵 First choice for clicking |
| Left click at (x, y) | ⚡ Last resort |
| Double click at (x, y) | ⚡ Open files, select words |
| Right click at (x, y) | ⚡ Context menus |
| Move cursor without clicking | ⚡ Hover menus |
| Scroll at position (physical mouse wheel) | ⚡ Scroll content |
| Drag from start to end — accepts or | ⚡ Resize, select ranges |
Keyboard (5)
| Tool | What it does | When |
|---|---|---|
| Find input by label, focus it, type | 🔵 First choice for form fields |
| Clipboard paste into focused element | ⚡ After manually focusing |
| Send key combo (, , ) | ⚡ After focus_window |
| List keyboard shortcuts for current app | ⚡ Before reaching for mouse |
| Run a named shortcut (fuzzy match) | ⚡ Save, copy, paste, undo |
Window Management (5)
| Tool | What it does | When |
|---|---|---|
| List all open windows with PIDs and bounds | ⚡ Situational awareness |
| Current foreground window | ⚡ Check current focus |
| Element with keyboard focus | ⚡ Debug wrong-field typing |
| Bring window to front (auto-clears off-screen phantoms) | ⚡ Always before key_press |
| Minimize by processName, processId, or title | ⚡ Clear focus stealers |
UI Elements (2)
| Tool | What it does | When |
|---|---|---|
| Search UI tree by name or type | ⚡ Find automation IDs |
| Invoke element by automation ID or name | ⚡ When ID known from read_screen |
Clipboard (2)
| Tool | What it does | When |
|---|---|---|
| Read clipboard text | ⚡ After copy operations |
| Write text to clipboard | ⚡ Before paste operations |
Browser / CDP (11)
| Tool | What it does | When |
|---|---|---|
| Connect to browser DevTools Protocol | ⚡ First step for any browser task |
| List interactive elements on page | ⚡ After connect |
| Extract DOM text | ⚡ Read page content |
| Click by CSS selector or visible text | ⚡ Browser clicks |
| Type into input by label or selector | ⚡ Browser form filling |
| Select dropdown option | ⚡ Select elements |
| Run JavaScript in page context | ⚡ Custom queries |
| Scroll page via DOM (, px) | ⚡ DOM-level scroll |
| Wait for element to appear | ⚡ After navigation/AJAX |
| List all browser tabs | ⚡ When on wrong tab |
| Switch to a tab by title or index | ⚡ After cdp_list_tabs |
Orchestration (4)
| Tool | What it does | When |
|---|---|---|
| Launch application by name | ⚡ First step for desktop tasks |
| Open URL (auto-enables CDP) | ⚡ First step for browser tasks |
| Pause N seconds | ⚡ After opening apps, let UI render |
| Send task to built-in autonomous agent | 🟡 Complex multi-step tasks (requires ) |
Provider Setup (agent mode only)
| Provider | Setup | Cost |
|---|---|---|
| Ollama (local) | | $0 — fully offline, no data leaves machine |
| Any cloud | Set env var: , , , , etc. | Varies |
| OpenClaw users | Auto-detected from | No extra setup |
Run
clawdcursor doctor to auto-detect and validate providers.
Security
- Network isolation: Binds to
only. Verify:127.0.0.1
— should shownetstat -an | findstr 3847
, never127.0.0.1:38470.0.0.0:3847 - Ollama: 100% offline. Screenshots stay in RAM, never leave the machine.
- Cloud providers: Screenshots/text sent only to your configured provider. No telemetry, no analytics, no third-party logging.
- Token auth: All mutating POST endpoints require
. Token atAuthorization: Bearer <token>
.~/.clawdcursor/token - Safety tiers: Auto / Preview / Confirm. Agents must never self-approve Confirm actions.
Coordinate System
All mouse tools use image-space coordinates from a 1280px-wide viewport — matching screenshots from
desktop_screenshot. DPI scaling is handled automatically. Do not pre-scale coordinates.
Safety
| Tier | Actions | Behavior |
|---|---|---|
| 🟢 Auto | Navigation, reading, opening apps | Runs immediately |
| 🟡 Preview | Typing, form filling | Logged |
| 🔴 Confirm | Send, delete, purchase | Pauses — always ask user first |
- Never self-approve Confirm actions.
andAlt+F4
are blocked.Ctrl+Alt+Delete- Server binds to
only.127.0.0.1 - First run requires explicit user consent for desktop control.
Error Recovery
| Problem | Fix |
|---|---|
| Port 3847 not responding | — wait 2s — |
| 401 Unauthorized | Token changed — read and use fresh value |
| CDP not available | Chrome must be open. auto-enables it. |
| CDP on wrong tab | → |
fails | to confirm title/processName, then retry |
can't find element | for coords → |
goes to wrong window | You skipped — always focus first |
returns empty | Canvas app — use instead |
| Same action fails 3+ times | Try a completely different approach |
Platform Support
| Platform | A11y | OCR | CDP |
|---|---|---|---|
| Windows (x64/ARM64) | PowerShell + .NET UIA | Windows.Media.Ocr | Chrome/Edge |
| macOS (Intel/Apple Silicon) | JXA + System Events | Apple Vision | Chrome/Edge |
| Linux (x64/ARM64) | AT-SPI | Tesseract | Chrome/Edge |
macOS: Grant Accessibility in System Settings → Privacy → Accessibility. Linux:
sudo apt install tesseract-ocr for OCR support.