Skills clawdcursor

install
source · Clone the upstream repo
git clone https://github.com/openclaw/skills
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/openclaw/skills "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/amrdab/clawdcursor" ~/.claude/skills/openclaw-skills-clawdcursor-270728 && rm -rf "$T"
OpenClaw · Install into ~/.openclaw/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/openclaw/skills "$T" && mkdir -p ~/.openclaw/skills && cp -r "$T/skills/amrdab/clawdcursor" ~/.openclaw/skills/openclaw-skills-clawdcursor-270728 && rm -rf "$T"
manifest: skills/amrdab/clawdcursor/SKILL.md
source content

If a human can do it on a screen, you can too. No API? No integration? No problem.

USE AS A FALLBACK — NOT FIRST CHOICE

Before reaching for any ClawdCursor tool, ask:

  1. Is there a native API? (Gmail API, GitHub API, Slack API) → use the API
  2. Is there a CLI? (
    git
    ,
    npm
    ,
    curl
    ) → use the CLI
  3. Can you edit the file directly? → do that
  4. Is there a browser automation layer? (Playwright, Puppeteer) → use that

None of the above work? Now use ClawdCursor. It's for the last mile.


Modes at a Glance

ModeCommandBrainTools available
serve
clawdcursor serve
You (REST client)All 42 tools via HTTP
mcp
clawdcursor mcp
You (MCP client)All 42 tools via MCP stdio
start
clawdcursor start
Built-in LLM pipelineAll 42 tools + autonomous agent

In

serve
and
mcp
modes: you reason, ClawdCursor acts. There is no built-in LLM. You call tools, interpret results, decide next steps.


Connecting

Option A — REST (
clawdcursor serve
)

clawdcursor serve        # starts on http://127.0.0.1:3847

All POST endpoints require:

Authorization: Bearer <token>
(token saved to
~/.clawdcursor/token
)

GET  /tools              → all tool schemas (OpenAI function-calling format)
POST /execute/{name}     → run a tool: {"param": "value"}
GET  /health             → {"status":"ok","version":"0.7.5"}
GET  /docs               → full documentation

Example:

POST /execute/get_windows     {}
POST /execute/mouse_click     {"x": 640, "y": 400}
POST /execute/type_text       {"text": "hello world"}

If the server isn't running, start it yourself — don't ask the user:

clawdcursor serve
# wait 2 seconds, then verify: GET /health

Option B — MCP (
clawdcursor mcp
)

{
  "mcpServers": {
    "clawdcursor": {
      "command": "clawdcursor",
      "args": ["mcp"]
    }
  }
}

Works with Claude Code, Cursor, Windsurf, Zed, or any MCP-compatible client. All 42 tools are exposed identically.

Option C — Autonomous agent (
clawdcursor start
)

POST /task    {"task": "Open Notepad and write Hello"}   → submit task
GET  /status  → {"status": "acting"} | "idle" | "waiting_confirm"
POST /confirm {"approved": true}                         → approve safety-gated action
POST /abort                                              → stop current task

Use

delegate_to_agent
tool to submit tasks from within MCP/REST sessions. Requires
clawdcursor start
running on port 3847.

Polling pattern:

POST /task  {"task": "...", "returnPartial": true}
→ poll GET /status every 2s:
    "acting"           → still running, keep polling
    "waiting_confirm"  → STOP. Ask user → POST /confirm {"approved": true}
    "idle"             → done, check GET /task-logs for result
→ if 60s+ with no progress: POST /abort, retry with simpler phrasing

returnPartial mode — send

{"returnPartial": true}
with POST /task: ClawdCursor skips Stage 3 (expensive vision) and returns control to you if Stage 2 fails:

{"partial": true, "stepsCompleted": [...], "context": "got stuck on dialog"}

You finish the task with MCP tools, then call POST /learn to save what worked.

POST /learn — adaptive learning: After completing a task with your own tool calls, teach ClawdCursor for next time:

POST /learn
{
  "processName": "EXCEL",
  "task": "create table with headers",
  "actions": [
    {"action": "key", "description": "Ctrl+Home to go to A1"},
    {"action": "type", "description": "Type header name"},
    {"action": "key", "description": "Tab to next column"}
  ],
  "shortcuts": {"next_cell": "Tab", "next_row": "Enter"},
  "tips": ["Use Tab between columns, Enter between rows"]
}

This enriches the app's guide JSON. Stage 2 reads it on the next run — no vision fallback needed.


The Universal Loop

Every GUI task follows the same pattern regardless of transport:

1. ORIENT  →  read_screen() or get_windows()          see what's open and focused
2. ACT     →  smart_click() / smart_type() / key_press()   do the thing
3. VERIFY  →  check return value → window state → text check → screenshot
4. REPEAT  →  until done

Verification (cheapest to most expensive)

  1. Tool return value — every tool reports success/failure. Check it first.
  2. Window state
    get_active_window()
    ,
    get_windows()
    — did a dialog appear? Did the title change?
  3. Text check
    read_screen()
    or
    smart_read()
    — is the expected text visible?
  4. Screenshot
    desktop_screenshot()
    — only when text methods fail. Costs the most.
  5. Negative check — look for error dialogs, wrong window, unchanged screen.

Always verify after: sends, saves, deletes, form submissions. Skip verification for: mid-sequence keystrokes, scrolling.


Tool Decision Trees

Perception — always start here

read_screen()          → FIRST. Accessibility tree: buttons, inputs, text, with coords.
                          Fast, structured, works on native apps.
ocr_read_screen()      → When a11y tree is empty (canvas UIs, image-based apps).
smart_read()           → Combines OCR + a11y. Good first call when unsure.
desktop_screenshot()   → LAST RESORT. Only when you need pixel-level visual detail.
desktop_screenshot_region(x,y,w,h) → Zoomed crop when you need detail in one area.

Clicking

smart_click("Save")              → FIRST. Finds by label/text via OCR + a11y, clicks.
                                   Pass processId to target the right window.
invoke_element(name="Save")      → When you know the exact automation ID from read_screen.
cdp_click(text="Submit")         → Browser elements. Requires cdp_connect() first.
mouse_click(x, y)                → LAST RESORT. Raw coordinates from a screenshot.

Typing

smart_type("Email", "user@x.com")  → FIRST. Finds field by label, focuses, types.
cdp_type(label="Email", text="…")  → Browser inputs. Requires cdp_connect() first.
type_text("hello")                 → Clipboard paste into whatever is focused.
                                     Use after manually focusing with smart_click.

Browser / CDP

1. navigate_browser(url)     → opens URL, auto-enables CDP
2. cdp_connect()             → connect to browser DevTools Protocol
3. cdp_page_context()        → list interactive elements on page
4. cdp_read_text()           → extract DOM text (returns empty on canvas apps → use OCR)
5. cdp_click(text="…")       → click by visible text
6. cdp_type(label, text)     → fill input by label
7. cdp_evaluate(script)      → run JavaScript in page context
8. cdp_scroll(direction, px) → scroll page via DOM (not mouse wheel)
9. cdp_list_tabs()           → list all open tabs
10. cdp_switch_tab(target)   → switch to a specific tab

If CDP isn't connected, switch tabs with keyboard:

key_press("ctrl+1")          → tab 1
key_press("ctrl+tab")        → next tab
key_press("ctrl+shift+tab")  → previous tab

Window Management

get_windows()                         → list all open windows (use to find PIDs)
get_active_window()                   → what's in the foreground right now
focus_window(processName="Discord")   → bring to front (auto-minimizes phantom off-screen windows)
minimize_window(processName="calc")   → minimize a window — 1 call, cross-platform
                                        also accepts: processId, title

Rule: Always

focus_window()
before
key_press()
or
type_text()
. Keystrokes go to whatever has focus — if that's your terminal, not the target app.

Canvas apps (Google Docs, Figma, Notion)

DOM has no readable text. Pattern:

ocr_read_screen()          → read content (DOM extraction fails)
mouse_click(x, y)          → click into the canvas area
type_text("your text")     → clipboard paste works even on canvas

Quick Patterns

Open app and type:

open_app("notepad") → wait(2) → smart_read() → type_text("Hello") → smart_read()

Read a webpage:

navigate_browser(url) → wait(3) → cdp_connect() → cdp_read_text()

Fill a web form:

cdp_connect() → cdp_type("Email", "x@x.com") → cdp_type("Password", "…") → cdp_click("Submit")

Cross-app copy/paste:

focus_window("Chrome") → key_press("ctrl+a") → key_press("ctrl+c")
→ read_clipboard() → focus_window("Notepad") → type_text(clipboard)

Send email via Outlook:

open_app("outlook") → wait(2) → smart_click("New Email")
→ mouse_click(to_field_x, to_field_y) → type_text("recipient@x.com") → key_press("Tab")
→ mouse_click(subject_x, subject_y) → type_text("Subject") → key_press("Tab")
→ mouse_click(body_x, body_y) → type_text("Body text")
→ mouse_click(send_x, send_y)

Autonomous complex task (requires

clawdcursor start
):

delegate_to_agent("Open Gmail, find latest email from Stripe, forward to billing@x.com")
→ poll GET /status every 2s
→ if waiting_confirm: ask user → POST /confirm {"approved": true}
→ if idle: task done

Full Tool Reference (42 tools)

Speed: ⚡ Free/instant · 🔵 Cheap · 🟡 Moderate · 🔴 Vision (expensive)

Perception (6)

ToolWhat it doesWhen
read_screen
A11y tree — buttons, inputs, text, coords⚡ Default first read
smart_read
OCR + a11y combined🔵 When unsure which to use
ocr_read_screen
Raw OCR text with bounding boxes🔵 Canvas UIs, empty a11y trees
desktop_screenshot
Full screen image (1280px wide)⚡ Last resort visual check
desktop_screenshot_region
Zoomed crop of specific area⚡ Fine-grained visual detail
get_screen_size
Screen dimensions and DPI⚡ Coordinate calculations

Mouse (7)

ToolWhat it doesWhen
smart_click
Find element by text/label, click🔵 First choice for clicking
mouse_click
Left click at (x, y)⚡ Last resort
mouse_double_click
Double click at (x, y)⚡ Open files, select words
mouse_right_click
Right click at (x, y)⚡ Context menus
mouse_hover
Move cursor without clicking⚡ Hover menus
mouse_scroll
Scroll at position (physical mouse wheel)⚡ Scroll content
mouse_drag
Drag from start to end — accepts
startX/startY/endX/endY
or
x1/y1/x2/y2
⚡ Resize, select ranges

Keyboard (5)

ToolWhat it doesWhen
smart_type
Find input by label, focus it, type🔵 First choice for form fields
type_text
Clipboard paste into focused element⚡ After manually focusing
key_press
Send key combo (
ctrl+s
,
Return
,
alt+tab
)
⚡ After focus_window
shortcuts_list
List keyboard shortcuts for current app⚡ Before reaching for mouse
shortcuts_execute
Run a named shortcut (fuzzy match)⚡ Save, copy, paste, undo

Window Management (5)

ToolWhat it doesWhen
get_windows
List all open windows with PIDs and bounds⚡ Situational awareness
get_active_window
Current foreground window⚡ Check current focus
get_focused_element
Element with keyboard focus⚡ Debug wrong-field typing
focus_window
Bring window to front (auto-clears off-screen phantoms)⚡ Always before key_press
minimize_window
Minimize by processName, processId, or title⚡ Clear focus stealers

UI Elements (2)

ToolWhat it doesWhen
find_element
Search UI tree by name or type⚡ Find automation IDs
invoke_element
Invoke element by automation ID or name⚡ When ID known from read_screen

Clipboard (2)

ToolWhat it doesWhen
read_clipboard
Read clipboard text⚡ After copy operations
write_clipboard
Write text to clipboard⚡ Before paste operations

Browser / CDP (11)

ToolWhat it doesWhen
cdp_connect
Connect to browser DevTools Protocol⚡ First step for any browser task
cdp_page_context
List interactive elements on page⚡ After connect
cdp_read_text
Extract DOM text⚡ Read page content
cdp_click
Click by CSS selector or visible text⚡ Browser clicks
cdp_type
Type into input by label or selector⚡ Browser form filling
cdp_select_option
Select dropdown option⚡ Select elements
cdp_evaluate
Run JavaScript in page context⚡ Custom queries
cdp_scroll
Scroll page via DOM (
direction
,
amount
px)
⚡ DOM-level scroll
cdp_wait_for_selector
Wait for element to appear⚡ After navigation/AJAX
cdp_list_tabs
List all browser tabs⚡ When on wrong tab
cdp_switch_tab
Switch to a tab by title or index⚡ After cdp_list_tabs

Orchestration (4)

ToolWhat it doesWhen
open_app
Launch application by name⚡ First step for desktop tasks
navigate_browser
Open URL (auto-enables CDP)⚡ First step for browser tasks
wait
Pause N seconds⚡ After opening apps, let UI render
delegate_to_agent
Send task to built-in autonomous agent🟡 Complex multi-step tasks (requires
clawdcursor start
)

Provider Setup (agent mode only)

ProviderSetupCost
Ollama (local)
ollama pull qwen2.5:7b && ollama serve
$0 — fully offline, no data leaves machine
Any cloudSet env var:
ANTHROPIC_API_KEY
,
OPENAI_API_KEY
,
GEMINI_API_KEY
,
MOONSHOT_API_KEY
, etc.
Varies
OpenClaw usersAuto-detected from
~/.openclaw/agents/main/auth-profiles.json
No extra setup

Run

clawdcursor doctor
to auto-detect and validate providers.


Security

  • Network isolation: Binds to
    127.0.0.1
    only. Verify:
    netstat -an | findstr 3847
    — should show
    127.0.0.1:3847
    , never
    0.0.0.0:3847
  • Ollama: 100% offline. Screenshots stay in RAM, never leave the machine.
  • Cloud providers: Screenshots/text sent only to your configured provider. No telemetry, no analytics, no third-party logging.
  • Token auth: All mutating POST endpoints require
    Authorization: Bearer <token>
    . Token at
    ~/.clawdcursor/token
    .
  • Safety tiers: Auto / Preview / Confirm. Agents must never self-approve Confirm actions.

Coordinate System

All mouse tools use image-space coordinates from a 1280px-wide viewport — matching screenshots from

desktop_screenshot
. DPI scaling is handled automatically. Do not pre-scale coordinates.


Safety

TierActionsBehavior
🟢 AutoNavigation, reading, opening appsRuns immediately
🟡 PreviewTyping, form fillingLogged
🔴 ConfirmSend, delete, purchasePauses — always ask user first
  • Never self-approve Confirm actions.
  • Alt+F4
    and
    Ctrl+Alt+Delete
    are blocked.
  • Server binds to
    127.0.0.1
    only.
  • First run requires explicit user consent for desktop control.

Error Recovery

ProblemFix
Port 3847 not responding
clawdcursor serve
— wait 2s —
GET /health
401 UnauthorizedToken changed — read
~/.clawdcursor/token
and use fresh value
CDP not availableChrome must be open.
navigate_browser(url)
auto-enables it.
CDP on wrong tab
cdp_list_tabs()
cdp_switch_tab(target)
focus_window
fails
get_windows()
to confirm title/processName, then retry
smart_click
can't find element
read_screen()
for coords →
mouse_click(x, y)
key_press
goes to wrong window
You skipped
focus_window
— always focus first
cdp_read_text
returns empty
Canvas app — use
ocr_read_screen()
instead
Same action fails 3+ timesTry a completely different approach

Platform Support

PlatformA11yOCRCDP
Windows (x64/ARM64)PowerShell + .NET UIAWindows.Media.OcrChrome/Edge
macOS (Intel/Apple Silicon)JXA + System EventsApple VisionChrome/Edge
Linux (x64/ARM64)AT-SPITesseractChrome/Edge

macOS: Grant Accessibility in System Settings → Privacy → Accessibility. Linux:

sudo apt install tesseract-ocr
for OCR support.