install
source · Clone the upstream repo
git clone https://github.com/openclaw/skills
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/openclaw/skills "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/alfredjamesli/gui-claw/skills/gui-observe" ~/.claude/skills/openclaw-skills-gui-observe && rm -rf "$T"
OpenClaw · Install into ~/.openclaw/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/openclaw/skills "$T" && mkdir -p ~/.openclaw/skills && cp -r "$T/skills/alfredjamesli/gui-claw/skills/gui-observe" ~/.openclaw/skills/openclaw-skills-gui-observe && rm -rf "$T"
manifest:
skills/alfredjamesli/gui-claw/skills/gui-observe/SKILL.mdsource content
Observe — Know Before You Act
Three Visual Methods (see main SKILL.md for full details)
| Method | Returns | Coordinates? |
|---|---|---|
OCR () | Text + bounding box | ✅ YES |
GPA-GUI-Detector () | UI components + bounding box | ✅ YES (no labels) |
| image tool | Semantic understanding | ⛔ NEVER |
Phase 1: First encounter / unfamiliar page (DEFAULT)
- Take screenshot
- Run OCR (
) → read all text + get coordinatesdetect_text - Send screenshot to image tool → understand layout and semantics (⛔ no coordinates from this)
- Run GPA-GUI-Detector (
) → detect all UI components + coordinatesdetect_icons - Combine all three to understand current state
Phase 2: Familiar page (OPTIMIZATION)
- Take screenshot (but don't send to image tool)
- Run OCR + GPA-GUI-Detector → get text + coordinates as structured text
- LLM reads text output directly → decide without visual analysis
- If uncertain → fall back to Phase 1
For known apps with saved memory
Use template matching instead of full detection:
→ which saved components are on screen_detect_visible_components()
→ which known state matchesidentify_state_by_components()- If state is known → proceed with
(no GPA-GUI-Detector needed)click_component - If state is unknown → Phase 1 (full observation)
Coordinate System — ImageContext
detect_all() returns image pixel coordinates (raw detection output).
Callers create an ImageContext to convert to screen click coordinates.
Cropping uses image pixel coords directly — no conversion needed.
ImageContext (in ui_detector.py
)
ui_detector.pyfrom scripts.ui_detector import ImageContext ctx = ImageContext.mac_fullscreen() # Mac screencapture fullscreen ctx = ImageContext.mac_window(wx, wy) # Mac window screenshot (win pos in click-space) ctx = ImageContext.remote() # VM / remote / downloaded image (1:1) # Image pixels → screen click coords click_x, click_y = ctx.image_to_click(el["cx"], el["cy"]) # Screen click coords → image pixels (for cropping) px_x, px_y = ctx.click_to_image(click_x, click_y)
How it works
ImageContext knows two things:
- pixel_scale — image pixels per click-space unit (from
: Retina=2.0, else 1.0)backingScaleFactor - origin — image top-left in screen click-space (fullscreen=(0,0), window=(win_x, win_y))
| Source | Coordinates |
|---|---|
| detect_all output | image pixels |
| detect_icons / detect_text | image pixels |
| cv2 image crop | image pixels |
| gui_action.py click | click-space (use ) |
| template_match raw | image pixels |
- Mac Retina: pixel_scale=2.0 (e.g., 3024×1964 image, 1512×982 click-space)
- Mac non-Retina: pixel_scale=1.0
- Remote VMs: pixel_scale=1.0, origin=(0,0)
- Templates: saved in image pixel coordinates
State Detection
States are identified by which components are visible (F1 score matching):
from app_memory import identify_state_by_components, _detect_visible_components visible = _detect_visible_components(app_name) state, f1 = identify_state_by_components(app_name, visible)