Skills gui-learn

Learn a new app's UI — detect all components, identify, filter, save to visual memory. Run before operating any app not yet in memory.

install

source · Clone the upstream repo

git clone https://github.com/openclaw/skills

Claude Code · Install into ~/.claude/skills/

T=$(mktemp -d) && git clone --depth=1 https://github.com/openclaw/skills "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/alfredjamesli/gui-claw/skills/gui-learn" ~/.claude/skills/clawdbot-skills-gui-learn && rm -rf "$T"

manifest: skills/alfredjamesli/gui-claw/skills/gui-learn/SKILL.md

source content

Learn — Build Visual Memory

Learning detects and saves all UI components for future template matching. Run this whenever:

App not in
```
memory/apps/<appname>/
```
→ full learn
New page/state in known app →
```
learn --page <pagename>
```
Match rate < 80% on eval → incremental re-learn

Command

python3 scripts/agent.py learn --app AppName

What Happens

1. Activate app, ensure window ≥ 800x600
2. agent.py runs learn:
   a. Takes full-screen screenshot, crops window region
   b. Runs Salesforce/GPA-GUI-Detector → icons, buttons, UI elements
   c. Merges results with IoU dedup
   d. Crops each element → saves to memory/apps/<appname>/components/
   e. Reports unlabeled icons
3. YOU identify all components:
   a. Use `image` tool to view each cropped image (**one at a time** for accuracy)
   b. For each: read text, describe icon, determine actual name
   c. Only label GENERIC UI components (buttons, icons, tabs, nav)
   d. DELETE temporary/dynamic content (to prevent storage bloat)
   e. Verify _find_nearest_text names (often wrong in dense UIs)
   f. Rename: app_memory.py rename --old X --new Y
4. After identification + task complete:
   a. Run: agent.py cleanup --app AppName
   b. Remove dynamic content (timestamps, message previews)
   c. Keep all stable UI elements (no privacy filtering needed — data stays local)
5. Result: ~20-30 named, fixed UI components per page

Important: Components are cropped from full-screen screenshots so they match perfectly when doing full-screen template matching later. This is why

capture_window

uses full-screen screenshot + crop (via gui_action.py screenshot).

_find_nearest_text

is a hint, not truth — always verify by viewing the cropped image.

Component Filtering

Only save stable UI elements — things that look the same next session:

SAVE (stable):

Sidebar elements (left ~15% of window)
Toolbar elements (top ~12%)
Footer elements (bottom ~12%)
Any element with OCR text label

SKIP (dynamic):

Tiny elements (< 25×25 pixels)
Content area icons without labels
Temporary content that changes every session

Naming:

Has OCR label → label as filename (
```
Search.png
```
,
```
Settings.png
```
)
No label + stable region →
```
unlabeled_<region>_<x>_<y>.png
```
No label + content area → SKIP

What to KEEP vs REMOVE

Golden rule: only save things that look the same next time you open the app.

KEEP: sidebar nav icons, toolbar buttons, input controls, window controls, tab headers, fixed logos

REMOVE: chat messages, timestamps, user avatars in lists, notification badges, contact names, web content, text >15 chars in content area, profile pictures

Quick test: "Same place, same appearance tomorrow?" → KEEP. Otherwise → REMOVE.

Post-Learn Checklist

No
```
unlabeled_
```
files remain
No timestamps, message previews, or chat content
Each filename describes what it IS
No duplicates
~20-30 components per page

Ensure App Ready (Eval Logic)

Task arrives → ensure_app_ready(app, workflow)
  │
  ├── Never learned? → full learn
  ├── Known app, new page? → learn --page <name>
  └── Known app, known page → template match:
        ├── ≥ 80% → proceed
        └── < 80% → incremental learn

For memory rules, naming, dedup, privacy, and browser per-site memory → see

skills/gui-memory/SKILL.md