Skills gui-learn
Learn a new app's UI — detect all components, identify, filter, save to visual memory. Run before operating any app not yet in memory.
git clone https://github.com/openclaw/skills
T=$(mktemp -d) && git clone --depth=1 https://github.com/openclaw/skills "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/alfredjamesli/gui-claw/skills/gui-learn" ~/.claude/skills/clawdbot-skills-gui-learn && rm -rf "$T"
skills/alfredjamesli/gui-claw/skills/gui-learn/SKILL.mdLearn — Build Visual Memory
Learning detects and saves all UI components for future template matching. Run this whenever:
- App not in
→ full learnmemory/apps/<appname>/ - New page/state in known app →
learn --page <pagename> - Match rate < 80% on eval → incremental re-learn
Command
python3 scripts/agent.py learn --app AppName
What Happens
1. Activate app, ensure window ≥ 800x600 2. agent.py runs learn: a. Takes full-screen screenshot, crops window region b. Runs Salesforce/GPA-GUI-Detector → icons, buttons, UI elements c. Merges results with IoU dedup d. Crops each element → saves to memory/apps/<appname>/components/ e. Reports unlabeled icons 3. YOU identify all components: a. Use `image` tool to view each cropped image (**one at a time** for accuracy) b. For each: read text, describe icon, determine actual name c. Only label GENERIC UI components (buttons, icons, tabs, nav) d. DELETE temporary/dynamic content (to prevent storage bloat) e. Verify _find_nearest_text names (often wrong in dense UIs) f. Rename: app_memory.py rename --old X --new Y 4. After identification + task complete: a. Run: agent.py cleanup --app AppName b. Remove dynamic content (timestamps, message previews) c. Keep all stable UI elements (no privacy filtering needed — data stays local) 5. Result: ~20-30 named, fixed UI components per page
Important: Components are cropped from full-screen screenshots so they match perfectly when doing full-screen template matching later. This is why
capture_window
uses full-screen screenshot + crop (via gui_action.py screenshot).
_find_nearest_text is a hint, not truth — always verify by viewing the cropped image.
Component Filtering
Only save stable UI elements — things that look the same next session:
SAVE (stable):
- Sidebar elements (left ~15% of window)
- Toolbar elements (top ~12%)
- Footer elements (bottom ~12%)
- Any element with OCR text label
SKIP (dynamic):
- Tiny elements (< 25×25 pixels)
- Content area icons without labels
- Temporary content that changes every session
Naming:
- Has OCR label → label as filename (
,Search.png
)Settings.png - No label + stable region →
unlabeled_<region>_<x>_<y>.png - No label + content area → SKIP
What to KEEP vs REMOVE
Golden rule: only save things that look the same next time you open the app.
KEEP: sidebar nav icons, toolbar buttons, input controls, window controls, tab headers, fixed logos
REMOVE: chat messages, timestamps, user avatars in lists, notification badges, contact names, web content, text >15 chars in content area, profile pictures
Quick test: "Same place, same appearance tomorrow?" → KEEP. Otherwise → REMOVE.
Post-Learn Checklist
- No
files remainunlabeled_ - No timestamps, message previews, or chat content
- Each filename describes what it IS
- No duplicates
- ~20-30 components per page
Ensure App Ready (Eval Logic)
Task arrives → ensure_app_ready(app, workflow) │ ├── Never learned? → full learn ├── Known app, new page? → learn --page <name> └── Known app, known page → template match: ├── ≥ 80% → proceed └── < 80% → incremental learn
For memory rules, naming, dedup, privacy, and browser per-site memory → see
skills/gui-memory/SKILL.md.