openclaw-ultra-scraping
install
source · Clone the upstream repo
git clone https://github.com/LeoYeAI/openclaw-ultra-scraping
Claude Code · Install into ~/.claude/skills/
git clone --depth=1 https://github.com/LeoYeAI/openclaw-ultra-scraping ~/.claude/skills/leoyeai-openclaw-ultra-scraping-openclaw-ultra-scraping
OpenClaw · Install into ~/.openclaw/skills/
git clone --depth=1 https://github.com/LeoYeAI/openclaw-ultra-scraping ~/.openclaw/skills/leoyeai-openclaw-ultra-scraping-openclaw-ultra-scraping
manifest:
SKILL.mdsource content
OpenClaw Ultra Scraping
Adaptive web scraping framework for OpenClaw agents. Handles everything from single-page extraction to full-scale concurrent crawls with anti-bot bypass.
Setup
Run once before first use:
bash scripts/setup.sh
This installs all dependencies + browser engines into
/opt/scrapling-venv.
Quick Start — CLI Script
The bundled
scripts/scrape.py provides a unified CLI:
PYTHON=/opt/scrapling-venv/bin/python3 # Simple fetch (JSON output) $PYTHON scripts/scrape.py fetch "https://example.com" --css ".content" # Extract text $PYTHON scripts/scrape.py extract "https://example.com" --css "h1" # Stealth mode (bypass Cloudflare) $PYTHON scripts/scrape.py fetch "https://protected-site.com" --stealth --solve-cloudflare --css ".data" # Dynamic (full browser rendering) $PYTHON scripts/scrape.py fetch "https://spa-site.com" --dynamic --css ".product" # Extract links $PYTHON scripts/scrape.py links "https://example.com" --filter "\.pdf$" # Multi-page crawl $PYTHON scripts/scrape.py crawl "https://example.com" --depth 2 --concurrency 10 --css ".item" -o results.json # Output formats: json, jsonl, csv, text, markdown, html $PYTHON scripts/scrape.py fetch "https://example.com" -f markdown -o page.md
Quick Start — Python
For complex tasks, write Python directly using the venv:
#!/opt/scrapling-venv/bin/python3 from scrapling.fetchers import Fetcher, StealthyFetcher # Simple HTTP page = Fetcher.get('https://example.com', impersonate='chrome') titles = page.css('h1::text').getall() # Bypass Cloudflare page = StealthyFetcher.fetch('https://protected.com', headless=True, solve_cloudflare=True) data = page.css('.product').getall()
Fetcher Selection Guide
| Scenario | Fetcher | Flag |
|---|---|---|
| Normal sites, fast scraping | | (default) |
| JS-rendered SPAs | | |
| Cloudflare/anti-bot protected | | |
| Cloudflare Turnstile challenge | | |
Selector Cheat Sheet
page.css('.class') # CSS page.css('.class::text').getall() # Text extraction page.xpath('//div[@id="main"]') # XPath page.find_all('div', class_='item') # BS4-style page.find_by_text('keyword') # Text search page.css('.item', adaptive=True) # Adaptive (survives redesigns)
Advanced Features
- Adaptive tracking:
on first run,auto_save=True
later — elements are found even after site redesignadaptive=True - Proxy rotation: Pass
or useproxy="http://host:port"ProxyRotator - Sessions:
,FetcherSession
,StealthySession
for cookie/state persistenceDynamicSession - Spider framework: Scrapy-like concurrent crawling with pause/resume
- Async support: All fetchers have async variants
For full API details: read
references/api-reference.md