openclaw-ultra-scraping

install

source · Clone the upstream repo

git clone https://github.com/LeoYeAI/openclaw-ultra-scraping

Claude Code · Install into ~/.claude/skills/

git clone --depth=1 https://github.com/LeoYeAI/openclaw-ultra-scraping ~/.claude/skills/leoyeai-openclaw-ultra-scraping-openclaw-ultra-scraping

OpenClaw · Install into ~/.openclaw/skills/

git clone --depth=1 https://github.com/LeoYeAI/openclaw-ultra-scraping ~/.openclaw/skills/leoyeai-openclaw-ultra-scraping-openclaw-ultra-scraping

manifest: SKILL.md

source content

OpenClaw Ultra Scraping

Adaptive web scraping framework for OpenClaw agents. Handles everything from single-page extraction to full-scale concurrent crawls with anti-bot bypass.

Setup

Run once before first use:

bash scripts/setup.sh

This installs all dependencies + browser engines into

/opt/scrapling-venv

Quick Start — CLI Script

The bundled

scripts/scrape.py

provides a unified CLI:

PYTHON=/opt/scrapling-venv/bin/python3

# Simple fetch (JSON output)
$PYTHON scripts/scrape.py fetch "https://example.com" --css ".content"

# Extract text
$PYTHON scripts/scrape.py extract "https://example.com" --css "h1"

# Stealth mode (bypass Cloudflare)
$PYTHON scripts/scrape.py fetch "https://protected-site.com" --stealth --solve-cloudflare --css ".data"

# Dynamic (full browser rendering)
$PYTHON scripts/scrape.py fetch "https://spa-site.com" --dynamic --css ".product"

# Extract links
$PYTHON scripts/scrape.py links "https://example.com" --filter "\.pdf$"

# Multi-page crawl
$PYTHON scripts/scrape.py crawl "https://example.com" --depth 2 --concurrency 10 --css ".item" -o results.json

# Output formats: json, jsonl, csv, text, markdown, html
$PYTHON scripts/scrape.py fetch "https://example.com" -f markdown -o page.md

Quick Start — Python

For complex tasks, write Python directly using the venv:

#!/opt/scrapling-venv/bin/python3
from scrapling.fetchers import Fetcher, StealthyFetcher

# Simple HTTP
page = Fetcher.get('https://example.com', impersonate='chrome')
titles = page.css('h1::text').getall()

# Bypass Cloudflare
page = StealthyFetcher.fetch('https://protected.com', headless=True, solve_cloudflare=True)
data = page.css('.product').getall()

Fetcher Selection Guide

Scenario	Fetcher	Flag
Normal sites, fast scraping	`Fetcher`	(default)
JS-rendered SPAs	`DynamicFetcher`	`--dynamic`
Cloudflare/anti-bot protected	`StealthyFetcher`	`--stealth`
Cloudflare Turnstile challenge	`StealthyFetcher`	`--stealth --solve-cloudflare`

Selector Cheat Sheet

page.css('.class')                    # CSS
page.css('.class::text').getall()     # Text extraction
page.xpath('//div[@id="main"]')      # XPath
page.find_all('div', class_='item')  # BS4-style
page.find_by_text('keyword')         # Text search
page.css('.item', adaptive=True)     # Adaptive (survives redesigns)

Advanced Features

Adaptive tracking:
```
auto_save=True
```
on first run,
```
adaptive=True
```
later — elements are found even after site redesign
Proxy rotation: Pass
```
proxy="http://host:port"
```
or use
```
ProxyRotator
```

Sessions:

FetcherSession

StealthySession

DynamicSession

for cookie/state persistence

Spider framework: Scrapy-like concurrent crawling with pause/resume
Async support: All fetchers have async variants

For full API details: read

references/api-reference.md