Claude-code-minoan scrapling
Scrape pages locally with anti-bot bypass, TLS impersonation, and adaptive element tracking — no API keys, no cloud. Handles Cloudflare protection, CSS/XPath element extraction, and survives site redesigns. Complements firecrawl (cloud) with 100% local execution. Triggers on Cloudflare bypass, anti-bot scraping, stealth fetch, local scraping, Scrapling.
git clone https://github.com/tdimino/claude-code-minoan
T=$(mktemp -d) && git clone --depth=1 https://github.com/tdimino/claude-code-minoan "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/research/scrapling" ~/.claude/skills/tdimino-claude-code-minoan-scrapling && rm -rf "$T"
skills/research/scrapling/SKILL.mdScrapling -- Local Stealth Web Scraping
100% local Python library (BSD-3, D4Vinci/Scrapling). No API keys, no cloud dependencies. Built-in Cloudflare solver, TLS impersonation, and adaptive element tracking.
When to Use Scrapling vs Firecrawl
| Need | Use | Why |
|---|---|---|
| Clean markdown from a URL | | Optimized for LLM markdown conversion |
| Bypass Cloudflare/anti-bot | stealth fetch | Built-in Turnstile solver, Patchright stealth |
| Extract specific elements | with CSS/XPath selectors | Element-level precision, adaptive tracking |
| No API key available | | 100% local, zero credentials |
| Batch cloud scraping | / | Cloud infrastructure, parallel processing |
| Site redesign resilience | adaptive mode | SQLite-backed similarity matching |
| Full-site concurrent crawl | Spider framework | Scrapy-like with pause/resume |
| Web search + scrape | | Combined search + extraction |
Installation
Run once to set up Scrapling with all features:
~/.claude/skills/scrapling/scripts/scrapling_install.sh
Installs
scrapling[all] via uv, downloads Chromium + system dependencies, and verifies all fetchers load.
Quick Start -- Stdout Wrapper
The wrapper uses Scrapling's Python API directly (faster than CLI, avoids curl_cffi cert issues) and outputs to stdout for piping into
filter_web_results.py:
# Basic HTTP fetch (fastest, TLS impersonation) python3 ~/.claude/skills/scrapling/scripts/scrapling_fetch.py https://example.com # Stealth mode (Patchright, anti-bot bypass) python3 ~/.claude/skills/scrapling/scripts/scrapling_fetch.py https://protected.site --stealth # Stealth + Cloudflare solver python3 ~/.claude/skills/scrapling/scripts/scrapling_fetch.py https://cf-protected.site --stealth --solve-cloudflare # Dynamic (Playwright Chromium, JS rendering) python3 ~/.claude/skills/scrapling/scripts/scrapling_fetch.py https://js-heavy.site --dynamic # With CSS selector for targeted extraction python3 ~/.claude/skills/scrapling/scripts/scrapling_fetch.py https://example.com --css ".product-list" # Pipe through firecrawl's filter for token efficiency python3 ~/.claude/skills/scrapling/scripts/scrapling_fetch.py https://example.com --stealth | \ python3 ~/.claude/skills/firecrawl/scripts/filter_web_results.py --sections "Pricing" --max-chars 5000
Full path:
python3 ~/.claude/skills/scrapling/scripts/scrapling_fetch.py
Flags: --stealth, --dynamic, --css SELECTOR, --solve-cloudflare, --impersonate BROWSER, --format {text,html}, --no-headless, --timeout SECONDS
For
--network-idle, --real-chrome, or POST requests, use the CLI direct path below.
Quick Start -- CLI Direct
For file-based output (Scrapling's native CLI):
# HTTP fetch -> markdown scrapling extract get 'https://example.com' content.md # HTTP with CSS selector and browser impersonation scrapling extract get 'https://example.com' content.md --css-selector '.main-content' --impersonate chrome # Dynamic fetch (Playwright, JS rendering) scrapling extract fetch 'https://example.com' content.md # Stealth with Cloudflare bypass scrapling extract stealthy-fetch 'https://protected.site' content.md --solve-cloudflare # Stealth with CSS selector, visible browser for debugging scrapling extract stealthy-fetch 'https://site.com' content.md --css-selector '.data' --no-headless
Three Fetcher Tiers
| CLI verb | Engine | Stealth | JS | Speed |
|---|---|---|---|---|
| curl_cffi (HTTP) | TLS impersonation | No | Fast |
| Playwright/Chromium | Medium | Yes | Medium |
| Patchright/Chrome | Maximum | Yes | Slower |
Python API
For element-level extraction, automation, or when the CLI is insufficient:
from scrapling.fetchers import Fetcher, StealthyFetcher, DynamicFetcher # Simple HTTP fetch with CSS extraction page = Fetcher.get('https://example.com', impersonate='chrome') titles = page.css('.item h2::text').getall() links = page.css('a::attr(href)').getall() # Stealth with Cloudflare bypass page = StealthyFetcher.fetch('https://protected.site', headless=True, solve_cloudflare=True, hide_canvas=True, block_webrtc=True) data = page.css('.content').get_all_text() # Page automation (login, click, fill) def login(page): page.fill('#username', 'user') page.fill('#password', 'pass') page.click('#submit') page = StealthyFetcher.fetch('https://app.example.com', page_action=login)
Parsing (no fetching)
from scrapling.parser import Selector page = Selector("<html>...</html>") page.css('.item::text').getall() # CSS with pseudo-elements page.xpath('//div[@class="item"]') # XPath page.find_all('div', class_='item') # BeautifulSoup-style page.find_by_text('Add to Cart') # Text search page.find_by_regex(r'Price: \$\d+') # Regex search
Adaptive Scraping
Scrapling's signature feature. Elements are fingerprinted to SQLite and relocated by similarity scoring after site redesigns.
from scrapling.fetchers import Fetcher Fetcher.adaptive = True page = Fetcher.get('https://example.com') # First run: save element fingerprint products = page.css('.product-list', auto_save=True) # Later, after site redesign breaks the selector: products = page.css('.product-list', adaptive=True) # Still finds it
Read:
references/adaptive-scraping.md
Spider Framework
For full-site crawling with concurrency, pause/resume, and session routing:
from scrapling.spiders import Spider, Response, Request from scrapling.fetchers import FetcherSession, StealthySession class ResearchSpider(Spider): name = "research" start_urls = ["https://example.com/"] concurrent_requests = 10 def configure_sessions(self, manager): manager.add("fast", FetcherSession(impersonate="chrome")) manager.add("stealth", StealthySession(headless=True), lazy=True) async def parse(self, response: Response): for link in response.css('a::attr(href)').getall(): if "protected" in link: yield Request(link, sid="stealth") else: yield Request(link, sid="fast") for item in response.css('.article'): yield {"title": item.css('h2::text').get(), "url": response.url} result = ResearchSpider().start(crawldir="./data") # Pause/resume enabled result.items.to_json("output.json")
Troubleshooting
- Browser not found: Run
to download browser dependenciesscrapling install - Import error on fetchers: Install the full package:
uv pip install "scrapling[all]" - Cloudflare still blocking: Combine flags:
--solve-cloudflare --block-webrtc --hide-canvas - SSL cert error with HTTP fetcher: curl_cffi's bundled CA certs can be stale on macOS/pyenv. The wrapper handles this automatically (
). For Python API, passverify=False
toverify=False
.Fetcher.get() - Slow stealthy fetch: Expected--browser automation is inherently slower than HTTP requests. Use
(HTTP) when stealth is not needed.get
Reference Documentation
| File | Contents |
|---|---|
| Full CLI extract command reference (get, post, fetch, stealthy-fetch) |
| Python API (Fetcher, DynamicFetcher, StealthyFetcher, Sessions, Spiders) |
| Adaptive element tracking deep-dive (save, match, similarity scoring) |