Hermes-agent scrapling
Web scraping with Scrapling - HTTP fetching, stealth browser automation, Cloudflare bypass, and spider crawling via CLI and Python.
git clone https://github.com/NousResearch/hermes-agent
T=$(mktemp -d) && git clone --depth=1 https://github.com/NousResearch/hermes-agent "$T" && mkdir -p ~/.claude/skills && cp -r "$T/optional-skills/research/scrapling" ~/.claude/skills/nousresearch-hermes-agent-scrapling-8ebf88 && rm -rf "$T"
optional-skills/research/scrapling/SKILL.mdScrapling
Scrapling is a web scraping framework with anti-bot bypass, stealth browser automation, and a spider framework. It provides three fetching strategies (HTTP, dynamic JS, stealth/Cloudflare) and a full CLI.
This skill is for educational and research purposes only. Users must comply with local/international data scraping laws and respect website Terms of Service.
When to Use
- Scraping static HTML pages (faster than browser tools)
- Scraping JS-rendered pages that need a real browser
- Bypassing Cloudflare Turnstile or bot detection
- Crawling multiple pages with a spider
- When the built-in
tool does not return the data you needweb_extract
Installation
pip install "scrapling[all]" scrapling install
Minimal install (HTTP only, no browser):
pip install scrapling
With browser automation only:
pip install "scrapling[fetchers]" scrapling install
Quick Reference
| Approach | Class | Use When |
|---|---|---|
| HTTP | / | Static pages, APIs, fast bulk requests |
| Dynamic | / | JS-rendered content, SPAs |
| Stealth | / | Cloudflare, anti-bot protected sites |
| Spider | | Multi-page crawling with link following |
CLI Usage
Extract Static Page
scrapling extract get 'https://example.com' output.md
With CSS selector and browser impersonation:
scrapling extract get 'https://example.com' output.md \ --css-selector '.content' \ --impersonate 'chrome'
Extract JS-Rendered Page
scrapling extract fetch 'https://example.com' output.md \ --css-selector '.dynamic-content' \ --disable-resources \ --network-idle
Extract Cloudflare-Protected Page
scrapling extract stealthy-fetch 'https://protected-site.com' output.html \ --solve-cloudflare \ --block-webrtc \ --hide-canvas
POST Request
scrapling extract post 'https://example.com/api' output.json \ --json '{"query": "search term"}'
Output Formats
The output format is determined by the file extension:
-- raw HTML.html
-- converted to Markdown.md
-- plain text.txt
/.json
-- JSON.jsonl
Python: HTTP Scraping
Single Request
from scrapling.fetchers import Fetcher page = Fetcher.get('https://quotes.toscrape.com/') quotes = page.css('.quote .text::text').getall() for q in quotes: print(q)
Session (Persistent Cookies)
from scrapling.fetchers import FetcherSession with FetcherSession(impersonate='chrome') as session: page = session.get('https://example.com/', stealthy_headers=True) links = page.css('a::attr(href)').getall() for link in links[:5]: sub = session.get(link) print(sub.css('h1::text').get())
POST / PUT / DELETE
page = Fetcher.post('https://api.example.com/data', json={"key": "value"}) page = Fetcher.put('https://api.example.com/item/1', data={"name": "updated"}) page = Fetcher.delete('https://api.example.com/item/1')
With Proxy
page = Fetcher.get('https://example.com', proxy='http://user:pass@proxy:8080')
Python: Dynamic Pages (JS-Rendered)
For pages that require JavaScript execution (SPAs, lazy-loaded content):
from scrapling.fetchers import DynamicFetcher page = DynamicFetcher.fetch('https://example.com', headless=True) data = page.css('.js-loaded-content::text').getall()
Wait for Specific Element
page = DynamicFetcher.fetch( 'https://example.com', wait_selector=('.results', 'visible'), network_idle=True, )
Disable Resources for Speed
Blocks fonts, images, media, stylesheets (~25% faster):
from scrapling.fetchers import DynamicSession with DynamicSession(headless=True, disable_resources=True, network_idle=True) as session: page = session.fetch('https://example.com') items = page.css('.item::text').getall()
Custom Page Automation
from playwright.sync_api import Page from scrapling.fetchers import DynamicFetcher def scroll_and_click(page: Page): page.mouse.wheel(0, 3000) page.wait_for_timeout(1000) page.click('button.load-more') page.wait_for_selector('.extra-results') page = DynamicFetcher.fetch('https://example.com', page_action=scroll_and_click) results = page.css('.extra-results .item::text').getall()
Python: Stealth Mode (Anti-Bot Bypass)
For Cloudflare-protected or heavily fingerprinted sites:
from scrapling.fetchers import StealthyFetcher page = StealthyFetcher.fetch( 'https://protected-site.com', headless=True, solve_cloudflare=True, block_webrtc=True, hide_canvas=True, ) content = page.css('.protected-content::text').getall()
Stealth Session
from scrapling.fetchers import StealthySession with StealthySession(headless=True, solve_cloudflare=True) as session: page1 = session.fetch('https://protected-site.com/page1') page2 = session.fetch('https://protected-site.com/page2')
Element Selection
All fetchers return a
Selector object with these methods:
CSS Selectors
page.css('h1::text').get() # First h1 text page.css('a::attr(href)').getall() # All link hrefs page.css('.quote .text::text').getall() # Nested selection
XPath
page.xpath('//div[@class="content"]/text()').getall() page.xpath('//a/@href').getall()
Find Methods
page.find_all('div', class_='quote') # By tag + attribute page.find_by_text('Read more', tag='a') # By text content page.find_by_regex(r'\$\d+\.\d{2}') # By regex pattern
Similar Elements
Find elements with similar structure (useful for product listings, etc.):
first_product = page.css('.product')[0] all_similar = first_product.find_similar()
Navigation
el = page.css('.target')[0] el.parent # Parent element el.children # Child elements el.next_sibling # Next sibling el.prev_sibling # Previous sibling
Python: Spider Framework
For multi-page crawling with link following:
from scrapling.spiders import Spider, Request, Response class QuotesSpider(Spider): name = "quotes" start_urls = ["https://quotes.toscrape.com/"] concurrent_requests = 10 download_delay = 1 async def parse(self, response: Response): for quote in response.css('.quote'): yield { "text": quote.css('.text::text').get(), "author": quote.css('.author::text').get(), "tags": quote.css('.tag::text').getall(), } next_page = response.css('.next a::attr(href)').get() if next_page: yield response.follow(next_page) result = QuotesSpider().start() print(f"Scraped {len(result.items)} quotes") result.items.to_json("quotes.json")
Multi-Session Spider
Route requests to different fetcher types:
from scrapling.fetchers import FetcherSession, AsyncStealthySession class SmartSpider(Spider): name = "smart" start_urls = ["https://example.com/"] def configure_sessions(self, manager): manager.add("fast", FetcherSession(impersonate="chrome")) manager.add("stealth", AsyncStealthySession(headless=True), lazy=True) async def parse(self, response: Response): for link in response.css('a::attr(href)').getall(): if "protected" in link: yield Request(link, sid="stealth") else: yield Request(link, sid="fast", callback=self.parse)
Pause/Resume Crawling
spider = QuotesSpider(crawldir="./crawl_checkpoint") spider.start() # Ctrl+C to pause, re-run to resume from checkpoint
Pitfalls
- Browser install required: run
after pip install -- without it,scrapling install
andDynamicFetcher
will failStealthyFetcher - Timeouts: DynamicFetcher/StealthyFetcher timeout is in milliseconds (default 30000), Fetcher timeout is in seconds
- Cloudflare bypass:
adds 5-15 seconds to fetch time -- only enable when neededsolve_cloudflare=True - Resource usage: StealthyFetcher runs a real browser -- limit concurrent usage
- Legal: always check robots.txt and website ToS before scraping. This library is for educational and research purposes
- Python version: requires Python 3.10+