Open-skills crawl-websites-at-scale

Scrape websites at scale using Scrapy, a Python web crawling and scraping framework. Use when: (1) Crawling multiple pages or entire sites, (2) Extracting structured data from HTML/XML, or (3) Building automated data pipelines from web sources.

install

source · Clone the upstream repo

git clone https://github.com/besoeasy/open-skills

Claude Code · Install into ~/.claude/skills/

T=$(mktemp -d) && git clone --depth=1 https://github.com/besoeasy/open-skills "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/crawl-websites-at-scale" ~/.claude/skills/besoeasy-open-skills-crawl-websites-at-scale && rm -rf "$T"

manifest: skills/crawl-websites-at-scale/SKILL.md

source content

Scrapy Web Scraping Skill

Scrapy is a fast, high-level Python web crawling and scraping framework. It enables structured data extraction from websites, supports crawling entire sites, and integrates pipelines to process and store scraped data.

When to use

Crawl entire websites or follow links across many pages
Extract structured data (prices, articles, product listings) into JSON/CSV
Run scheduled or large-scale scraping pipelines
Need built-in support for request throttling, retries, and middlewares

Required tools / APIs

No external API required
Python 3.8+ required
Scrapy: Web crawling and scraping framework

Install options:

# pip
pip install scrapy

# Ubuntu/Debian
sudo apt-get install -y python3-pip && pip install scrapy

# macOS
brew install python && pip install scrapy

# Verify installation
scrapy version

Skills

basic_usage

Create and run a simple Scrapy spider to scrape a single page.

# Create a new Scrapy project
scrapy startproject myproject
cd myproject

# Generate a spider
scrapy genspider quotes quotes.toscrape.com

# Run the spider and save to JSON
scrapy crawl quotes -o output.json

# Run the spider and save to CSV
scrapy crawl quotes -o output.csv

Python spider (quotes.py):

import scrapy

class QuotesSpider(scrapy.Spider):
    name = "quotes"
    start_urls = ["https://quotes.toscrape.com"]

    def parse(self, response):
        for quote in response.css("div.quote"):
            yield {
                "text": quote.css("span.text::text").get(),
                "author": quote.css("small.author::text").get(),
                "tags": quote.css("a.tag::text").getall(),
            }

        # Follow pagination links
        next_page = response.css("li.next a::attr(href)").get()
        if next_page:
            yield response.follow(next_page, self.parse)

robust_usage

Production-oriented spider with settings, item pipelines, and error handling.

# Run with custom settings (rate limiting, retries)
scrapy crawl quotes \
  -s DOWNLOAD_DELAY=1 \
  -s AUTOTHROTTLE_ENABLED=True \
  -s RETRY_TIMES=3 \
  -o output.json

# Run from a script (no project required)
scrapy runspider spider.py -o output.json

Python with error handling and structured items:

import scrapy
from scrapy import signals
from scrapy.crawler import CrawlerProcess

class ArticleSpider(scrapy.Spider):
    name = "articles"
    custom_settings = {
        "DOWNLOAD_DELAY": 1,
        "AUTOTHROTTLE_ENABLED": True,
        "AUTOTHROTTLE_START_DELAY": 1,
        "AUTOTHROTTLE_MAX_DELAY": 10,
        "ROBOTSTXT_OBEY": True,
        "USER_AGENT": "open-skills-bot/1.0 (+https://github.com/besoeasy/open-skills)",
        "RETRY_TIMES": 3,
        "FEEDS": {"output.json": {"format": "json"}},
    }

    def __init__(self, start_url=None, *args, **kwargs):
        super().__init__(*args, **kwargs)
        self.start_urls = [start_url or "https://quotes.toscrape.com"]

    def parse(self, response):
        for article in response.css("article, div.post, div.entry"):
            yield {
                "url": response.url,
                "title": article.css("h1::text, h2::text").get("").strip(),
                "body": " ".join(article.css("p::text").getall()),
            }

        for link in response.css("a::attr(href)").getall():
            if link.startswith("/") or response.url in link:
                yield response.follow(link, self.parse)

    def errback(self, failure):
        self.logger.error(f"Request failed: {failure.request.url} — {failure.value}")


# Run without a Scrapy project
if __name__ == "__main__":
    process = CrawlerProcess()
    process.crawl(ArticleSpider, start_url="https://quotes.toscrape.com")
    process.start()

extract_with_xpath

Use XPath selectors for precise extraction from complex HTML structures.

import scrapy

class XPathSpider(scrapy.Spider):
    name = "xpath_example"
    start_urls = ["https://quotes.toscrape.com"]

    def parse(self, response):
        for quote in response.xpath("//div[@class='quote']"):
            yield {
                "text": quote.xpath(".//span[@class='text']/text()").get(),
                "author": quote.xpath(".//small[@class='author']/text()").get(),
                "tags": quote.xpath(".//a[@class='tag']/text()").getall(),
            }

Output format

Scrapy yields Python dicts (or Item objects) per scraped record. When saved to file:

```
output.json
```
— Array of JSON objects, one per item
```
output.csv
```
— CSV with headers matching dict keys
```
output.jsonl
```
— One JSON object per line (memory-efficient for large crawls)

Example item:

{
  "text": "The world as we have created it is a process of our thinking.",
  "author": "Albert Einstein",
  "tags": ["change", "deep-thoughts", "thinking", "world"]
}

Error shape: Scrapy logs errors to stderr; unhandled HTTP errors trigger the

errback

method if defined.

Rate limits / Best practices

Enable
```
ROBOTSTXT_OBEY = True
```
to respect robots.txt automatically
Set
```
DOWNLOAD_DELAY
```
(seconds between requests) to avoid overloading servers
Enable
```
AUTOTHROTTLE_ENABLED = True
```
for adaptive rate limiting
Set a descriptive
```
USER_AGENT
```
identifying your bot
Use
```
CONCURRENT_REQUESTS_PER_DOMAIN = 1
```
for polite single-domain crawling
Cache responses during development:
```
HTTPCACHE_ENABLED = True
```

Agent prompt

You have scrapy web-scraping capability. When a user asks to scrape or crawl a website:

1. Confirm the target URL and data fields to extract (e.g., title, price, link)
2. Create a Scrapy spider using CSS or XPath selectors to target those fields
3. Enable ROBOTSTXT_OBEY=True and set DOWNLOAD_DELAY>=1 to be polite
4. Follow pagination links if the user needs data across multiple pages
5. Save results to output.json or output.csv

Always identify your bot with a descriptive USER_AGENT and never scrape login-protected or paywalled content.

Troubleshooting

Error: "Forbidden by robots.txt"

Symptom: Spider skips URLs and logs "Forbidden by robots.txt"
Solution: Review the site's robots.txt; only scrape paths that are allowed, or set
```
ROBOTSTXT_OBEY = False
```
if you have explicit permission from the site owner

Error: "Empty or missing data"

Symptom: Items are yielded with empty strings or
```
None
```
values
Solution: Inspect the page source (
```
scrapy shell <url>
```
) and adjust your CSS/XPath selectors to match the actual HTML structure

Error: "Too many redirects / 429 Too Many Requests"

Symptom: Requests fail with HTTP 429 or redirect loops
Solution: Increase
```
DOWNLOAD_DELAY
```
, enable
```
AUTOTHROTTLE_ENABLED = True
```
, or add a
```
Retry-After
```
respecting middleware

Error: "JavaScript-rendered content not found"

Symptom: Expected data is missing because the site uses client-side rendering
Solution: Use
```
scrapy-playwright
```
or
```
scrapy-splash
```
middleware to render JavaScript before parsing