Skills web-scraper-skill

install

source · Clone the upstream repo

git clone https://github.com/openclaw/skills

Claude Code · Install into ~/.claude/skills/

T=$(mktemp -d) && git clone --depth=1 https://github.com/openclaw/skills "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/abhishekj9621/web-scraper-skill" ~/.claude/skills/clawdbot-skills-web-scraper-skill && rm -rf "$T"

manifest: skills/abhishekj9621/web-scraper-skill/SKILL.md

source content

Web Scraper Skill (Apify + Firecrawl)

This skill helps Openclaw scrape and extract data from websites using two powerful APIs:

Firecrawl — best for scraping individual pages, crawling entire sites, and getting LLM-ready content (markdown)
Apify — best for specialized scrapers (social media, Google Maps, e-commerce, etc.) via pre-built Actors

Quick Decision Guide: Apify vs Firecrawl

Use Case	Recommended Tool
Scrape a single page into markdown/JSON	Firecrawl `/scrape`
Crawl an entire website (follow links)	Firecrawl `/crawl`
Map all URLs on a site	Firecrawl `/map`
Search web + scrape results	Firecrawl `/search`
Scrape Instagram / TikTok / Twitter	Apify (social actors)
Scrape Google Maps / reviews	Apify (compass/crawler-google-places)
Scrape Amazon products	Apify (apify/amazon-scraper)
Scrape Google Search results	Apify (apify/google-search-scraper)
Custom actor / any Apify Store actor	Apify

Authentication

Both APIs require API keys passed via headers. Always ask the user for their key if not provided.

Firecrawl:

Authorization: Bearer fc-YOUR_API_KEY

Apify:

Authorization: Bearer YOUR_APIFY_TOKEN

(or

?token=YOUR_TOKEN

in URL)

Firecrawl API Reference

Base URL:

https://api.firecrawl.dev/v2

1. Scrape a Single Page

POST /v2/scrape
Authorization: Bearer fc-YOUR_API_KEY
Content-Type: application/json

{
  "url": "https://example.com",
  "formats": ["markdown"],          // Options: markdown, html, rawHtml, links, screenshot, json
  "onlyMainContent": true,          // Strips nav/footer/ads
  "waitFor": 0,                     // ms to wait before scraping (for JS-heavy pages)
  "timeout": 30000,                 // ms
  "blockAds": true,
  "proxy": "auto"                   // "auto", "basic", or "stealth"
}

Response:

{ "success": true, "data": { "markdown": "...", "metadata": {...} } }

2. Crawl an Entire Website

Crawling is async — starts a job, then poll for results.

POST /v2/crawl
{
  "url": "https://docs.example.com",
  "limit": 50,                      // Max pages
  "maxDepth": 3,
  "allowExternalLinks": false,
  "scrapeOptions": {
    "formats": ["markdown"],
    "onlyMainContent": true
  }
}

Response:

{ "success": true, "id": "crawl-job-id" }

Poll status:

GET /v2/crawl/{crawl-job-id}

Response:

{ "status": "completed", "total": 50, "data": [...] }

3. Map a Website's URLs

POST /v2/map
{ "url": "https://example.com" }

Response:

{ "success": true, "links": [{ "url": "...", "title": "..." }] }

4. Search + Scrape in One Call

POST /v2/search
{
  "query": "best web scraping tools 2025",
  "limit": 5,
  "scrapeOptions": { "formats": ["markdown"] }
}

Response:

{ "data": [{ "url": "...", "title": "...", "markdown": "..." }] }

5. Batch Scrape Multiple URLs

POST /v2/batch/scrape
{
  "urls": ["https://a.com", "https://b.com"],
  "formats": ["markdown"]
}

Returns a job ID; poll with

GET /v2/batch/scrape/{id}

Apify API Reference

Base URL:

https://api.apify.com/v2

Auth: Pass token as query param

?token=YOUR_TOKEN

or in Authorization header.

Core Workflow

Apify runs "Actors" (pre-built scrapers). The flow is:

Start a run → get a
```
runId
```
and
```
defaultDatasetId
```
Poll status until
```
SUCCEEDED
```
Fetch results from the dataset

1. Run an Actor (Async)

POST /v2/acts/{actorId}/runs?token=YOUR_TOKEN
Content-Type: application/json

{ ...actor-specific input... }

Response:

{
  "data": {
    "id": "RUN_ID",
    "status": "RUNNING",
    "defaultDatasetId": "DATASET_ID"
  }
}

Common Actor IDs:

```
apify/web-scraper
```
— generic JS scraper
```
apify/google-search-scraper
```
— Google SERPs
```
compass/crawler-google-places
```
— Google Maps
```
apify/instagram-scraper
```
— Instagram
```
clockworks/free-tiktok-scraper
```
— TikTok
```
apify/amazon-scraper
```
— Amazon products

2. Poll Run Status

GET /v2/acts/{actorId}/runs/{runId}?token=YOUR_TOKEN

Poll until

status

SUCCEEDED

FAILED

. Recommended interval: 5 seconds.

3. Fetch Results

GET /v2/datasets/{datasetId}/items?token=YOUR_TOKEN&format=json

Optional params:

format

(json/csv/xlsx/xml),

limit

offset

4. Run Synchronously (≤5 minutes)

For short runs, use the sync endpoint — it waits and returns dataset items directly:

POST /v2/acts/{actorId}/run-sync-get-dataset-items?token=YOUR_TOKEN
Content-Type: application/json

{ ...actor input... }

Common Actor Inputs

Google Search Scraper:

{ "queries": "web scraping tools", "maxPagesPerQuery": 1, "resultsPerPage": 10 }

Google Maps Scraper:

{ "searchStringsArray": ["restaurants in Mumbai"], "maxCrawledPlaces": 20 }

Web Scraper (generic):

{
  "startUrls": [{ "url": "https://example.com" }],
  "pageFunction": "async function pageFunction(context) { const $ = context.jQuery; return { title: $('title').text() }; }",
  "maxPagesPerCrawl": 10
}

Output Handling

Firecrawl returns data directly in the response (or via polling for crawl/batch).
Apify stores results in a dataset; retrieve with
```
GET /v2/datasets/{id}/items
```
.
Both support JSON output. Firecrawl also provides clean markdown ideal for LLMs.
Apify also supports CSV, XLSX, XML output formats.

Code Templates

See

references/code-templates.md

for ready-to-run Python and JavaScript code for both APIs.

Error Handling

Firecrawl 402 → out of credits; user needs to upgrade plan
Firecrawl 429 → rate limited; add delays between requests
Apify FAILED run → check run logs via
```
GET /v2/acts/{id}/runs/{runId}/log
```
Always wrap API calls in try/catch and check
```
success: false
```
in Firecrawl responses
Firecrawl crawls respect
```
robots.txt
```
by default
For JS-heavy pages, increase
```
waitFor
```
(Firecrawl) or use Playwright/Puppeteer actors (Apify)

Best Practices

Start small — test with 1 URL or a small
```
limit
```
before scaling
Use
onlyMainContent: true
in Firecrawl to remove nav/footer noise
Choose async for large jobs — don't use sync endpoints for crawls with 50+ pages
Store API keys securely — never hardcode them; use environment variables
Check rate limits — Firecrawl: varies by plan; Apify: 250k requests/min global
Prefer Firecrawl for LLM pipelines — markdown output is clean and ready for RAG/AI
Prefer Apify for social/structured data — specialized actors handle anti-bot better