Skills web-scraper-skill

install
source · Clone the upstream repo
git clone https://github.com/openclaw/skills
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/openclaw/skills "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/abhishekj9621/web-scraper-skill" ~/.claude/skills/clawdbot-skills-web-scraper-skill && rm -rf "$T"
manifest: skills/abhishekj9621/web-scraper-skill/SKILL.md
source content

Web Scraper Skill (Apify + Firecrawl)

This skill helps Openclaw scrape and extract data from websites using two powerful APIs:

  • Firecrawl — best for scraping individual pages, crawling entire sites, and getting LLM-ready content (markdown)
  • Apify — best for specialized scrapers (social media, Google Maps, e-commerce, etc.) via pre-built Actors

Quick Decision Guide: Apify vs Firecrawl

Use CaseRecommended Tool
Scrape a single page into markdown/JSONFirecrawl
/scrape
Crawl an entire website (follow links)Firecrawl
/crawl
Map all URLs on a siteFirecrawl
/map
Search web + scrape resultsFirecrawl
/search
Scrape Instagram / TikTok / TwitterApify (social actors)
Scrape Google Maps / reviewsApify (compass/crawler-google-places)
Scrape Amazon productsApify (apify/amazon-scraper)
Scrape Google Search resultsApify (apify/google-search-scraper)
Custom actor / any Apify Store actorApify

Authentication

Both APIs require API keys passed via headers. Always ask the user for their key if not provided.

Firecrawl:

Authorization: Bearer fc-YOUR_API_KEY
Apify:
Authorization: Bearer YOUR_APIFY_TOKEN
(or
?token=YOUR_TOKEN
in URL)


Firecrawl API Reference

Base URL:

https://api.firecrawl.dev/v2

1. Scrape a Single Page

POST /v2/scrape
Authorization: Bearer fc-YOUR_API_KEY
Content-Type: application/json

{
  "url": "https://example.com",
  "formats": ["markdown"],          // Options: markdown, html, rawHtml, links, screenshot, json
  "onlyMainContent": true,          // Strips nav/footer/ads
  "waitFor": 0,                     // ms to wait before scraping (for JS-heavy pages)
  "timeout": 30000,                 // ms
  "blockAds": true,
  "proxy": "auto"                   // "auto", "basic", or "stealth"
}

Response:

{ "success": true, "data": { "markdown": "...", "metadata": {...} } }

2. Crawl an Entire Website

Crawling is async — starts a job, then poll for results.

POST /v2/crawl
{
  "url": "https://docs.example.com",
  "limit": 50,                      // Max pages
  "maxDepth": 3,
  "allowExternalLinks": false,
  "scrapeOptions": {
    "formats": ["markdown"],
    "onlyMainContent": true
  }
}

Response:

{ "success": true, "id": "crawl-job-id" }

Poll status:

GET /v2/crawl/{crawl-job-id}

Response:

{ "status": "completed", "total": 50, "data": [...] }

3. Map a Website's URLs

POST /v2/map
{ "url": "https://example.com" }

Response:

{ "success": true, "links": [{ "url": "...", "title": "..." }] }

4. Search + Scrape in One Call

POST /v2/search
{
  "query": "best web scraping tools 2025",
  "limit": 5,
  "scrapeOptions": { "formats": ["markdown"] }
}

Response:

{ "data": [{ "url": "...", "title": "...", "markdown": "..." }] }

5. Batch Scrape Multiple URLs

POST /v2/batch/scrape
{
  "urls": ["https://a.com", "https://b.com"],
  "formats": ["markdown"]
}

Returns a job ID; poll with

GET /v2/batch/scrape/{id}


Apify API Reference

Base URL:

https://api.apify.com/v2
Auth: Pass token as query param
?token=YOUR_TOKEN
or in Authorization header.

Core Workflow

Apify runs "Actors" (pre-built scrapers). The flow is:

  1. Start a run → get a
    runId
    and
    defaultDatasetId
  2. Poll status until
    SUCCEEDED
  3. Fetch results from the dataset

1. Run an Actor (Async)

POST /v2/acts/{actorId}/runs?token=YOUR_TOKEN
Content-Type: application/json

{ ...actor-specific input... }

Response:

{
  "data": {
    "id": "RUN_ID",
    "status": "RUNNING",
    "defaultDatasetId": "DATASET_ID"
  }
}

Common Actor IDs:

  • apify/web-scraper
    — generic JS scraper
  • apify/google-search-scraper
    — Google SERPs
  • compass/crawler-google-places
    — Google Maps
  • apify/instagram-scraper
    — Instagram
  • clockworks/free-tiktok-scraper
    — TikTok
  • apify/amazon-scraper
    — Amazon products

2. Poll Run Status

GET /v2/acts/{actorId}/runs/{runId}?token=YOUR_TOKEN

Poll until

status
is
SUCCEEDED
or
FAILED
. Recommended interval: 5 seconds.

3. Fetch Results

GET /v2/datasets/{datasetId}/items?token=YOUR_TOKEN&format=json

Optional params:

format
(json/csv/xlsx/xml),
limit
,
offset

4. Run Synchronously (≤5 minutes)

For short runs, use the sync endpoint — it waits and returns dataset items directly:

POST /v2/acts/{actorId}/run-sync-get-dataset-items?token=YOUR_TOKEN
Content-Type: application/json

{ ...actor input... }

Common Actor Inputs

Google Search Scraper:

{ "queries": "web scraping tools", "maxPagesPerQuery": 1, "resultsPerPage": 10 }

Google Maps Scraper:

{ "searchStringsArray": ["restaurants in Mumbai"], "maxCrawledPlaces": 20 }

Web Scraper (generic):

{
  "startUrls": [{ "url": "https://example.com" }],
  "pageFunction": "async function pageFunction(context) { const $ = context.jQuery; return { title: $('title').text() }; }",
  "maxPagesPerCrawl": 10
}

Output Handling

  • Firecrawl returns data directly in the response (or via polling for crawl/batch).
  • Apify stores results in a dataset; retrieve with
    GET /v2/datasets/{id}/items
    .
  • Both support JSON output. Firecrawl also provides clean markdown ideal for LLMs.
  • Apify also supports CSV, XLSX, XML output formats.

Code Templates

See

references/code-templates.md
for ready-to-run Python and JavaScript code for both APIs.


Error Handling

  • Firecrawl 402 → out of credits; user needs to upgrade plan
  • Firecrawl 429 → rate limited; add delays between requests
  • Apify FAILED run → check run logs via
    GET /v2/acts/{id}/runs/{runId}/log
  • Always wrap API calls in try/catch and check
    success: false
    in Firecrawl responses
  • Firecrawl crawls respect
    robots.txt
    by default
  • For JS-heavy pages, increase
    waitFor
    (Firecrawl) or use Playwright/Puppeteer actors (Apify)

Best Practices

  1. Start small — test with 1 URL or a small
    limit
    before scaling
  2. Use
    onlyMainContent: true
    in Firecrawl to remove nav/footer noise
  3. Choose async for large jobs — don't use sync endpoints for crawls with 50+ pages
  4. Store API keys securely — never hardcode them; use environment variables
  5. Check rate limits — Firecrawl: varies by plan; Apify: 250k requests/min global
  6. Prefer Firecrawl for LLM pipelines — markdown output is clean and ready for RAG/AI
  7. Prefer Apify for social/structured data — specialized actors handle anti-bot better