Skills web-scraper-skill
install
source · Clone the upstream repo
git clone https://github.com/openclaw/skills
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/openclaw/skills "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/abhishekj9621/web-scraper-skill" ~/.claude/skills/clawdbot-skills-web-scraper-skill && rm -rf "$T"
manifest:
skills/abhishekj9621/web-scraper-skill/SKILL.mdsource content
Web Scraper Skill (Apify + Firecrawl)
This skill helps Openclaw scrape and extract data from websites using two powerful APIs:
- Firecrawl — best for scraping individual pages, crawling entire sites, and getting LLM-ready content (markdown)
- Apify — best for specialized scrapers (social media, Google Maps, e-commerce, etc.) via pre-built Actors
Quick Decision Guide: Apify vs Firecrawl
| Use Case | Recommended Tool |
|---|---|
| Scrape a single page into markdown/JSON | Firecrawl |
| Crawl an entire website (follow links) | Firecrawl |
| Map all URLs on a site | Firecrawl |
| Search web + scrape results | Firecrawl |
| Scrape Instagram / TikTok / Twitter | Apify (social actors) |
| Scrape Google Maps / reviews | Apify (compass/crawler-google-places) |
| Scrape Amazon products | Apify (apify/amazon-scraper) |
| Scrape Google Search results | Apify (apify/google-search-scraper) |
| Custom actor / any Apify Store actor | Apify |
Authentication
Both APIs require API keys passed via headers. Always ask the user for their key if not provided.
Firecrawl:
Authorization: Bearer fc-YOUR_API_KEY
Apify: Authorization: Bearer YOUR_APIFY_TOKEN (or ?token=YOUR_TOKEN in URL)
Firecrawl API Reference
Base URL:
https://api.firecrawl.dev/v2
1. Scrape a Single Page
POST /v2/scrape Authorization: Bearer fc-YOUR_API_KEY Content-Type: application/json { "url": "https://example.com", "formats": ["markdown"], // Options: markdown, html, rawHtml, links, screenshot, json "onlyMainContent": true, // Strips nav/footer/ads "waitFor": 0, // ms to wait before scraping (for JS-heavy pages) "timeout": 30000, // ms "blockAds": true, "proxy": "auto" // "auto", "basic", or "stealth" }
Response:
{ "success": true, "data": { "markdown": "...", "metadata": {...} } }
2. Crawl an Entire Website
Crawling is async — starts a job, then poll for results.
POST /v2/crawl { "url": "https://docs.example.com", "limit": 50, // Max pages "maxDepth": 3, "allowExternalLinks": false, "scrapeOptions": { "formats": ["markdown"], "onlyMainContent": true } }
Response:
{ "success": true, "id": "crawl-job-id" }
Poll status:
GET /v2/crawl/{crawl-job-id}
Response:
{ "status": "completed", "total": 50, "data": [...] }
3. Map a Website's URLs
POST /v2/map { "url": "https://example.com" }
Response:
{ "success": true, "links": [{ "url": "...", "title": "..." }] }
4. Search + Scrape in One Call
POST /v2/search { "query": "best web scraping tools 2025", "limit": 5, "scrapeOptions": { "formats": ["markdown"] } }
Response:
{ "data": [{ "url": "...", "title": "...", "markdown": "..." }] }
5. Batch Scrape Multiple URLs
POST /v2/batch/scrape { "urls": ["https://a.com", "https://b.com"], "formats": ["markdown"] }
Returns a job ID; poll with
GET /v2/batch/scrape/{id}
Apify API Reference
Base URL:
https://api.apify.com/v2
Auth: Pass token as query param ?token=YOUR_TOKEN or in Authorization header.
Core Workflow
Apify runs "Actors" (pre-built scrapers). The flow is:
- Start a run → get a
andrunIddefaultDatasetId - Poll status until
SUCCEEDED - Fetch results from the dataset
1. Run an Actor (Async)
POST /v2/acts/{actorId}/runs?token=YOUR_TOKEN Content-Type: application/json { ...actor-specific input... }
Response:
{ "data": { "id": "RUN_ID", "status": "RUNNING", "defaultDatasetId": "DATASET_ID" } }
Common Actor IDs:
— generic JS scraperapify/web-scraper
— Google SERPsapify/google-search-scraper
— Google Mapscompass/crawler-google-placesapify/instagram-scraper
— TikTokclockworks/free-tiktok-scraper
— Amazon productsapify/amazon-scraper
2. Poll Run Status
GET /v2/acts/{actorId}/runs/{runId}?token=YOUR_TOKEN
Poll until
status is SUCCEEDED or FAILED. Recommended interval: 5 seconds.
3. Fetch Results
GET /v2/datasets/{datasetId}/items?token=YOUR_TOKEN&format=json
Optional params:
format (json/csv/xlsx/xml), limit, offset
4. Run Synchronously (≤5 minutes)
For short runs, use the sync endpoint — it waits and returns dataset items directly:
POST /v2/acts/{actorId}/run-sync-get-dataset-items?token=YOUR_TOKEN Content-Type: application/json { ...actor input... }
Common Actor Inputs
Google Search Scraper:
{ "queries": "web scraping tools", "maxPagesPerQuery": 1, "resultsPerPage": 10 }
Google Maps Scraper:
{ "searchStringsArray": ["restaurants in Mumbai"], "maxCrawledPlaces": 20 }
Web Scraper (generic):
{ "startUrls": [{ "url": "https://example.com" }], "pageFunction": "async function pageFunction(context) { const $ = context.jQuery; return { title: $('title').text() }; }", "maxPagesPerCrawl": 10 }
Output Handling
- Firecrawl returns data directly in the response (or via polling for crawl/batch).
- Apify stores results in a dataset; retrieve with
.GET /v2/datasets/{id}/items - Both support JSON output. Firecrawl also provides clean markdown ideal for LLMs.
- Apify also supports CSV, XLSX, XML output formats.
Code Templates
See
references/code-templates.md for ready-to-run Python and JavaScript code for both APIs.
Error Handling
- Firecrawl 402 → out of credits; user needs to upgrade plan
- Firecrawl 429 → rate limited; add delays between requests
- Apify FAILED run → check run logs via
GET /v2/acts/{id}/runs/{runId}/log - Always wrap API calls in try/catch and check
in Firecrawl responsessuccess: false - Firecrawl crawls respect
by defaultrobots.txt - For JS-heavy pages, increase
(Firecrawl) or use Playwright/Puppeteer actors (Apify)waitFor
Best Practices
- Start small — test with 1 URL or a small
before scalinglimit - Use
in Firecrawl to remove nav/footer noiseonlyMainContent: true - Choose async for large jobs — don't use sync endpoints for crawls with 50+ pages
- Store API keys securely — never hardcode them; use environment variables
- Check rate limits — Firecrawl: varies by plan; Apify: 250k requests/min global
- Prefer Firecrawl for LLM pipelines — markdown output is clean and ready for RAG/AI
- Prefer Apify for social/structured data — specialized actors handle anti-bot better