Claude-skill-registry apify-scraper-builder
Build Apify Actors (web scrapers) using Node.js and Crawlee. Use when creating new scrapers, defining input schemas, configuring Dockerfiles, or deploying to Apify. Triggers include apify, actor, scraper, crawlee, web scraping, data extraction.
install
source · Clone the upstream repo
git clone https://github.com/majiayu000/claude-skill-registry
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/majiayu000/claude-skill-registry "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/data/apify-scraper-builder" ~/.claude/skills/majiayu000-claude-skill-registry-apify-scraper-builder && rm -rf "$T"
manifest:
skills/data/apify-scraper-builder/SKILL.mdsafety · automated scan (medium risk)
This is a pattern-based risk scan, not a security review. Our crawler flagged:
- global npm install
- eval/exec/Function constructor
Always read a skill's source content before installing. Patterns alone don't mean the skill is malicious — but they warrant attention.
source content
Apify Scraper Builder
Build production-ready Apify Actors using Node.js/TypeScript and Crawlee.
Crawler Type Decision Tree
| Scenario | Crawler | Why |
|---|---|---|
| Static HTML, no JavaScript | CheerioCrawler | Fastest, lowest memory |
| JavaScript-rendered content | PlaywrightCrawler | Modern, cross-browser |
| Legacy sites, specific Chrome behavior | PuppeteerCrawler | Chrome-specific features |
| Need to handle both static and JS | PlaywrightCrawler | More versatile |
| High-volume scraping (1000s pages) | CheerioCrawler | Best performance |
Actor Creation Workflow
Step 1: Initialize Project
python scripts/init_actor.py my-scraper --type cheerio
Or manually create structure:
my-scraper/ ├── .actor/ │ ├── actor.json # REQUIRED │ ├── input_schema.json # Recommended │ └── Dockerfile # REQUIRED ├── src/ │ └── main.ts # Entry point ├── package.json └── tsconfig.json
Step 2: Configure actor.json
{ "actorSpecification": 1, "name": "my-scraper", "version": "0.0", "buildTag": "latest", "input": "./input_schema.json", "dockerfile": "./Dockerfile" }
Step 3: Define Input Schema
python scripts/generate_input_schema.py "Scrape product pages with URLs, max items limit, and proxy support"
Or use templates from
references/input-schema-guide.md
Step 4: Implement Crawler
Use patterns from
references/crawlee-patterns.md
Step 5: Validate Configuration
python scripts/validate_actor.py /path/to/actor
Step 6: Deploy
apify login apify push
Project Structure
Required Files
.actor/actor.json
{ "actorSpecification": 1, "name": "my-scraper", "version": "0.0", "buildTag": "latest", "minMemoryMbytes": 256, "maxMemoryMbytes": 4096, "dockerfile": "./Dockerfile", "input": "./input_schema.json", "storages": { "dataset": "./dataset_schema.json" } }
.actor/Dockerfile (Node.js)
FROM apify/actor-node:20 COPY package*.json ./ RUN npm --quiet set progress=false \ && npm install --omit=dev --omit=optional \ && echo "Installed NPM packages:" \ && npm list || true \ && echo "Node.js version:" \ && node --version \ && echo "NPM version:" \ && npm --version COPY . ./ CMD npm start
package.json
{ "name": "my-scraper", "version": "0.0.1", "type": "module", "main": "dist/main.js", "scripts": { "start": "node dist/main.js", "build": "tsc" }, "dependencies": { "apify": "^3.0.0", "crawlee": "^3.0.0" }, "devDependencies": { "typescript": "^5.0.0" } }
Input Schema Editors
| Editor | Use Case | Example |
|---|---|---|
| Single-line text | Name, URL |
| Multi-line text | CSS selectors, notes |
| URL list with labels | Start URLs |
| Proxy configuration | Apify Proxy settings |
| JSON object/array | Custom configuration |
| Dropdown options | Country, category |
| Boolean toggle | Debug mode |
| Integer/float | Max items, delay |
| Date selection | Date range filter |
Common Input Schema Pattern
{ "title": "Scraper Input", "type": "object", "schemaVersion": 1, "properties": { "startUrls": { "title": "Start URLs", "type": "array", "description": "URLs to start scraping from", "editor": "requestListSources", "prefill": [{"url": "https://example.com"}] }, "maxItems": { "title": "Max Items", "type": "integer", "description": "Maximum number of items to scrape", "default": 100, "minimum": 1 }, "proxyConfig": { "title": "Proxy Configuration", "type": "object", "description": "Proxy settings for the scraper", "editor": "proxy", "default": {"useApifyProxy": true} } }, "required": ["startUrls"] }
Crawlee Patterns
CheerioCrawler (Fast HTML Parsing)
import { Actor } from 'apify'; import { CheerioCrawler, Dataset } from 'crawlee'; await Actor.init(); const input = await Actor.getInput<{ startUrls: { url: string }[]; maxItems: number; }>(); const crawler = new CheerioCrawler({ maxRequestsPerCrawl: input?.maxItems || 100, async requestHandler({ request, $, enqueueLinks }) { const title = $('h1').text().trim(); const price = $('.price').text().trim(); await Dataset.pushData({ url: request.url, title, price, }); // Enqueue pagination links await enqueueLinks({ selector: 'a.next-page', }); }, }); await crawler.run(input?.startUrls?.map(u => u.url) || []); await Actor.exit();
PlaywrightCrawler (JavaScript Rendering)
import { Actor } from 'apify'; import { PlaywrightCrawler, Dataset } from 'crawlee'; await Actor.init(); const input = await Actor.getInput<{ startUrls: { url: string }[]; maxItems: number; }>(); const proxyConfiguration = await Actor.createProxyConfiguration( input?.proxyConfig ); const crawler = new PlaywrightCrawler({ proxyConfiguration, maxRequestsPerCrawl: input?.maxItems || 100, async requestHandler({ page, request, enqueueLinks }) { // Wait for dynamic content await page.waitForSelector('.product-list'); const products = await page.$$eval('.product', items => items.map(item => ({ title: item.querySelector('h2')?.textContent?.trim(), price: item.querySelector('.price')?.textContent?.trim(), })) ); for (const product of products) { await Dataset.pushData({ url: request.url, ...product, }); } await enqueueLinks({ selector: 'a.pagination', }); }, }); await crawler.run(input?.startUrls?.map(u => u.url) || []); await Actor.exit();
PuppeteerCrawler (Chrome-specific)
import { Actor } from 'apify'; import { PuppeteerCrawler, Dataset } from 'crawlee'; await Actor.init(); const input = await Actor.getInput<{ startUrls: { url: string }[]; }>(); const crawler = new PuppeteerCrawler({ launchContext: { launchOptions: { headless: true, }, }, async requestHandler({ page, request }) { await page.waitForSelector('.content'); const data = await page.evaluate(() => ({ title: document.querySelector('h1')?.textContent, content: document.querySelector('.content')?.innerHTML, })); await Dataset.pushData({ url: request.url, ...data, }); }, }); await crawler.run(input?.startUrls?.map(u => u.url) || []); await Actor.exit();
Scripts
Initialize New Actor
python scripts/init_actor.py <name> --type <cheerio|playwright|puppeteer> [--path <dir>]
Validate Actor Configuration
python scripts/validate_actor.py <actor-path>
Generate Input Schema
python scripts/generate_input_schema.py "<description>" [--output <path>]
Deployment Commands
# Install Apify CLI npm install -g @apify/cli # Login to Apify apify login # Create new Actor from template (interactive) apify create my-actor # Run Actor locally apify run --purge # Push to Apify platform apify push # Build Actor remotely apify actors build # Call Actor remotely apify actors call <actor-id> # Pull Actor code from Apify apify actors pull <actor-id>
Validation Checklist
Before Building
- Correct crawler type selected for target site
- Input schema defines all required parameters
- Dependencies in package.json are correct
Configuration
- actor.json has actorSpecification: 1
- actor.json has valid name and version
- Dockerfile uses correct Node.js base image
- Input schema editors match field types
Code Quality
- Error handling for network failures
- Proxy configuration used for production
- Rate limiting/delays configured
- Data validation before pushData
Pre-Deployment
-
succeeds locallyapify run --purge - Output data structure is correct
- Memory limits are appropriate
References
| Topic | File |
|---|---|
| actor.json Specification | |
| Input Schema Editors | |
| Crawlee Patterns | |
Templates
| Template | Description | Path |
|---|---|---|
| Cheerio | Fast HTML scraping | |
| Playwright | JS-rendered content | |
| Puppeteer | Chrome-specific | |