Claude-skill-registry apify-scraper-builder

Build Apify Actors (web scrapers) using Node.js and Crawlee. Use when creating new scrapers, defining input schemas, configuring Dockerfiles, or deploying to Apify. Triggers include apify, actor, scraper, crawlee, web scraping, data extraction.

install
source · Clone the upstream repo
git clone https://github.com/majiayu000/claude-skill-registry
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/majiayu000/claude-skill-registry "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/data/apify-scraper-builder" ~/.claude/skills/majiayu000-claude-skill-registry-apify-scraper-builder && rm -rf "$T"
manifest: skills/data/apify-scraper-builder/SKILL.md
safety · automated scan (medium risk)
This is a pattern-based risk scan, not a security review. Our crawler flagged:
  • global npm install
  • eval/exec/Function constructor
Always read a skill's source content before installing. Patterns alone don't mean the skill is malicious — but they warrant attention.
source content

Apify Scraper Builder

Build production-ready Apify Actors using Node.js/TypeScript and Crawlee.

Crawler Type Decision Tree

ScenarioCrawlerWhy
Static HTML, no JavaScriptCheerioCrawlerFastest, lowest memory
JavaScript-rendered contentPlaywrightCrawlerModern, cross-browser
Legacy sites, specific Chrome behaviorPuppeteerCrawlerChrome-specific features
Need to handle both static and JSPlaywrightCrawlerMore versatile
High-volume scraping (1000s pages)CheerioCrawlerBest performance

Actor Creation Workflow

Step 1: Initialize Project

python scripts/init_actor.py my-scraper --type cheerio

Or manually create structure:

my-scraper/
├── .actor/
│   ├── actor.json           # REQUIRED
│   ├── input_schema.json    # Recommended
│   └── Dockerfile           # REQUIRED
├── src/
│   └── main.ts              # Entry point
├── package.json
└── tsconfig.json

Step 2: Configure actor.json

{
    "actorSpecification": 1,
    "name": "my-scraper",
    "version": "0.0",
    "buildTag": "latest",
    "input": "./input_schema.json",
    "dockerfile": "./Dockerfile"
}

Step 3: Define Input Schema

python scripts/generate_input_schema.py "Scrape product pages with URLs, max items limit, and proxy support"

Or use templates from

references/input-schema-guide.md

Step 4: Implement Crawler

Use patterns from

references/crawlee-patterns.md

Step 5: Validate Configuration

python scripts/validate_actor.py /path/to/actor

Step 6: Deploy

apify login
apify push

Project Structure

Required Files

.actor/actor.json

{
    "actorSpecification": 1,
    "name": "my-scraper",
    "version": "0.0",
    "buildTag": "latest",
    "minMemoryMbytes": 256,
    "maxMemoryMbytes": 4096,
    "dockerfile": "./Dockerfile",
    "input": "./input_schema.json",
    "storages": {
        "dataset": "./dataset_schema.json"
    }
}

.actor/Dockerfile (Node.js)

FROM apify/actor-node:20

COPY package*.json ./
RUN npm --quiet set progress=false \
    && npm install --omit=dev --omit=optional \
    && echo "Installed NPM packages:" \
    && npm list || true \
    && echo "Node.js version:" \
    && node --version \
    && echo "NPM version:" \
    && npm --version

COPY . ./
CMD npm start

package.json

{
    "name": "my-scraper",
    "version": "0.0.1",
    "type": "module",
    "main": "dist/main.js",
    "scripts": {
        "start": "node dist/main.js",
        "build": "tsc"
    },
    "dependencies": {
        "apify": "^3.0.0",
        "crawlee": "^3.0.0"
    },
    "devDependencies": {
        "typescript": "^5.0.0"
    }
}

Input Schema Editors

EditorUse CaseExample
textfield
Single-line textName, URL
textarea
Multi-line textCSS selectors, notes
requestListSources
URL list with labelsStart URLs
proxy
Proxy configurationApify Proxy settings
json
JSON object/arrayCustom configuration
select
Dropdown optionsCountry, category
checkbox
Boolean toggleDebug mode
number
Integer/floatMax items, delay
datepicker
Date selectionDate range filter

Common Input Schema Pattern

{
    "title": "Scraper Input",
    "type": "object",
    "schemaVersion": 1,
    "properties": {
        "startUrls": {
            "title": "Start URLs",
            "type": "array",
            "description": "URLs to start scraping from",
            "editor": "requestListSources",
            "prefill": [{"url": "https://example.com"}]
        },
        "maxItems": {
            "title": "Max Items",
            "type": "integer",
            "description": "Maximum number of items to scrape",
            "default": 100,
            "minimum": 1
        },
        "proxyConfig": {
            "title": "Proxy Configuration",
            "type": "object",
            "description": "Proxy settings for the scraper",
            "editor": "proxy",
            "default": {"useApifyProxy": true}
        }
    },
    "required": ["startUrls"]
}

Crawlee Patterns

CheerioCrawler (Fast HTML Parsing)

import { Actor } from 'apify';
import { CheerioCrawler, Dataset } from 'crawlee';

await Actor.init();

const input = await Actor.getInput<{
    startUrls: { url: string }[];
    maxItems: number;
}>();

const crawler = new CheerioCrawler({
    maxRequestsPerCrawl: input?.maxItems || 100,
    async requestHandler({ request, $, enqueueLinks }) {
        const title = $('h1').text().trim();
        const price = $('.price').text().trim();

        await Dataset.pushData({
            url: request.url,
            title,
            price,
        });

        // Enqueue pagination links
        await enqueueLinks({
            selector: 'a.next-page',
        });
    },
});

await crawler.run(input?.startUrls?.map(u => u.url) || []);
await Actor.exit();

PlaywrightCrawler (JavaScript Rendering)

import { Actor } from 'apify';
import { PlaywrightCrawler, Dataset } from 'crawlee';

await Actor.init();

const input = await Actor.getInput<{
    startUrls: { url: string }[];
    maxItems: number;
}>();

const proxyConfiguration = await Actor.createProxyConfiguration(
    input?.proxyConfig
);

const crawler = new PlaywrightCrawler({
    proxyConfiguration,
    maxRequestsPerCrawl: input?.maxItems || 100,
    async requestHandler({ page, request, enqueueLinks }) {
        // Wait for dynamic content
        await page.waitForSelector('.product-list');

        const products = await page.$$eval('.product', items =>
            items.map(item => ({
                title: item.querySelector('h2')?.textContent?.trim(),
                price: item.querySelector('.price')?.textContent?.trim(),
            }))
        );

        for (const product of products) {
            await Dataset.pushData({
                url: request.url,
                ...product,
            });
        }

        await enqueueLinks({
            selector: 'a.pagination',
        });
    },
});

await crawler.run(input?.startUrls?.map(u => u.url) || []);
await Actor.exit();

PuppeteerCrawler (Chrome-specific)

import { Actor } from 'apify';
import { PuppeteerCrawler, Dataset } from 'crawlee';

await Actor.init();

const input = await Actor.getInput<{
    startUrls: { url: string }[];
}>();

const crawler = new PuppeteerCrawler({
    launchContext: {
        launchOptions: {
            headless: true,
        },
    },
    async requestHandler({ page, request }) {
        await page.waitForSelector('.content');

        const data = await page.evaluate(() => ({
            title: document.querySelector('h1')?.textContent,
            content: document.querySelector('.content')?.innerHTML,
        }));

        await Dataset.pushData({
            url: request.url,
            ...data,
        });
    },
});

await crawler.run(input?.startUrls?.map(u => u.url) || []);
await Actor.exit();

Scripts

Initialize New Actor

python scripts/init_actor.py <name> --type <cheerio|playwright|puppeteer> [--path <dir>]

Validate Actor Configuration

python scripts/validate_actor.py <actor-path>

Generate Input Schema

python scripts/generate_input_schema.py "<description>" [--output <path>]

Deployment Commands

# Install Apify CLI
npm install -g @apify/cli

# Login to Apify
apify login

# Create new Actor from template (interactive)
apify create my-actor

# Run Actor locally
apify run --purge

# Push to Apify platform
apify push

# Build Actor remotely
apify actors build

# Call Actor remotely
apify actors call <actor-id>

# Pull Actor code from Apify
apify actors pull <actor-id>

Validation Checklist

Before Building

  • Correct crawler type selected for target site
  • Input schema defines all required parameters
  • Dependencies in package.json are correct

Configuration

  • actor.json has actorSpecification: 1
  • actor.json has valid name and version
  • Dockerfile uses correct Node.js base image
  • Input schema editors match field types

Code Quality

  • Error handling for network failures
  • Proxy configuration used for production
  • Rate limiting/delays configured
  • Data validation before pushData

Pre-Deployment

  • apify run --purge
    succeeds locally
  • Output data structure is correct
  • Memory limits are appropriate

References

TopicFile
actor.json Specification
references/actor-json-spec.md
Input Schema Editors
references/input-schema-guide.md
Crawlee Patterns
references/crawlee-patterns.md

Templates

TemplateDescriptionPath
CheerioFast HTML scraping
templates/crawlee-cheerio/
PlaywrightJS-rendered content
templates/crawlee-playwright/
PuppeteerChrome-specific
templates/crawlee-puppeteer/