Awesome-omni-skill web-scraping

This skill activates for web scraping and Actor development. It proactively discovers sitemaps/APIs, recommends optimal strategy (sitemap/API/Playwright/hybrid), and implements iteratively. For production, it guides TypeScript Actor creation via Apify CLI.

install
source · Clone the upstream repo
git clone https://github.com/diegosouzapw/awesome-omni-skill
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/diegosouzapw/awesome-omni-skill "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/development/web-scraping-majiayu000" ~/.claude/skills/diegosouzapw-awesome-omni-skill-web-scraping && rm -rf "$T"
manifest: skills/development/web-scraping-majiayu000/SKILL.md
source content

Web Scraping with Intelligent Strategy Selection

When This Skill Activates

Activate automatically when user requests:

  • "Scrape [website]"
  • "Extract data from [site]"
  • "Get product information from [URL]"
  • "Find all links/pages on [site]"
  • "I'm getting blocked" or "Getting 403 errors" (loads
    strategies/anti-blocking.md
    )
  • "Make this an Apify Actor" (loads
    apify/
    subdirectory)
  • "Productionize this scraper"

Proactive Workflow

This skill follows a systematic 5-phase approach to web scraping, always starting with interactive reconnaissance and ending with production-ready code.

Phase 1: INTERACTIVE RECONNAISSANCE (Critical First Step)

When user says "scrape X", immediately start with hands-on reconnaissance using MCP tools:

DO NOT jump to automated checks or implementation - reconnaissance prevents wasted effort and discovers hidden APIs.

Use Playwright MCP & Chrome DevTools MCP:

1. Open site in real browser (Playwright MCP)

  • Navigate like a real user
  • Observe page loading behavior (SSR? SPA? Loading states?)
  • Take screenshots for reference
  • Test basic interactions

2. Monitor network traffic (Chrome DevTools via Playwright)

  • Watch XHR/Fetch requests in real-time
  • Find API endpoints returning JSON (10-100x faster than HTML scraping!)
  • Analyze request/response patterns
  • Document headers, cookies, authentication tokens
  • Extract pagination parameters

3. Test site interactions

  • Pagination: URL-based? API? Infinite scroll?
  • Filtering and search: How do they work?
  • Dynamic content loading: Triggers and patterns
  • Authentication flows: Required? Optional?

4. Assess protection mechanisms

  • Cloudflare/bot detection
  • CAPTCHA requirements
  • Rate limiting behavior (test with multiple requests)
  • Fingerprinting scripts

5. Generate Intelligence Report

  • Site architecture (framework, rendering method)
  • Discovered APIs/endpoints with full specs
  • Protection mechanisms and required countermeasures
  • Optimal extraction strategy (API > Sitemap > HTML)
  • Time/complexity estimates

See:

workflows/reconnaissance.md
for complete reconnaissance guide with MCP examples

Why this matters: Reconnaissance discovers hidden APIs (eliminating need for HTML scraping), identifies blockers before coding, and provides intelligence for optimal strategy selection. Never skip this step.

Phase 2: AUTOMATIC DISCOVERY (Validate Reconnaissance)

After Phase 1 reconnaissance, validate findings with automated checks:

1. Check for Sitemaps

# Automatically check these locations
curl -s https://[site]/robots.txt | grep -i Sitemap
curl -I https://[site]/sitemap.xml
curl -I https://[site]/sitemap_index.xml

Log findings clearly:

  • ✓ "Found sitemap at /sitemap.xml with ~1,234 URLs"
  • ✓ "Found sitemap index with 5 sub-sitemaps"
  • ✗ "No sitemap detected at common locations"

Why this matters: Sitemaps provide instant URL discovery (60x faster than crawling)

2. Investigate APIs

Prompt user:

Should I check for JSON APIs first? (Highly recommended)

Benefits of APIs vs HTML scraping:
• 10-100x faster execution
• More reliable (structured JSON vs fragile HTML)
• Less bandwidth usage
• Easier to maintain

Check for APIs? [Y/n]

If yes, guide user:

  1. Open browser DevTools → Network tab
  2. Navigate the target website
  3. Look for XHR/Fetch requests
  4. Check for endpoints:
    /api/
    ,
    /v1/
    ,
    /v2/
    ,
    /graphql
    ,
    /_next/data/
  5. Analyze request/response format (JSON, GraphQL, REST)

Log findings:

  • ✓ "Found API: GET /api/products/{id} (returns JSON)"
  • ✓ "Found GraphQL endpoint: /graphql"
  • ✗ "No obvious public APIs detected"

3. Analyze Site Structure

Automatically assess:

  • JavaScript-heavy? (Look for React, Vue, Angular indicators)
  • Authentication required? (Login walls, auth tokens)
  • Page count estimate (from sitemap or site exploration)
  • Rate limiting indicators (robots.txt directives)

Phase 3: STRATEGY RECOMMENDATION

Based on Phases 1-2 findings, present 2-3 options with clear reasoning:

Example Output Template:

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
📊 Analysis of example.com
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Phase 1 Intelligence (Reconnaissance):
✓ API discovered via DevTools: GET /api/products?page=N&limit=100
✓ Framework: Next.js (SSR + CSR hybrid)
✓ Protection: Cloudflare detected, rate limit ~60/min
✗ No authentication required

Phase 2 Validation:
✓ Sitemap found: 1,234 product URLs (validates API total)
✓ Static HTML fallback available if needed

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Recommended Approaches:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

⭐ Option 1: Hybrid (Sitemap + API) [RECOMMENDED]
   ✓ Use sitemap to get all 1,234 product URLs instantly
   ✓ Extract product IDs from URLs
   ✓ Fetch data via API (fast, reliable JSON)

   Estimated time: 8-12 minutes
   Complexity: Low-Medium
   Data quality: Excellent
   Speed: Very Fast

⚡ Option 2: Sitemap + Playwright
   ✓ Use sitemap for URLs
   ✓ Scrape HTML with Playwright

   Estimated time: 15-20 minutes
   Complexity: Medium
   Data quality: Good
   Speed: Fast

🔧 Option 3: Pure API (if sitemap fails)
   ✓ Discover product IDs through API exploration
   ✓ Fetch all data via API

   Estimated time: 10-15 minutes
   Complexity: Medium
   Data quality: Excellent
   Speed: Fast

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
My Recommendation: Option 1 (Hybrid)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Reasoning:
• Sitemap gives us complete URL list (instant discovery)
• API provides clean, structured data (no HTML parsing)
• Combines speed of sitemap with reliability of API
• Best of both worlds

Proceed with Option 1? [Y/n]

Key principles:

  • Always recommend the SIMPLEST approach that works
  • Sitemap > API > Playwright (in terms of simplicity)
  • Show time estimates and complexity
  • Explain reasoning clearly

Phase 4: ITERATIVE IMPLEMENTATION

Implement scraper incrementally, starting simple and adding complexity only as needed.

Core Pattern:

  1. Implement recommended approach (minimal code)
  2. Test with small batch (5-10 items)
  3. Validate data quality
  4. Scale to full dataset or fallback
  5. Handle blocking if encountered
  6. Add robustness (error handling, retries, logging)

See:

workflows/implementation.md
for complete implementation patterns and code examples

Phase 5: PRODUCTIONIZATION (On Request)

Convert scraper to production-ready Apify Actor.

Activation triggers:

  • "Make this an Apify Actor"
  • "Productionize this scraper"
  • "Deploy to Apify"
  • "Create an actor from this"

Core Pattern:

  1. Confirm TypeScript preference (STRONGLY RECOMMENDED)
  2. Initialize with
    apify create
    command (CRITICAL)
  3. Port scraping logic to Actor format
  4. Test locally and deploy

See:

workflows/productionization.md
for complete productionization workflow and
apify/
directory for all Actor development guides

Quick Reference

TaskPattern/CommandDocumentation
ReconnaissancePlaywright + DevTools MCP
workflows/reconnaissance.md
Find sitemaps
RobotsFile.find(url)
strategies/sitemap-discovery.md
Filter sitemap URLs
RequestList + regex
reference/regex-patterns.md
Discover APIsDevTools → Network tab
strategies/api-discovery.md
Playwright scraping
PlaywrightCrawler
strategies/playwright-scraping.md
HTTP scraping
CheerioCrawler
strategies/cheerio-scraping.md
Hybrid approachSitemap + API
strategies/hybrid-approaches.md
Handle blockingfingerprint-suite + proxies
strategies/anti-blocking.md
Fingerprint configsQuick patterns
reference/fingerprint-patterns.md
Create Apify Actor
apify create
apify/cli-workflow.md
Template selectionCheerio vs Playwright
workflows/productionization.md
Input schema
.actor/input_schema.json
apify/input-schemas.md
Deploy actor
apify push
apify/deployment.md

Common Patterns

Pattern 1: Sitemap-Based Scraping

import { RobotsFile, PlaywrightCrawler, Dataset } from 'crawlee';

// Auto-discover and parse sitemaps
const robots = await RobotsFile.find('https://example.com');
const urls = await robots.parseUrlsFromSitemaps();

const crawler = new PlaywrightCrawler({
    async requestHandler({ page, request }) {
        const data = await page.evaluate(() => ({
            title: document.title,
            // ... extract data
        }));
        await Dataset.pushData(data);
    },
});

await crawler.addRequests(urls);
await crawler.run();

See

examples/sitemap-basic.js
for complete example.

Pattern 2: API-Based Scraping

import { gotScraping } from 'got-scraping';

const productIds = [123, 456, 789];

for (const id of productIds) {
    const response = await gotScraping({
        url: `https://api.example.com/products/${id}`,
        responseType: 'json',
    });

    console.log(response.body);
}

See

examples/api-scraper.js
for complete example.

Pattern 3: Hybrid (Sitemap + API)

// Get URLs from sitemap
const robots = await RobotsFile.find('https://shop.com');
const urls = await robots.parseUrlsFromSitemaps();

// Extract IDs from URLs
const productIds = urls
    .map(url => url.match(/\/products\/(\d+)/)?.[1])
    .filter(Boolean);

// Fetch data via API
for (const id of productIds) {
    const data = await gotScraping({
        url: `https://api.shop.com/v1/products/${id}`,
        responseType: 'json',
    });
    // Process data
}

See

examples/hybrid-sitemap-api.js
for complete example.

Directory Navigation

This skill uses progressive disclosure - detailed information is organized in subdirectories and loaded only when needed.

Workflows (Implementation Patterns)

For: Step-by-step workflow guides for each phase

  • workflows/reconnaissance.md
    - Phase 1 interactive reconnaissance (CRITICAL)
  • workflows/implementation.md
    - Phase 4 iterative implementation patterns
  • workflows/productionization.md
    - Phase 5 Apify Actor creation workflow

Strategies (Deep Dives)

For: Detailed guides on specific scraping approaches

  • strategies/sitemap-discovery.md
    - Complete sitemap guide (4 patterns)
  • strategies/api-discovery.md
    - Finding and using APIs
  • strategies/playwright-scraping.md
    - Browser-based scraping
  • strategies/cheerio-scraping.md
    - HTTP-only scraping
  • strategies/hybrid-approaches.md
    - Combining strategies
  • strategies/anti-blocking.md
    - Fingerprinting & proxies for blocked sites

Examples (Runnable Code)

For: Working code to reference or execute

JavaScript Learning Examples (Simple standalone scripts):

  • examples/sitemap-basic.js
    - Simple sitemap scraper
  • examples/api-scraper.js
    - Pure API approach
  • examples/playwright-basic.js
    - Basic Playwright scraper
  • examples/hybrid-sitemap-api.js
    - Combined approach
  • examples/iterative-fallback.js
    - Try sitemap→API→Playwright

TypeScript Production Examples (Complete Actors):

  • apify/examples/basic-scraper/
    - Sitemap + Playwright
  • apify/examples/anti-blocking/
    - Fingerprinting + proxies
  • apify/examples/hybrid-api/
    - Sitemap + API (optimal)

Reference (Quick Lookup)

For: Quick patterns and troubleshooting

  • reference/regex-patterns.md
    - Common URL regex patterns
  • reference/selector-guide.md
    - Playwright selector strategies
  • reference/fingerprint-patterns.md
    - Common fingerprint configurations
  • reference/anti-patterns.md
    - What NOT to do

Apify (Production Deployment)

For: Creating production Apify Actors

  • apify/README.md
    - When and how to use Apify
  • apify/typescript-first.md
    - Why TypeScript for actors
  • apify/cli-workflow.md
    - apify create workflow (CRITICAL)
  • apify/initialization.md
    - Complete setup guide
  • apify/input-schemas.md
    - Input validation patterns
  • apify/configuration.md
    - actor.json setup
  • apify/deployment.md
    - Testing and deployment
  • apify/templates/
    - TypeScript boilerplate

Note: Each file is self-contained and can be read independently. Claude will navigate to specific files as needed.

Core Principles

1. Progressive Enhancement

Start with the simplest approach that works:

  • Sitemap > API > Playwright
  • Static > Dynamic
  • HTTP > Browser

2. Proactive Discovery

Always investigate before implementing:

  • Check for sitemaps automatically
  • Look for APIs (ask user to check DevTools)
  • Analyze site structure

3. Iterative Implementation

Build incrementally:

  • Small test batch first (5-10 items)
  • Validate quality
  • Scale or fallback
  • Add robustness last

4. Production-Ready Code

When productionizing:

  • Use TypeScript (strongly recommended)
  • Use
    apify create
    (never manual setup)
  • Add proper error handling
  • Include logging and monitoring

Remember: Sitemaps first, APIs second, scraping last!

For detailed guidance on any topic, navigate to the relevant subdirectory file listed above.