Awesome-omni-skill web-scraping

This skill activates for web scraping and Actor development. It proactively discovers sitemaps/APIs, recommends optimal strategy (sitemap/API/Playwright/hybrid), and implements iteratively. For production, it guides TypeScript Actor creation via Apify CLI.

install

source · Clone the upstream repo

git clone https://github.com/diegosouzapw/awesome-omni-skill

Claude Code · Install into ~/.claude/skills/

T=$(mktemp -d) && git clone --depth=1 https://github.com/diegosouzapw/awesome-omni-skill "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/development/web-scraping-majiayu000" ~/.claude/skills/diegosouzapw-awesome-omni-skill-web-scraping && rm -rf "$T"

manifest: skills/development/web-scraping-majiayu000/SKILL.md

source content

Web Scraping with Intelligent Strategy Selection

When This Skill Activates

Activate automatically when user requests:

"Scrape [website]"
"Extract data from [site]"
"Get product information from [URL]"
"Find all links/pages on [site]"
"I'm getting blocked" or "Getting 403 errors" (loads
```
strategies/anti-blocking.md
```
)
"Make this an Apify Actor" (loads
```
apify/
```
subdirectory)
"Productionize this scraper"

Proactive Workflow

This skill follows a systematic 5-phase approach to web scraping, always starting with interactive reconnaissance and ending with production-ready code.

Phase 1: INTERACTIVE RECONNAISSANCE (Critical First Step)

When user says "scrape X", immediately start with hands-on reconnaissance using MCP tools:

DO NOT jump to automated checks or implementation - reconnaissance prevents wasted effort and discovers hidden APIs.

Use Playwright MCP & Chrome DevTools MCP:

1. Open site in real browser (Playwright MCP)

Navigate like a real user
Observe page loading behavior (SSR? SPA? Loading states?)
Take screenshots for reference
Test basic interactions

2. Monitor network traffic (Chrome DevTools via Playwright)

Watch XHR/Fetch requests in real-time
Find API endpoints returning JSON (10-100x faster than HTML scraping!)
Analyze request/response patterns
Document headers, cookies, authentication tokens
Extract pagination parameters

3. Test site interactions

Pagination: URL-based? API? Infinite scroll?
Filtering and search: How do they work?
Dynamic content loading: Triggers and patterns
Authentication flows: Required? Optional?

4. Assess protection mechanisms

Cloudflare/bot detection
CAPTCHA requirements
Rate limiting behavior (test with multiple requests)
Fingerprinting scripts

5. Generate Intelligence Report

Site architecture (framework, rendering method)
Discovered APIs/endpoints with full specs
Protection mechanisms and required countermeasures
Optimal extraction strategy (API > Sitemap > HTML)
Time/complexity estimates

See:

workflows/reconnaissance.md

for complete reconnaissance guide with MCP examples

Why this matters: Reconnaissance discovers hidden APIs (eliminating need for HTML scraping), identifies blockers before coding, and provides intelligence for optimal strategy selection. Never skip this step.

Phase 2: AUTOMATIC DISCOVERY (Validate Reconnaissance)

After Phase 1 reconnaissance, validate findings with automated checks:

1. Check for Sitemaps

# Automatically check these locations
curl -s https://[site]/robots.txt | grep -i Sitemap
curl -I https://[site]/sitemap.xml
curl -I https://[site]/sitemap_index.xml

Log findings clearly:

✓ "Found sitemap at /sitemap.xml with ~1,234 URLs"
✓ "Found sitemap index with 5 sub-sitemaps"
✗ "No sitemap detected at common locations"

Why this matters: Sitemaps provide instant URL discovery (60x faster than crawling)

2. Investigate APIs

Prompt user:

Should I check for JSON APIs first? (Highly recommended)

Benefits of APIs vs HTML scraping:
• 10-100x faster execution
• More reliable (structured JSON vs fragile HTML)
• Less bandwidth usage
• Easier to maintain

Check for APIs? [Y/n]

If yes, guide user:

Open browser DevTools → Network tab
Navigate the target website
Look for XHR/Fetch requests
Check for endpoints:
```
/api/
```
,
```
/v1/
```
,
```
/v2/
```
,
```
/graphql
```
,
```
/_next/data/
```
Analyze request/response format (JSON, GraphQL, REST)

Log findings:

✓ "Found API: GET /api/products/{id} (returns JSON)"
✓ "Found GraphQL endpoint: /graphql"
✗ "No obvious public APIs detected"

3. Analyze Site Structure

Automatically assess:

JavaScript-heavy? (Look for React, Vue, Angular indicators)
Authentication required? (Login walls, auth tokens)
Page count estimate (from sitemap or site exploration)
Rate limiting indicators (robots.txt directives)

Phase 3: STRATEGY RECOMMENDATION

Based on Phases 1-2 findings, present 2-3 options with clear reasoning:

Example Output Template:

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
📊 Analysis of example.com
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Phase 1 Intelligence (Reconnaissance):
✓ API discovered via DevTools: GET /api/products?page=N&limit=100
✓ Framework: Next.js (SSR + CSR hybrid)
✓ Protection: Cloudflare detected, rate limit ~60/min
✗ No authentication required

Phase 2 Validation:
✓ Sitemap found: 1,234 product URLs (validates API total)
✓ Static HTML fallback available if needed

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Recommended Approaches:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

⭐ Option 1: Hybrid (Sitemap + API) [RECOMMENDED]
   ✓ Use sitemap to get all 1,234 product URLs instantly
   ✓ Extract product IDs from URLs
   ✓ Fetch data via API (fast, reliable JSON)

   Estimated time: 8-12 minutes
   Complexity: Low-Medium
   Data quality: Excellent
   Speed: Very Fast

⚡ Option 2: Sitemap + Playwright
   ✓ Use sitemap for URLs
   ✓ Scrape HTML with Playwright

   Estimated time: 15-20 minutes
   Complexity: Medium
   Data quality: Good
   Speed: Fast

🔧 Option 3: Pure API (if sitemap fails)
   ✓ Discover product IDs through API exploration
   ✓ Fetch all data via API

   Estimated time: 10-15 minutes
   Complexity: Medium
   Data quality: Excellent
   Speed: Fast

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
My Recommendation: Option 1 (Hybrid)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Reasoning:
• Sitemap gives us complete URL list (instant discovery)
• API provides clean, structured data (no HTML parsing)
• Combines speed of sitemap with reliability of API
• Best of both worlds

Proceed with Option 1? [Y/n]

Key principles:

Always recommend the SIMPLEST approach that works
Sitemap > API > Playwright (in terms of simplicity)
Show time estimates and complexity
Explain reasoning clearly

Phase 4: ITERATIVE IMPLEMENTATION

Implement scraper incrementally, starting simple and adding complexity only as needed.

Core Pattern:

Implement recommended approach (minimal code)
Test with small batch (5-10 items)
Validate data quality
Scale to full dataset or fallback
Handle blocking if encountered
Add robustness (error handling, retries, logging)

See:

workflows/implementation.md

for complete implementation patterns and code examples

Phase 5: PRODUCTIONIZATION (On Request)

Convert scraper to production-ready Apify Actor.

Activation triggers:

"Make this an Apify Actor"
"Productionize this scraper"
"Deploy to Apify"
"Create an actor from this"

Core Pattern:

Confirm TypeScript preference (STRONGLY RECOMMENDED)
Initialize with
```
apify create
```
command (CRITICAL)
Port scraping logic to Actor format
Test locally and deploy

See:

workflows/productionization.md

for complete productionization workflow and

apify/

directory for all Actor development guides

Quick Reference

Task	Pattern/Command	Documentation
Reconnaissance	Playwright + DevTools MCP	`workflows/reconnaissance.md`
Find sitemaps	`RobotsFile.find(url)`	`strategies/sitemap-discovery.md`
Filter sitemap URLs	`RequestList + regex`	`reference/regex-patterns.md`
Discover APIs	DevTools → Network tab	`strategies/api-discovery.md`
Playwright scraping	`PlaywrightCrawler`	`strategies/playwright-scraping.md`
HTTP scraping	`CheerioCrawler`	`strategies/cheerio-scraping.md`
Hybrid approach	Sitemap + API	`strategies/hybrid-approaches.md`
Handle blocking	fingerprint-suite + proxies	`strategies/anti-blocking.md`
Fingerprint configs	Quick patterns	`reference/fingerprint-patterns.md`
Create Apify Actor	`apify create`	`apify/cli-workflow.md`
Template selection	Cheerio vs Playwright	`workflows/productionization.md`
Input schema	`.actor/input_schema.json`	`apify/input-schemas.md`
Deploy actor	`apify push`	`apify/deployment.md`

Common Patterns

Pattern 1: Sitemap-Based Scraping

import { RobotsFile, PlaywrightCrawler, Dataset } from 'crawlee';

// Auto-discover and parse sitemaps
const robots = await RobotsFile.find('https://example.com');
const urls = await robots.parseUrlsFromSitemaps();

const crawler = new PlaywrightCrawler({
    async requestHandler({ page, request }) {
        const data = await page.evaluate(() => ({
            title: document.title,
            // ... extract data
        }));
        await Dataset.pushData(data);
    },
});

await crawler.addRequests(urls);
await crawler.run();

See

examples/sitemap-basic.js

for complete example.

Pattern 2: API-Based Scraping

import { gotScraping } from 'got-scraping';

const productIds = [123, 456, 789];

for (const id of productIds) {
    const response = await gotScraping({
        url: `https://api.example.com/products/${id}`,
        responseType: 'json',
    });

    console.log(response.body);
}

See

examples/api-scraper.js

for complete example.

Pattern 3: Hybrid (Sitemap + API)

// Get URLs from sitemap
const robots = await RobotsFile.find('https://shop.com');
const urls = await robots.parseUrlsFromSitemaps();

// Extract IDs from URLs
const productIds = urls
    .map(url => url.match(/\/products\/(\d+)/)?.[1])
    .filter(Boolean);

// Fetch data via API
for (const id of productIds) {
    const data = await gotScraping({
        url: `https://api.shop.com/v1/products/${id}`,
        responseType: 'json',
    });
    // Process data
}

See

examples/hybrid-sitemap-api.js

for complete example.

Directory Navigation

This skill uses progressive disclosure - detailed information is organized in subdirectories and loaded only when needed.

Workflows (Implementation Patterns)

For: Step-by-step workflow guides for each phase

```
workflows/reconnaissance.md
```
- Phase 1 interactive reconnaissance (CRITICAL)
```
workflows/implementation.md
```
- Phase 4 iterative implementation patterns
```
workflows/productionization.md
```
- Phase 5 Apify Actor creation workflow

Strategies (Deep Dives)

For: Detailed guides on specific scraping approaches

```
strategies/sitemap-discovery.md
```
- Complete sitemap guide (4 patterns)
```
strategies/api-discovery.md
```
- Finding and using APIs
```
strategies/playwright-scraping.md
```
- Browser-based scraping
```
strategies/cheerio-scraping.md
```
- HTTP-only scraping
```
strategies/hybrid-approaches.md
```
- Combining strategies
```
strategies/anti-blocking.md
```
- Fingerprinting & proxies for blocked sites

Examples (Runnable Code)

For: Working code to reference or execute

JavaScript Learning Examples (Simple standalone scripts):

```
examples/sitemap-basic.js
```
- Simple sitemap scraper
```
examples/api-scraper.js
```
- Pure API approach
```
examples/playwright-basic.js
```
- Basic Playwright scraper
```
examples/hybrid-sitemap-api.js
```
- Combined approach
```
examples/iterative-fallback.js
```
- Try sitemap→API→Playwright

TypeScript Production Examples (Complete Actors):

```
apify/examples/basic-scraper/
```
- Sitemap + Playwright
```
apify/examples/anti-blocking/
```
- Fingerprinting + proxies
```
apify/examples/hybrid-api/
```
- Sitemap + API (optimal)

Reference (Quick Lookup)

For: Quick patterns and troubleshooting

```
reference/regex-patterns.md
```
- Common URL regex patterns
```
reference/selector-guide.md
```
- Playwright selector strategies
```
reference/fingerprint-patterns.md
```
- Common fingerprint configurations
```
reference/anti-patterns.md
```
- What NOT to do

Apify (Production Deployment)

For: Creating production Apify Actors

```
apify/README.md
```
- When and how to use Apify
```
apify/typescript-first.md
```
- Why TypeScript for actors
```
apify/cli-workflow.md
```
- apify create workflow (CRITICAL)
```
apify/initialization.md
```
- Complete setup guide
```
apify/input-schemas.md
```
- Input validation patterns
```
apify/configuration.md
```
- actor.json setup
```
apify/deployment.md
```
- Testing and deployment
```
apify/templates/
```
- TypeScript boilerplate

Note: Each file is self-contained and can be read independently. Claude will navigate to specific files as needed.

Core Principles

1. Progressive Enhancement

Start with the simplest approach that works:

Sitemap > API > Playwright
Static > Dynamic
HTTP > Browser

2. Proactive Discovery

Always investigate before implementing:

Check for sitemaps automatically
Look for APIs (ask user to check DevTools)
Analyze site structure

3. Iterative Implementation

Build incrementally:

Small test batch first (5-10 items)
Validate quality
Scale or fallback
Add robustness last

4. Production-Ready Code

When productionizing:

Use TypeScript (strongly recommended)
Use
```
apify create
```
(never manual setup)
Add proper error handling
Include logging and monitoring

Remember: Sitemaps first, APIs second, scraping last!

For detailed guidance on any topic, navigate to the relevant subdirectory file listed above.