web-scraping

This skill activates for web scraping and Actor development. It proactively discovers APIs via traffic interception, recommends optimal strategy (traffic interception/sitemap/API/DOM scraping/hybrid), and implements iteratively. For production, it guides TypeScript Actor creation via Apify CLI.

install

source · Clone the upstream repo

git clone https://github.com/yfe404/web-scraper

Claude Code · Install into ~/.claude/skills/

git clone --depth=1 https://github.com/yfe404/web-scraper ~/.claude/skills/yfe404-web-scraper-web-scraping

manifest: SKILL.md

source content

Web Scraping with Intelligent Strategy Selection

When This Skill Activates

Activate automatically when user requests:

"Scrape [website]"
"Extract data from [site]"
"Get product information from [URL]"
"Find all links/pages on [site]"
"I'm getting blocked" or "Getting 403 errors" (loads
```
strategies/anti-blocking.md
```
)
"Make this an Apify Actor" (loads
```
apify/
```
subdirectory)
"Productionize this scraper"

Input Parsing

Determine reconnaissance depth from user request:

User Says	Mode	Phases Run
"quick recon", "just check", "what framework"	Quick	Phase 0 only
"scrape X", "extract data from X" (default)	Standard	Phases 0-3 + 5, Phase 4 only if protection signals detected
"full recon", "deep scan", "production scraping"	Full	All phases (0-5) including protection testing

Default is Standard mode. Escalate to Full if protection signals appear during any phase.

Adaptive Reconnaissance Workflow

This skill uses an adaptive phased workflow with quality gates. Each gate asks "Do I have enough?" — continue only when the answer is no.

See:

strategies/framework-signatures.md

for framework detection tables referenced throughout.

Phase 0: QUICK ASSESSMENT (curl, no browser)

Gather maximum intelligence with minimum cost — a single HTTP request.

Step 0a: Fetch raw HTML and headers

curl -s -D- -L "https://target.com/page" -o response.html

Step 0b: Check response headers

Match headers against
```
strategies/framework-signatures.md
```
→ Response Header Signatures table

Note

Server

X-Powered-By

X-Shopify-Stage

Set-Cookie

(protection markers)

Check HTTP status code (200 = accessible, 403 = protected, 3xx = redirects)

Step 0c: Check Known Major Sites table

Match domain against
```
strategies/framework-signatures.md
```
→ Known Major Sites
If matched: use the specified data strategy, skip generic pattern scanning

Step 0d: Detect framework from HTML

Search raw HTML for signatures in
```
strategies/framework-signatures.md
```
→ HTML Signatures table

Look for

__NEXT_DATA__

__NUXT__

ld+json

/wp-content/

data-reactroot

Step 0e: Search for target data points

For each data point the user wants: search raw HTML for that content
Track which data points are found vs missing

Check for sitemaps:

curl -s https://[site]/robots.txt | grep -i Sitemap

Step 0f: Note protection signals

403/503 status, Cloudflare challenge HTML, CAPTCHA elements,
```
cf-ray
```
header
Record for Phase 4 decision

See:

strategies/cheerio-vs-browser-test.md

for the Cheerio viability assessment

QUALITY GATE A: All target data points found in raw HTML + no protection signals? → YES: Skip to Phase 3 (Validate Findings). No browser needed. → NO: Continue to Phase 1.

Phase 1: BROWSER RECONNAISSANCE (only if Phase 0 needs it)

Launch browser only for data points missing from raw HTML or when JavaScript rendering is required.

Step 1a: Initialize browser session

```
proxy_start()
```
→ Start traffic interception proxy

interceptor_chrome_launch(url, stealthMode: true)

→ Launch Chrome with anti-detection

interceptor_chrome_devtools_attach(target_id)

→ Attach DevTools bridge

interceptor_chrome_devtools_screenshot()

→ Capture visual state

Step 1b: Capture traffic and rendered DOM

```
proxy_list_traffic()
```
→ Review all traffic from page load

proxy_search_traffic(query: "application/json")

→ Find JSON responses

interceptor_chrome_devtools_list_network(resource_types: ["xhr", "fetch"])

→ XHR/fetch calls

```
interceptor_chrome_devtools_snapshot()
```
→ Accessibility tree (rendered DOM)

Step 1c: Search rendered DOM for missing data points

For each data point NOT found in Phase 0: search rendered DOM
Use framework-specific search strategy from
```
strategies/framework-signatures.md
```
→ Framework → Search Strategy table
Only search patterns relevant to the detected framework

Step 1d: Inspect discovered endpoints

```
proxy_get_exchange(exchange_id)
```
→ Full request/response for promising endpoints
Document: method, headers, auth, response structure, pagination

QUALITY GATE B: All target data points now covered (raw HTML + rendered DOM + traffic)? → YES: Skip to Phase 3 (Validate Findings). No deep scan needed. → NO: Continue to Phase 2 for missing data points only.

Phase 2: DEEP SCAN (only for missing data points)

Targeted investigation for data points not yet found. Only search for what's missing.

Step 2a: Test interactions for missing data

```
proxy_clear_traffic()
```
before each action → Isolate API calls
```
humanizer_click(target_id, selector)
```
→ Trigger dynamic content loads

humanizer_scroll(target_id, direction, amount)

→ Trigger lazy loading / infinite scroll

```
humanizer_idle(target_id, duration_ms)
```
→ Wait for delayed content
After each action:
```
proxy_list_traffic()
```
→ Check for new API calls

Step 2b: Sniff APIs (framework-aware)

Search only patterns relevant to detected framework:

Next.js →

proxy_list_traffic(url_filter: "/_next/data/")

WordPress →

proxy_list_traffic(url_filter: "/wp-json/")

GraphQL →
```
proxy_search_traffic(query: "graphql")
```

Generic →

proxy_list_traffic(url_filter: "/api/")

proxy_search_traffic(query: "application/json")

Skip patterns that don't apply to the detected framework

Step 2c: Test pagination and filtering

Only if pagination data is a missing data point or needed for coverage assessment

proxy_clear_traffic()

→ click next page →

proxy_list_traffic(url_filter: "page=")

Document pagination type (URL-based, API offset, cursor, infinite scroll)

QUALITY GATE C: Enough data points covered for a useful report? → YES: Go to Phase 3. → NO: Document gaps, go to Phase 3 anyway (report will note missing data in self-critique).

Phase 3: VALIDATE FINDINGS

Every claimed extraction method must be verified. A data point is not "found" until the extraction path is specified and tested.

See:

strategies/cheerio-vs-browser-test.md

for validation methodology

Step 3a: Validate CSS selectors

For each Cheerio/selector-based method: confirm the selector matches actual HTML
Test against raw HTML (curl output) or rendered DOM (snapshot)
Confirm selector extracts the correct value, not a different element

Step 3b: Validate JSON paths

For each JSON extraction (e.g.,
```
__NEXT_DATA__
```
, API response): confirm the path resolves
Parse the JSON, follow the path, verify it returns the expected data type and value

Step 3c: Validate API endpoints

For each discovered API: replay the request (curl or
```
proxy_get_exchange
```
)
Confirm: response status 200, expected data structure, correct values
Test pagination if claimed (at least page 1 and page 2)

Step 3d: Downgrade or re-investigate failures

If a selector doesn't match: try alternative selectors, or downgrade to PARTIAL confidence
If an API returns 403: note protection requirement, flag for Phase 4
If a JSON path is wrong: re-examine the JSON structure, correct the path

Phase 4: PROTECTION TESTING (conditional)

See:

strategies/proxy-escalation.md

for complete skip/run decision logic

Skip Phase 4 when ALL true:

No protection signals detected in Phases 0-2
All data points have validated extraction methods
User didn't request "full recon"

Run Phase 4 when ANY true:

403/challenge page observed during any phase
Known high-protection domain
High-volume or production intent
User explicitly requested it

If running:

Step 4a: Test raw HTTP access

curl -s -o /dev/null -w "%{http_code}" "https://target.com/page"

200 → Cheerio viable, no browser needed for accessible endpoints
403/503 → Escalate to stealth browser

Step 4b: Test with stealth browser (if needed)

Already running from Phase 1 — check if pages loaded without challenges

interceptor_chrome_devtools_list_cookies(domain_filter: "cloudflare")

→ Protection cookies

interceptor_chrome_devtools_list_storage_keys(storage_type: "local")

→ Fingerprint markers

```
proxy_get_tls_fingerprints()
```
→ TLS fingerprint analysis

Step 4c: Test with upstream proxy (if needed)

proxy_set_upstream("http://user:pass@proxy-provider:port")

Re-test blocked endpoints through proxy
Document minimum access level for each data point

Step 4d: Document protection profile

What protections exist, what worked to bypass them, what production scrapers will need

Phase 5: REPORT + SELF-CRITIQUE

Generate the intelligence report, then critically review it for gaps.

See:

reference/report-schema.md

for complete report format

Step 5a: Generate report

Follow
```
reference/report-schema.md
```
schema (Sections 1-6)
Include
```
Validated?
```
status for every strategy (YES / PARTIAL / NO)
Include all discovered endpoints with full specs

Step 5b: Self-critique

Write Section 7 (Self-Critique) per
```
reference/report-schema.md
```
:
- Gaps: Data points not found — why, and what would find them
- Skipped steps: Which phases skipped, with quality gate reasoning
- Unvalidated claims: Anything marked PARTIAL or NO
- Assumptions: Things not verified (e.g., "consistent layout across categories")
- Staleness risk: Geo-dependent prices, A/B layouts, session-specific content
- Recommendations: Targeted next steps (not "re-run everything")

Step 5c: Fix gaps with targeted re-investigation

If self-critique reveals fixable gaps: go back to the specific phase/step, not a full re-run
Example: "Price selector untested" → run one curl + parse, don't re-launch browser
Update report with results

Step 5d: Record session (if browser was used)

proxy_session_start(name)

→

proxy_session_stop(session_id)

→

proxy_export_har(session_id, path)

HAR file captures all traffic for replay. See
```
strategies/session-workflows.md
```

IMPLEMENTATION (after reconnaissance)

After reconnaissance report is accepted, implement scraper iteratively.

Core Pattern:

Implement recommended approach (minimal code)
Test with small batch (5-10 items)
Validate data quality
Scale to full dataset or fallback
Handle blocking if encountered
Add robustness (error handling, retries, logging)

See:

workflows/implementation.md

for complete implementation patterns and code examples

PRODUCTIONIZATION (on request)

Convert scraper to production-ready Apify Actor.

Activation triggers: "Make this an Apify Actor", "Productionize this", "Deploy to Apify"

Core Pattern:

Confirm TypeScript preference (STRONGLY RECOMMENDED)
Initialize with
```
apify create
```
command (CRITICAL)
Port scraping logic to Actor format
Test locally and deploy

Note: During development, proxy-mcp provides reconnaissance and traffic analysis. For production Actors, use Crawlee crawlers (CheerioCrawler/PlaywrightCrawler) on Apify infrastructure.

See:

workflows/productionization.md

for complete workflow and

apify/

for Actor development guides

Quick Reference

Task	Pattern/Command	Documentation
Reconnaissance	Adaptive Phases 0-5	`workflows/reconnaissance.md`
Framework detection	Header + HTML signature matching	`strategies/framework-signatures.md`
Cheerio vs Browser	Three-way test + early exit	`strategies/cheerio-vs-browser-test.md`
Traffic analysis	`proxy_list_traffic()` + `proxy_get_exchange()`	`strategies/traffic-interception.md`
Protection testing	Conditional escalation	`strategies/proxy-escalation.md`
Report format	Sections 1-7 with self-critique	`reference/report-schema.md`
Find sitemaps	`RobotsFile.find(url)`	`strategies/sitemap-discovery.md`
Filter sitemap URLs	`RequestList + regex`	`reference/regex-patterns.md`
Discover APIs	Traffic capture (automatic)	`strategies/api-discovery.md`
DOM scraping	DevTools bridge + humanizer	`strategies/dom-scraping.md`
HTTP scraping	`CheerioCrawler`	`strategies/cheerio-scraping.md`
Hybrid approach	Sitemap + API	`strategies/hybrid-approaches.md`
Handle blocking	Stealth mode + upstream proxies	`strategies/anti-blocking.md`
Session recording	`proxy_session_start()` / `proxy_export_har()`	`strategies/session-workflows.md`
Proxy-MCP tools	Complete reference	`reference/proxy-tool-reference.md`
Fingerprint configs	Stealth + TLS presets	`reference/fingerprint-patterns.md`
Create Apify Actor	`apify create`	`apify/cli-workflow.md`
Template selection	Cheerio vs Playwright	`workflows/productionization.md`
Input schema	`.actor/input_schema.json`	`apify/input-schemas.md`
Deploy actor	`apify push`	`apify/deployment.md`

Common Patterns

Pattern 1: Sitemap-Based Scraping

import { RobotsFile, CheerioCrawler, Dataset } from 'crawlee';

// Auto-discover and parse sitemaps
const robots = await RobotsFile.find('https://example.com');
const urls = await robots.parseUrlsFromSitemaps();

const crawler = new CheerioCrawler({
    async requestHandler({ $, request }) {
        const data = {
            title: $('h1').text().trim(),
            // ... extract data
        };
        await Dataset.pushData(data);
    },
});

await crawler.addRequests(urls);
await crawler.run();

See

examples/sitemap-basic.js

for complete example.

Pattern 2: API-Based Scraping

import { gotScraping } from 'got-scraping';

const productIds = [123, 456, 789];

for (const id of productIds) {
    const response = await gotScraping({
        url: `https://api.example.com/products/${id}`,
        responseType: 'json',
    });

    console.log(response.body);
}

See

examples/api-scraper.js

for complete example.

Pattern 3: Hybrid (Sitemap + API)

// Get URLs from sitemap
const robots = await RobotsFile.find('https://shop.com');
const urls = await robots.parseUrlsFromSitemaps();

// Extract IDs from URLs
const productIds = urls
    .map(url => url.match(/\/products\/(\d+)/)?.[1])
    .filter(Boolean);

// Fetch data via API
for (const id of productIds) {
    const data = await gotScraping({
        url: `https://api.shop.com/v1/products/${id}`,
        responseType: 'json',
    });
    // Process data
}

See

examples/hybrid-sitemap-api.js

for complete example.

Directory Navigation

This skill uses progressive disclosure - detailed information is organized in subdirectories and loaded only when needed.

Workflows (Implementation Patterns)

For: Step-by-step workflow guides for each phase

```
workflows/reconnaissance.md
```
- Phase 1 interactive reconnaissance (CRITICAL)
```
workflows/implementation.md
```
- Phase 4 iterative implementation patterns
```
workflows/productionization.md
```
- Phase 5 Apify Actor creation workflow

Strategies (Deep Dives)

For: Detailed guides on specific scraping approaches

```
strategies/framework-signatures.md
```
- Framework detection lookup tables (Phase 0/1)
```
strategies/cheerio-vs-browser-test.md
```
- Cheerio vs Browser decision test with early exit
```
strategies/proxy-escalation.md
```
- Protection testing skip/run conditions (Phase 4)
```
strategies/traffic-interception.md
```
- Traffic interception via MITM proxy
```
strategies/sitemap-discovery.md
```
- Complete sitemap guide (4 patterns)
```
strategies/api-discovery.md
```
- Finding and using APIs
```
strategies/dom-scraping.md
```
- DOM scraping via DevTools bridge
```
strategies/cheerio-scraping.md
```
- HTTP-only scraping
```
strategies/hybrid-approaches.md
```
- Combining strategies
```
strategies/anti-blocking.md
```
- Multi-layer anti-detection (stealth, humanizer, proxies, TLS)
```
strategies/session-workflows.md
```
- Session recording, HAR export, replay

Examples (Runnable Code)

For: Working code to reference or execute

JavaScript Learning Examples (Simple standalone scripts):

```
examples/sitemap-basic.js
```
- Simple sitemap scraper
```
examples/api-scraper.js
```
- Pure API approach
```
examples/traffic-interception-basic.js
```
- Proxy-based reconnaissance
```
examples/hybrid-sitemap-api.js
```
- Combined approach
```
examples/iterative-fallback.js
```
- Try traffic interception→sitemap→API→DOM scraping

TypeScript Production Examples (Complete Actors):

```
apify/examples/basic-scraper/
```
- Sitemap + Playwright
```
apify/examples/anti-blocking/
```
- Fingerprinting + proxies
```
apify/examples/hybrid-api/
```
- Sitemap + API (optimal)

Reference (Quick Lookup)

For: Quick patterns and troubleshooting

```
reference/report-schema.md
```
- Intelligence report format (Sections 1-7 + self-critique)
```
reference/proxy-tool-reference.md
```
- Proxy-MCP tool reference (all 80+ tools)
```
reference/regex-patterns.md
```
- Common URL regex patterns
```
reference/fingerprint-patterns.md
```
- Stealth mode + TLS fingerprint presets
```
reference/anti-patterns.md
```
- What NOT to do

Apify (Production Deployment)

For: Creating production Apify Actors

```
apify/README.md
```
- When and how to use Apify
```
apify/typescript-first.md
```
- Why TypeScript for actors
```
apify/cli-workflow.md
```
- apify create workflow (CRITICAL)
```
apify/initialization.md
```
- Complete setup guide
```
apify/input-schemas.md
```
- Input validation patterns
```
apify/configuration.md
```
- actor.json setup
```
apify/deployment.md
```
- Testing and deployment
```
apify/templates/
```
- TypeScript boilerplate

Note: Each file is self-contained and can be read independently. Claude will navigate to specific files as needed.

Core Principles

1. Assess Before Committing Resources

Start cheap (curl), escalate only when needed:

Phase 0 (curl) before Phase 1 (browser) before Phase 2 (deep scan)
Quality gates skip phases when data is sufficient
Never launch a browser if curl gives you everything

2. Detect First, Then Search Relevant Patterns

Use framework detection to focus searches:

Match against
```
strategies/framework-signatures.md
```
before scanning
Skip patterns that don't apply (no
```
__NEXT_DATA__
```
on Amazon)
Known major sites get direct strategy lookup

3. Validate, Don't Assume

Every claimed extraction method must be tested:

"Found text in HTML" is not enough — need a working selector/path
Phase 3 validates every finding before the report
Unvalidated claims are marked PARTIAL or NO in the report

4. Iterative Implementation

Build incrementally:

Small test batch first (5-10 items)
Validate quality
Scale or fallback
Add robustness last

5. Production-Ready Code

When productionizing:

Use TypeScript (strongly recommended)
Use
```
apify create
```
(never manual setup)
Add proper error handling
Include logging and monitoring

Remember: Traffic interception first, sitemaps second, APIs third, DOM scraping last!

For detailed guidance on any topic, navigate to the relevant subdirectory file listed above.