web-scraping
This skill activates for web scraping and Actor development. It proactively discovers APIs via traffic interception, recommends optimal strategy (traffic interception/sitemap/API/DOM scraping/hybrid), and implements iteratively. For production, it guides TypeScript Actor creation via Apify CLI.
git clone https://github.com/yfe404/web-scraper
git clone --depth=1 https://github.com/yfe404/web-scraper ~/.claude/skills/yfe404-web-scraper-web-scraping
SKILL.mdWeb Scraping with Intelligent Strategy Selection
When This Skill Activates
Activate automatically when user requests:
- "Scrape [website]"
- "Extract data from [site]"
- "Get product information from [URL]"
- "Find all links/pages on [site]"
- "I'm getting blocked" or "Getting 403 errors" (loads
)strategies/anti-blocking.md - "Make this an Apify Actor" (loads
subdirectory)apify/ - "Productionize this scraper"
Input Parsing
Determine reconnaissance depth from user request:
| User Says | Mode | Phases Run |
|---|---|---|
| "quick recon", "just check", "what framework" | Quick | Phase 0 only |
| "scrape X", "extract data from X" (default) | Standard | Phases 0-3 + 5, Phase 4 only if protection signals detected |
| "full recon", "deep scan", "production scraping" | Full | All phases (0-5) including protection testing |
Default is Standard mode. Escalate to Full if protection signals appear during any phase.
Adaptive Reconnaissance Workflow
This skill uses an adaptive phased workflow with quality gates. Each gate asks "Do I have enough?" — continue only when the answer is no.
See:
strategies/framework-signatures.md for framework detection tables referenced throughout.
Phase 0: QUICK ASSESSMENT (curl, no browser)
Gather maximum intelligence with minimum cost — a single HTTP request.
Step 0a: Fetch raw HTML and headers
curl -s -D- -L "https://target.com/page" -o response.html
Step 0b: Check response headers
- Match headers against
→ Response Header Signatures tablestrategies/framework-signatures.md - Note
,Server
,X-Powered-By
,X-Shopify-Stage
(protection markers)Set-Cookie - Check HTTP status code (200 = accessible, 403 = protected, 3xx = redirects)
Step 0c: Check Known Major Sites table
- Match domain against
→ Known Major Sitesstrategies/framework-signatures.md - If matched: use the specified data strategy, skip generic pattern scanning
Step 0d: Detect framework from HTML
- Search raw HTML for signatures in
→ HTML Signatures tablestrategies/framework-signatures.md - Look for
,__NEXT_DATA__
,__NUXT__
,ld+json
,/wp-content/data-reactroot
Step 0e: Search for target data points
- For each data point the user wants: search raw HTML for that content
- Track which data points are found vs missing
- Check for sitemaps:
curl -s https://[site]/robots.txt | grep -i Sitemap
Step 0f: Note protection signals
- 403/503 status, Cloudflare challenge HTML, CAPTCHA elements,
headercf-ray - Record for Phase 4 decision
See:
strategies/cheerio-vs-browser-test.md for the Cheerio viability assessment
QUALITY GATE A: All target data points found in raw HTML + no protection signals? → YES: Skip to Phase 3 (Validate Findings). No browser needed. → NO: Continue to Phase 1.
Phase 1: BROWSER RECONNAISSANCE (only if Phase 0 needs it)
Launch browser only for data points missing from raw HTML or when JavaScript rendering is required.
Step 1a: Initialize browser session
→ Start traffic interception proxyproxy_start()
→ Launch Chrome with anti-detectioninterceptor_chrome_launch(url, stealthMode: true)
→ Attach DevTools bridgeinterceptor_chrome_devtools_attach(target_id)
→ Capture visual stateinterceptor_chrome_devtools_screenshot()
Step 1b: Capture traffic and rendered DOM
→ Review all traffic from page loadproxy_list_traffic()
→ Find JSON responsesproxy_search_traffic(query: "application/json")
→ XHR/fetch callsinterceptor_chrome_devtools_list_network(resource_types: ["xhr", "fetch"])
→ Accessibility tree (rendered DOM)interceptor_chrome_devtools_snapshot()
Step 1c: Search rendered DOM for missing data points
- For each data point NOT found in Phase 0: search rendered DOM
- Use framework-specific search strategy from
→ Framework → Search Strategy tablestrategies/framework-signatures.md - Only search patterns relevant to the detected framework
Step 1d: Inspect discovered endpoints
→ Full request/response for promising endpointsproxy_get_exchange(exchange_id)- Document: method, headers, auth, response structure, pagination
QUALITY GATE B: All target data points now covered (raw HTML + rendered DOM + traffic)? → YES: Skip to Phase 3 (Validate Findings). No deep scan needed. → NO: Continue to Phase 2 for missing data points only.
Phase 2: DEEP SCAN (only for missing data points)
Targeted investigation for data points not yet found. Only search for what's missing.
Step 2a: Test interactions for missing data
before each action → Isolate API callsproxy_clear_traffic()
→ Trigger dynamic content loadshumanizer_click(target_id, selector)
→ Trigger lazy loading / infinite scrollhumanizer_scroll(target_id, direction, amount)
→ Wait for delayed contenthumanizer_idle(target_id, duration_ms)- After each action:
→ Check for new API callsproxy_list_traffic()
Step 2b: Sniff APIs (framework-aware)
- Search only patterns relevant to detected framework:
- Next.js →
proxy_list_traffic(url_filter: "/_next/data/") - WordPress →
proxy_list_traffic(url_filter: "/wp-json/") - GraphQL →
proxy_search_traffic(query: "graphql") - Generic →
+proxy_list_traffic(url_filter: "/api/")proxy_search_traffic(query: "application/json")
- Next.js →
- Skip patterns that don't apply to the detected framework
Step 2c: Test pagination and filtering
- Only if pagination data is a missing data point or needed for coverage assessment
→ click next page →proxy_clear_traffic()proxy_list_traffic(url_filter: "page=")- Document pagination type (URL-based, API offset, cursor, infinite scroll)
QUALITY GATE C: Enough data points covered for a useful report? → YES: Go to Phase 3. → NO: Document gaps, go to Phase 3 anyway (report will note missing data in self-critique).
Phase 3: VALIDATE FINDINGS
Every claimed extraction method must be verified. A data point is not "found" until the extraction path is specified and tested.
See:
strategies/cheerio-vs-browser-test.md for validation methodology
Step 3a: Validate CSS selectors
- For each Cheerio/selector-based method: confirm the selector matches actual HTML
- Test against raw HTML (curl output) or rendered DOM (snapshot)
- Confirm selector extracts the correct value, not a different element
Step 3b: Validate JSON paths
- For each JSON extraction (e.g.,
, API response): confirm the path resolves__NEXT_DATA__ - Parse the JSON, follow the path, verify it returns the expected data type and value
Step 3c: Validate API endpoints
- For each discovered API: replay the request (curl or
)proxy_get_exchange - Confirm: response status 200, expected data structure, correct values
- Test pagination if claimed (at least page 1 and page 2)
Step 3d: Downgrade or re-investigate failures
- If a selector doesn't match: try alternative selectors, or downgrade to PARTIAL confidence
- If an API returns 403: note protection requirement, flag for Phase 4
- If a JSON path is wrong: re-examine the JSON structure, correct the path
Phase 4: PROTECTION TESTING (conditional)
See:
strategies/proxy-escalation.md for complete skip/run decision logic
Skip Phase 4 when ALL true:
- No protection signals detected in Phases 0-2
- All data points have validated extraction methods
- User didn't request "full recon"
Run Phase 4 when ANY true:
- 403/challenge page observed during any phase
- Known high-protection domain
- High-volume or production intent
- User explicitly requested it
If running:
Step 4a: Test raw HTTP access
curl -s -o /dev/null -w "%{http_code}" "https://target.com/page"
- 200 → Cheerio viable, no browser needed for accessible endpoints
- 403/503 → Escalate to stealth browser
Step 4b: Test with stealth browser (if needed)
- Already running from Phase 1 — check if pages loaded without challenges
→ Protection cookiesinterceptor_chrome_devtools_list_cookies(domain_filter: "cloudflare")
→ Fingerprint markersinterceptor_chrome_devtools_list_storage_keys(storage_type: "local")
→ TLS fingerprint analysisproxy_get_tls_fingerprints()
Step 4c: Test with upstream proxy (if needed)
proxy_set_upstream("http://user:pass@proxy-provider:port")- Re-test blocked endpoints through proxy
- Document minimum access level for each data point
Step 4d: Document protection profile
- What protections exist, what worked to bypass them, what production scrapers will need
Phase 5: REPORT + SELF-CRITIQUE
Generate the intelligence report, then critically review it for gaps.
See:
reference/report-schema.md for complete report format
Step 5a: Generate report
- Follow
schema (Sections 1-6)reference/report-schema.md - Include
status for every strategy (YES / PARTIAL / NO)Validated? - Include all discovered endpoints with full specs
Step 5b: Self-critique
- Write Section 7 (Self-Critique) per
:reference/report-schema.md- Gaps: Data points not found — why, and what would find them
- Skipped steps: Which phases skipped, with quality gate reasoning
- Unvalidated claims: Anything marked PARTIAL or NO
- Assumptions: Things not verified (e.g., "consistent layout across categories")
- Staleness risk: Geo-dependent prices, A/B layouts, session-specific content
- Recommendations: Targeted next steps (not "re-run everything")
Step 5c: Fix gaps with targeted re-investigation
- If self-critique reveals fixable gaps: go back to the specific phase/step, not a full re-run
- Example: "Price selector untested" → run one curl + parse, don't re-launch browser
- Update report with results
Step 5d: Record session (if browser was used)
→proxy_session_start(name)
→proxy_session_stop(session_id)proxy_export_har(session_id, path)- HAR file captures all traffic for replay. See
strategies/session-workflows.md
IMPLEMENTATION (after reconnaissance)
After reconnaissance report is accepted, implement scraper iteratively.
Core Pattern:
- Implement recommended approach (minimal code)
- Test with small batch (5-10 items)
- Validate data quality
- Scale to full dataset or fallback
- Handle blocking if encountered
- Add robustness (error handling, retries, logging)
See:
workflows/implementation.md for complete implementation patterns and code examples
PRODUCTIONIZATION (on request)
Convert scraper to production-ready Apify Actor.
Activation triggers: "Make this an Apify Actor", "Productionize this", "Deploy to Apify"
Core Pattern:
- Confirm TypeScript preference (STRONGLY RECOMMENDED)
- Initialize with
command (CRITICAL)apify create - Port scraping logic to Actor format
- Test locally and deploy
Note: During development, proxy-mcp provides reconnaissance and traffic analysis. For production Actors, use Crawlee crawlers (CheerioCrawler/PlaywrightCrawler) on Apify infrastructure.
See:
workflows/productionization.md for complete workflow and apify/ for Actor development guides
Quick Reference
| Task | Pattern/Command | Documentation |
|---|---|---|
| Reconnaissance | Adaptive Phases 0-5 | |
| Framework detection | Header + HTML signature matching | |
| Cheerio vs Browser | Three-way test + early exit | |
| Traffic analysis | + | |
| Protection testing | Conditional escalation | |
| Report format | Sections 1-7 with self-critique | |
| Find sitemaps | | |
| Filter sitemap URLs | | |
| Discover APIs | Traffic capture (automatic) | |
| DOM scraping | DevTools bridge + humanizer | |
| HTTP scraping | | |
| Hybrid approach | Sitemap + API | |
| Handle blocking | Stealth mode + upstream proxies | |
| Session recording | / | |
| Proxy-MCP tools | Complete reference | |
| Fingerprint configs | Stealth + TLS presets | |
| Create Apify Actor | | |
| Template selection | Cheerio vs Playwright | |
| Input schema | | |
| Deploy actor | | |
Common Patterns
Pattern 1: Sitemap-Based Scraping
import { RobotsFile, CheerioCrawler, Dataset } from 'crawlee'; // Auto-discover and parse sitemaps const robots = await RobotsFile.find('https://example.com'); const urls = await robots.parseUrlsFromSitemaps(); const crawler = new CheerioCrawler({ async requestHandler({ $, request }) { const data = { title: $('h1').text().trim(), // ... extract data }; await Dataset.pushData(data); }, }); await crawler.addRequests(urls); await crawler.run();
See
examples/sitemap-basic.js for complete example.
Pattern 2: API-Based Scraping
import { gotScraping } from 'got-scraping'; const productIds = [123, 456, 789]; for (const id of productIds) { const response = await gotScraping({ url: `https://api.example.com/products/${id}`, responseType: 'json', }); console.log(response.body); }
See
examples/api-scraper.js for complete example.
Pattern 3: Hybrid (Sitemap + API)
// Get URLs from sitemap const robots = await RobotsFile.find('https://shop.com'); const urls = await robots.parseUrlsFromSitemaps(); // Extract IDs from URLs const productIds = urls .map(url => url.match(/\/products\/(\d+)/)?.[1]) .filter(Boolean); // Fetch data via API for (const id of productIds) { const data = await gotScraping({ url: `https://api.shop.com/v1/products/${id}`, responseType: 'json', }); // Process data }
See
examples/hybrid-sitemap-api.js for complete example.
Directory Navigation
This skill uses progressive disclosure - detailed information is organized in subdirectories and loaded only when needed.
Workflows (Implementation Patterns)
For: Step-by-step workflow guides for each phase
- Phase 1 interactive reconnaissance (CRITICAL)workflows/reconnaissance.md
- Phase 4 iterative implementation patternsworkflows/implementation.md
- Phase 5 Apify Actor creation workflowworkflows/productionization.md
Strategies (Deep Dives)
For: Detailed guides on specific scraping approaches
- Framework detection lookup tables (Phase 0/1)strategies/framework-signatures.md
- Cheerio vs Browser decision test with early exitstrategies/cheerio-vs-browser-test.md
- Protection testing skip/run conditions (Phase 4)strategies/proxy-escalation.md
- Traffic interception via MITM proxystrategies/traffic-interception.md
- Complete sitemap guide (4 patterns)strategies/sitemap-discovery.md
- Finding and using APIsstrategies/api-discovery.md
- DOM scraping via DevTools bridgestrategies/dom-scraping.md
- HTTP-only scrapingstrategies/cheerio-scraping.md
- Combining strategiesstrategies/hybrid-approaches.md
- Multi-layer anti-detection (stealth, humanizer, proxies, TLS)strategies/anti-blocking.md
- Session recording, HAR export, replaystrategies/session-workflows.md
Examples (Runnable Code)
For: Working code to reference or execute
JavaScript Learning Examples (Simple standalone scripts):
- Simple sitemap scraperexamples/sitemap-basic.js
- Pure API approachexamples/api-scraper.js
- Proxy-based reconnaissanceexamples/traffic-interception-basic.js
- Combined approachexamples/hybrid-sitemap-api.js
- Try traffic interception→sitemap→API→DOM scrapingexamples/iterative-fallback.js
TypeScript Production Examples (Complete Actors):
- Sitemap + Playwrightapify/examples/basic-scraper/
- Fingerprinting + proxiesapify/examples/anti-blocking/
- Sitemap + API (optimal)apify/examples/hybrid-api/
Reference (Quick Lookup)
For: Quick patterns and troubleshooting
- Intelligence report format (Sections 1-7 + self-critique)reference/report-schema.md
- Proxy-MCP tool reference (all 80+ tools)reference/proxy-tool-reference.md
- Common URL regex patternsreference/regex-patterns.md
- Stealth mode + TLS fingerprint presetsreference/fingerprint-patterns.md
- What NOT to doreference/anti-patterns.md
Apify (Production Deployment)
For: Creating production Apify Actors
- When and how to use Apifyapify/README.md
- Why TypeScript for actorsapify/typescript-first.md
- apify create workflow (CRITICAL)apify/cli-workflow.md
- Complete setup guideapify/initialization.md
- Input validation patternsapify/input-schemas.md
- actor.json setupapify/configuration.md
- Testing and deploymentapify/deployment.md
- TypeScript boilerplateapify/templates/
Note: Each file is self-contained and can be read independently. Claude will navigate to specific files as needed.
Core Principles
1. Assess Before Committing Resources
Start cheap (curl), escalate only when needed:
- Phase 0 (curl) before Phase 1 (browser) before Phase 2 (deep scan)
- Quality gates skip phases when data is sufficient
- Never launch a browser if curl gives you everything
2. Detect First, Then Search Relevant Patterns
Use framework detection to focus searches:
- Match against
before scanningstrategies/framework-signatures.md - Skip patterns that don't apply (no
on Amazon)__NEXT_DATA__ - Known major sites get direct strategy lookup
3. Validate, Don't Assume
Every claimed extraction method must be tested:
- "Found text in HTML" is not enough — need a working selector/path
- Phase 3 validates every finding before the report
- Unvalidated claims are marked PARTIAL or NO in the report
4. Iterative Implementation
Build incrementally:
- Small test batch first (5-10 items)
- Validate quality
- Scale or fallback
- Add robustness last
5. Production-Ready Code
When productionizing:
- Use TypeScript (strongly recommended)
- Use
(never manual setup)apify create - Add proper error handling
- Include logging and monitoring
Remember: Traffic interception first, sitemaps second, APIs third, DOM scraping last!
For detailed guidance on any topic, navigate to the relevant subdirectory file listed above.