Awesome-omni-skill web-scraping
This skill activates for web scraping and Actor development. It proactively discovers sitemaps/APIs, recommends optimal strategy (sitemap/API/Playwright/hybrid), and implements iteratively. For production, it guides TypeScript Actor creation via Apify CLI.
git clone https://github.com/diegosouzapw/awesome-omni-skill
T=$(mktemp -d) && git clone --depth=1 https://github.com/diegosouzapw/awesome-omni-skill "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/development/web-scraping-majiayu000" ~/.claude/skills/diegosouzapw-awesome-omni-skill-web-scraping && rm -rf "$T"
skills/development/web-scraping-majiayu000/SKILL.mdWeb Scraping with Intelligent Strategy Selection
When This Skill Activates
Activate automatically when user requests:
- "Scrape [website]"
- "Extract data from [site]"
- "Get product information from [URL]"
- "Find all links/pages on [site]"
- "I'm getting blocked" or "Getting 403 errors" (loads
)strategies/anti-blocking.md - "Make this an Apify Actor" (loads
subdirectory)apify/ - "Productionize this scraper"
Proactive Workflow
This skill follows a systematic 5-phase approach to web scraping, always starting with interactive reconnaissance and ending with production-ready code.
Phase 1: INTERACTIVE RECONNAISSANCE (Critical First Step)
When user says "scrape X", immediately start with hands-on reconnaissance using MCP tools:
DO NOT jump to automated checks or implementation - reconnaissance prevents wasted effort and discovers hidden APIs.
Use Playwright MCP & Chrome DevTools MCP:
1. Open site in real browser (Playwright MCP)
- Navigate like a real user
- Observe page loading behavior (SSR? SPA? Loading states?)
- Take screenshots for reference
- Test basic interactions
2. Monitor network traffic (Chrome DevTools via Playwright)
- Watch XHR/Fetch requests in real-time
- Find API endpoints returning JSON (10-100x faster than HTML scraping!)
- Analyze request/response patterns
- Document headers, cookies, authentication tokens
- Extract pagination parameters
3. Test site interactions
- Pagination: URL-based? API? Infinite scroll?
- Filtering and search: How do they work?
- Dynamic content loading: Triggers and patterns
- Authentication flows: Required? Optional?
4. Assess protection mechanisms
- Cloudflare/bot detection
- CAPTCHA requirements
- Rate limiting behavior (test with multiple requests)
- Fingerprinting scripts
5. Generate Intelligence Report
- Site architecture (framework, rendering method)
- Discovered APIs/endpoints with full specs
- Protection mechanisms and required countermeasures
- Optimal extraction strategy (API > Sitemap > HTML)
- Time/complexity estimates
See:
workflows/reconnaissance.md for complete reconnaissance guide with MCP examples
Why this matters: Reconnaissance discovers hidden APIs (eliminating need for HTML scraping), identifies blockers before coding, and provides intelligence for optimal strategy selection. Never skip this step.
Phase 2: AUTOMATIC DISCOVERY (Validate Reconnaissance)
After Phase 1 reconnaissance, validate findings with automated checks:
1. Check for Sitemaps
# Automatically check these locations curl -s https://[site]/robots.txt | grep -i Sitemap curl -I https://[site]/sitemap.xml curl -I https://[site]/sitemap_index.xml
Log findings clearly:
- ✓ "Found sitemap at /sitemap.xml with ~1,234 URLs"
- ✓ "Found sitemap index with 5 sub-sitemaps"
- ✗ "No sitemap detected at common locations"
Why this matters: Sitemaps provide instant URL discovery (60x faster than crawling)
2. Investigate APIs
Prompt user:
Should I check for JSON APIs first? (Highly recommended) Benefits of APIs vs HTML scraping: • 10-100x faster execution • More reliable (structured JSON vs fragile HTML) • Less bandwidth usage • Easier to maintain Check for APIs? [Y/n]
If yes, guide user:
- Open browser DevTools → Network tab
- Navigate the target website
- Look for XHR/Fetch requests
- Check for endpoints:
,/api/
,/v1/
,/v2/
,/graphql/_next/data/ - Analyze request/response format (JSON, GraphQL, REST)
Log findings:
- ✓ "Found API: GET /api/products/{id} (returns JSON)"
- ✓ "Found GraphQL endpoint: /graphql"
- ✗ "No obvious public APIs detected"
3. Analyze Site Structure
Automatically assess:
- JavaScript-heavy? (Look for React, Vue, Angular indicators)
- Authentication required? (Login walls, auth tokens)
- Page count estimate (from sitemap or site exploration)
- Rate limiting indicators (robots.txt directives)
Phase 3: STRATEGY RECOMMENDATION
Based on Phases 1-2 findings, present 2-3 options with clear reasoning:
Example Output Template:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 📊 Analysis of example.com ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ Phase 1 Intelligence (Reconnaissance): ✓ API discovered via DevTools: GET /api/products?page=N&limit=100 ✓ Framework: Next.js (SSR + CSR hybrid) ✓ Protection: Cloudflare detected, rate limit ~60/min ✗ No authentication required Phase 2 Validation: ✓ Sitemap found: 1,234 product URLs (validates API total) ✓ Static HTML fallback available if needed ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ Recommended Approaches: ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ ⭐ Option 1: Hybrid (Sitemap + API) [RECOMMENDED] ✓ Use sitemap to get all 1,234 product URLs instantly ✓ Extract product IDs from URLs ✓ Fetch data via API (fast, reliable JSON) Estimated time: 8-12 minutes Complexity: Low-Medium Data quality: Excellent Speed: Very Fast ⚡ Option 2: Sitemap + Playwright ✓ Use sitemap for URLs ✓ Scrape HTML with Playwright Estimated time: 15-20 minutes Complexity: Medium Data quality: Good Speed: Fast 🔧 Option 3: Pure API (if sitemap fails) ✓ Discover product IDs through API exploration ✓ Fetch all data via API Estimated time: 10-15 minutes Complexity: Medium Data quality: Excellent Speed: Fast ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ My Recommendation: Option 1 (Hybrid) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ Reasoning: • Sitemap gives us complete URL list (instant discovery) • API provides clean, structured data (no HTML parsing) • Combines speed of sitemap with reliability of API • Best of both worlds Proceed with Option 1? [Y/n]
Key principles:
- Always recommend the SIMPLEST approach that works
- Sitemap > API > Playwright (in terms of simplicity)
- Show time estimates and complexity
- Explain reasoning clearly
Phase 4: ITERATIVE IMPLEMENTATION
Implement scraper incrementally, starting simple and adding complexity only as needed.
Core Pattern:
- Implement recommended approach (minimal code)
- Test with small batch (5-10 items)
- Validate data quality
- Scale to full dataset or fallback
- Handle blocking if encountered
- Add robustness (error handling, retries, logging)
See:
workflows/implementation.md for complete implementation patterns and code examples
Phase 5: PRODUCTIONIZATION (On Request)
Convert scraper to production-ready Apify Actor.
Activation triggers:
- "Make this an Apify Actor"
- "Productionize this scraper"
- "Deploy to Apify"
- "Create an actor from this"
Core Pattern:
- Confirm TypeScript preference (STRONGLY RECOMMENDED)
- Initialize with
command (CRITICAL)apify create - Port scraping logic to Actor format
- Test locally and deploy
See:
workflows/productionization.md for complete productionization workflow and apify/ directory for all Actor development guides
Quick Reference
| Task | Pattern/Command | Documentation |
|---|---|---|
| Reconnaissance | Playwright + DevTools MCP | |
| Find sitemaps | | |
| Filter sitemap URLs | | |
| Discover APIs | DevTools → Network tab | |
| Playwright scraping | | |
| HTTP scraping | | |
| Hybrid approach | Sitemap + API | |
| Handle blocking | fingerprint-suite + proxies | |
| Fingerprint configs | Quick patterns | |
| Create Apify Actor | | |
| Template selection | Cheerio vs Playwright | |
| Input schema | | |
| Deploy actor | | |
Common Patterns
Pattern 1: Sitemap-Based Scraping
import { RobotsFile, PlaywrightCrawler, Dataset } from 'crawlee'; // Auto-discover and parse sitemaps const robots = await RobotsFile.find('https://example.com'); const urls = await robots.parseUrlsFromSitemaps(); const crawler = new PlaywrightCrawler({ async requestHandler({ page, request }) { const data = await page.evaluate(() => ({ title: document.title, // ... extract data })); await Dataset.pushData(data); }, }); await crawler.addRequests(urls); await crawler.run();
See
examples/sitemap-basic.js for complete example.
Pattern 2: API-Based Scraping
import { gotScraping } from 'got-scraping'; const productIds = [123, 456, 789]; for (const id of productIds) { const response = await gotScraping({ url: `https://api.example.com/products/${id}`, responseType: 'json', }); console.log(response.body); }
See
examples/api-scraper.js for complete example.
Pattern 3: Hybrid (Sitemap + API)
// Get URLs from sitemap const robots = await RobotsFile.find('https://shop.com'); const urls = await robots.parseUrlsFromSitemaps(); // Extract IDs from URLs const productIds = urls .map(url => url.match(/\/products\/(\d+)/)?.[1]) .filter(Boolean); // Fetch data via API for (const id of productIds) { const data = await gotScraping({ url: `https://api.shop.com/v1/products/${id}`, responseType: 'json', }); // Process data }
See
examples/hybrid-sitemap-api.js for complete example.
Directory Navigation
This skill uses progressive disclosure - detailed information is organized in subdirectories and loaded only when needed.
Workflows (Implementation Patterns)
For: Step-by-step workflow guides for each phase
- Phase 1 interactive reconnaissance (CRITICAL)workflows/reconnaissance.md
- Phase 4 iterative implementation patternsworkflows/implementation.md
- Phase 5 Apify Actor creation workflowworkflows/productionization.md
Strategies (Deep Dives)
For: Detailed guides on specific scraping approaches
- Complete sitemap guide (4 patterns)strategies/sitemap-discovery.md
- Finding and using APIsstrategies/api-discovery.md
- Browser-based scrapingstrategies/playwright-scraping.md
- HTTP-only scrapingstrategies/cheerio-scraping.md
- Combining strategiesstrategies/hybrid-approaches.md
- Fingerprinting & proxies for blocked sitesstrategies/anti-blocking.md
Examples (Runnable Code)
For: Working code to reference or execute
JavaScript Learning Examples (Simple standalone scripts):
- Simple sitemap scraperexamples/sitemap-basic.js
- Pure API approachexamples/api-scraper.js
- Basic Playwright scraperexamples/playwright-basic.js
- Combined approachexamples/hybrid-sitemap-api.js
- Try sitemap→API→Playwrightexamples/iterative-fallback.js
TypeScript Production Examples (Complete Actors):
- Sitemap + Playwrightapify/examples/basic-scraper/
- Fingerprinting + proxiesapify/examples/anti-blocking/
- Sitemap + API (optimal)apify/examples/hybrid-api/
Reference (Quick Lookup)
For: Quick patterns and troubleshooting
- Common URL regex patternsreference/regex-patterns.md
- Playwright selector strategiesreference/selector-guide.md
- Common fingerprint configurationsreference/fingerprint-patterns.md
- What NOT to doreference/anti-patterns.md
Apify (Production Deployment)
For: Creating production Apify Actors
- When and how to use Apifyapify/README.md
- Why TypeScript for actorsapify/typescript-first.md
- apify create workflow (CRITICAL)apify/cli-workflow.md
- Complete setup guideapify/initialization.md
- Input validation patternsapify/input-schemas.md
- actor.json setupapify/configuration.md
- Testing and deploymentapify/deployment.md
- TypeScript boilerplateapify/templates/
Note: Each file is self-contained and can be read independently. Claude will navigate to specific files as needed.
Core Principles
1. Progressive Enhancement
Start with the simplest approach that works:
- Sitemap > API > Playwright
- Static > Dynamic
- HTTP > Browser
2. Proactive Discovery
Always investigate before implementing:
- Check for sitemaps automatically
- Look for APIs (ask user to check DevTools)
- Analyze site structure
3. Iterative Implementation
Build incrementally:
- Small test batch first (5-10 items)
- Validate quality
- Scale or fallback
- Add robustness last
4. Production-Ready Code
When productionizing:
- Use TypeScript (strongly recommended)
- Use
(never manual setup)apify create - Add proper error handling
- Include logging and monitoring
Remember: Sitemaps first, APIs second, scraping last!
For detailed guidance on any topic, navigate to the relevant subdirectory file listed above.