Skillshub crawl4ai
Complete toolkit for web crawling and data extraction using Crawl4AI. This skill should be used when users need to scrape websites, extract structured data, handle JavaScript-heavy pages, crawl multiple URLs, or build automated web data pipelines. Includes optimized extraction patterns with schema generation for efficient, LLM-free extraction.
git clone https://github.com/ComeOnOliver/skillshub
T=$(mktemp -d) && git clone --depth=1 https://github.com/ComeOnOliver/skillshub "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/Harmeet10000/skills/crawl4ai-skill" ~/.claude/skills/comeonoliver-skillshub-crawl4ai && rm -rf "$T"
skills/Harmeet10000/skills/crawl4ai-skill/SKILL.mdCrawl4AI
Overview
This skill provides comprehensive support for web crawling and data extraction using the Crawl4AI library, including the complete SDK reference, ready-to-use scripts for common patterns, and optimized workflows for efficient data extraction.
Quick Start
Installation Check
# Verify installation crawl4ai-doctor # If issues, run setup crawl4ai-setup
Basic First Crawl
import asyncio from crawl4ai import AsyncWebCrawler async def main(): async with AsyncWebCrawler() as crawler: result = await crawler.arun("https://example.com") print(result.markdown[:500]) # First 500 chars asyncio.run(main())
Using Provided Scripts
# Simple markdown extraction python scripts/basic_crawler.py https://example.com # Batch processing python scripts/batch_crawler.py urls.txt # Data extraction python scripts/extraction_pipeline.py --generate-schema https://shop.com "extract products"
Core Crawling Fundamentals
1. Basic Crawling
Understanding the core components for any crawl:
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig # Browser configuration (controls browser behavior) browser_config = BrowserConfig( headless=True, # Run without GUI viewport_width=1920, viewport_height=1080, user_agent="custom-agent" # Optional custom user agent ) # Crawler configuration (controls crawl behavior) crawler_config = CrawlerRunConfig( page_timeout=30000, # 30 seconds timeout screenshot=True, # Take screenshot remove_overlay_elements=True # Remove popups/overlays ) # Execute crawl with arun() async with AsyncWebCrawler(config=browser_config) as crawler: result = await crawler.arun( url="https://example.com", config=crawler_config ) # CrawlResult contains everything print(f"Success: {result.success}") print(f"HTML length: {len(result.html)}") print(f"Markdown length: {len(result.markdown)}") print(f"Links found: {len(result.links)}")
2. Configuration Deep Dive
BrowserConfig - Controls the browser instance:
: Run with/without GUIheadless
: Browser dimensionsviewport_width/height
: Custom user agent stringuser_agent
: Pre-set cookiescookies
: Custom HTTP headersheaders
CrawlerRunConfig - Controls each crawl:
: Maximum page load/JS execution time (ms)page_timeout
: CSS selector or JS condition to wait for (optional)wait_for
: Control caching behaviorcache_mode
: Execute custom JavaScriptjs_code
: Capture page screenshotscreenshot
: Persist session across crawlssession_id
3. Content Processing
Basic content operations available in every crawl:
result = await crawler.arun(url) # Access extracted content markdown = result.markdown # Clean markdown html = result.html # Raw HTML text = result.cleaned_html # Cleaned HTML # Media and links images = result.media["images"] videos = result.media["videos"] internal_links = result.links["internal"] external_links = result.links["external"] # Metadata title = result.metadata["title"] description = result.metadata["description"]
Markdown Generation (Primary Use Case)
1. Basic Markdown Extraction
Crawl4AI excels at generating clean, well-formatted markdown:
# Simple markdown extraction async with AsyncWebCrawler() as crawler: result = await crawler.arun("https://docs.example.com") # High-quality markdown ready for LLMs with open("documentation.md", "w") as f: f.write(result.markdown)
2. Fit Markdown (Content Filtering)
Use content filters to get only relevant content:
from crawl4ai.content_filter_strategy import PruningContentFilter, BM25ContentFilter from crawl4ai.markdown_generation_strategy import DefaultMarkdownGenerator # Option 1: Pruning filter (removes low-quality content) pruning_filter = PruningContentFilter(threshold=0.4, threshold_type="fixed") # Option 2: BM25 filter (relevance-based filtering) bm25_filter = BM25ContentFilter(user_query="machine learning tutorials", bm25_threshold=1.0) md_generator = DefaultMarkdownGenerator(content_filter=bm25_filter) config = CrawlerRunConfig(markdown_generator=md_generator) result = await crawler.arun(url, config=config) # Access filtered content print(result.markdown.fit_markdown) # Filtered markdown print(result.markdown.raw_markdown) # Original markdown
3. Markdown Customization
Control markdown generation with options:
config = CrawlerRunConfig( # Exclude elements from markdown excluded_tags=["nav", "footer", "aside"], # Focus on specific CSS selector css_selector=".main-content", # Clean up formatting remove_forms=True, remove_overlay_elements=True, # Control link handling exclude_external_links=True, exclude_internal_links=False ) # Custom markdown generation from crawl4ai.markdown_generation_strategy import DefaultMarkdownGenerator generator = DefaultMarkdownGenerator( options={ "ignore_links": False, "ignore_images": False, "image_alt_text": True } )
Data Extraction
1. Schema-Based Extraction (Most Efficient)
For repetitive patterns, generate schema once and reuse:
# Step 1: Generate schema with LLM (one-time) python scripts/extraction_pipeline.py --generate-schema https://shop.com "extract products" # Step 2: Use schema for fast extraction (no LLM) python scripts/extraction_pipeline.py --use-schema https://shop.com generated_schema.json
2. Manual CSS/JSON Extraction
When you know the structure:
schema = { "name": "articles", "baseSelector": "article.post", "fields": [ {"name": "title", "selector": "h2", "type": "text"}, {"name": "date", "selector": ".date", "type": "text"}, {"name": "content", "selector": ".content", "type": "text"} ] } extraction_strategy = JsonCssExtractionStrategy(schema=schema) config = CrawlerRunConfig(extraction_strategy=extraction_strategy)
3. LLM-Based Extraction
For complex or irregular content:
extraction_strategy = LLMExtractionStrategy( provider="openai/gpt-4o-mini", instruction="Extract key financial metrics and quarterly trends" )
Advanced Patterns
1. Deep Crawling
Discover and crawl links from a page:
# Basic link discovery async with AsyncWebCrawler() as crawler: result = await crawler.arun(url) # Extract and process discovered links internal_links = result.links.get("internal", []) external_links = result.links.get("external", []) # Crawl discovered internal links for link in internal_links: if "/blog/" in link and "/tag/" not in link: # Filter links sub_result = await crawler.arun(link) # Process sub-page # For advanced deep crawling, consider using URL seeding patterns # or custom crawl strategies (see complete-sdk-reference.md)
2. Batch & Multi-URL Processing
Efficiently crawl multiple URLs:
urls = ["https://site1.com", "https://site2.com", "https://site3.com"] async with AsyncWebCrawler() as crawler: # Concurrent crawling with arun_many() results = await crawler.arun_many( urls=urls, config=crawler_config, max_concurrent=5 # Control concurrency ) for result in results: if result.success: print(f"✅ {result.url}: {len(result.markdown)} chars")
3. Session & Authentication
Handle login-required content:
# First crawl - establish session and login login_config = CrawlerRunConfig( session_id="user_session", js_code=""" document.querySelector('#username').value = 'myuser'; document.querySelector('#password').value = 'mypass'; document.querySelector('#submit').click(); """, wait_for="css:.dashboard" # Wait for post-login element ) await crawler.arun("https://site.com/login", config=login_config) # Subsequent crawls - reuse session config = CrawlerRunConfig(session_id="user_session") await crawler.arun("https://site.com/protected-content", config=config)
4. Dynamic Content Handling
For JavaScript-heavy sites:
config = CrawlerRunConfig( # Wait for dynamic content wait_for="css:.ajax-content", # Execute JavaScript js_code=""" // Scroll to load content window.scrollTo(0, document.body.scrollHeight); // Click load more button document.querySelector('.load-more')?.click(); """, # Note: For virtual scrolling (Twitter/Instagram-style), # use virtual_scroll_config parameter (see docs) # Extended timeout for slow loading page_timeout=60000 )
5. Anti-Detection & Proxies
Avoid bot detection:
# Proxy configuration browser_config = BrowserConfig( headless=True, proxy_config={ "server": "http://proxy.server:8080", "username": "user", "password": "pass" } ) # For stealth/undetected browsing, consider: # - Rotating user agents via user_agent parameter # - Using different viewport sizes # - Adding delays between requests # Rate limiting import asyncio for url in urls: result = await crawler.arun(url) await asyncio.sleep(2) # Delay between requests
Common Use Cases
Documentation to Markdown
# Convert entire documentation site to clean markdown async with AsyncWebCrawler() as crawler: result = await crawler.arun("https://docs.example.com") # Save as markdown for LLM consumption with open("docs.md", "w") as f: f.write(result.markdown)
E-commerce Product Monitoring
# Generate schema once for product pages # Then monitor prices/availability without LLM costs schema = load_json("product_schema.json") products = await crawler.arun_many(product_urls, config=CrawlerRunConfig(extraction_strategy=JsonCssExtractionStrategy(schema)))
News Aggregation
# Crawl multiple news sources concurrently news_urls = ["https://news1.com", "https://news2.com", "https://news3.com"] results = await crawler.arun_many(news_urls, max_concurrent=5) # Extract articles with Fit Markdown for result in results: if result.success: # Get only relevant content article = result.fit_markdown
Research & Data Collection
# Academic paper collection with focused extraction config = CrawlerRunConfig( fit_markdown=True, fit_markdown_options={ "query": "machine learning transformers", "max_tokens": 10000 } )
Resources
scripts/
- extraction_pipeline.py - Three extraction approaches with schema generation
- basic_crawler.py - Simple markdown extraction with screenshots
- batch_crawler.py - Multi-URL concurrent processing
references/
- complete-sdk-reference.md - Complete SDK documentation (23K words) with all parameters, methods, and advanced features
Example Code Repository
The Crawl4AI repository includes extensive examples in
docs/examples/:
Core Examples
- quickstart.py - Comprehensive starter with all basic patterns:
- Simple crawling, JavaScript execution, CSS selectors
- Content filtering, link analysis, media handling
- LLM extraction, CSS extraction, dynamic content
- Browser comparison, SSL certificates
Specialized Examples
- amazon_product_extraction_*.py - Three approaches for e-commerce scraping
- extraction_strategies_examples.py - All extraction strategies demonstrated
- deepcrawl_example.py - Advanced deep crawling patterns
- crypto_analysis_example.py - Complex data extraction with analysis
- parallel_execution_example.py - High-performance concurrent crawling
- session_management_example.py - Authentication and session handling
- markdown_generation_example.py - Advanced markdown customization
- hooks_example.py - Custom hooks for crawl lifecycle events
- proxy_rotation_example.py - Proxy management and rotation
- router_example.py - Request routing and URL patterns
Advanced Patterns
- adaptive_crawling/ - Intelligent crawling strategies
- c4a_script/ - C4A script examples
- docker_*.py - Docker deployment patterns
To explore examples:
# The examples are located in your Crawl4AI installation: # Look in: docs/examples/ directory # Start with quickstart.py for comprehensive patterns # It includes: simple crawl, JS execution, CSS selectors, # content filtering, LLM extraction, dynamic pages, and more # For specific use cases: # - E-commerce: amazon_product_extraction_*.py # - High performance: parallel_execution_example.py # - Authentication: session_management_example.py # - Deep crawling: deepcrawl_example.py # Run any example directly: # python docs/examples/quickstart.py
Best Practices
- Start with basic crawling - Understand BrowserConfig, CrawlerRunConfig, and arun() before moving to advanced features
- Use markdown generation for documentation and content - Crawl4AI excels at clean markdown extraction
- Try schema generation first for structured data - 10-100x more efficient than LLM extraction
- Enable caching during development -
to avoid repeated requestscache_mode=CacheMode.ENABLED - Set appropriate timeouts - 30s for normal sites, 60s+ for JavaScript-heavy sites
- Respect rate limits - Use delays and
parametermax_concurrent - Reuse sessions for authenticated content instead of re-logging
Troubleshooting
JavaScript not loading:
config = CrawlerRunConfig( wait_for="css:.dynamic-content", # Wait for specific element page_timeout=60000 # Increase timeout )
Bot detection issues:
browser_config = BrowserConfig( headless=False, # Sometimes visible browsing helps viewport_width=1920, viewport_height=1080, user_agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36" ) # Add delays between requests await asyncio.sleep(random.uniform(2, 5))
Content extraction problems:
# Debug what's being extracted result = await crawler.arun(url) print(f"HTML length: {len(result.html)}") print(f"Markdown length: {len(result.markdown)}") print(f"Links found: {len(result.links)}") # Try different wait strategies config = CrawlerRunConfig( wait_for="js:document.querySelector('.content') !== null" )
Session/auth issues:
# Verify session is maintained config = CrawlerRunConfig(session_id="test_session") result = await crawler.arun(url, config=config) print(f"Session ID: {result.session_id}") print(f"Cookies: {result.cookies}")
For more details on any topic, refer to
references/complete-sdk-reference.md which contains comprehensive documentation of all features, parameters, and advanced usage patterns.