install
source · Clone the upstream repo
git clone https://github.com/ComeOnOliver/skillshub
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/ComeOnOliver/skillshub "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/TerminalSkills/skills/crawl4ai" ~/.claude/skills/comeonoliver-skillshub-crawl4ai-4a55d9 && rm -rf "$T"
manifest:
skills/TerminalSkills/skills/crawl4ai/SKILL.mdsource content
Crawl4AI — LLM-Friendly Web Crawler
You are an expert in Crawl4AI, the open-source web crawler built for AI applications. You help developers extract clean, structured data from websites for LLM training, RAG pipelines, and content analysis — with automatic markdown conversion, JavaScript rendering, CSS-based extraction, LLM-powered structured extraction, and session management for multi-page crawling.
Core Capabilities
Basic Crawling
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig, CacheMode async with AsyncWebCrawler() as crawler: result = await crawler.arun( url="https://docs.example.com/getting-started", config=CrawlerRunConfig( word_count_threshold=10, # Skip blocks with <10 words cache_mode=CacheMode.ENABLED, ), ) print(result.markdown) # Clean markdown (LLM-ready) print(result.cleaned_html) # Cleaned HTML print(result.media["images"]) # Extracted images print(result.links["internal"]) # Internal links print(result.links["external"]) # External links print(result.metadata) # Title, description, keywords
LLM-Powered Structured Extraction
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig from crawl4ai.extraction_strategy import LLMExtractionStrategy from pydantic import BaseModel class Product(BaseModel): name: str price: float description: str rating: float | None features: list[str] extraction = LLMExtractionStrategy( provider="openai/gpt-4o-mini", api_token=os.environ["OPENAI_API_KEY"], schema=Product.model_json_schema(), instruction="Extract all products from this page", ) async with AsyncWebCrawler() as crawler: result = await crawler.arun( url="https://shop.example.com/keyboards", config=CrawlerRunConfig(extraction_strategy=extraction), ) products = [Product(**p) for p in json.loads(result.extracted_content)] for p in products: print(f"{p.name}: ${p.price} — {p.rating}★")
CSS-Based Extraction (No LLM)
from crawl4ai.extraction_strategy import JsonCssExtractionStrategy schema = { "name": "Product listings", "baseSelector": ".product-card", # Repeating element "fields": [ {"name": "title", "selector": "h3.product-title", "type": "text"}, {"name": "price", "selector": ".price", "type": "text"}, {"name": "url", "selector": "a", "type": "attribute", "attribute": "href"}, {"name": "image", "selector": "img", "type": "attribute", "attribute": "src"}, ], } extraction = JsonCssExtractionStrategy(schema) async with AsyncWebCrawler() as crawler: result = await crawler.arun( url="https://shop.example.com/all", config=CrawlerRunConfig( extraction_strategy=extraction, js_code="window.scrollTo(0, document.body.scrollHeight);", # Scroll to load more wait_for=".product-card:nth-child(20)", # Wait for 20 items ), ) products = json.loads(result.extracted_content)
Multi-Page Crawling
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig async with AsyncWebCrawler() as crawler: # Session-based: maintains cookies, auth state session_id = "docs-crawl" # Login first await crawler.arun( url="https://docs.example.com/login", config=CrawlerRunConfig( session_id=session_id, js_code=""" document.querySelector('#email').value = 'user@example.com'; document.querySelector('#password').value = 'pass'; document.querySelector('form').submit(); """, wait_for="#dashboard", ), ) # Crawl authenticated pages urls = ["https://docs.example.com/api", "https://docs.example.com/guides"] for url in urls: result = await crawler.arun( url=url, config=CrawlerRunConfig(session_id=session_id), ) # Save markdown for RAG indexing save_to_knowledge_base(url, result.markdown)
Installation
pip install crawl4ai crawl4ai-setup # Install Playwright browsers
Best Practices
- Markdown output — Use
for LLM/RAG input; clean, structured, no HTML noiseresult.markdown - CSS extraction — Use
for structured pages; no LLM cost, fast, deterministicJsonCssExtractionStrategy - LLM extraction — Use
for unstructured pages; Pydantic schema ensures valid outputLLMExtractionStrategy - JavaScript rendering — Crawl4AI uses Playwright; handles SPAs, infinite scroll, dynamic content
- Sessions — Use
for multi-page crawls; maintains cookies, auth state across requestssession_id - Caching — Enable
for development; avoid re-crawling the same pagesCacheMode.ENABLED - Wait conditions — Use
CSS selectors to ensure content is loaded before extractionwait_for - Rate limiting — Add delays between requests; respect robots.txt; be a good citizen