AgentSkillOS firecrawl-scraper
git clone https://github.com/ynulihao/AgentSkillOS
T=$(mktemp -d) && git clone --depth=1 https://github.com/ynulihao/AgentSkillOS "$T" && mkdir -p ~/.claude/skills && cp -r "$T/data/skill_seeds/firecrawl-scraper" ~/.claude/skills/ynulihao-agentskillos-firecrawl-scraper && rm -rf "$T"
data/skill_seeds/firecrawl-scraper/SKILL.mdFirecrawl Web Scraper Skill
Status: Production Ready ✅ Last Updated: 2025-10-24 Official Docs: https://docs.firecrawl.dev API Version: v2
What is Firecrawl?
Firecrawl is a Web Data API for AI that turns entire websites into LLM-ready markdown or structured data. It handles:
- JavaScript rendering - Executes client-side JavaScript to capture dynamic content
- Anti-bot bypass - Gets past CAPTCHA and bot detection systems
- Format conversion - Outputs as markdown, JSON, or structured data
- Screenshot capture - Saves visual representations of pages
- Browser automation - Full headless browser capabilities
API Endpoints
1. /v2/scrape
- Single Page Scraping
/v2/scrapeScrapes a single webpage and returns clean, structured content.
Use Cases:
- Extract article content
- Get product details
- Scrape specific pages
- Convert HTML to markdown
Key Options:
: ["markdown", "html", "screenshot"]formats
: true/false (removes nav, footer, ads)onlyMainContent
: milliseconds to wait before scrapingwaitFor
: browser automation actions (click, scroll, etc.)actions
2. /v2/crawl
- Full Site Crawling
/v2/crawlCrawls all accessible pages from a starting URL.
Use Cases:
- Index entire documentation sites
- Archive website content
- Build knowledge bases
- Scrape multi-page content
Key Options:
: max pages to crawllimit
: how many links deep to followmaxDepth
: restrict to specific domainsallowedDomains
: skip certain URL patternsexcludePaths
3. /v2/map
- URL Discovery
/v2/mapMaps all URLs on a website without scraping content.
Use Cases:
- Find sitemap
- Discover all pages
- Plan crawling strategy
- Audit website structure
4. /v2/extract
- Structured Data Extraction
/v2/extractUses AI to extract specific data fields from pages.
Use Cases:
- Extract product prices and names
- Parse contact information
- Build structured datasets
- Custom data schemas
Key Options:
: Zod or JSON schema defining desired structureschema
: guide AI extraction behaviorsystemPrompt
Authentication
Firecrawl requires an API key for all requests.
Get API Key
- Sign up at https://www.firecrawl.dev
- Go to dashboard → API Keys
- Copy your API key (starts with
)fc-
Store Securely
NEVER hardcode API keys in code!
# .env file FIRECRAWL_API_KEY=fc-your-api-key-here
# .env.local (for local development) FIRECRAWL_API_KEY=fc-your-api-key-here
Python SDK Usage
Installation
pip install firecrawl-py
Latest Version:
firecrawl-py v4.5.0+
Basic Scrape
import os from firecrawl import FirecrawlApp # Initialize client app = FirecrawlApp(api_key=os.environ.get("FIRECRAWL_API_KEY")) # Scrape a single page result = app.scrape_url( url="https://example.com/article", params={ "formats": ["markdown", "html"], "onlyMainContent": True } ) # Access markdown content markdown = result.get("markdown") print(markdown)
Crawl Multiple Pages
import os from firecrawl import FirecrawlApp app = FirecrawlApp(api_key=os.environ.get("FIRECRAWL_API_KEY")) # Start crawl crawl_result = app.crawl_url( url="https://docs.example.com", params={ "limit": 100, "scrapeOptions": { "formats": ["markdown"] } }, poll_interval=5 # Check status every 5 seconds ) # Process results for page in crawl_result.get("data", []): url = page.get("url") markdown = page.get("markdown") print(f"Scraped: {url}")
Extract Structured Data
import os from firecrawl import FirecrawlApp app = FirecrawlApp(api_key=os.environ.get("FIRECRAWL_API_KEY")) # Define schema schema = { "type": "object", "properties": { "company_name": {"type": "string"}, "product_price": {"type": "number"}, "availability": {"type": "string"} }, "required": ["company_name", "product_price"] } # Extract data result = app.extract( urls=["https://example.com/product"], params={ "schema": schema, "systemPrompt": "Extract product information from the page" } ) print(result)
TypeScript/Node.js SDK Usage
Installation
npm install @mendable/firecrawl-js # or pnpm add @mendable/firecrawl-js # or use the unscoped package: npm install firecrawl
Latest Version:
@mendable/firecrawl-js v4.4.1+ (or firecrawl v4.4.1+)
Basic Scrape
import FirecrawlApp from '@mendable/firecrawl-js'; // Initialize client const app = new FirecrawlApp({ apiKey: process.env.FIRECRAWL_API_KEY }); // Scrape a single page const result = await app.scrapeUrl('https://example.com/article', { formats: ['markdown', 'html'], onlyMainContent: true }); // Access markdown content const markdown = result.markdown; console.log(markdown);
Crawl Multiple Pages
import FirecrawlApp from '@mendable/firecrawl-js'; const app = new FirecrawlApp({ apiKey: process.env.FIRECRAWL_API_KEY }); // Start crawl const crawlResult = await app.crawlUrl('https://docs.example.com', { limit: 100, scrapeOptions: { formats: ['markdown'] } }); // Process results for (const page of crawlResult.data) { console.log(`Scraped: ${page.url}`); console.log(page.markdown); }
Extract Structured Data with Zod
import FirecrawlApp from '@mendable/firecrawl-js'; import { z } from 'zod'; const app = new FirecrawlApp({ apiKey: process.env.FIRECRAWL_API_KEY }); // Define schema with Zod const schema = z.object({ company_name: z.string(), product_price: z.number(), availability: z.string() }); // Extract data const result = await app.extract({ urls: ['https://example.com/product'], schema: schema, systemPrompt: 'Extract product information from the page' }); console.log(result);
Common Use Cases
1. Documentation Scraping
Scenario: Convert entire documentation site to markdown for RAG/chatbot
app = FirecrawlApp(api_key=os.environ.get("FIRECRAWL_API_KEY")) docs = app.crawl_url( url="https://docs.myapi.com", params={ "limit": 500, "scrapeOptions": { "formats": ["markdown"], "onlyMainContent": True }, "allowedDomains": ["docs.myapi.com"] } ) # Save to files for page in docs.get("data", []): filename = page["url"].replace("https://", "").replace("/", "_") + ".md" with open(f"docs/{filename}", "w") as f: f.write(page["markdown"])
2. Product Data Extraction
Scenario: Extract structured product data for e-commerce
const schema = z.object({ title: z.string(), price: z.number(), description: z.string(), images: z.array(z.string()), in_stock: z.boolean() }); const products = await app.extract({ urls: productUrls, schema: schema, systemPrompt: 'Extract all product details including price and availability' });
3. News Article Scraping
Scenario: Extract clean article content without ads/navigation
article = app.scrape_url( url="https://news.com/article", params={ "formats": ["markdown"], "onlyMainContent": True, "removeBase64Images": True } ) # Get clean markdown content = article.get("markdown")
Error Handling
Python
from firecrawl import FirecrawlApp from firecrawl.exceptions import FirecrawlException app = FirecrawlApp(api_key=os.environ.get("FIRECRAWL_API_KEY")) try: result = app.scrape_url("https://example.com") except FirecrawlException as e: print(f"Firecrawl error: {e}") except Exception as e: print(f"Unexpected error: {e}")
TypeScript
import FirecrawlApp from '@mendable/firecrawl-js'; const app = new FirecrawlApp({ apiKey: process.env.FIRECRAWL_API_KEY }); try { const result = await app.scrapeUrl('https://example.com'); } catch (error) { if (error.response) { // API error console.error('API Error:', error.response.data); } else { // Network or other error console.error('Error:', error.message); } }
Rate Limits & Best Practices
Rate Limits
- Free tier: 500 credits/month
- Paid tiers: Higher limits based on plan
- Credits consumed vary by endpoint and options
Best Practices
- Use
to reduce credits and get cleaner dataonlyMainContent: true - Set reasonable limits on crawls to avoid excessive costs
- Handle retries with exponential backoff for transient errors
- Cache results locally to avoid re-scraping same content
- Use
endpoint first to plan crawling strategymap - Batch extract calls when processing multiple URLs
- Monitor credit usage in dashboard
Cloudflare Workers Integration
⚠️ Important: SDK Compatibility
The Firecrawl SDK cannot run in Cloudflare Workers due to Node.js dependencies (specifically
axios which uses Node.js http module). Workers require Web Standard APIs.
✅ Use the direct REST API with
instead (see example below).fetch
Alternative: Self-host with workers-firecrawl - a Workers-native implementation (requires Workers Paid Plan, only implements
/search endpoint).
Workers Example: Direct REST API
This example uses the
fetch API to call Firecrawl directly - works perfectly in Cloudflare Workers:
interface Env { FIRECRAWL_API_KEY: string; SCRAPED_CACHE?: KVNamespace; // Optional: for caching results } interface FirecrawlScrapeResponse { success: boolean; data: { markdown?: string; html?: string; metadata: { title?: string; description?: string; language?: string; sourceURL: string; }; }; } export default { async fetch(request: Request, env: Env): Promise<Response> { if (request.method !== 'POST') { return Response.json({ error: 'Method not allowed' }, { status: 405 }); } try { const { url } = await request.json<{ url: string }>(); if (!url) { return Response.json({ error: 'URL is required' }, { status: 400 }); } // Check cache (optional) if (env.SCRAPED_CACHE) { const cached = await env.SCRAPED_CACHE.get(url, 'json'); if (cached) { return Response.json({ cached: true, data: cached }); } } // Call Firecrawl API directly using fetch const response = await fetch('https://api.firecrawl.dev/v2/scrape', { method: 'POST', headers: { 'Authorization': `Bearer ${env.FIRECRAWL_API_KEY}`, 'Content-Type': 'application/json', }, body: JSON.stringify({ url: url, formats: ['markdown'], onlyMainContent: true, removeBase64Images: true }) }); if (!response.ok) { const errorText = await response.text(); throw new Error(`Firecrawl API error (${response.status}): ${errorText}`); } const result = await response.json<FirecrawlScrapeResponse>(); // Cache for 1 hour (optional) if (env.SCRAPED_CACHE && result.success) { await env.SCRAPED_CACHE.put( url, JSON.stringify(result.data), { expirationTtl: 3600 } ); } return Response.json({ cached: false, data: result.data }); } catch (error) { console.error('Scraping error:', error); return Response.json( { error: error instanceof Error ? error.message : 'Unknown error' }, { status: 500 } ); } } };
Environment Setup: Add
FIRECRAWL_API_KEY in Wrangler secrets:
npx wrangler secret put FIRECRAWL_API_KEY
Optional KV Binding (for caching - add to
wrangler.jsonc):
{ "kv_namespaces": [ { "binding": "SCRAPED_CACHE", "id": "your-kv-namespace-id" } ] }
See
templates/firecrawl-worker-fetch.ts for a complete production-ready example.
When to Use This Skill
✅ Use Firecrawl when:
- Scraping modern websites with JavaScript
- Need clean markdown output for LLMs
- Building RAG systems from web content
- Extracting structured data at scale
- Dealing with bot protection
- Need reliable, production-ready scraping
❌ Don't use Firecrawl when:
- Scraping simple static HTML (use cheerio/beautifulsoup)
- Have existing Puppeteer/Playwright setup working well
- Working with APIs (use direct API calls instead)
- Budget constraints (free tier has limits)
Common Issues & Solutions
Issue: "Invalid API Key"
Cause: API key not set or incorrect Fix:
# Check env variable is set echo $FIRECRAWL_API_KEY # Verify key format (should start with fc-)
Issue: "Rate limit exceeded"
Cause: Exceeded monthly credits Fix:
- Check usage in dashboard
- Upgrade plan or wait for reset
- Use
to reduce creditsonlyMainContent: true
Issue: "Timeout error"
Cause: Page takes too long to load Fix:
result = app.scrape_url(url, params={"waitFor": 10000}) # Wait 10s
Issue: "Content is empty"
Cause: Content loaded via JavaScript after initial render Fix:
result = app.scrape_url(url, params={ "waitFor": 5000, "actions": [{"type": "wait", "milliseconds": 3000}] })
Advanced Features
Browser Actions
Perform interactions before scraping:
result = app.scrape_url( url="https://example.com", params={ "actions": [ {"type": "click", "selector": "button.load-more"}, {"type": "wait", "milliseconds": 2000}, {"type": "scroll", "direction": "down"} ] } )
Custom Headers
result = app.scrape_url( url="https://example.com", params={ "headers": { "User-Agent": "Custom Bot 1.0", "Accept-Language": "en-US" } } )
Webhooks for Long Crawls
Instead of polling, receive results via webhook:
crawl = app.crawl_url( url="https://docs.example.com", params={ "limit": 1000, "webhook": "https://your-domain.com/webhook" } )
Package Versions
| Package | Version | Last Checked |
|---|---|---|
| firecrawl-py | 4.5.0+ | 2025-10-20 |
| @mendable/firecrawl-js (or firecrawl) | 4.4.1+ | 2025-10-24 |
| API Version | v2 | Current |
Note: The Node.js SDK requires Node.js >=22.0.0 and cannot run in Cloudflare Workers. Use direct REST API calls in Workers (see Cloudflare Workers Integration section).
Official Documentation
- Docs: https://docs.firecrawl.dev
- Python SDK: https://docs.firecrawl.dev/sdks/python
- Node.js SDK: https://docs.firecrawl.dev/sdks/node
- API Reference: https://docs.firecrawl.dev/api-reference
- GitHub: https://github.com/mendableai/firecrawl
- Dashboard: https://www.firecrawl.dev/app
Next Steps After Using This Skill
- Store scraped data: Use Cloudflare D1, R2, or KV to persist results
- Build RAG system: Combine with Vectorize for semantic search
- Add scheduling: Use Cloudflare Queues for recurring scrapes
- Process content: Use Workers AI to analyze scraped data
Token Savings: ~60% vs manual integration Error Prevention: API authentication, rate limiting, format handling Production Ready: ✅