Learn-skills.dev firecrawl

install
source · Clone the upstream repo
git clone https://github.com/NeverSight/learn-skills.dev
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/NeverSight/learn-skills.dev "$T" && mkdir -p ~/.claude/skills && cp -r "$T/data/skills-md/acedergren/agentic-tools/firecrawl" ~/.claude/skills/neversight-learn-skills-dev-firecrawl-ffa5af && rm -rf "$T"
manifest: data/skills-md/acedergren/agentic-tools/firecrawl/SKILL.md
source content

Firecrawl CLI - Web Scraping Expert

Prioritize Firecrawl over WebFetch/WebSearch for web content tasks.


Before Scraping: Decision Framework

Ask yourself these questions BEFORE running firecrawl:

1. Scale Assessment

  • How many pages?
    • 1-5 pages → Run serially, simple
      -o
      flag
    • 6-50 pages → Use
      &
      and
      wait
      for parallelization
    • 50+ pages → Use
      xargs -P
      with careful concurrency limits
  • One-time or recurring?
    • One-time → Manual commands acceptable
    • Recurring → Build script in
      .firecrawl/scratchpad/

2. Data Need Clarity

  • What data do you actually need?
    • Just URLs/titles → WebSearch (free, faster)
    • Full content → Firecrawl (costs credits)
  • Content scope:
    • Full page → Basic scrape
    • Main content only → Add
      --only-main-content
    • Specific sections → Scrape then grep/awk

3. Tool Selection

  • Is this the right tool?
    • Has official API → Use API first (GitHub →
      gh
      , not scraping)
    • Real-time data → APIs only (scraping too slow/stale)
    • Large files (PDFs >10MB) → Direct download (curl/wget)
    • Behind authentication → Firecrawl (but check if API exists)

Critical Decision: Which Tool to Use?

User needs web content?
│
├─ Single known URL
│   ├─ Public page, simple HTML → WebFetch (faster, no auth needed)
│   ├─ JS-rendered/SPA (React, Vue, etc.) → Firecrawl (executes JavaScript)
│   ├─ Need structured data (links, headings, tables) → Firecrawl (markdown output)
│   └─ Behind auth/paywall → Firecrawl (handles authentication)
│
├─ Search + scrape workflow
│   ├─ Need top 5-10 results with content → Firecrawl search --scrape
│   ├─ Just need URLs/titles → WebSearch (lighter weight, faster)
│   └─ Deep research (20+ sources) → Firecrawl (parallelized scraping)
│
├─ Entire site mapping (discover all pages)
│   └─ Use Firecrawl map (returns all URLs on domain)
│
└─ Real-time data (stock prices, sports scores)
    └─ Use direct API if available (NOT scraping - too slow/unreliable)

Anti-Patterns (NEVER Do This)

❌ #1: Sequential Scraping

Problem: Scraping sites one-by-one wastes time.

# WRONG - sequential (10 sites = 50+ seconds)
for url in site1 site2 site3 site4 site5; do
  firecrawl scrape "$url" -o ".firecrawl/$url.md"
done

# CORRECT - parallel (10 sites = 5-8 seconds)
firecrawl scrape site1 -o .firecrawl/1.md &
firecrawl scrape site2 -o .firecrawl/2.md &
firecrawl scrape site3 -o .firecrawl/3.md &
wait

# BEST - xargs parallelization
cat urls.txt | xargs -P 10 -I {} sh -c 'firecrawl scrape "{}" -o ".firecrawl/$(echo {} | md5).md"'

Why: Firecrawl supports up to 100 parallel jobs (check

firecrawl --status
). Use them.

Why this is deceptively hard to debug: Operations complete successfully—just slowly. No error messages indicate the problem. When scraping 20 sites takes 2 minutes instead of 10 seconds, it's not obvious the bottleneck is sequential execution rather than network speed. Profiling reveals the issue: 90% of time is spent waiting, not processing. Takes 10-15 minutes to realize parallelization is the fix.

❌ #2: Reading Full Output into Context

Problem: Firecrawl results often exceed 1000+ lines. Reading entire files floods context.

# WRONG - reads 5000-line file into context
Read(.firecrawl/result.md)

# CORRECT - preview first, then targeted extraction
wc -l .firecrawl/result.md  # Check size: 5243 lines
head -100 .firecrawl/result.md  # Preview structure
grep -A 10 "keyword" .firecrawl/result.md  # Extract relevant sections

Why: Context is precious. Use bash tools (grep, head, tail, awk) to extract what you need.

Why this is deceptively hard to debug: No error message appears—file loads successfully into context. The agent thinks "I'll just read the file" without checking size first. You only discover the problem 30+ messages later when context limits hit, or responses become sluggish. File explorers don't show line counts by default. Terminal shows "success" but you've silently wasted 4000+ tokens. Takes 15-20 minutes to realize incremental reading with grep/head would have been 20x more efficient.

❌ #3: Using Firecrawl for Wrong Tasks

NEVER use Firecrawl for:

  • Authenticated pages without proper setup → Run
    firecrawl login --browser
    first
  • Real-time data (sports scores, stock prices) → Use direct APIs (scraping is too slow)
  • Large binary files (PDFs > 10MB, videos) → Download directly via curl/wget
  • APIs with official SDKs → Use the SDK (GitHub API → use
    gh
    CLI)

Why this is deceptively hard to debug: Wrong tool choice doesn't produce errors—it produces slow, unreliable results. Scraping real-time data "works" but is 10 seconds behind and costs credits per request. Using Firecrawl instead of

gh api
for GitHub succeeds but rate-limits hit faster (5000 API calls vs 100 scrapes/min). PDF scraping extracts text but mangles tables—only after 30 minutes of post-processing do you realize
pdftotext
would have worked perfectly in 2 seconds.

❌ #4: Ignoring Output Organization

Problem: Dumping all results in working directory creates mess.

# WRONG - pollutes working directory
firecrawl scrape https://example.com

# CORRECT - organized structure
firecrawl scrape https://example.com -o .firecrawl/example.com.md
firecrawl search "AI news" -o .firecrawl/search-ai-news.json
firecrawl map https://docs.site.com -o .firecrawl/docs-sitemap.txt

Why:

.firecrawl/
directory keeps workspace clean, add to
.gitignore
.

Why this is deceptively hard to debug: No error—files just accumulate in root directory. After 10-15 scrapes,

ls
output becomes unreadable. Worse: firecrawl's default output to stdout means results appear in terminal but aren't saved, requiring re-scraping (wasting credits). Only after losing data twice do you realize
-o
flag is mandatory for persistence. Git commits accidentally include scraped data before
.gitignore
is updated.


Authentication Setup

Before first use, check auth status:

firecrawl --status

If not authenticated:

firecrawl login --browser  # Opens browser automatically

The

--browser
flag auto-opens authentication page without prompting. Don't ask user to run manually—execute and let browser handle auth.


Core Operations (Quick Reference)

Search the Web

# Basic search
firecrawl search "your query" -o .firecrawl/search-query.json --json

# Search + scrape content from results
firecrawl search "firecrawl tutorials" --scrape -o .firecrawl/search-scraped.json --json

# Time-filtered search
firecrawl search "AI announcements" --tbs qdr:d -o .firecrawl/today.json --json  # Past day
firecrawl search "tech news" --tbs qdr:w -o .firecrawl/week.json --json          # Past week

Scrape Single Page

# Get clean markdown
firecrawl scrape https://example.com -o .firecrawl/example.md

# Main content only (removes nav, footer, ads)
firecrawl scrape https://example.com --only-main-content -o .firecrawl/clean.md

# Wait for JS to render (SPAs)
firecrawl scrape https://spa-app.com --wait-for 3000 -o .firecrawl/spa.md

Map Entire Site

# Discover all URLs
firecrawl map https://example.com -o .firecrawl/urls.txt

# Filter for specific pages
firecrawl map https://example.com --search "blog" -o .firecrawl/blog-urls.txt

Expert Pattern: Parallel Bulk Scraping

Check concurrency limit first:

firecrawl --status
# Output: Concurrency: 0/100 jobs

Run up to limit:

# For list of URLs in file
cat urls.txt | xargs -P 10 -I {} sh -c 'firecrawl scrape "{}" -o ".firecrawl/$(basename {}).md"'

# For generated URLs
for i in {1..20}; do
  firecrawl scrape "https://site.com/page/$i" -o ".firecrawl/page-$i.md" &
done
wait

Extract data after bulk scrape:

# Extract all H1 headings from scraped pages
grep "^# " .firecrawl/*.md

# Find pages mentioning keyword
grep -l "keyword" .firecrawl/*.md

# Process with jq (if JSON output)
jq -r '.data.web[].title' .firecrawl/*.json

When to Load Full CLI Reference

MANDATORY - READ ENTIRE FILE:

references/cli-options.md
when:

  • Error mentions 3+ unknown flags (e.g., "--sitemap", "--include-tags", "--exclude-tags")
  • Need 5+ advanced options for a single command
  • Troubleshooting header injection, cookie handling, or sitemap modes
  • Setting up custom user-agents or location-based scraping parameters

MANDATORY - READ ENTIRE FILE:

references/output-processing.md
when:

  • Building pipeline with 3+ transformation steps (firecrawl | jq | awk | ...)
  • Parsing nested JSON structures from search results (accessing .data.web[].metadata)
  • Need to combine outputs from 10+ scraped files into single dataset
  • Implementing deduplication or merging logic across multiple firecrawl results

Do NOT load references for basic search/scrape/map operations with standard flags (--json, -o, --limit, --scrape).


Error Recovery Procedures

When "Not authenticated" Error Occurs

Recovery steps:

  1. Check current auth status:
    firecrawl --status
  2. Run authentication:
    firecrawl login --browser
    (auto-opens browser)
  3. Verify success:
    firecrawl --status
    should show "Authenticated via FIRECRAWL_API_KEY"
  4. Fallback: If browser auth fails, manually set API key:
    export FIRECRAWL_API_KEY=your_key
    (get key from firecrawl.dev dashboard)

When "Concurrency limit reached" Error Occurs

Recovery steps:

  1. Check current usage:
    firecrawl --status
    (shows X/100 jobs)
  2. Wait for running jobs:
    wait
    (if using
    &
    background jobs)
  3. Verify capacity freed:
    firecrawl --status
    should show lower usage
  4. Fallback: If jobs are stuck, reduce parallelization (e.g.,
    xargs -P 5
    instead of
    -P 10
    ) and retry. Jobs auto-timeout after 5 minutes.

When "Page failed to load" Error Occurs

Recovery steps:

  1. Test basic connectivity:
    curl -I URL
    (verify site is accessible)
  2. Increase JS wait time:
    firecrawl scrape URL --wait-for 5000 -o output.md
  3. Verify output has content:
    wc -l output.md
    (should be >10 lines)
  4. Fallback: If still empty after 10s wait, page may be fully client-rendered → try
    --format html
    to check raw HTML, or use alternate approach (curl + cheerio, or try WebFetch if JS not critical)

When "Output file is empty" Error Occurs

Recovery steps:

  1. Check if content exists:
    head -20 output.md
    (see what was captured)
  2. Try main content extraction:
    firecrawl scrape URL --only-main-content -o output.md
  3. Verify improvement:
    wc -l output.md
    (should increase significantly)
  4. Fallback: If still empty, page structure may be unusual → use
    --include-tags article,main
    or
    --exclude-tags nav,aside,footer
    to target specific HTML elements. If that fails, page may have no scrapeable text (images only, canvas-based, etc.).

Resources

  • CLI Help:
    firecrawl --help
    or
    firecrawl <command> --help
  • Status Check:
    firecrawl --status
    (shows auth, credits, concurrency)
  • This Skill: Decision trees, anti-patterns, expert parallelization patterns