Claude-code-plugins firecrawl-architecture-variants
install
source · Clone the upstream repo
git clone https://github.com/jeremylongshore/claude-code-plugins-plus-skills
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/jeremylongshore/claude-code-plugins-plus-skills "$T" && mkdir -p ~/.claude/skills && cp -r "$T/plugins/saas-packs/firecrawl-pack/skills/firecrawl-architecture-variants" ~/.claude/skills/jeremylongshore-claude-code-plugins-firecrawl-architecture-variants && rm -rf "$T"
manifest:
plugins/saas-packs/firecrawl-pack/skills/firecrawl-architecture-variants/SKILL.mdsource content
Firecrawl Architecture Variants
Overview
Three deployment architectures for Firecrawl at different scales: on-demand scraping for simple use cases, scheduled crawl pipelines for content monitoring, and real-time ingestion pipelines for AI/RAG applications. Choose based on volume, latency requirements, and cost budget.
Decision Matrix
| Factor | On-Demand | Scheduled Pipeline | Real-Time Pipeline |
|---|---|---|---|
| Volume | < 500/day | 500-10K/day | 10K+/day |
| Latency | Sync (2-10s) | Async (hours) | Async (minutes) |
| Use Case | Single page lookup | Site monitoring | Knowledge base, RAG |
| Credit Control | Per-request | Per-crawl budget | Credit pipeline |
| Complexity | Low | Medium | High |
Instructions
Architecture 1: On-Demand Scraping
User Request → Backend API → firecrawl.scrapeUrl → Clean Content → Response
Best for: chatbots, content preview, single-page extraction.
import FirecrawlApp from "@mendable/firecrawl-js"; const firecrawl = new FirecrawlApp({ apiKey: process.env.FIRECRAWL_API_KEY!, }); // Simple API endpoint app.post("/api/scrape", async (req, res) => { const { url } = req.body; const result = await firecrawl.scrapeUrl(url, { formats: ["markdown"], onlyMainContent: true, waitFor: 3000, }); res.json({ title: result.metadata?.title, content: result.markdown, url: result.metadata?.sourceURL, }); }); // With LLM extraction app.post("/api/extract", async (req, res) => { const { url, schema } = req.body; const result = await firecrawl.scrapeUrl(url, { formats: ["extract"], extract: { schema }, }); res.json({ data: result.extract }); });
Architecture 2: Scheduled Crawl Pipeline
Scheduler (cron) → Crawl Queue → firecrawl.asyncCrawlUrl → Result Store │ ▼ Content Processor → Search Index
Best for: documentation monitoring, content indexing, competitive analysis.
import cron from "node-cron"; interface CrawlTarget { id: string; url: string; maxPages: number; paths?: string[]; schedule: string; // cron expression } const targets: CrawlTarget[] = [ { id: "docs", url: "https://docs.example.com", maxPages: 100, paths: ["/docs/*"], schedule: "0 2 * * *" }, { id: "blog", url: "https://blog.example.com", maxPages: 50, schedule: "0 4 * * 1" }, ]; // Schedule crawls for (const target of targets) { cron.schedule(target.schedule, async () => { console.log(`Starting scheduled crawl: ${target.id}`); const job = await firecrawl.asyncCrawlUrl(target.url, { limit: target.maxPages, includePaths: target.paths, scrapeOptions: { formats: ["markdown"], onlyMainContent: true }, }); await db.saveCrawlJob({ targetId: target.id, jobId: job.id, startedAt: new Date() }); }); } // Separate worker polls for results async function processPendingCrawls() { const pending = await db.getPendingCrawlJobs(); for (const job of pending) { const status = await firecrawl.checkCrawlStatus(job.jobId); if (status.status === "completed") { await indexPages(job.targetId, status.data || []); await db.markComplete(job.id, status.data?.length || 0); console.log(`Crawl ${job.targetId} complete: ${status.data?.length} pages indexed`); } } } setInterval(processPendingCrawls, 30000);
Architecture 3: Real-Time Content Pipeline
URL Sources → Priority Queue → Firecrawl Workers → Content Validation │ ▼ Vector DB + Search Index │ ▼ RAG / AI Pipeline
Best for: AI training data, knowledge base, enterprise content platform.
import PQueue from "p-queue"; class ContentPipeline { private queue: PQueue; private firecrawl: FirecrawlApp; private creditBudget: number; private creditsUsed = 0; constructor(concurrency = 5, dailyBudget = 10000) { this.queue = new PQueue({ concurrency, interval: 1000, intervalCap: 10 }); this.firecrawl = new FirecrawlApp({ apiKey: process.env.FIRECRAWL_API_KEY! }); this.creditBudget = dailyBudget; } async ingest(urls: string[]) { if (this.creditsUsed + urls.length > this.creditBudget) { throw new Error("Daily credit budget exceeded"); } // Use batch scrape for efficiency const result = await this.queue.add(() => this.firecrawl.batchScrapeUrls(urls, { formats: ["markdown"], onlyMainContent: true, }) ); this.creditsUsed += urls.length; // Validate and process const pages = (result?.data || []).filter(page => { const md = page.markdown || ""; return md.length > 100 && !/captcha|access denied/i.test(md); }); // Store in vector DB for (const page of pages) { await vectorStore.upsert({ id: page.metadata?.sourceURL, content: page.markdown, metadata: { title: page.metadata?.title, url: page.metadata?.sourceURL }, }); } return { ingested: pages.length, rejected: urls.length - pages.length }; } async discover(siteUrl: string, pathFilter: string) { const map = await this.firecrawl.mapUrl(siteUrl); return (map.links || []).filter(url => url.includes(pathFilter)); } } // Usage const pipeline = new ContentPipeline(5, 10000); const urls = await pipeline.discover("https://docs.example.com", "/api/"); const result = await pipeline.ingest(urls.slice(0, 100)); console.log(`Ingested ${result.ingested} pages into vector store`);
Choosing Your Architecture
Need real-time, user-facing response? ├── YES → On-Demand (Architecture 1) └── NO → How many pages/day? ├── < 500 → On-Demand with caching ├── 500-10K → Scheduled Pipeline (Architecture 2) └── 10K+ → Real-Time Pipeline (Architecture 3)
Error Handling
| Issue | Cause | Solution |
|---|---|---|
| Slow on-demand response | JS-heavy target page | Add caching layer, reduce waitFor |
| Stale indexed content | Crawl schedule too infrequent | Increase frequency for critical sources |
| Credit overrun | Pipeline ingesting too aggressively | Implement daily budget with hard cap |
| Duplicate content | Re-crawling same pages | Deduplicate by content hash before indexing |
Resources
Next Steps
For common pitfalls, see
firecrawl-known-pitfalls.