Vibecosystem harvest-deep-crawl

Multi-page deep crawling - documentation sites, wikis, knowledge bases

install
source · Clone the upstream repo
git clone https://github.com/vibeeval/vibecosystem
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/vibeeval/vibecosystem "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/harvest-deep-crawl" ~/.claude/skills/vibeeval-vibecosystem-harvest-deep-crawl && rm -rf "$T"
manifest: skills/harvest-deep-crawl/SKILL.md
source content

Harvest Deep Crawl

Crawl multi-page websites following internal links to a specified depth. Ideal for building complete knowledge bases from documentation sites, wikis, and reference materials.

Usage

/crawl <url> --depth <N>

Examples

# Crawl docs site 3 levels deep
/crawl https://docs.example.com --depth 3

# Crawl a specific section
/crawl https://docs.example.com/api --depth 2

# Crawl with page limit
/crawl https://wiki.example.com --depth 5 --max-pages 50

Parameters

ParamDefaultDescription
--depth
2Max link-following depth
--max-pages
100Max pages to crawl
--same-domain
trueStay on same domain
--include
*URL pattern to include
--exclude
-URL pattern to exclude

How It Works

  1. Start at root URL, extract all internal links
  2. Follow links up to specified depth (BFS order)
  3. Extract content from each page
  4. Deduplicate pages with > 90% content overlap
  5. Build table of contents from page hierarchy
  6. Merge into coherent knowledge base
  7. Save to
    .claude/cache/agents/harvest/crawl-{domain}/

Output Structure

crawl-{domain}-{timestamp}/
  index.md          # Table of contents + summary
  page-001.md       # First page content
  page-002.md       # Second page content
  ...
  metadata.json     # Crawl stats, URLs, timings

Crawl Engine

Primary: crawl4ai (Docker port 11235)

curl -s http://localhost:11235/crawl \
  -H "Content-Type: application/json" \
  -d '{
    "urls": ["https://docs.example.com"],
    "max_depth": 3,
    "same_domain": true,
    "word_count_threshold": 50
  }'

Fallback: Manual Link Following

When Docker unavailable:

  1. WebFetch root URL
  2. Parse links from markdown output
  3. WebFetch each linked page (depth-limited)
  4. Compile results

Use Cases

ScenarioDepthMax Pages
API reference2-350
Full documentation site3-5100
Wiki section230
Changelog history1-220
Tutorial series2-330

Rules

  • Respect robots.txt
  • Max 2 requests/second
  • Skip binary files (PDF, images, videos)
  • Detect and skip infinite pagination
  • Cache results for 24 hours