Vibecosystem harvest-deep-crawl

Multi-page deep crawling - documentation sites, wikis, knowledge bases

install

source · Clone the upstream repo

git clone https://github.com/vibeeval/vibecosystem

Claude Code · Install into ~/.claude/skills/

T=$(mktemp -d) && git clone --depth=1 https://github.com/vibeeval/vibecosystem "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/harvest-deep-crawl" ~/.claude/skills/vibeeval-vibecosystem-harvest-deep-crawl && rm -rf "$T"

manifest: skills/harvest-deep-crawl/SKILL.md

Harvest Deep Crawl

Crawl multi-page websites following internal links to a specified depth. Ideal for building complete knowledge bases from documentation sites, wikis, and reference materials.

Usage

/crawl <url> --depth <N>

Examples

# Crawl docs site 3 levels deep
/crawl https://docs.example.com --depth 3

# Crawl a specific section
/crawl https://docs.example.com/api --depth 2

# Crawl with page limit
/crawl https://wiki.example.com --depth 5 --max-pages 50

Parameters

Param	Default	Description
`--depth`	2	Max link-following depth
`--max-pages`	100	Max pages to crawl
`--same-domain`	true	Stay on same domain
`--include`	*	URL pattern to include
`--exclude`	-	URL pattern to exclude

How It Works

Start at root URL, extract all internal links
Follow links up to specified depth (BFS order)
Extract content from each page
Deduplicate pages with > 90% content overlap
Build table of contents from page hierarchy
Merge into coherent knowledge base

Save to

.claude/cache/agents/harvest/crawl-{domain}/

Output Structure

crawl-{domain}-{timestamp}/
  index.md          # Table of contents + summary
  page-001.md       # First page content
  page-002.md       # Second page content
  ...
  metadata.json     # Crawl stats, URLs, timings

Crawl Engine

Primary: crawl4ai (Docker port 11235)

curl -s http://localhost:11235/crawl \
  -H "Content-Type: application/json" \
  -d '{
    "urls": ["https://docs.example.com"],
    "max_depth": 3,
    "same_domain": true,
    "word_count_threshold": 50
  }'

Fallback: Manual Link Following

When Docker unavailable:

WebFetch root URL
Parse links from markdown output
WebFetch each linked page (depth-limited)
Compile results

Use Cases

Scenario	Depth	Max Pages
API reference	2-3	50
Full documentation site	3-5	100
Wiki section	2	30
Changelog history	1-2	20
Tutorial series	2-3	30

Rules

Respect robots.txt
Max 2 requests/second
Skip binary files (PDF, images, videos)
Detect and skip infinite pagination
Cache results for 24 hours