Vibecosystem harvest-deep-crawl
Multi-page deep crawling - documentation sites, wikis, knowledge bases
install
source · Clone the upstream repo
git clone https://github.com/vibeeval/vibecosystem
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/vibeeval/vibecosystem "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/harvest-deep-crawl" ~/.claude/skills/vibeeval-vibecosystem-harvest-deep-crawl && rm -rf "$T"
manifest:
skills/harvest-deep-crawl/SKILL.mdsource content
Harvest Deep Crawl
Crawl multi-page websites following internal links to a specified depth. Ideal for building complete knowledge bases from documentation sites, wikis, and reference materials.
Usage
/crawl <url> --depth <N>
Examples
# Crawl docs site 3 levels deep /crawl https://docs.example.com --depth 3 # Crawl a specific section /crawl https://docs.example.com/api --depth 2 # Crawl with page limit /crawl https://wiki.example.com --depth 5 --max-pages 50
Parameters
| Param | Default | Description |
|---|---|---|
| 2 | Max link-following depth |
| 100 | Max pages to crawl |
| true | Stay on same domain |
| * | URL pattern to include |
| - | URL pattern to exclude |
How It Works
- Start at root URL, extract all internal links
- Follow links up to specified depth (BFS order)
- Extract content from each page
- Deduplicate pages with > 90% content overlap
- Build table of contents from page hierarchy
- Merge into coherent knowledge base
- Save to
.claude/cache/agents/harvest/crawl-{domain}/
Output Structure
crawl-{domain}-{timestamp}/ index.md # Table of contents + summary page-001.md # First page content page-002.md # Second page content ... metadata.json # Crawl stats, URLs, timings
Crawl Engine
Primary: crawl4ai (Docker port 11235)
curl -s http://localhost:11235/crawl \ -H "Content-Type: application/json" \ -d '{ "urls": ["https://docs.example.com"], "max_depth": 3, "same_domain": true, "word_count_threshold": 50 }'
Fallback: Manual Link Following
When Docker unavailable:
- WebFetch root URL
- Parse links from markdown output
- WebFetch each linked page (depth-limited)
- Compile results
Use Cases
| Scenario | Depth | Max Pages |
|---|---|---|
| API reference | 2-3 | 50 |
| Full documentation site | 3-5 | 100 |
| Wiki section | 2 | 30 |
| Changelog history | 1-2 | 20 |
| Tutorial series | 2-3 | 30 |
Rules
- Respect robots.txt
- Max 2 requests/second
- Skip binary files (PDF, images, videos)
- Detect and skip infinite pagination
- Cache results for 24 hours