Claude-skill-registry documentation-scraper
Use when needing to scrape documentation websites into markdown for AI context. Triggers on "scrape docs", "download documentation", "get docs for [library]", or creating local copies of online documentation. CRITICAL - always analyze sitemap first before scraping.
git clone https://github.com/majiayu000/claude-skill-registry
T=$(mktemp -d) && git clone --depth=1 https://github.com/majiayu000/claude-skill-registry "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/data/documentation-scraper" ~/.claude/skills/majiayu000-claude-skill-registry-documentation-scraper && rm -rf "$T"
skills/data/documentation-scraper/SKILL.mdDocumentation Scraper with slurp-ai
Overview
slurp-ai scrapes documentation websites and compiles them into a single markdown file optimized for AI agent context. It is lightweight, fast, and deterministic - it does NOT use AI to scrape, it is FOR AI consumption.
CRITICAL: Run Outside Sandbox
All commands in this skill MUST be run outside the sandbox. Use
dangerouslyDisableSandbox: true for all Bash commands including:
(installation check)which slurp
(sitemap analysis)node analyze-sitemap.js
(scraping)slurp- File inspection commands (
,wc
,head
, etc.)cat
The sandbox blocks network access and file operations required for web scraping.
Pre-Flight: Check Installation
Before scraping, verify slurp-ai is installed:
which slurp || echo "NOT INSTALLED"
If not installed, ask the user to run:
npm install -g slurp-ai
Requires: Node.js v20+
Do NOT proceed with scraping until slurp-ai is confirmed installed.
Commands
| Command | Purpose |
|---|---|
| Fetch and compile in one step |
| Download docs to partials only |
| Compile partials into single file |
| Read local documentation |
Output: Creates
slurp_compiled/compiled_docs.md from partials in slurp_partials/.
CRITICAL: Analyze Sitemap First
Before running slurp, ALWAYS analyze the sitemap. This reveals the complete site structure and informs your
--base-path and --max decisions.
Step 1: Run Sitemap Analysis
Use the included
analyze-sitemap.js script:
node analyze-sitemap.js https://docs.example.com
This outputs:
- Total page count (informs
)--max - URLs grouped by section (informs
)--base-path - Suggested slurp commands with appropriate flags
- Sample URLs to understand naming patterns
Step 2: Interpret the Output
Example output:
📊 Total URLs in sitemap: 247 📁 URLs by top-level section: /docs 182 pages /api 45 pages /blog 20 pages 🎯 Suggested --base-path options: https://docs.example.com/docs/guides/ (67 pages) https://docs.example.com/docs/reference/ (52 pages) https://docs.example.com/api/ (45 pages) 💡 Recommended slurp commands: # Just "/docs/guides" section (67 pages) slurp https://docs.example.com/docs/guides/ --base-path https://docs.example.com/docs/guides/ --max 80
Step 3: Choose Scope Based on Analysis
| Sitemap Shows | Action |
|---|---|
| < 50 pages total | Scrape entire site: |
| 50-200 pages | Scope to relevant section with |
| 200+ pages | Must scope down - pick specific subsection |
| No sitemap found | Start with , inspect partials, adjust |
Step 4: Frame the Slurp Command
With sitemap data, you can now set accurate parameters:
# From sitemap: /docs/api has 45 pages slurp https://docs.example.com/docs/api/intro \ --base-path https://docs.example.com/docs/api/ \ --max 55
Key insight: Starting URL is where crawling begins. Base path filters which links get followed. They can differ (useful when base path itself returns 404).
Common Scraping Patterns
Library Documentation (versioned)
# Express.js 4.x docs slurp https://expressjs.com/en/4x/api.html --base-path https://expressjs.com/en/4x/ # React docs (latest) slurp https://react.dev/learn --base-path https://react.dev/learn
API Reference Only
slurp https://docs.example.com/api/introduction --base-path https://docs.example.com/api/
Full Documentation Site
slurp https://docs.example.com/
CLI Options
| Flag | Default | Purpose |
|---|---|---|
| 20 | Maximum pages to scrape |
| 5 | Parallel page requests |
| true | Use headless browser |
| start URL | Filter links to this prefix |
| | Output directory for partials |
| 3 | Retries for failed requests |
| 1000 | Delay between retries |
| - | Skip confirmation prompts |
Compile Options
| Flag | Default | Purpose |
|---|---|---|
| | Input directory |
| | Output file |
| true | Keep metadata blocks |
| true | Strip nav elements |
| true | Eliminate duplicates |
| - | JSON array of regex patterns to exclude |
When to Disable Headless Mode
Use
--headless false for:
- Static HTML documentation sites
- Faster scraping when JS rendering not needed
Default is headless (true) - works for most modern doc sites including SPAs.
Output Structure
slurp_partials/ # Intermediate files └── page1.md └── page2.md slurp_compiled/ # Final output └── compiled_docs.md # Compiled result
Quick Reference
# 1. ALWAYS analyze sitemap first node analyze-sitemap.js https://docs.example.com # 2. Scrape with informed parameters (from sitemap analysis) slurp https://docs.example.com/docs/ --base-path https://docs.example.com/docs/ --max 80 # 3. Skip prompts for automation slurp https://docs.example.com/ --yes # 4. Check output cat slurp_compiled/compiled_docs.md | head -100
Common Issues
| Problem | Cause | Solution |
|---|---|---|
Wrong value | Guessing page count | Run first |
| Too few pages scraped | limit (default 20) | Set based on sitemap analysis |
| Missing content | JS not rendering | Ensure (default) |
| Crawl stuck/slow | Rate limiting | Reduce |
| Duplicate sections | Similar content | Use (default) |
| Wrong pages included | Base path too broad | Use sitemap to find correct |
| Prompts blocking automation | Interactive mode | Add flag |
Post-Scrape Usage
The output markdown is designed for AI context injection:
# Check file size (context budget) wc -c slurp_compiled/compiled_docs.md # Preview structure grep "^#" slurp_compiled/compiled_docs.md | head -30 # Use with Claude Code - reference in prompt or via @file
When NOT to Use
- API specs in OpenAPI/Swagger: Use dedicated parsers instead
- GitHub READMEs: Fetch directly via raw.githubusercontent.com
- npm package docs: Often better to read source + README
- Frequently updated docs: Consider caching strategy