Claude-skill-registry documentation-scraper

Use when needing to scrape documentation websites into markdown for AI context. Triggers on "scrape docs", "download documentation", "get docs for [library]", or creating local copies of online documentation. CRITICAL - always analyze sitemap first before scraping.

install
source · Clone the upstream repo
git clone https://github.com/majiayu000/claude-skill-registry
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/majiayu000/claude-skill-registry "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/data/documentation-scraper" ~/.claude/skills/majiayu000-claude-skill-registry-documentation-scraper && rm -rf "$T"
manifest: skills/data/documentation-scraper/SKILL.md
source content

Documentation Scraper with slurp-ai

Overview

slurp-ai scrapes documentation websites and compiles them into a single markdown file optimized for AI agent context. It is lightweight, fast, and deterministic - it does NOT use AI to scrape, it is FOR AI consumption.

CRITICAL: Run Outside Sandbox

All commands in this skill MUST be run outside the sandbox. Use

dangerouslyDisableSandbox: true
for all Bash commands including:

  • which slurp
    (installation check)
  • node analyze-sitemap.js
    (sitemap analysis)
  • slurp
    (scraping)
  • File inspection commands (
    wc
    ,
    head
    ,
    cat
    , etc.)

The sandbox blocks network access and file operations required for web scraping.

Pre-Flight: Check Installation

Before scraping, verify slurp-ai is installed:

which slurp || echo "NOT INSTALLED"

If not installed, ask the user to run:

npm install -g slurp-ai

Requires: Node.js v20+

Do NOT proceed with scraping until slurp-ai is confirmed installed.

Commands

CommandPurpose
slurp <url>
Fetch and compile in one step
slurp fetch <url> [version]
Download docs to partials only
slurp compile
Compile partials into single file
slurp read <package> [version]
Read local documentation

Output: Creates

slurp_compiled/compiled_docs.md
from partials in
slurp_partials/
.

CRITICAL: Analyze Sitemap First

Before running slurp, ALWAYS analyze the sitemap. This reveals the complete site structure and informs your

--base-path
and
--max
decisions.

Step 1: Run Sitemap Analysis

Use the included

analyze-sitemap.js
script:

node analyze-sitemap.js https://docs.example.com

This outputs:

  • Total page count (informs
    --max
    )
  • URLs grouped by section (informs
    --base-path
    )
  • Suggested slurp commands with appropriate flags
  • Sample URLs to understand naming patterns

Step 2: Interpret the Output

Example output:

📊 Total URLs in sitemap: 247

📁 URLs by top-level section:
   /docs                          182 pages
   /api                            45 pages
   /blog                           20 pages

🎯 Suggested --base-path options:
   https://docs.example.com/docs/guides/     (67 pages)
   https://docs.example.com/docs/reference/  (52 pages)
   https://docs.example.com/api/             (45 pages)

💡 Recommended slurp commands:

   # Just "/docs/guides" section (67 pages)
   slurp https://docs.example.com/docs/guides/ --base-path https://docs.example.com/docs/guides/ --max 80

Step 3: Choose Scope Based on Analysis

Sitemap ShowsAction
< 50 pages totalScrape entire site:
slurp <url> --max 60
50-200 pagesScope to relevant section with
--base-path
200+ pagesMust scope down - pick specific subsection
No sitemap foundStart with
--max 30
, inspect partials, adjust

Step 4: Frame the Slurp Command

With sitemap data, you can now set accurate parameters:

# From sitemap: /docs/api has 45 pages
slurp https://docs.example.com/docs/api/intro \
  --base-path https://docs.example.com/docs/api/ \
  --max 55

Key insight: Starting URL is where crawling begins. Base path filters which links get followed. They can differ (useful when base path itself returns 404).

Common Scraping Patterns

Library Documentation (versioned)

# Express.js 4.x docs
slurp https://expressjs.com/en/4x/api.html --base-path https://expressjs.com/en/4x/

# React docs (latest)
slurp https://react.dev/learn --base-path https://react.dev/learn

API Reference Only

slurp https://docs.example.com/api/introduction --base-path https://docs.example.com/api/

Full Documentation Site

slurp https://docs.example.com/

CLI Options

FlagDefaultPurpose
--max <n>
20Maximum pages to scrape
--concurrency <n>
5Parallel page requests
--headless <bool>
trueUse headless browser
--base-path <url>
start URLFilter links to this prefix
--output <dir>
./slurp_partials
Output directory for partials
--retry-count <n>
3Retries for failed requests
--retry-delay <ms>
1000Delay between retries
--yes
-Skip confirmation prompts

Compile Options

FlagDefaultPurpose
--input <dir>
./slurp_partials
Input directory
--output <file>
./slurp_compiled/compiled_docs.md
Output file
--preserve-metadata
trueKeep metadata blocks
--remove-navigation
trueStrip nav elements
--remove-duplicates
trueEliminate duplicates
--exclude <json>
-JSON array of regex patterns to exclude

When to Disable Headless Mode

Use

--headless false
for:

  • Static HTML documentation sites
  • Faster scraping when JS rendering not needed

Default is headless (true) - works for most modern doc sites including SPAs.

Output Structure

slurp_partials/              # Intermediate files
  └── page1.md
  └── page2.md
slurp_compiled/              # Final output
  └── compiled_docs.md       # Compiled result

Quick Reference

# 1. ALWAYS analyze sitemap first
node analyze-sitemap.js https://docs.example.com

# 2. Scrape with informed parameters (from sitemap analysis)
slurp https://docs.example.com/docs/ --base-path https://docs.example.com/docs/ --max 80

# 3. Skip prompts for automation
slurp https://docs.example.com/ --yes

# 4. Check output
cat slurp_compiled/compiled_docs.md | head -100

Common Issues

ProblemCauseSolution
Wrong
--max
value
Guessing page countRun
analyze-sitemap.js
first
Too few pages scraped
--max
limit (default 20)
Set
--max
based on sitemap analysis
Missing contentJS not renderingEnsure
--headless true
(default)
Crawl stuck/slowRate limitingReduce
--concurrency 3
Duplicate sectionsSimilar contentUse
--remove-duplicates
(default)
Wrong pages includedBase path too broadUse sitemap to find correct
--base-path
Prompts blocking automationInteractive modeAdd
--yes
flag

Post-Scrape Usage

The output markdown is designed for AI context injection:

# Check file size (context budget)
wc -c slurp_compiled/compiled_docs.md

# Preview structure
grep "^#" slurp_compiled/compiled_docs.md | head -30

# Use with Claude Code - reference in prompt or via @file

When NOT to Use

  • API specs in OpenAPI/Swagger: Use dedicated parsers instead
  • GitHub READMEs: Fetch directly via raw.githubusercontent.com
  • npm package docs: Often better to read source + README
  • Frequently updated docs: Consider caching strategy