Claude-skill-registry doc-scraper

Snowflake Documentation Scraper

install
source · Clone the upstream repo
git clone https://github.com/majiayu000/claude-skill-registry
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/majiayu000/claude-skill-registry "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/data/doc-scraper" ~/.claude/skills/majiayu000-claude-skill-registry-doc-scraper && rm -rf "$T"
manifest: skills/data/doc-scraper/SKILL.md
source content

Snowflake Documentation Scraper

Scrapes docs.snowflake.com sections to Markdown with SQLite caching (7-day expiration).

Usage

First time setup (auto-installs uv and doc-scraper):

python3 .claude/skills/doc-scraper/scripts/doc_scraper.py

Subsequent runs:

doc-scraper --output-dir=./snowflake-docs
doc-scraper --output-dir=./snowflake-docs --base-path="/en/sql-reference/"
doc-scraper --output-dir=./snowflake-docs --spider-depth=2

Command Options

OptionDefaultDescription
--output-dir
RequiredOutput directory for scraped docs
--base-path
/en/migrations/
URL section to scrape
--spider-depth
1
Link depth: 0=seeds, 1=+links, 2=+2nd
--limit
NoneCap URLs (for testing)
--dry-run
-Preview without writing

Output

output-dir/
├── SKILL.md              # Auto-generated index
├── scraper_config.yaml   # Editable config (auto-created)
├── .cache/               # SQLite cache (auto-managed)
└── en/migrations/*.md    # Scraped pages with frontmatter

Configuration

Auto-created at

{output-dir}/scraper_config.yaml
:

rate_limiting:
  max_concurrent_threads: 4
spider:
  max_pages: 1000
  allowed_paths: ["/en/"]
scraped_pages:
  expiration_days: 7

Troubleshooting

IssueSolution
Too many pagesLower
--spider-depth
or edit config
Missing pagesIncrease
--spider-depth
Cache corruptionDelete
{output-dir}/.cache/
(rare)