Claude-skill-registry fetcher
install
source · Clone the upstream repo
git clone https://github.com/majiayu000/claude-skill-registry
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/majiayu000/claude-skill-registry "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/data/fetcher" ~/.claude/skills/majiayu000-claude-skill-registry-fetcher && rm -rf "$T"
manifest:
skills/data/fetcher/SKILL.mdsource content
Fetcher - Web Crawling
Fetch web pages and documents with automatic fallbacks, proxy rotation, and content extraction.
Self-contained skill - auto-installs via
uvx from git (no pre-installation needed).
Fully automatic - Playwright browsers are installed on first run for SPA/JS page support.
Simplest Usage
# Via wrapper (recommended - auto-installs) .agents/skills/fetcher/run.sh get https://example.com # Or directly if fetcher is installed fetcher get https://example.com
Common Commands
./run.sh get https://example.com # Fetch single URL ./run.sh get-manifest urls.txt # Fetch list of URLs ./run.sh get-manifest - < urls.txt # Fetch from stdin
Common Patterns
Fetch a single URL
fetcher get https://www.nasa.gov --out run/nasa
Outputs to
run/nasa/:
- structured resultconsumer_summary.json
- human-readable summaryWalkthrough.md
- raw content filesdownloads/
Fetch multiple URLs
# From file (one URL per line) fetcher get-manifest urls.txt --out run/batch # From stdin echo -e "https://example.com\nhttps://nasa.gov" | fetcher get-manifest -
ETL mode (full control)
fetcher-etl --inventory urls.jsonl --out run/etl_batch fetcher-etl --manifest urls.txt --out run/demo
Check environment
fetcher doctor # Check dependencies and config fetcher get --dry-run <url> # Validate without fetching fetcher-etl --help-full # All options fetcher-etl --find metrics # Search options
Output Structure
run/artifacts/<run-id>/ ├── results.jsonl # Fetch results per URL ├── consumer_summary.json # Summary stats ├── Walkthrough.md # Human-readable summary ├── downloads/ # Raw files (HTML, PDF, etc.) ├── text_blobs/ # Extracted text ├── markdown/ # LLM-friendly markdown ├── fit_markdown/ # Pruned markdown for LLM input ├── junk_results.jsonl # Failed/junk URLs └── junk_table.md # Quick triage table
Content Extraction
Enable markdown output
export FETCHER_EMIT_MARKDOWN=1 export FETCHER_EMIT_FIT_MARKDOWN=1 # Pruned for LLM input fetcher get https://example.com
Rolling windows (for chunking)
export FETCHER_DOWNLOAD_MODE=rolling_extract export FETCHER_ROLLING_WINDOW_SIZE=6000 export FETCHER_ROLLING_WINDOW_STEP=3000 fetcher get https://example.com
Advanced Features
HTTP caching
# Cache enabled by default fetcher get https://example.com # Disable cache for fresh fetch fetcher get https://example.com --no-http-cache
PDF discovery
# Auto-fetch PDF links from HTML pages export FETCHER_ENABLE_PDF_DISCOVERY=1 export FETCHER_PDF_DISCOVERY_MAX=3 fetcher get https://example.com
Proxy rotation (rate-limited sites)
export SPARTA_STEP06_PROXY_HOST=gw.iproyal.com export SPARTA_STEP06_PROXY_PORT=12321 export SPARTA_STEP06_PROXY_USER=team export SPARTA_STEP06_PROXY_PASSWORD=secret fetcher-etl --inventory urls.jsonl
Brave/Wayback fallbacks
# Enable alternate URL resolution export BRAVE_API_KEY=sk-your-key fetcher-etl --use-alternates --inventory urls.jsonl
Python API
import asyncio from fetcher.workflows.web_fetch import URLFetcher, FetchConfig, write_results from pathlib import Path async def main(): config = FetchConfig(concurrency=4, per_domain=2) fetcher = URLFetcher(config) entries = [{"url": "https://www.nasa.gov"}] results, audit = await fetcher.fetch_many(entries) write_results(results, Path("artifacts/nasa.jsonl")) print(audit) asyncio.run(main())
Single URL helper
from fetcher.workflows.fetcher import fetch_url result = await fetch_url("https://example.com") print(result.content_verdict) # "ok", "empty", "paywall", etc. print(result.text) # Extracted text
FetchResult Fields
| Field | Description |
|---|---|
| Original URL |
| After redirects |
| , , , , etc. |
| Extracted text content |
| Path to raw download |
| Path to markdown (if enabled) |
| Whether result came from cache |
| Content hash for change detection |
Environment Variables
| Variable | Purpose |
|---|---|
| Enable Brave search fallbacks |
| Generate LLM-friendly markdown |
| Generate pruned markdown |
| , , |
| Disable HTTP caching |
| Auto-fetch embedded PDFs |
Troubleshooting
| Problem | Solution |
|---|---|
| Playwright missing | |
| SPA page returns empty/thin | Playwright auto-fallback should trigger; check in summary |
| Stale cached results | Set for fresh fetch |
| Rate limited | Configure proxy rotation or reduce concurrency |
| Paywall detected | Check and use alternates |
| Empty content | Check for diagnosis |
Run
fetcher doctor to check environment and dependencies.
SPA/JavaScript Page Support
Fetcher automatically falls back to Playwright for known SPA domains. If a page returns thin/empty content:
- Check if
inused_playwright: 1consumer_summary.json - If not, the domain may need to be added to
in fetcher sourceSPA_FALLBACK_DOMAINS - Force fresh fetch with
FETCHER_HTTP_CACHE_DISABLE=1