Claude-skill-registry fetcher

install
source · Clone the upstream repo
git clone https://github.com/majiayu000/claude-skill-registry
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/majiayu000/claude-skill-registry "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/data/fetcher" ~/.claude/skills/majiayu000-claude-skill-registry-fetcher && rm -rf "$T"
manifest: skills/data/fetcher/SKILL.md
source content

Fetcher - Web Crawling

Fetch web pages and documents with automatic fallbacks, proxy rotation, and content extraction.

Self-contained skill - auto-installs via

uvx
from git (no pre-installation needed).

Fully automatic - Playwright browsers are installed on first run for SPA/JS page support.

Simplest Usage

# Via wrapper (recommended - auto-installs)
.agents/skills/fetcher/run.sh get https://example.com

# Or directly if fetcher is installed
fetcher get https://example.com

Common Commands

./run.sh get https://example.com                   # Fetch single URL
./run.sh get-manifest urls.txt                     # Fetch list of URLs
./run.sh get-manifest - < urls.txt                 # Fetch from stdin

Common Patterns

Fetch a single URL

fetcher get https://www.nasa.gov --out run/nasa

Outputs to

run/nasa/
:

  • consumer_summary.json
    - structured result
  • Walkthrough.md
    - human-readable summary
  • downloads/
    - raw content files

Fetch multiple URLs

# From file (one URL per line)
fetcher get-manifest urls.txt --out run/batch

# From stdin
echo -e "https://example.com\nhttps://nasa.gov" | fetcher get-manifest -

ETL mode (full control)

fetcher-etl --inventory urls.jsonl --out run/etl_batch
fetcher-etl --manifest urls.txt --out run/demo

Check environment

fetcher doctor                    # Check dependencies and config
fetcher get --dry-run <url>       # Validate without fetching
fetcher-etl --help-full           # All options
fetcher-etl --find metrics        # Search options

Output Structure

run/artifacts/<run-id>/
├── results.jsonl              # Fetch results per URL
├── consumer_summary.json      # Summary stats
├── Walkthrough.md             # Human-readable summary
├── downloads/                 # Raw files (HTML, PDF, etc.)
├── text_blobs/                # Extracted text
├── markdown/                  # LLM-friendly markdown
├── fit_markdown/              # Pruned markdown for LLM input
├── junk_results.jsonl         # Failed/junk URLs
└── junk_table.md              # Quick triage table

Content Extraction

Enable markdown output

export FETCHER_EMIT_MARKDOWN=1
export FETCHER_EMIT_FIT_MARKDOWN=1  # Pruned for LLM input
fetcher get https://example.com

Rolling windows (for chunking)

export FETCHER_DOWNLOAD_MODE=rolling_extract
export FETCHER_ROLLING_WINDOW_SIZE=6000
export FETCHER_ROLLING_WINDOW_STEP=3000
fetcher get https://example.com

Advanced Features

HTTP caching

# Cache enabled by default
fetcher get https://example.com

# Disable cache for fresh fetch
fetcher get https://example.com --no-http-cache

PDF discovery

# Auto-fetch PDF links from HTML pages
export FETCHER_ENABLE_PDF_DISCOVERY=1
export FETCHER_PDF_DISCOVERY_MAX=3
fetcher get https://example.com

Proxy rotation (rate-limited sites)

export SPARTA_STEP06_PROXY_HOST=gw.iproyal.com
export SPARTA_STEP06_PROXY_PORT=12321
export SPARTA_STEP06_PROXY_USER=team
export SPARTA_STEP06_PROXY_PASSWORD=secret
fetcher-etl --inventory urls.jsonl

Brave/Wayback fallbacks

# Enable alternate URL resolution
export BRAVE_API_KEY=sk-your-key
fetcher-etl --use-alternates --inventory urls.jsonl

Python API

import asyncio
from fetcher.workflows.web_fetch import URLFetcher, FetchConfig, write_results
from pathlib import Path

async def main():
    config = FetchConfig(concurrency=4, per_domain=2)
    fetcher = URLFetcher(config)
    entries = [{"url": "https://www.nasa.gov"}]
    results, audit = await fetcher.fetch_many(entries)
    write_results(results, Path("artifacts/nasa.jsonl"))
    print(audit)

asyncio.run(main())

Single URL helper

from fetcher.workflows.fetcher import fetch_url

result = await fetch_url("https://example.com")
print(result.content_verdict)  # "ok", "empty", "paywall", etc.
print(result.text)             # Extracted text

FetchResult Fields

FieldDescription
url
Original URL
final_url
After redirects
content_verdict
ok
,
empty
,
paywall
,
error
, etc.
text
Extracted text content
file_path
Path to raw download
markdown_path
Path to markdown (if enabled)
from_cache
Whether result came from cache
content_sha256
Content hash for change detection

Environment Variables

VariablePurpose
BRAVE_API_KEY
Enable Brave search fallbacks
FETCHER_EMIT_MARKDOWN
Generate LLM-friendly markdown
FETCHER_EMIT_FIT_MARKDOWN
Generate pruned markdown
FETCHER_DOWNLOAD_MODE
text
,
download_only
,
rolling_extract
FETCHER_HTTP_CACHE_DISABLE
Disable HTTP caching
FETCHER_ENABLE_PDF_DISCOVERY
Auto-fetch embedded PDFs

Troubleshooting

ProblemSolution
Playwright missing
uvx --from "git+https://github.com/grahama1970/fetcher.git" playwright install chromium
SPA page returns empty/thinPlaywright auto-fallback should trigger; check
used_playwright
in summary
Stale cached resultsSet
FETCHER_HTTP_CACHE_DISABLE=1
for fresh fetch
Rate limitedConfigure proxy rotation or reduce concurrency
Paywall detectedCheck
content_verdict
and use alternates
Empty contentCheck
junk_results.jsonl
for diagnosis

Run

fetcher doctor
to check environment and dependencies.

SPA/JavaScript Page Support

Fetcher automatically falls back to Playwright for known SPA domains. If a page returns thin/empty content:

  1. Check if
    used_playwright: 1
    in
    consumer_summary.json
  2. If not, the domain may need to be added to
    SPA_FALLBACK_DOMAINS
    in fetcher source
  3. Force fresh fetch with
    FETCHER_HTTP_CACHE_DISABLE=1