Claude-skill-registry fetcher

install

source · Clone the upstream repo

git clone https://github.com/majiayu000/claude-skill-registry

Claude Code · Install into ~/.claude/skills/

T=$(mktemp -d) && git clone --depth=1 https://github.com/majiayu000/claude-skill-registry "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/data/fetcher" ~/.claude/skills/majiayu000-claude-skill-registry-fetcher && rm -rf "$T"

manifest: skills/data/fetcher/SKILL.md

source content

Fetcher - Web Crawling

Fetch web pages and documents with automatic fallbacks, proxy rotation, and content extraction.

Self-contained skill - auto-installs via

uvx

from git (no pre-installation needed).

Fully automatic - Playwright browsers are installed on first run for SPA/JS page support.

Simplest Usage

# Via wrapper (recommended - auto-installs)
.agents/skills/fetcher/run.sh get https://example.com

# Or directly if fetcher is installed
fetcher get https://example.com

Common Commands

./run.sh get https://example.com                   # Fetch single URL
./run.sh get-manifest urls.txt                     # Fetch list of URLs
./run.sh get-manifest - < urls.txt                 # Fetch from stdin

Common Patterns

Fetch a single URL

fetcher get https://www.nasa.gov --out run/nasa

Outputs to

run/nasa/

```
consumer_summary.json
```
- structured result
```
Walkthrough.md
```
- human-readable summary
```
downloads/
```
- raw content files

Fetch multiple URLs

# From file (one URL per line)
fetcher get-manifest urls.txt --out run/batch

# From stdin
echo -e "https://example.com\nhttps://nasa.gov" | fetcher get-manifest -

ETL mode (full control)

fetcher-etl --inventory urls.jsonl --out run/etl_batch
fetcher-etl --manifest urls.txt --out run/demo

Check environment

fetcher doctor                    # Check dependencies and config
fetcher get --dry-run <url>       # Validate without fetching
fetcher-etl --help-full           # All options
fetcher-etl --find metrics        # Search options

Output Structure

run/artifacts/<run-id>/
├── results.jsonl              # Fetch results per URL
├── consumer_summary.json      # Summary stats
├── Walkthrough.md             # Human-readable summary
├── downloads/                 # Raw files (HTML, PDF, etc.)
├── text_blobs/                # Extracted text
├── markdown/                  # LLM-friendly markdown
├── fit_markdown/              # Pruned markdown for LLM input
├── junk_results.jsonl         # Failed/junk URLs
└── junk_table.md              # Quick triage table

Content Extraction

Enable markdown output

export FETCHER_EMIT_MARKDOWN=1
export FETCHER_EMIT_FIT_MARKDOWN=1  # Pruned for LLM input
fetcher get https://example.com

Rolling windows (for chunking)

export FETCHER_DOWNLOAD_MODE=rolling_extract
export FETCHER_ROLLING_WINDOW_SIZE=6000
export FETCHER_ROLLING_WINDOW_STEP=3000
fetcher get https://example.com

Advanced Features

HTTP caching

# Cache enabled by default
fetcher get https://example.com

# Disable cache for fresh fetch
fetcher get https://example.com --no-http-cache

PDF discovery

# Auto-fetch PDF links from HTML pages
export FETCHER_ENABLE_PDF_DISCOVERY=1
export FETCHER_PDF_DISCOVERY_MAX=3
fetcher get https://example.com

Proxy rotation (rate-limited sites)

export SPARTA_STEP06_PROXY_HOST=gw.iproyal.com
export SPARTA_STEP06_PROXY_PORT=12321
export SPARTA_STEP06_PROXY_USER=team
export SPARTA_STEP06_PROXY_PASSWORD=secret
fetcher-etl --inventory urls.jsonl

Brave/Wayback fallbacks

# Enable alternate URL resolution
export BRAVE_API_KEY=sk-your-key
fetcher-etl --use-alternates --inventory urls.jsonl

Python API

import asyncio
from fetcher.workflows.web_fetch import URLFetcher, FetchConfig, write_results
from pathlib import Path

async def main():
    config = FetchConfig(concurrency=4, per_domain=2)
    fetcher = URLFetcher(config)
    entries = [{"url": "https://www.nasa.gov"}]
    results, audit = await fetcher.fetch_many(entries)
    write_results(results, Path("artifacts/nasa.jsonl"))
    print(audit)

asyncio.run(main())

Single URL helper

from fetcher.workflows.fetcher import fetch_url

result = await fetch_url("https://example.com")
print(result.content_verdict)  # "ok", "empty", "paywall", etc.
print(result.text)             # Extracted text

FetchResult Fields

Field	Description
`url`	Original URL
`final_url`	After redirects
`content_verdict`	`ok` , `empty` , `paywall` , `error` , etc.
`text`	Extracted text content
`file_path`	Path to raw download
`markdown_path`	Path to markdown (if enabled)
`from_cache`	Whether result came from cache
`content_sha256`	Content hash for change detection

Environment Variables

Variable	Purpose
`BRAVE_API_KEY`	Enable Brave search fallbacks
`FETCHER_EMIT_MARKDOWN`	Generate LLM-friendly markdown
`FETCHER_EMIT_FIT_MARKDOWN`	Generate pruned markdown
`FETCHER_DOWNLOAD_MODE`	`text` , `download_only` , `rolling_extract`
`FETCHER_HTTP_CACHE_DISABLE`	Disable HTTP caching
`FETCHER_ENABLE_PDF_DISCOVERY`	Auto-fetch embedded PDFs

Troubleshooting

Problem	Solution
Playwright missing	`uvx --from "git+https://github.com/grahama1970/fetcher.git" playwright install chromium`
SPA page returns empty/thin	Playwright auto-fallback should trigger; check `used_playwright` in summary
Stale cached results	Set `FETCHER_HTTP_CACHE_DISABLE=1` for fresh fetch
Rate limited	Configure proxy rotation or reduce concurrency
Paywall detected	Check `content_verdict` and use alternates
Empty content	Check `junk_results.jsonl` for diagnosis

Run

fetcher doctor

to check environment and dependencies.

SPA/JavaScript Page Support

Fetcher automatically falls back to Playwright for known SPA domains. If a page returns thin/empty content:

Check if
```
used_playwright: 1
```
in
```
consumer_summary.json
```
If not, the domain may need to be added to
```
SPA_FALLBACK_DOMAINS
```
in fetcher source
Force fresh fetch with
```
FETCHER_HTTP_CACHE_DISABLE=1
```