Goose-skills web-archive-scraper
install
source · Clone the upstream repo
git clone https://github.com/gooseworks-ai/goose-skills
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/gooseworks-ai/goose-skills "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/capabilities/web-archive-scraper" ~/.claude/skills/gooseworks-ai-goose-skills-web-archive-scraper && rm -rf "$T"
manifest:
skills/capabilities/web-archive-scraper/SKILL.mdsource content
Web Archive Scraper
Search the Wayback Machine (Internet Archive) for archived snapshots of websites. Fetch cached page content to find customer lists, testimonials, partner directories, and other information from sites that have changed or shut down.
Quick Start
Only dependency is
requests. No API key needed.
# Find all snapshots of a URL python3 skills/web-archive-scraper/scripts/search_archive.py \ --url "https://botkeeper.com/customers" # Search with date range python3 skills/web-archive-scraper/scripts/search_archive.py \ --url "https://botkeeper.com" --from 2025-01-01 --to 2026-02-01 # Search all pages under a domain (prefix match) python3 skills/web-archive-scraper/scripts/search_archive.py \ --url "https://botkeeper.com" --match prefix --limit 50 # Fetch the actual archived page content python3 skills/web-archive-scraper/scripts/search_archive.py \ --url "https://botkeeper.com/customers" --fetch # Output formats python3 skills/web-archive-scraper/scripts/search_archive.py --url URL --output json python3 skills/web-archive-scraper/scripts/search_archive.py --url URL --output csv python3 skills/web-archive-scraper/scripts/search_archive.py --url URL --output summary
How It Works
- CDX API search — Queries
for snapshots matching the URLweb.archive.org/cdx/search/cdx - Filtering — Filters by date range, HTTP status code, and MIME type
- Dedup — Collapses to one snapshot per day by default to avoid redundant results
- Content fetch — Optionally fetches the raw archived HTML (using
modifier to skip Wayback toolbar)id_ - Text extraction — Strips HTML tags for readable text output when fetching content
CLI Reference
| Flag | Default | Description |
|---|---|---|
| required | Target URL to search in the archive |
| exact | Match type: , , , |
| none | Start date (YYYY-MM-DD) |
| none | End date (YYYY-MM-DD) |
| 25 | Max number of snapshots to return |
| false | Fetch and display the content of the most recent snapshot |
| false | Fetch content of ALL matched snapshots (use with small --limit) |
| 200 | HTTP status filter (set to "any" to include all) |
| json | Output format: , , |
| day | Dedup level: , , , |
Output Schema
{ "url": "https://botkeeper.com/customers", "timestamp": "20250915143022", "datetime": "2025-09-15T14:30:22", "status_code": "200", "mime_type": "text/html", "archive_url": "https://web.archive.org/web/20250915143022/https://botkeeper.com/customers", "raw_url": "https://web.archive.org/web/20250915143022id_/https://botkeeper.com/customers", "content": "..." }
The
content field is only populated when --fetch or --fetch-all is used.
Cost
Free. The Wayback Machine CDX API requires no authentication or API key. Rate limit is ~15 requests/minute.
Common Use Cases
- Find customer lists from shut-down companies (e.g., botkeeper.com)
- Recover testimonials/case studies before a site redesign
- Track how a competitor's messaging changed over time
- Find partner directories that have been removed