Aiwg doc-scraper
Scrape documentation websites into organized reference files. Use when converting docs sites to searchable references or building Claude skills.
install
source · Clone the upstream repo
git clone https://github.com/jmagly/aiwg
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/jmagly/aiwg "$T" && mkdir -p ~/.claude/skills && cp -r "$T/agentic/code/addons/doc-intelligence/skills/doc-scraper" ~/.claude/skills/jmagly-aiwg-doc-scraper-fff465 && rm -rf "$T"
manifest:
agentic/code/addons/doc-intelligence/skills/doc-scraper/SKILL.mdsource content
Documentation Scraper Skill
Purpose
Single responsibility: Convert documentation websites into organized, categorized reference files suitable for Claude skills or offline archives. (BP-4)
Grounding Checkpoint (Archetype 1 Mitigation)
Before executing, VERIFY:
- Target URL is accessible (test with
)curl -I - Documentation structure is identifiable (inspect page for content selectors)
- Output directory is writable
- Rate limiting requirements are known (check robots.txt)
DO NOT proceed without verification. Inspect before scraping.
Uncertainty Escalation (Archetype 2 Mitigation)
ASK USER instead of guessing when:
- Content selector is ambiguous (multiple
or<article>
elements)<main> - URL patterns unclear (can't determine include/exclude rules)
- Category mapping uncertain (content doesn't fit predefined categories)
- Rate limiting unknown (no robots.txt, unclear ToS)
NEVER substitute missing configuration with assumptions.
Context Scope (Archetype 3 Mitigation)
| Context Type | Included | Excluded |
|---|---|---|
| RELEVANT | Target URL, selectors, output path | Unrelated documentation |
| PERIPHERAL | Similar site examples for selector hints | Historical scrape data |
| DISTRACTOR | Other projects, unrelated URLs | Previous failed attempts |
Workflow Steps
Step 1: Verify Target (Grounding)
# Test URL accessibility curl -I <target-url> # Check robots.txt curl <base-url>/robots.txt # Inspect page structure (use browser dev tools or fetch sample)
Step 2: Create Configuration
Generate scraper config based on inspection:
{ "name": "skill-name", "description": "When to use this skill", "base_url": "https://docs.example.com/", "selectors": { "main_content": "article", "title": "h1", "code_blocks": "pre code" }, "url_patterns": { "include": ["/docs", "/guide", "/api"], "exclude": ["/blog", "/changelog", "/releases"] }, "categories": { "getting_started": ["intro", "quickstart", "installation"], "api_reference": ["api", "reference", "methods"], "guides": ["guide", "tutorial", "how-to"] }, "rate_limit": 0.5, "max_pages": 500 }
Step 3: Execute Scraping
Option A: With skill-seekers (if installed)
# Verify skill-seekers is available pip show skill-seekers # Run scraper skill-seekers scrape --config config.json # For large docs, use async mode skill-seekers scrape --config config.json --async --workers 8
Option B: Manual scraping guidance
- Use sitemap.xml or crawl starting URL
- Extract content using configured selectors
- Categorize pages based on URL patterns and keywords
- Save to organized directory structure
Step 4: Validate Output
# Check output structure ls -la output/<skill-name>/ # Verify content quality head -50 output/<skill-name>/references/index.md # Count extracted pages find output/<skill-name>_data/pages -name "*.json" | wc -l
Recovery Protocol (Archetype 4 Mitigation)
On error:
- PAUSE - Stop scraping, preserve already-fetched pages
- DIAGNOSE - Check error type:
→ Verify URL, check networkConnection error
→ Re-inspect page structureSelector not found
→ Increase delay, reduce workersRate limited
→ Reduce batch size, clear temp filesMemory/disk
- ADAPT - Adjust configuration based on diagnosis
- RETRY - Resume from checkpoint (max 3 attempts)
- ESCALATE - Ask user for guidance
Checkpoint Support
State saved to:
.aiwg/working/checkpoints/doc-scraper/
Resume interrupted scrape:
skill-seekers scrape --config config.json --resume
Clear checkpoint and start fresh:
skill-seekers scrape --config config.json --fresh
Output Structure
output/<skill-name>/ ├── SKILL.md # Main skill description ├── references/ # Categorized documentation │ ├── index.md # Category index │ ├── getting_started.md │ ├── api_reference.md │ └── guides.md ├── scripts/ # (empty, for user additions) └── assets/ # (empty, for user additions) output/<skill-name>_data/ ├── pages/ # Raw scraped JSON (one per page) └── summary.json # Scrape statistics
Configuration Templates
Minimal Config
{ "name": "myframework", "base_url": "https://docs.example.com/", "max_pages": 100 }
Full Config
{ "name": "myframework", "description": "MyFramework documentation for building web apps", "base_url": "https://docs.example.com/", "selectors": { "main_content": "article, main, div[role='main']", "title": "h1, .title", "code_blocks": "pre code, .highlight code", "navigation": "nav, .sidebar" }, "url_patterns": { "include": ["/docs/", "/api/", "/guide/"], "exclude": ["/blog/", "/changelog/", "/v1/", "/v2/"] }, "categories": { "getting_started": ["intro", "quickstart", "install", "setup"], "concepts": ["concept", "overview", "architecture"], "api": ["api", "reference", "method", "function"], "guides": ["guide", "tutorial", "how-to", "example"], "advanced": ["advanced", "internals", "customize"] }, "rate_limit": 0.5, "max_pages": 1000, "checkpoint": { "enabled": true, "interval": 100 } }
Troubleshooting
| Issue | Diagnosis | Solution |
|---|---|---|
| No content extracted | Selector mismatch | Inspect page, update selector |
| Wrong pages scraped | URL pattern issue | Check / patterns |
| Rate limited | Too aggressive | Increase to 1.0+ seconds |
| Memory issues | Too many pages | Add limit, enable checkpoints |
| Categories wrong | Keyword mismatch | Update category keywords in config |
References
- Skill Seekers: https://github.com/jmagly/Skill_Seekers
- REF-001: Production-Grade Agentic Workflows (BP-1, BP-4, BP-9)
- REF-002: LLM Failure Modes (Archetype 1-4 mitigations)