Aiwg doc-scraper

Scrape documentation websites into organized reference files. Use when converting docs sites to searchable references or building Claude skills.

install

source · Clone the upstream repo

git clone https://github.com/jmagly/aiwg

Claude Code · Install into ~/.claude/skills/

T=$(mktemp -d) && git clone --depth=1 https://github.com/jmagly/aiwg "$T" && mkdir -p ~/.claude/skills && cp -r "$T/agentic/code/addons/doc-intelligence/skills/doc-scraper" ~/.claude/skills/jmagly-aiwg-doc-scraper-fff465 && rm -rf "$T"

manifest: agentic/code/addons/doc-intelligence/skills/doc-scraper/SKILL.md

source content

Documentation Scraper Skill

Purpose

Single responsibility: Convert documentation websites into organized, categorized reference files suitable for Claude skills or offline archives. (BP-4)

Grounding Checkpoint (Archetype 1 Mitigation)

Before executing, VERIFY:

Target URL is accessible (test with
```
curl -I
```
)
Documentation structure is identifiable (inspect page for content selectors)
Output directory is writable
Rate limiting requirements are known (check robots.txt)

DO NOT proceed without verification. Inspect before scraping.

Uncertainty Escalation (Archetype 2 Mitigation)

ASK USER instead of guessing when:

Content selector is ambiguous (multiple
```
<article>
```
or
```
<main>
```
elements)
URL patterns unclear (can't determine include/exclude rules)
Category mapping uncertain (content doesn't fit predefined categories)
Rate limiting unknown (no robots.txt, unclear ToS)

NEVER substitute missing configuration with assumptions.

Context Scope (Archetype 3 Mitigation)

Context Type	Included	Excluded
RELEVANT	Target URL, selectors, output path	Unrelated documentation
PERIPHERAL	Similar site examples for selector hints	Historical scrape data
DISTRACTOR	Other projects, unrelated URLs	Previous failed attempts

Workflow Steps

Step 1: Verify Target (Grounding)

# Test URL accessibility
curl -I <target-url>

# Check robots.txt
curl <base-url>/robots.txt

# Inspect page structure (use browser dev tools or fetch sample)

Step 2: Create Configuration

Generate scraper config based on inspection:

{
  "name": "skill-name",
  "description": "When to use this skill",
  "base_url": "https://docs.example.com/",
  "selectors": {
    "main_content": "article",
    "title": "h1",
    "code_blocks": "pre code"
  },
  "url_patterns": {
    "include": ["/docs", "/guide", "/api"],
    "exclude": ["/blog", "/changelog", "/releases"]
  },
  "categories": {
    "getting_started": ["intro", "quickstart", "installation"],
    "api_reference": ["api", "reference", "methods"],
    "guides": ["guide", "tutorial", "how-to"]
  },
  "rate_limit": 0.5,
  "max_pages": 500
}

Step 3: Execute Scraping

Option A: With skill-seekers (if installed)

# Verify skill-seekers is available
pip show skill-seekers

# Run scraper
skill-seekers scrape --config config.json

# For large docs, use async mode
skill-seekers scrape --config config.json --async --workers 8

Option B: Manual scraping guidance

Use sitemap.xml or crawl starting URL
Extract content using configured selectors
Categorize pages based on URL patterns and keywords
Save to organized directory structure

Step 4: Validate Output

# Check output structure
ls -la output/<skill-name>/

# Verify content quality
head -50 output/<skill-name>/references/index.md

# Count extracted pages
find output/<skill-name>_data/pages -name "*.json" | wc -l

Recovery Protocol (Archetype 4 Mitigation)

On error:

PAUSE - Stop scraping, preserve already-fetched pages
DIAGNOSE - Check error type:
- ```
Connection error
```
  → Verify URL, check network
- ```
Selector not found
```
  → Re-inspect page structure
- ```
Rate limited
```
  → Increase delay, reduce workers
- ```
Memory/disk
```
  → Reduce batch size, clear temp files
ADAPT - Adjust configuration based on diagnosis
RETRY - Resume from checkpoint (max 3 attempts)
ESCALATE - Ask user for guidance

Checkpoint Support

State saved to:

.aiwg/working/checkpoints/doc-scraper/

Resume interrupted scrape:

skill-seekers scrape --config config.json --resume

Clear checkpoint and start fresh:

skill-seekers scrape --config config.json --fresh

Output Structure

output/<skill-name>/
├── SKILL.md              # Main skill description
├── references/           # Categorized documentation
│   ├── index.md          # Category index
│   ├── getting_started.md
│   ├── api_reference.md
│   └── guides.md
├── scripts/              # (empty, for user additions)
└── assets/               # (empty, for user additions)

output/<skill-name>_data/
├── pages/                # Raw scraped JSON (one per page)
└── summary.json          # Scrape statistics

Configuration Templates

Minimal Config

{
  "name": "myframework",
  "base_url": "https://docs.example.com/",
  "max_pages": 100
}

Full Config

{
  "name": "myframework",
  "description": "MyFramework documentation for building web apps",
  "base_url": "https://docs.example.com/",
  "selectors": {
    "main_content": "article, main, div[role='main']",
    "title": "h1, .title",
    "code_blocks": "pre code, .highlight code",
    "navigation": "nav, .sidebar"
  },
  "url_patterns": {
    "include": ["/docs/", "/api/", "/guide/"],
    "exclude": ["/blog/", "/changelog/", "/v1/", "/v2/"]
  },
  "categories": {
    "getting_started": ["intro", "quickstart", "install", "setup"],
    "concepts": ["concept", "overview", "architecture"],
    "api": ["api", "reference", "method", "function"],
    "guides": ["guide", "tutorial", "how-to", "example"],
    "advanced": ["advanced", "internals", "customize"]
  },
  "rate_limit": 0.5,
  "max_pages": 1000,
  "checkpoint": {
    "enabled": true,
    "interval": 100
  }
}

Troubleshooting

Issue	Diagnosis	Solution
No content extracted	Selector mismatch	Inspect page, update `main_content` selector
Wrong pages scraped	URL pattern issue	Check `include` / `exclude` patterns
Rate limited	Too aggressive	Increase `rate_limit` to 1.0+ seconds
Memory issues	Too many pages	Add `max_pages` limit, enable checkpoints
Categories wrong	Keyword mismatch	Update category keywords in config

References

Skill Seekers: https://github.com/jmagly/Skill_Seekers
REF-001: Production-Grade Agentic Workflows (BP-1, BP-4, BP-9)
REF-002: LLM Failure Modes (Archetype 1-4 mitigations)