Learn-skills.dev scrape-webpage
Scrape webpage content, extract metadata, download images, and prepare for import/migration to AEM Edge Delivery Services. Returns analysis JSON with paths, metadata, cleaned HTML, and local images.
install
source · Clone the upstream repo
git clone https://github.com/NeverSight/learn-skills.dev
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/NeverSight/learn-skills.dev "$T" && mkdir -p ~/.claude/skills && cp -r "$T/data/skills-md/adobe/helix-website/scrape-webpage" ~/.claude/skills/neversight-learn-skills-dev-scrape-webpage && rm -rf "$T"
manifest:
data/skills-md/adobe/helix-website/scrape-webpage/SKILL.mdsource content
Scrape Webpage
Extract content, metadata, and images from a webpage for import/migration.
When to Use This Skill
Use this skill when:
- Starting a page import and need to extract content from source URL
- Need webpage analysis with local image downloads
- Want metadata extraction (Open Graph, JSON-LD, etc.)
Invoked by: page-import skill (Step 1)
Prerequisites
Before using this skill, ensure:
- ✅ Node.js is available
- ✅ npm playwright is installed (
)npm install playwright - ✅ Chromium browser is installed (
)npx playwright install chromium - ✅ Sharp image library is installed (
)cd .claude/skills/scrape-webpage/scripts && npm install
Related Skills
- page-import - Orchestrator that invokes this skill
- identify-page-structure - Uses this skill's output (screenshot, HTML, metadata)
- generate-import-html - Uses image mapping and paths from this skill
Scraping Workflow
Step 1: Run Analysis Script
Command:
node .claude/skills/scrape-webpage/scripts/analyze-webpage.js "https://example.com/page" --output ./import-work
What the script does:
- Sets up network interception to capture all images
- Loads page in headless Chromium
- Scrolls through entire page to trigger lazy-loaded images
- Downloads all images locally (converts WebP/AVIF/SVG to PNG)
- Captures full-page screenshot for visual reference
- Extracts metadata (title, description, Open Graph, JSON-LD, canonical)
- Fixes images in DOM (background-image→img, picture elements, srcset→src, relative→absolute, inline SVG→img)
- Extracts cleaned HTML (removes scripts/styles)
- Replaces image URLs in HTML with local paths (./images/...)
- Generates document paths (sanitized, lowercase, no .html extension)
- Saves complete analysis with image mapping to metadata.json
For detailed explanation: See
resources/web-page-analysis.md
Step 2: Verify Output
Output files:
- Complete analysis with paths and image mapping./import-work/metadata.json
- Visual reference for layout comparison./import-work/screenshot.png
- Main content HTML with local image paths./import-work/cleaned.html
- All downloaded images (WebP/AVIF/SVG converted to PNG)./import-work/images/
Verify files exist:
ls -lh ./import-work/metadata.json ./import-work/screenshot.png ./import-work/cleaned.html ls -lh ./import-work/images/ | head -5
Step 3: Review Metadata JSON
Output JSON structure:
{ "url": "https://example.com/page", "timestamp": "2025-01-12T10:30:00.000Z", "paths": { "documentPath": "/us/en/about", "htmlFilePath": "us/en/about.plain.html", "mdFilePath": "us/en/about.md", "dirPath": "us/en", "filename": "about" }, "screenshot": "./import-work/screenshot.png", "html": { "filePath": "./import-work/cleaned.html", "size": 45230 }, "metadata": { "title": "Page Title", "description": "Page description", "og:image": "https://example.com/image.jpg", "canonical": "https://example.com/page" }, "images": { "count": 15, "mapping": { "https://example.com/hero.jpg": "./images/a1b2c3d4e5f6.jpg", "https://example.com/logo.webp": "./images/f6e5d4c3b2a1.png" }, "stats": { "total": 15, "converted": 3, "skipped": 12, "failed": 0 } } }
Key fields:
- Used for browser preview URLpaths.documentPath
- Where to save final HTML filepaths.htmlFilePath
- Original URLs → local pathsimages.mapping
- Extracted page metadatametadata
Output
This skill provides:
- ✅ metadata.json with paths, metadata, image mapping
- ✅ screenshot.png for visual reference
- ✅ cleaned.html with local image references
- ✅ images/ folder with all downloaded images
Next step: Pass these outputs to identify-page-structure skill
Troubleshooting
Browser not installed:
npx playwright install chromium
Sharp not installed:
cd .claude/skills/scrape-webpage/scripts && npm install
Image download failures:
- Check images.stats.failed count in metadata.json
- Some images may require authentication or be blocked by CORS
- Failed images will be noted but won't stop the scraping process
Lazy-loaded images not captured:
- Script scrolls through page to trigger lazy loading
- Some advanced lazy-loading may need customization in scripts/analyze-webpage.js