Claude-skill-registry get-web-page-raw
Capture a web page as raw material (extracted text/Markdown) with metadata, storing it in the raw/ directory for later distillation.
install
source · Clone the upstream repo
git clone https://github.com/majiayu000/claude-skill-registry
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/majiayu000/claude-skill-registry "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/data/get-web-page-raw" ~/.claude/skills/majiayu000-claude-skill-registry-get-web-page-raw && rm -rf "$T"
manifest:
skills/data/get-web-page-raw/SKILL.mdsource content
Get Web Page (Raw)
When to use
Use this skill when:
- The user provides a URL to capture as source material
- You need to store a web page for later distillation
- You're building up raw materials for the distillation pipeline
Keywords: capture, web page, URL, fetch, raw
Inputs
Required:
(string): The URL of the web page to captureurl
Optional:
(string): A human-provided hint for the title if extraction failstitle_hint
Outputs
This skill produces:
- A new file in
with the naming pattern:raw/YYYYMMDD-HHMMSSZ--<slug>.md - Metadata returned to the agent:
: full path to the created fileraw_path
: extracted or inferred titletitle
: canonical URL if found, otherwise the input URLcanonical_url
: ISO timestamp (UTC)captured_at
Procedure
1. Generate timestamp and slug
- Get current UTC timestamp:
date -u +"%Y%m%d-%H%M%SZ" - Create a slug from the page title or URL path:
- Use the main topic/title in lowercase
- Replace spaces and special chars with hyphens
- Keep it concise (typically 3-8 words)
- Example:
patterns-for-ai-assisted-software-development
2. Fetch the web page content
Primary method: fetch_webpage tool
- Call
with the URL and a descriptive query about the contentfetch_webpage - This extracts main content, removes navigation/ads, and converts to clean Markdown
- Captures metadata like title and author when available
Alternative methods (when needed):
- Playwright MCP: For complex interactive pages, JavaScript-heavy sites, or when you need precise browser control
- curl + HTML parsing: For simple pages when fetch_webpage isn't available, though requires manual Markdown conversion
- Browser tools: Built-in browser navigation and snapshot capabilities
Note: Most public articles and blog posts work well with fetch_webpage
3. Build front matter
Create YAML front matter with required and available fields:
--- title: "<extracted from page>" source_url: "<original URL provided>" captured_at: "<ISO timestamp from step 1>" capture_type: web_page capture_tool: fetch_webpage # or browser if used raw_format: markdown status: captured author: "<if clearly visible>" # optional published_at: "<YYYY-MM-DD if available>" # optional tags: ["keyword1", "keyword2"] # optional, 3-5 relevant tags ---
Tips:
- Use the page's actual title, not a description
- Convert published_at to simple date format (YYYY-MM-DD)
- Tags should be broad topics, not exhaustive keywords
4. Write the file
- Filename:
(e.g.,<timestamp>--<slug>.md
)20260102-095107Z--patterns-for-ai-assisted-software-development.md - Full path:
raw/<filename> - Content: front matter + blank line + extracted content
- Use
tool with the full contentcreate_file
5. Confirm to user
Provide a brief confirmation:
- Link to the created file using relative path
- One-sentence summary of what the article covers (2-3 key topics)
Examples
Example 1: Dev.to article capture
Input:
url: "https://dev.to/javatarz/patterns-for-ai-assisted-software-development-4ga2"
Actions:
- Generate timestamp:
20260102-095107Z - Fetch with
toolfetch_webpage - Extract title: "Patterns for AI assisted software development"
- Create slug:
patterns-for-ai-assisted-software-development - Write to:
raw/20260102-095107Z--patterns-for-ai-assisted-software-development.md
Output:
--- title: "Patterns for AI assisted software development" source_url: "https://dev.to/javatarz/patterns-for-ai-assisted-software-development-4ga2" captured_at: "2026-01-02T09:51:07Z" capture_type: web_page capture_tool: fetch_webpage raw_format: markdown status: captured author: "Karun Japhet" published_at: "2025-12-27" tags: ["ai", "programming", "productivity", "patterns"] ---
Edge cases
fetch_webpage fails:
- Check if authentication is required
- Try browser-based extraction if needed
- Inform user if the page cannot be accessed
Missing metadata:
- If no title is extracted, use the URL path or ask user for a title hint
- If no author/date visible, omit those fields
- Minimal front matter is acceptable
Paywalled or restricted content:
- Inform user about the restriction
- Capture what's visible if useful (preview text)
- Consider asking if they have alternative access
References
- Parent workflow:
docs/distillation/distillation-pipeline.md - Front matter schema: see "Raw file front matter" in distillation-pipeline.md
- Next step after capture: use
skill to process this raw filemake-distilled