Claude-skill-registry get-web-page-raw

Capture a web page as raw material (extracted text/Markdown) with metadata, storing it in the raw/ directory for later distillation.

install

source · Clone the upstream repo

git clone https://github.com/majiayu000/claude-skill-registry

Claude Code · Install into ~/.claude/skills/

T=$(mktemp -d) && git clone --depth=1 https://github.com/majiayu000/claude-skill-registry "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/data/get-web-page-raw" ~/.claude/skills/majiayu000-claude-skill-registry-get-web-page-raw && rm -rf "$T"

manifest: skills/data/get-web-page-raw/SKILL.md

source content

Get Web Page (Raw)

When to use

Use this skill when:

The user provides a URL to capture as source material
You need to store a web page for later distillation
You're building up raw materials for the distillation pipeline

Keywords: capture, web page, URL, fetch, raw

Inputs

Required:

```
url
```
(string): The URL of the web page to capture

Optional:

```
title_hint
```
(string): A human-provided hint for the title if extraction fails

Outputs

This skill produces:

A new file in
```
raw/
```
with the naming pattern:
```
YYYYMMDD-HHMMSSZ--<slug>.md
```
Metadata returned to the agent:
- ```
raw_path
```
  : full path to the created file
- ```
title
```
  : extracted or inferred title
- ```
canonical_url
```
  : canonical URL if found, otherwise the input URL
- ```
captured_at
```
  : ISO timestamp (UTC)

Procedure

1. Generate timestamp and slug

Get current UTC timestamp:
```
date -u +"%Y%m%d-%H%M%SZ"
```
Create a slug from the page title or URL path:
- Use the main topic/title in lowercase
- Replace spaces and special chars with hyphens
- Keep it concise (typically 3-8 words)
- Example:
```
patterns-for-ai-assisted-software-development
```

2. Fetch the web page content

Primary method: fetch_webpage tool

Call
```
fetch_webpage
```
with the URL and a descriptive query about the content
This extracts main content, removes navigation/ads, and converts to clean Markdown
Captures metadata like title and author when available

Alternative methods (when needed):

Playwright MCP: For complex interactive pages, JavaScript-heavy sites, or when you need precise browser control
curl + HTML parsing: For simple pages when fetch_webpage isn't available, though requires manual Markdown conversion
Browser tools: Built-in browser navigation and snapshot capabilities

Note: Most public articles and blog posts work well with fetch_webpage

3. Build front matter

Create YAML front matter with required and available fields:

---
title: "<extracted from page>"
source_url: "<original URL provided>"
captured_at: "<ISO timestamp from step 1>"
capture_type: web_page
capture_tool: fetch_webpage  # or browser if used
raw_format: markdown
status: captured
author: "<if clearly visible>"  # optional
published_at: "<YYYY-MM-DD if available>"  # optional
tags: ["keyword1", "keyword2"]  # optional, 3-5 relevant tags
---

Tips:

Use the page's actual title, not a description
Convert published_at to simple date format (YYYY-MM-DD)
Tags should be broad topics, not exhaustive keywords

4. Write the file

Filename:

<timestamp>--<slug>.md

(e.g.,

20260102-095107Z--patterns-for-ai-assisted-software-development.md

)

Full path:
```
raw/<filename>
```
Content: front matter + blank line + extracted content
Use
```
create_file
```
tool with the full content

5. Confirm to user

Provide a brief confirmation:

Link to the created file using relative path
One-sentence summary of what the article covers (2-3 key topics)

Examples

Example 1: Dev.to article capture

Input:

url: "https://dev.to/javatarz/patterns-for-ai-assisted-software-development-4ga2"

Actions:

Generate timestamp:
```
20260102-095107Z
```
Fetch with
```
fetch_webpage
```
tool
Extract title: "Patterns for AI assisted software development"

Create slug:

patterns-for-ai-assisted-software-development

Write to:

raw/20260102-095107Z--patterns-for-ai-assisted-software-development.md

Output:

---
title: "Patterns for AI assisted software development"
source_url: "https://dev.to/javatarz/patterns-for-ai-assisted-software-development-4ga2"
captured_at: "2026-01-02T09:51:07Z"
capture_type: web_page
capture_tool: fetch_webpage
raw_format: markdown
status: captured
author: "Karun Japhet"
published_at: "2025-12-27"
tags: ["ai", "programming", "productivity", "patterns"]
---

Edge cases

fetch_webpage fails:

Check if authentication is required
Try browser-based extraction if needed
Inform user if the page cannot be accessed

Missing metadata:

If no title is extracted, use the URL path or ask user for a title hint
If no author/date visible, omit those fields
Minimal front matter is acceptable

Paywalled or restricted content:

Inform user about the restriction
Capture what's visible if useful (preview text)
Consider asking if they have alternative access

References

Parent workflow:

docs/distillation/distillation-pipeline.md

Front matter schema: see "Raw file front matter" in distillation-pipeline.md
Next step after capture: use
```
make-distilled
```
skill to process this raw file