Claude-skill-registry blog-scraper

Fetch and compress blog articles from tech-lab.sios.jp into the doc/ directory with token usage statistics and OGP metadata

install

source · Clone the upstream repo

git clone https://github.com/majiayu000/claude-skill-registry

Claude Code · Install into ~/.claude/skills/

T=$(mktemp -d) && git clone --depth=1 https://github.com/majiayu000/claude-skill-registry "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/data/blog-scraper" ~/.claude/skills/majiayu000-claude-skill-registry-blog-scraper && rm -rf "$T"

manifest: skills/data/blog-scraper/SKILL.md

Blog Scraper Skill

Overview

This skill fetches blog articles from

tech-lab.sios.jp/archives/*

, compresses the HTML content by removing unnecessary attributes and whitespace, and saves the result to the

doc/

directory with metadata.

When to Use

User requests to fetch a specific blog article
User wants to update existing cached articles
User needs to scrape multiple articles for analysis or documentation

Usage

Single Article

URL=https://tech-lab.sios.jp/archives/[article-id] npm run scraper

Example:

URL=https://tech-lab.sios.jp/archives/48397 npm run scraper

Multiple Articles

For multiple articles, run the command sequentially for each URL.

Output

The scraper will:

Fetch and parse the HTML from the specified URL
Extract content using the CSS selector
```
section.entry-content
```
Compress by removing:
- Scripts, styles, and noscript tags
- Class, ID, and style attributes
- Whitespace between tags
Preserve:
- Image alt text as
```
[画像: alt]
```
- Image src URLs
- Link href attributes
Add metadata as HTML comment:
- OGP title
- Source URL
- OGP image URL
- Extraction timestamp

Save to

docs/data/tech-lab-sios-jp-archives-[id].html

Report compression statistics:
- Token count reduction (estimated for Claude)
- Compression ratio percentages
- File size

Cache Behavior

If the target HTML file already exists in
```
docs/data/
```
, the scraper skips fetching and reports the existing file size
To re-fetch, delete the existing HTML file first

Token Estimation

The scraper estimates Claude token usage for Japanese content:

Hiragana/Katakana: ~1.5 chars/token
Kanji: ~1 char/token
ASCII: ~4 chars/token
Other: ~2 chars/token

Typical compression achieves 60-85% token reduction.

Implementation Details

See

application/tools/scraper.ts

for the TypeScript implementation using:

```
node-fetch
```
for HTTP requests
```
cheerio
```
for HTML parsing
OGP metadata extraction
Custom token estimation for Japanese text

Permissions Required

This skill requires the following permissions in

.claude/settings.local.json

{
  "permissions": {
    "allow": [
      "Bash(npm run scraper:*)",
      "Bash(URL=:*)"
    ]
  }
}

Note: The

Bash(URL=:*)

permission uses prefix matching to allow any URL environment variable pattern. This is a broad permission - consider restricting to specific domains if needed for security.