Claude-skill-registry blog-scraper

Fetch and compress blog articles from tech-lab.sios.jp into the doc/ directory with token usage statistics and OGP metadata

install
source · Clone the upstream repo
git clone https://github.com/majiayu000/claude-skill-registry
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/majiayu000/claude-skill-registry "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/data/blog-scraper" ~/.claude/skills/majiayu000-claude-skill-registry-blog-scraper && rm -rf "$T"
manifest: skills/data/blog-scraper/SKILL.md
source content

Blog Scraper Skill

Overview

This skill fetches blog articles from

tech-lab.sios.jp/archives/*
, compresses the HTML content by removing unnecessary attributes and whitespace, and saves the result to the
doc/
directory with metadata.

When to Use

  • User requests to fetch a specific blog article
  • User wants to update existing cached articles
  • User needs to scrape multiple articles for analysis or documentation

Usage

Single Article

URL=https://tech-lab.sios.jp/archives/[article-id] npm run scraper

Example:

URL=https://tech-lab.sios.jp/archives/48397 npm run scraper

Multiple Articles

For multiple articles, run the command sequentially for each URL.

Output

The scraper will:

  1. Fetch and parse the HTML from the specified URL
  2. Extract content using the CSS selector
    section.entry-content
  3. Compress by removing:
    • Scripts, styles, and noscript tags
    • Class, ID, and style attributes
    • Whitespace between tags
  4. Preserve:
    • Image alt text as
      [画像: alt]
    • Image src URLs
    • Link href attributes
  5. Add metadata as HTML comment:
    • OGP title
    • Source URL
    • OGP image URL
    • Extraction timestamp
  6. Save to
    docs/data/tech-lab-sios-jp-archives-[id].html
  7. Report compression statistics:
    • Token count reduction (estimated for Claude)
    • Compression ratio percentages
    • File size

Cache Behavior

  • If the target HTML file already exists in
    docs/data/
    , the scraper skips fetching and reports the existing file size
  • To re-fetch, delete the existing HTML file first

Token Estimation

The scraper estimates Claude token usage for Japanese content:

  • Hiragana/Katakana: ~1.5 chars/token
  • Kanji: ~1 char/token
  • ASCII: ~4 chars/token
  • Other: ~2 chars/token

Typical compression achieves 60-85% token reduction.

Implementation Details

See

application/tools/scraper.ts
for the TypeScript implementation using:

  • node-fetch
    for HTTP requests
  • cheerio
    for HTML parsing
  • OGP metadata extraction
  • Custom token estimation for Japanese text

Permissions Required

This skill requires the following permissions in

.claude/settings.local.json
:

{
  "permissions": {
    "allow": [
      "Bash(npm run scraper:*)",
      "Bash(URL=:*)"
    ]
  }
}

Note: The

Bash(URL=:*)
permission uses prefix matching to allow any URL environment variable pattern. This is a broad permission - consider restricting to specific domains if needed for security.