Claude-skills markdown-fetch

Fetch and extract web content as clean Markdown when provided with URLs. Use this skill whenever a user provides a URL (http/https link) that needs to be read, analyzed, summarized, or extracted. Converts web pages to Markdown with 80% fewer tokens than raw HTML. Handles all content types including JS-heavy sites, documentation, articles, and blog posts. Supports three conversion methods (auto, AI, browser rendering). Always use this instead of web_fetch when working with URLs - it's more efficient and provides cleaner output.

install

source · Clone the upstream repo

git clone https://github.com/ckorhonen/claude-skills

Claude Code · Install into ~/.claude/skills/

T=$(mktemp -d) && git clone --depth=1 https://github.com/ckorhonen/claude-skills "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/markdown-fetch" ~/.claude/skills/ckorhonen-claude-skills-markdown-fetch && rm -rf "$T"

manifest: skills/markdown-fetch/SKILL.md

source content

Markdown Fetch

Efficiently fetch web content as clean Markdown using the markdown.new service.

Why Use This

80% fewer tokens than raw HTML
5x more content fits in context window
No external dependencies or parsing libraries needed
Three-tier conversion (Markdown-first, AI fallback, browser rendering)

Triggering

This skill should trigger automatically when:

User provides a URL (e.g., "Read https://example.com")
User asks to extract/fetch/analyze web content
User requests summarization of a webpage
User needs to process article/blog/documentation URLs

Quick Start

# Fetch any URL
scripts/fetch.sh "https://example.com"

# Use browser rendering for JS-heavy sites
scripts/fetch.sh "https://example.com" --method browser

# Retain images in output
scripts/fetch.sh "https://example.com" --retain-images

Typical Usage Patterns

When a user says:

"Read this article: https://..." → Use this skill to fetch the content
"Summarize https://..." → Fetch with this skill first, then summarize
"What does this page say: https://..." → Fetch the content
"Extract the text from https://..." → Use this skill

Conversion Methods

auto (default) - Try Markdown-first, fall back to AI or browser as needed
ai - Use Cloudflare Workers AI for conversion
browser - Full browser rendering for JS-heavy content

Options

--method <auto|ai|browser>

- Conversion method

--retain-images

- Keep image references in output

--output <file>

- Save to file instead of stdout

Output

Returns clean Markdown with metadata:

---
title: Page Title
url: https://example.com
method: auto
duration_ms: 725
fetched_at: 2026-03-07T12:00:00Z
---

# Content here...

When to Use

Extracting articles, documentation, or blog posts
Building RAG pipelines with web content
Summarizing web pages
Fetching content for analysis
Converting sites to Markdown format

Implementation Notes

The service handles:

Content negotiation (Accept: text/markdown)
Cloudflare Workers AI conversion
Browser rendering for dynamic content
Automatic fallback between methods

Common Pitfalls

HTML Parsing Failures

Symptom: Empty output, truncated content, or "Error: No content in response"

Root Causes:

CSS selectors broken by page layout changes
Dynamic content not rendered by default (use
```
--method browser
```
)
Page structure uses unusual HTML patterns (iframes, shadow DOM, Web Components)

Debugging & Workarounds:

# 1. Check if the page loads at all
curl -s "https://example.com" | head -c 500

# 2. Try browser rendering (handles JS-heavy sites)
scripts/fetch.sh "https://example.com" --method browser

# 3. Check what the service is getting
scripts/fetch.sh "https://example.com" 2>&1 | head -20

# 4. If still empty, the page may use iframes or require auth

Best Practice: Always use

--method browser

for single-page apps, social platforms, or content-heavy dashboards.

Rate Limiting & IP Blocks

Symptom: HTTP 429 (Too Many Requests), 403 (Forbidden), or connection timeouts

Root Causes:

Multiple rapid requests to same domain
Service IP address blacklisted by the target site
Bot detection triggering (User-Agent headers, request patterns)
Cloudflare/WAF blocks (common on high-traffic sites)

Debugging & Workarounds:

# 1. Check HTTP status code
curl -s -w "\n%{http_code}\n" -o /dev/null "https://example.com"

# 2. Test with curl directly (bypasses service)
curl -H "User-Agent: Mozilla/5.0" "https://example.com" | head -c 500

# 3. Add delays between requests (if fetching multiple URLs)
for url in url1 url2 url3; do
  scripts/fetch.sh "$url"
  sleep 5  # Wait 5 seconds between requests
done

# 4. If blocked, try from different IP or use residential proxy
# For markdown.new service: no built-in proxy rotation; consider mirror services

Best Practice: Respect site robots.txt and use sensible request delays when processing multiple pages. Some sites require retry-after headers.

Encoding Issues

Symptom: Garbled text, mojibake (weird characters), missing non-ASCII content (emojis, accents, CJK)

Root Causes:

Page declares wrong charset in headers vs actual content
Service not respecting Content-Type charset declaration
UTF-8 vs Latin-1 mismatch
Emoji or special symbols stripped during conversion

Debugging & Workarounds:

# 1. Check the page's declared charset
curl -sI "https://example.com" | grep -i "charset"

# 2. Save raw HTML and inspect encoding
curl -s "https://example.com" > raw.html
file raw.html
head -c 200 raw.html | od -c  # Show raw bytes

# 3. Try the markdown.new service and check output
scripts/fetch.sh "https://example.com" | file -  # Check output encoding

# 4. If output is garbled, convert explicitly
scripts/fetch.sh "https://example.com" | iconv -f UTF-8 -t UTF-8 > fixed.md

Best Practice: Always verify output is valid UTF-8. If encountering garbled content, the page likely has a charset mismatch — notify the site owner or use

--method browser

(more robust).

JavaScript-Rendered Content

Symptom: Page returns but content is missing, loads as empty, or shows loading spinners instead of data

Root Causes:

Page uses client-side JavaScript to render content (React, Vue, Angular, etc.)
Default
```
auto
```
method only fetches HTML skeleton without executing JS
Content loaded after page render (lazy loading, infinite scroll)
JavaScript requires authentication or API keys

Debugging & Workarounds:

# 1. Fetch with browser rendering (executes JS)
scripts/fetch.sh "https://example.com" --method browser

# 2. Check what the auto method returns
scripts/fetch.sh "https://example.com" --method auto

# 3. Inspect page source to confirm JS-rendering
curl -s "https://example.com" | grep -i "react\|vue\|angular\|<div id=\"app\""

# 4. If content still missing, page may require interaction
# (e.g., clicking buttons, scrolling) — no workaround in current service

Best Practice: Use

--method browser

by default for modern web apps, news sites, and any site built in the last 5 years. The

auto

method is faster but misses JS-rendered content.

Authentication & Paywall Content

Symptom: Returns login page instead of content, shows "Subscribe to read more", or HTTP 401/403

Root Causes:

Page requires login/authentication
Content behind paywall (subscription, membership)
IP-based access restrictions (geo-blocking)
Session-based authentication (cookies) not provided

Debugging & Workarounds:

# 1. Check if curl can access without auth
curl -s "https://example.com/article" | head -c 300

# 2. Identify authentication method
curl -sI "https://example.com" | grep -i "auth\|set-cookie\|www-authenticate"

# 3. Try public/preview version if available
# For paywalled sites, look for preview URLs or RSS feeds
curl -s "https://example.com/rss" | head -c 500

# 4. Check for paywall indicators
curl -s "https://example.com/article" | grep -i "paywall\|subscribe\|login required"

# 5. If member-only, request your credentials in the conversation
# (Don't embed in scripts — use prompt)

Best Practice: This skill cannot bypass authentication or paywalls by design. For protected content:

Use public preview links if available
Ask the user for authenticated access
Try RSS feeds (often unpaywalled summaries)
Check archive services (archive.org, 12ft.io for paywalled content) separately

Service-Side Failures

Symptom: HTTP error codes (5xx), timeout, or malformed JSON responses

Root Causes:

markdown.new service is down or overloaded
Request body malformed (invalid URL, missing required fields)
Service response is corrupted or incomplete
Cloudflare Workers timeout (pages >10MB or slow servers)

Debugging & Workarounds:

# 1. Check service health
curl -s "https://markdown.new/health" 2>&1 | head

# 2. Verify request format
jq -n --arg url "https://example.com" --arg method "auto" \
  '{url: $url, method: $method, retain_images: false}' | jq .

# 3. Retry with exponential backoff
for attempt in {1..3}; do
  result=$(scripts/fetch.sh "https://example.com" 2>&1)
  if [[ $? -eq 0 ]]; then
    echo "$result"
    break
  fi
  sleep $((2 ** attempt))  # 2s, 4s, 8s
done

# 4. If service is down, no workaround available
# — try again later or use alternative (curl, browser, etc.)

Best Practice: The service is stateless and generally reliable. If failures persist, fall back to

curl

directly or ask the user for an alternative source.

Troubleshooting Decision Tree

No content returned?
  ├─ Check if site requires JavaScript
  │   └─ YES → Use --method browser
  │   └─ NO → Continue
  ├─ Check if site requires authentication
  │   └─ YES → Use authenticated request (see Paywall section)
  │   └─ NO → Continue
  ├─ Check for encoding issues
  │   └─ YES → Garbled text → likely site charset mismatch
  │   └─ NO → Continue
  └─ Service may be down → Wait and retry

Getting rate-limited (HTTP 429)?
  ├─ Add delays between requests (5-10 seconds)
  └─ If persistent, try different time or request fewer pages

Truncated or broken output?
  └─ Try --method browser (more robust but slower)

Can't access paywalled content?
  └─ No workaround — try public preview, RSS, or archive services