Claude-skills markdown-fetch
Fetch and extract web content as clean Markdown when provided with URLs. Use this skill whenever a user provides a URL (http/https link) that needs to be read, analyzed, summarized, or extracted. Converts web pages to Markdown with 80% fewer tokens than raw HTML. Handles all content types including JS-heavy sites, documentation, articles, and blog posts. Supports three conversion methods (auto, AI, browser rendering). Always use this instead of web_fetch when working with URLs - it's more efficient and provides cleaner output.
git clone https://github.com/ckorhonen/claude-skills
T=$(mktemp -d) && git clone --depth=1 https://github.com/ckorhonen/claude-skills "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/markdown-fetch" ~/.claude/skills/ckorhonen-claude-skills-markdown-fetch && rm -rf "$T"
skills/markdown-fetch/SKILL.mdMarkdown Fetch
Efficiently fetch web content as clean Markdown using the markdown.new service.
Why Use This
- 80% fewer tokens than raw HTML
- 5x more content fits in context window
- No external dependencies or parsing libraries needed
- Three-tier conversion (Markdown-first, AI fallback, browser rendering)
Triggering
This skill should trigger automatically when:
- User provides a URL (e.g., "Read https://example.com")
- User asks to extract/fetch/analyze web content
- User requests summarization of a webpage
- User needs to process article/blog/documentation URLs
Quick Start
# Fetch any URL scripts/fetch.sh "https://example.com" # Use browser rendering for JS-heavy sites scripts/fetch.sh "https://example.com" --method browser # Retain images in output scripts/fetch.sh "https://example.com" --retain-images
Typical Usage Patterns
When a user says:
- "Read this article: https://..." → Use this skill to fetch the content
- "Summarize https://..." → Fetch with this skill first, then summarize
- "What does this page say: https://..." → Fetch the content
- "Extract the text from https://..." → Use this skill
Conversion Methods
auto (default) - Try Markdown-first, fall back to AI or browser as needed
ai - Use Cloudflare Workers AI for conversion
browser - Full browser rendering for JS-heavy content
Options
--method <auto|ai|browser> - Conversion method--retain-images - Keep image references in output--output <file> - Save to file instead of stdout
Output
Returns clean Markdown with metadata:
--- title: Page Title url: https://example.com method: auto duration_ms: 725 fetched_at: 2026-03-07T12:00:00Z --- # Content here...
When to Use
- Extracting articles, documentation, or blog posts
- Building RAG pipelines with web content
- Summarizing web pages
- Fetching content for analysis
- Converting sites to Markdown format
Implementation Notes
The service handles:
- Content negotiation (Accept: text/markdown)
- Cloudflare Workers AI conversion
- Browser rendering for dynamic content
- Automatic fallback between methods
Common Pitfalls
HTML Parsing Failures
Symptom: Empty output, truncated content, or "Error: No content in response"
Root Causes:
- CSS selectors broken by page layout changes
- Dynamic content not rendered by default (use
)--method browser - Page structure uses unusual HTML patterns (iframes, shadow DOM, Web Components)
Debugging & Workarounds:
# 1. Check if the page loads at all curl -s "https://example.com" | head -c 500 # 2. Try browser rendering (handles JS-heavy sites) scripts/fetch.sh "https://example.com" --method browser # 3. Check what the service is getting scripts/fetch.sh "https://example.com" 2>&1 | head -20 # 4. If still empty, the page may use iframes or require auth
Best Practice: Always use
--method browser for single-page apps, social platforms, or content-heavy dashboards.
Rate Limiting & IP Blocks
Symptom: HTTP 429 (Too Many Requests), 403 (Forbidden), or connection timeouts
Root Causes:
- Multiple rapid requests to same domain
- Service IP address blacklisted by the target site
- Bot detection triggering (User-Agent headers, request patterns)
- Cloudflare/WAF blocks (common on high-traffic sites)
Debugging & Workarounds:
# 1. Check HTTP status code curl -s -w "\n%{http_code}\n" -o /dev/null "https://example.com" # 2. Test with curl directly (bypasses service) curl -H "User-Agent: Mozilla/5.0" "https://example.com" | head -c 500 # 3. Add delays between requests (if fetching multiple URLs) for url in url1 url2 url3; do scripts/fetch.sh "$url" sleep 5 # Wait 5 seconds between requests done # 4. If blocked, try from different IP or use residential proxy # For markdown.new service: no built-in proxy rotation; consider mirror services
Best Practice: Respect site robots.txt and use sensible request delays when processing multiple pages. Some sites require retry-after headers.
Encoding Issues
Symptom: Garbled text, mojibake (weird characters), missing non-ASCII content (emojis, accents, CJK)
Root Causes:
- Page declares wrong charset in headers vs actual content
- Service not respecting Content-Type charset declaration
- UTF-8 vs Latin-1 mismatch
- Emoji or special symbols stripped during conversion
Debugging & Workarounds:
# 1. Check the page's declared charset curl -sI "https://example.com" | grep -i "charset" # 2. Save raw HTML and inspect encoding curl -s "https://example.com" > raw.html file raw.html head -c 200 raw.html | od -c # Show raw bytes # 3. Try the markdown.new service and check output scripts/fetch.sh "https://example.com" | file - # Check output encoding # 4. If output is garbled, convert explicitly scripts/fetch.sh "https://example.com" | iconv -f UTF-8 -t UTF-8 > fixed.md
Best Practice: Always verify output is valid UTF-8. If encountering garbled content, the page likely has a charset mismatch — notify the site owner or use
--method browser (more robust).
JavaScript-Rendered Content
Symptom: Page returns but content is missing, loads as empty, or shows loading spinners instead of data
Root Causes:
- Page uses client-side JavaScript to render content (React, Vue, Angular, etc.)
- Default
method only fetches HTML skeleton without executing JSauto - Content loaded after page render (lazy loading, infinite scroll)
- JavaScript requires authentication or API keys
Debugging & Workarounds:
# 1. Fetch with browser rendering (executes JS) scripts/fetch.sh "https://example.com" --method browser # 2. Check what the auto method returns scripts/fetch.sh "https://example.com" --method auto # 3. Inspect page source to confirm JS-rendering curl -s "https://example.com" | grep -i "react\|vue\|angular\|<div id=\"app\"" # 4. If content still missing, page may require interaction # (e.g., clicking buttons, scrolling) — no workaround in current service
Best Practice: Use
--method browser by default for modern web apps, news sites, and any site built in the last 5 years. The auto method is faster but misses JS-rendered content.
Authentication & Paywall Content
Symptom: Returns login page instead of content, shows "Subscribe to read more", or HTTP 401/403
Root Causes:
- Page requires login/authentication
- Content behind paywall (subscription, membership)
- IP-based access restrictions (geo-blocking)
- Session-based authentication (cookies) not provided
Debugging & Workarounds:
# 1. Check if curl can access without auth curl -s "https://example.com/article" | head -c 300 # 2. Identify authentication method curl -sI "https://example.com" | grep -i "auth\|set-cookie\|www-authenticate" # 3. Try public/preview version if available # For paywalled sites, look for preview URLs or RSS feeds curl -s "https://example.com/rss" | head -c 500 # 4. Check for paywall indicators curl -s "https://example.com/article" | grep -i "paywall\|subscribe\|login required" # 5. If member-only, request your credentials in the conversation # (Don't embed in scripts — use prompt)
Best Practice: This skill cannot bypass authentication or paywalls by design. For protected content:
- Use public preview links if available
- Ask the user for authenticated access
- Try RSS feeds (often unpaywalled summaries)
- Check archive services (archive.org, 12ft.io for paywalled content) separately
Service-Side Failures
Symptom: HTTP error codes (5xx), timeout, or malformed JSON responses
Root Causes:
- markdown.new service is down or overloaded
- Request body malformed (invalid URL, missing required fields)
- Service response is corrupted or incomplete
- Cloudflare Workers timeout (pages >10MB or slow servers)
Debugging & Workarounds:
# 1. Check service health curl -s "https://markdown.new/health" 2>&1 | head # 2. Verify request format jq -n --arg url "https://example.com" --arg method "auto" \ '{url: $url, method: $method, retain_images: false}' | jq . # 3. Retry with exponential backoff for attempt in {1..3}; do result=$(scripts/fetch.sh "https://example.com" 2>&1) if [[ $? -eq 0 ]]; then echo "$result" break fi sleep $((2 ** attempt)) # 2s, 4s, 8s done # 4. If service is down, no workaround available # — try again later or use alternative (curl, browser, etc.)
Best Practice: The service is stateless and generally reliable. If failures persist, fall back to
curl directly or ask the user for an alternative source.
Troubleshooting Decision Tree
No content returned? ├─ Check if site requires JavaScript │ └─ YES → Use --method browser │ └─ NO → Continue ├─ Check if site requires authentication │ └─ YES → Use authenticated request (see Paywall section) │ └─ NO → Continue ├─ Check for encoding issues │ └─ YES → Garbled text → likely site charset mismatch │ └─ NO → Continue └─ Service may be down → Wait and retry Getting rate-limited (HTTP 429)? ├─ Add delays between requests (5-10 seconds) └─ If persistent, try different time or request fewer pages Truncated or broken output? └─ Try --method browser (more robust but slower) Can't access paywalled content? └─ No workaround — try public preview, RSS, or archive services