Llm-wiki-skill baoyu-url-to-markdown
Fetch any URL and convert to markdown using Chrome CDP. Saves the rendered HTML snapshot alongside the markdown, uses an upgraded Defuddle pipeline with better web-component handling and YouTube transcript extraction, and automatically falls back to the pre-Defuddle HTML-to-Markdown pipeline when needed. If local browser capture fails entirely, it can fall back to the hosted defuddle.md API. Supports two modes - auto-capture on page load, or wait for user signal (for pages requiring login). Use when user wants to save a webpage as markdown.
git clone https://github.com/sdyckjq-lab/llm-wiki-skill
T=$(mktemp -d) && git clone --depth=1 https://github.com/sdyckjq-lab/llm-wiki-skill "$T" && mkdir -p ~/.claude/skills && cp -r "$T/deps/baoyu-url-to-markdown" ~/.claude/skills/sdyckjq-lab-llm-wiki-skill-baoyu-url-to-markdown && rm -rf "$T"
deps/baoyu-url-to-markdown/SKILL.mdURL to Markdown
Fetches any URL via Chrome CDP, saves the rendered HTML snapshot, and converts it to clean markdown.
Script Directory
Important: All scripts are located in the
scripts/ subdirectory of this skill.
Agent Execution Instructions:
- Determine this SKILL.md file's directory path as
{baseDir} - Script path =
{baseDir}/scripts/<script-name>.ts - Resolve
runtime: if${BUN_X}
installed →bun
; ifbun
available →npx
; else suggest installing bunnpx -y bun - Replace all
and{baseDir}
in this document with actual values${BUN_X}
Script Reference:
| Script | Purpose |
|---|---|
| CLI entry point for URL fetching |
| Markdown conversion entry point and converter selection |
| Defuddle-based conversion |
| Pre-Defuddle legacy extraction and markdown conversion |
| Shared metadata parsing and markdown document helpers |
Preferences (EXTEND.md)
Check EXTEND.md existence (priority order):
# macOS, Linux, WSL, Git Bash test -f .baoyu-skills/baoyu-url-to-markdown/EXTEND.md && echo "project" test -f "${XDG_CONFIG_HOME:-$HOME/.config}/baoyu-skills/baoyu-url-to-markdown/EXTEND.md" && echo "xdg" test -f "$HOME/.baoyu-skills/baoyu-url-to-markdown/EXTEND.md" && echo "user"
# PowerShell (Windows) if (Test-Path .baoyu-skills/baoyu-url-to-markdown/EXTEND.md) { "project" } $xdg = if ($env:XDG_CONFIG_HOME) { $env:XDG_CONFIG_HOME } else { "$HOME/.config" } if (Test-Path "$xdg/baoyu-skills/baoyu-url-to-markdown/EXTEND.md") { "xdg" } if (Test-Path "$HOME/.baoyu-skills/baoyu-url-to-markdown/EXTEND.md") { "user" }
┌────────────────────────────────────────────────────────┬───────────────────┐ │ Path │ Location │ ├────────────────────────────────────────────────────────┼───────────────────┤ │ .baoyu-skills/baoyu-url-to-markdown/EXTEND.md │ Project directory │ ├────────────────────────────────────────────────────────┼───────────────────┤ │ $HOME/.baoyu-skills/baoyu-url-to-markdown/EXTEND.md │ User home │ └────────────────────────────────────────────────────────┴───────────────────┘
┌───────────┬───────────────────────────────────────────────────────────────────────────┐ │ Result │ Action │ ├───────────┼───────────────────────────────────────────────────────────────────────────┤ │ Found │ Read, parse, apply settings │ ├───────────┼───────────────────────────────────────────────────────────────────────────┤ │ Not found │ MUST run first-time setup (see below) — do NOT silently create defaults │ └───────────┴───────────────────────────────────────────────────────────────────────────┘
EXTEND.md Supports: Download media by default | Default output directory | Default capture mode | Timeout settings
First-Time Setup (BLOCKING)
CRITICAL: When EXTEND.md is not found, you MUST use
to ask the user for their preferences before creating EXTEND.md. NEVER create EXTEND.md with defaults without asking. This is a BLOCKING operation — do NOT proceed with any conversion until setup is complete.AskUserQuestion
Use
AskUserQuestion with ALL questions in ONE call:
Question 1 — header: "Media", question: "How to handle images and videos in pages?"
- "Ask each time (Recommended)" — After saving markdown, ask whether to download media
- "Always download" — Always download media to local imgs/ and videos/ directories
- "Never download" — Keep original remote URLs in markdown
Question 2 — header: "Output", question: "Default output directory?"
- "url-to-markdown (Recommended)" — Save to ./url-to-markdown/{domain}/{slug}.md
- (User may choose "Other" to type a custom path)
Question 3 — header: "Save", question: "Where to save preferences?"
- "User (Recommended)" — ~/.baoyu-skills/ (all projects)
- "Project" — .baoyu-skills/ (this project only)
After user answers, create EXTEND.md at the chosen location, confirm "Preferences saved to [path]", then continue.
Full reference: references/config/first-time-setup.md
Supported Keys
| Key | Default | Values | Description |
|---|---|---|---|
| | / / | = prompt each time, = always download, = never |
| empty | path or empty | Default output directory (empty = ) |
EXTEND.md → CLI mapping:
| EXTEND.md key | CLI argument | Notes |
|---|---|---|
| | |
| | Directory path. Do NOT pass to (which expects a file path) |
Value priority:
- CLI arguments (
,--download-media
,-o
)--output-dir - EXTEND.md
- Skill defaults
Features
- Chrome CDP for full JavaScript rendering
- Two capture modes: auto or wait-for-user
- Save rendered HTML as a sibling
file-captured.html - Clean markdown output with metadata
- Upgraded Defuddle-first markdown conversion with automatic fallback to the pre-Defuddle extractor from git history
- Materializes shadow DOM content before conversion so web-component pages survive serialization better
- YouTube pages can include transcript/caption text in the markdown when YouTube exposes a caption track
- If local browser capture fails completely, can fall back to
and still save markdowndefuddle.md/<url> - Handles login-required pages via wait mode
- Download images and videos to local directories
Usage
# Auto mode (default) - capture when page loads ${BUN_X} {baseDir}/scripts/main.ts <url> # Wait mode - wait for user signal before capture ${BUN_X} {baseDir}/scripts/main.ts <url> --wait # Save to specific file ${BUN_X} {baseDir}/scripts/main.ts <url> -o output.md # Save to a custom output directory (auto-generates filename) ${BUN_X} {baseDir}/scripts/main.ts <url> --output-dir ./posts/ # Download images and videos to local directories ${BUN_X} {baseDir}/scripts/main.ts <url> --download-media
Options
| Option | Description |
|---|---|
| URL to fetch |
| Output file path — must be a file path, not directory (default: auto-generated) |
| Base output directory — auto-generates (default: ) |
| Wait for user signal before capturing |
| Page load timeout (default: 30000) |
| Download image/video assets to local and , and rewrite markdown links to local relative paths |
Capture Modes
| Mode | Behavior | Use When |
|---|---|---|
| Auto (default) | Capture on network idle | Public pages, static content |
Wait () | User signals when ready | Login-required, lazy loading, paywalls |
Wait mode workflow:
- Run with
→ script outputs "Press Enter when ready"--wait - Ask user to confirm page is ready
- Send newline to stdin to trigger capture
Output Format
Each run saves two files side by side:
- Markdown: YAML front matter with
,url
,title
,description
,author
, optionalpublished
, andcoverImage
, followed by converted markdown contentcaptured_at - HTML snapshot:
, containing the rendered page HTML captured from Chrome*-captured.html
When Defuddle or page metadata provides a language hint, the markdown front matter also includes
language.
The HTML snapshot is saved before any markdown media localization, so it stays a faithful capture of the page DOM used for conversion. If the hosted
defuddle.md API fallback is used, markdown is still saved, but there is no local -captured.html snapshot for that run.
Output Directory
Default:
url-to-markdown/<domain>/<slug>.md
With --output-dir ./posts/: ./posts/<domain>/<slug>.md
HTML snapshot path uses the same basename:
-
url-to-markdown/<domain>/<slug>-captured.html -
./posts/<domain>/<slug>-captured.html -
: From page title or URL path (kebab-case, 2-6 words)<slug> -
Conflict resolution: Append timestamp
<slug>-YYYYMMDD-HHMMSS.md
When
--download-media is enabled:
- Images are saved to
next to the markdown fileimgs/ - Videos are saved to
next to the markdown filevideos/ - Markdown media links are rewritten to local relative paths
Conversion Fallback
Conversion order:
- Try Defuddle first
- For rich pages such as YouTube, prefer Defuddle's extractor-specific output (including transcripts when available) instead of replacing it with the legacy pipeline
- If Defuddle throws, cannot load, returns obviously incomplete markdown, or captures lower-quality content than the legacy pipeline, automatically fall back to the pre-Defuddle extractor
- If the entire local browser capture flow fails before markdown can be produced, try the hosted
API and save its markdown output directlyhttps://defuddle.md/<url> - The legacy fallback path uses the older Readability/selector/Next.js-data based HTML-to-Markdown implementation recovered from git history
CLI output will show:
when Defuddle succeedsConverter: defuddle
plusConverter: legacy:...
when fallback was neededFallback used: ...
when local browser capture failed and the hosted API was used insteadConverter: defuddle-api
Media Download Workflow
Based on
download_media setting in EXTEND.md:
| Setting | Behavior |
|---|---|
(always) | Run script with flag |
(never) | Run script without flag |
(default) | Follow the ask-each-time flow below |
Ask-Each-Time Flow
- Run script without
→ markdown saved--download-media - Check saved markdown for remote media URLs (
in image/video links)https:// - If no remote media found → done, no prompt needed
- If remote media found → use
:AskUserQuestion- header: "Media", question: "Download N images/videos to local files?"
- "Yes" — Download to local directories
- "No" — Keep remote URLs
- If user confirms → run script again with
(overwrites markdown with localized links)--download-media
Environment Variables
| Variable | Description |
|---|---|
| Custom Chrome executable path |
| Custom data directory |
| Custom Chrome profile directory |
Troubleshooting: Chrome not found → set
URL_CHROME_PATH. Timeout → increase --timeout. Complex pages → try --wait mode. If markdown quality is poor, inspect the saved -captured.html and check whether the run logged a legacy fallback.
YouTube Notes
- The upgraded Defuddle path uses async extractors, so YouTube pages can include transcript text directly in the markdown body.
- Transcript availability depends on YouTube exposing a caption track. Videos with captions disabled, restricted playback, or blocked regional access may still produce description-only output.
- If the page needs time to finish loading descriptions, chapters, or player metadata, prefer
and capture after the watch page is fully hydrated.--wait
Hosted API Fallback
- The hosted fallback endpoint is
. In shell form:https://defuddle.md/<url>curl https://defuddle.md/stephango.com - Use it only when the local Chrome/CDP capture path fails outright. The local path still has higher fidelity because it can save the captured HTML and handle authenticated pages.
- The hosted API already returns Markdown with YAML frontmatter, so save that response as-is and then apply the normal media-localization step if requested.
Extension Support
Custom configurations via EXTEND.md. See Preferences section for paths and supported options.