Claude-skill-registry document-to-markdown

Convert PDF, Office, images, HTML, and URLs to Markdown. Use for document conversion, text extraction, OCR, web scraping, RAG pipelines. Triggers: "convert to markdown", "extract text", "read PDF", "scrape URL".

install

source · Clone the upstream repo

git clone https://github.com/majiayu000/claude-skill-registry

Claude Code · Install into ~/.claude/skills/

T=$(mktemp -d) && git clone --depth=1 https://github.com/majiayu000/claude-skill-registry "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/data/document-to-markdown" ~/.claude/skills/majiayu000-claude-skill-registry-document-to-markdown && rm -rf "$T"

manifest: skills/data/document-to-markdown/SKILL.md

Document to Markdown

Convert documents and URLs to clean Markdown for LLM/RAG use.

Prerequisites (Auto-handled)

Before running any conversion, check and install dependencies if needed:

# Check if dependencies are installed, install if missing
pip show pymupdf4llm markitdown pyyaml > /dev/null 2>&1 || pip install -r ~/.claude/skills/document-to-markdown/requirements.txt

Run this check automatically before first use. Do not prompt user for installation.

Quick Start

# Single file
python scripts/gateway.py --input <file_or_url> [--json]

# Batch processing
python scripts/gateway.py --input-dir <folder> --output-dir <out> [--recursive] [--parallel 4]

Supported Inputs

Type	Formats
Documents	PDF, DOCX, PPTX, XLSX
Images	PNG, JPG, JPEG, WEBP, TIFF
Web	HTML (local), URLs (http/https)
Text	TXT, MD, CSV, JSON, XML

Key Options

Option	Purpose
`--format human`	Clean, readable output (default)
`--format rag`	Structured output for LLM/RAG
`--json`	Structured output for agents
`--input-dir`	Batch process entire directory
`--output-dir`	Output directory for batch mode
`--recursive`	Include subdirectories in batch
`--parallel N`	Process N files concurrently
`--frontmatter`	Add YAML metadata header
`--pdf-backend marker`	For scanned PDFs (slow, 1.3GB models)
`--pdf-backend paddleocr`	For scanned PDFs + Chinese (fast, <10MB)
`--table-mode`	Enable table recognition (requires `paddlex[ocr]` )
`--use-gpu`	GPU acceleration for PaddleOCR
`--to-traditional`	Convert Simplified to Traditional Chinese
`--pages 1-10`	Convert specific pages only

Workflow

Check dependencies (auto-install if missing)

Run:

python scripts/gateway.py --input <path> --json

Check JSON
```
success
```
field
If
```
warnings
```
present, consider switching backend
Read output file to present content to user

Format Selection

Default:

--format human

(clean, readable for humans)

Use

--format rag

when user prompt mentions:

"for RAG", "for LLM", "for embedding", "for AI"
"vector database", "chunking", "indexing"
"給 AI 讀", "餵給模型", "向量資料庫"

Format Output Style

human

## Title

Name (email)

/ clean links

rag

## **Title**

**Name** _email_

/ full metadata

Conditional Logic

IF warning "Complex tables detected":
  → Retry with --pdf-backend marker (slower but better tables)

IF output is empty or very short:
  → Retry with --pdf-backend marker (for scanned PDFs)

IF URL timeout:
  → Increase --url-timeout or use --url-backend markitdown

IF OCR quality poor:
  → Specify --lang for correct language

Output Format

Single file:

{"success": true, "output_path": "doc.md", "backend_used": "pymupdf4llm"}

Batch:

{"success": true, "total": 10, "converted": 9, "failed": 1, "results": [...]}

For backend details, see

references/backends.md

. For troubleshooting, see

references/troubleshooting.md