Claude-skill-registry document-to-markdown

Convert PDF, Office, images, HTML, and URLs to Markdown. Use for document conversion, text extraction, OCR, web scraping, RAG pipelines. Triggers: "convert to markdown", "extract text", "read PDF", "scrape URL".

install
source · Clone the upstream repo
git clone https://github.com/majiayu000/claude-skill-registry
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/majiayu000/claude-skill-registry "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/data/document-to-markdown" ~/.claude/skills/majiayu000-claude-skill-registry-document-to-markdown && rm -rf "$T"
manifest: skills/data/document-to-markdown/SKILL.md
source content

Document to Markdown

Convert documents and URLs to clean Markdown for LLM/RAG use.

Prerequisites (Auto-handled)

Before running any conversion, check and install dependencies if needed:

# Check if dependencies are installed, install if missing
pip show pymupdf4llm markitdown pyyaml > /dev/null 2>&1 || pip install -r ~/.claude/skills/document-to-markdown/requirements.txt

Run this check automatically before first use. Do not prompt user for installation.

Quick Start

# Single file
python scripts/gateway.py --input <file_or_url> [--json]

# Batch processing
python scripts/gateway.py --input-dir <folder> --output-dir <out> [--recursive] [--parallel 4]

Supported Inputs

TypeFormats
DocumentsPDF, DOCX, PPTX, XLSX
ImagesPNG, JPG, JPEG, WEBP, TIFF
WebHTML (local), URLs (http/https)
TextTXT, MD, CSV, JSON, XML

Key Options

OptionPurpose
--format human
Clean, readable output (default)
--format rag
Structured output for LLM/RAG
--json
Structured output for agents
--input-dir
Batch process entire directory
--output-dir
Output directory for batch mode
--recursive
Include subdirectories in batch
--parallel N
Process N files concurrently
--frontmatter
Add YAML metadata header
--pdf-backend marker
For scanned PDFs (slow, 1.3GB models)
--pdf-backend paddleocr
For scanned PDFs + Chinese (fast, <10MB)
--table-mode
Enable table recognition (requires
paddlex[ocr]
)
--use-gpu
GPU acceleration for PaddleOCR
--to-traditional
Convert Simplified to Traditional Chinese
--pages 1-10
Convert specific pages only

Workflow

  1. Check dependencies (auto-install if missing)
  2. Run:
    python scripts/gateway.py --input <path> --json
  3. Check JSON
    success
    field
  4. If
    warnings
    present, consider switching backend
  5. Read output file to present content to user

Format Selection

Default:

--format human
(clean, readable for humans)

Use

--format rag
when user prompt mentions:

  • "for RAG", "for LLM", "for embedding", "for AI"
  • "vector database", "chunking", "indexing"
  • "給 AI 讀", "餵給模型", "向量資料庫"
FormatOutput Style
human
## Title
/
Name (email)
/ clean links
rag
## **Title**
/
**Name** _email_
/ full metadata

Conditional Logic

IF warning "Complex tables detected":
  → Retry with --pdf-backend marker (slower but better tables)

IF output is empty or very short:
  → Retry with --pdf-backend marker (for scanned PDFs)

IF URL timeout:
  → Increase --url-timeout or use --url-backend markitdown

IF OCR quality poor:
  → Specify --lang for correct language

Output Format

Single file:

{"success": true, "output_path": "doc.md", "backend_used": "pymupdf4llm"}

Batch:

{"success": true, "total": 10, "converted": 9, "failed": 1, "results": [...]}

For backend details, see

references/backends.md
. For troubleshooting, see
references/troubleshooting.md
.