Claude-skill-registry document-to-markdown
Convert PDF, Office, images, HTML, and URLs to Markdown. Use for document conversion, text extraction, OCR, web scraping, RAG pipelines. Triggers: "convert to markdown", "extract text", "read PDF", "scrape URL".
install
source · Clone the upstream repo
git clone https://github.com/majiayu000/claude-skill-registry
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/majiayu000/claude-skill-registry "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/data/document-to-markdown" ~/.claude/skills/majiayu000-claude-skill-registry-document-to-markdown && rm -rf "$T"
manifest:
skills/data/document-to-markdown/SKILL.mdsource content
Document to Markdown
Convert documents and URLs to clean Markdown for LLM/RAG use.
Prerequisites (Auto-handled)
Before running any conversion, check and install dependencies if needed:
# Check if dependencies are installed, install if missing pip show pymupdf4llm markitdown pyyaml > /dev/null 2>&1 || pip install -r ~/.claude/skills/document-to-markdown/requirements.txt
Run this check automatically before first use. Do not prompt user for installation.
Quick Start
# Single file python scripts/gateway.py --input <file_or_url> [--json] # Batch processing python scripts/gateway.py --input-dir <folder> --output-dir <out> [--recursive] [--parallel 4]
Supported Inputs
| Type | Formats |
|---|---|
| Documents | PDF, DOCX, PPTX, XLSX |
| Images | PNG, JPG, JPEG, WEBP, TIFF |
| Web | HTML (local), URLs (http/https) |
| Text | TXT, MD, CSV, JSON, XML |
Key Options
| Option | Purpose |
|---|---|
| Clean, readable output (default) |
| Structured output for LLM/RAG |
| Structured output for agents |
| Batch process entire directory |
| Output directory for batch mode |
| Include subdirectories in batch |
| Process N files concurrently |
| Add YAML metadata header |
| For scanned PDFs (slow, 1.3GB models) |
| For scanned PDFs + Chinese (fast, <10MB) |
| Enable table recognition (requires ) |
| GPU acceleration for PaddleOCR |
| Convert Simplified to Traditional Chinese |
| Convert specific pages only |
Workflow
- Check dependencies (auto-install if missing)
- Run:
python scripts/gateway.py --input <path> --json - Check JSON
fieldsuccess - If
present, consider switching backendwarnings - Read output file to present content to user
Format Selection
Default:
--format human (clean, readable for humans)
Use
--format rag when user prompt mentions:
- "for RAG", "for LLM", "for embedding", "for AI"
- "vector database", "chunking", "indexing"
- "給 AI 讀", "餵給模型", "向量資料庫"
| Format | Output Style |
|---|---|
| / / clean links |
| / / full metadata |
Conditional Logic
IF warning "Complex tables detected": → Retry with --pdf-backend marker (slower but better tables) IF output is empty or very short: → Retry with --pdf-backend marker (for scanned PDFs) IF URL timeout: → Increase --url-timeout or use --url-backend markitdown IF OCR quality poor: → Specify --lang for correct language
Output Format
Single file:
{"success": true, "output_path": "doc.md", "backend_used": "pymupdf4llm"}
Batch:
{"success": true, "total": 10, "converted": 9, "failed": 1, "results": [...]}
For backend details, see
references/backends.md.
For troubleshooting, see references/troubleshooting.md.