Claude-skill-registry document-conversion
Convert DOC/DOCX/PDF/PPT/PPTX documents to Markdown format. Automatically detect PDF type (electronic/scanned), extract images to separate directory. Use this Skill when administrator onboards non-Markdown documents. Trigger condition: Onboard DOC/DOCX/PDF/PPT/PPTX format files.
install
source · Clone the upstream repo
git clone https://github.com/majiayu000/claude-skill-registry
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/majiayu000/claude-skill-registry "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/data/document-conversion" ~/.claude/skills/majiayu000-claude-skill-registry-document-conversion && rm -rf "$T"
manifest:
skills/data/document-conversion/SKILL.mdsource content
Document Format Conversion
Convert various document formats to Markdown for knowledge base onboarding.
Supported Formats
| Format | Processing Method |
|---|---|
| DOCX | Pandoc conversion, preserve formatting and images |
| DOC | LibreOffice → DOCX → Pandoc |
| PDF Electronic | PyMuPDF4LLM fast conversion |
| PDF Scanned | PaddleOCR-VL online OCR |
| PPTX | pptx2md professional conversion |
| PPT | LibreOffice → PPTX → pptx2md |
Usage
python .claude/skills/document-conversion/scripts/smart_convert.py \ <temp_path> \ --original-name "<original_filename>" \ --json-output
Parameters:
: Temporary file path (e.g.<temp_path>
)/tmp/kb_upload_xxx.pptx
: Must pass original filename, used to generate correct image directory name--original-name
: Output JSON format result--json-output
Output Format
{ "success": true, "markdown_file": "/path/to/output.md", "images_dir": "original_filename_images", "image_count": 5, "input_file": "/path/to/input.pptx" }
Processing Flow
- Execute conversion command (must use
and--original-name
)--json-output - Parse JSON output, check
fieldsuccess - If
, report error and endsuccess: false - If
, record generated file path and image directorysuccess: true
Important Notes
- Image directory uses original filename naming (e.g.
)培训资料_images/ - Not passing
will cause incorrect image reference paths--original-name - PDF type is automatically detected, scanned version processing is slower (tens of seconds to minutes)
Format Details
Detailed processing instructions for each format, see FORMATS.md