Skillshub doc-to-markdown

Converts DOCX/PDF/PPTX to high-quality Markdown with automatic post-processing. Fixes pandoc grid tables, simple tables, image paths, CJK bold spacing, attribute noise, and code blocks. Benchmarked best-in-class (7.6/10) against Docling, MarkItDown, Pandoc raw, and Mammoth. Trigger on "convert document", "docx to markdown", "parse word", "doc to markdown", "解析word", "转换文档".

install
source · Clone the upstream repo
git clone https://github.com/ComeOnOliver/skillshub
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/ComeOnOliver/skillshub "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/daymade/claude-code-skills/doc-to-markdown" ~/.claude/skills/comeonoliver-skillshub-doc-to-markdown && rm -rf "$T"
manifest: skills/daymade/claude-code-skills/doc-to-markdown/SKILL.md
source content

Doc to Markdown

Convert documents to high-quality markdown with intelligent multi-tool orchestration and automatic DOCX post-processing.

Architecture: Pandoc (best-in-class extraction) + 8 post-processing fixes (our value-add).

Quick Start

# DOCX → Markdown (one command, zero manual fixes)
uv run --with pymupdf4llm --with markitdown scripts/convert.py document.docx -o output.md --assets-dir ./media

# PDF → Markdown
uv run --with pymupdf4llm --with markitdown scripts/convert.py document.pdf -o output.md

# Run tests
uv run --with pytest pytest scripts/test_convert.py -v

Dual Mode

ModeSpeedQualityUse Case
Quick (default)FastGoodDrafts, simple documents
HeavySlowerBestFinal documents, complex layouts

Tool Selection

FormatQuick ModeHeavy Mode
PDFpymupdf4llmpymupdf4llm + markitdown
DOCXpandoc + post-processingpandoc + markitdown
PPTXmarkitdownmarkitdown + pandoc
XLSXmarkitdownmarkitdown

DOCX Post-Processing (automatic)

When converting DOCX via pandoc, 8 cleanups are applied automatically:

ProblemFixTest coverage
Grid tables (
+:---+
)
Single-column → blockquote, multi-column → pipe table
TestPostprocessPipeline
Simple tables (
  ---- ----
)
Multi-column images → pipe table with captions
TestSimpleTable
Image path nesting (
media/media/
)
Flatten to
media/
, absolute → relative
test_stats_tracking
Pandoc attributes (
{width="..."}
)
Removed
test_pandoc_attributes_removed
CJK bold spacing (
**粗体**中文
)
Add space around
**
for CJK bold spans
TestCjkBoldSpacing
(15 cases)
Indented dashed code blocks→ fenced ``` with language detection
test_code_block_with_language
Escaped brackets (
\[...\]
)
[...]
test_escaped_brackets_fixed
Double-bracket links (
[[text]](url)
)
[text](url)
test_double_bracket_links_fixed

CJK Bold Spacing — why and how

DOCX uses run-level styling (no spaces between bold/normal runs in CJK text). Markdown renderers need whitespace around

**
to recognize bold boundaries.

Rule: if a

**content**
span contains any CJK character, ensure both sides have a space — unless already spaced or at line boundary. This handles CJK punctuation, emoji adjacency, and mixed content.

Before: 打开**飞书**,就可以    → some renderers fail to bold
After:  打开 **飞书** ,就可以  → universally renders correctly

Heavy Mode Workflow

Heavy Mode runs multiple tools in parallel and selects the best segments:

  1. Parallel Execution: Run all applicable tools simultaneously
  2. Segment Analysis: Parse each output into segments (tables, headings, images, paragraphs)
  3. Quality Scoring: Score each segment based on completeness and structure
  4. Intelligent Merge: Select best version of each segment across tools

Merge Criteria

Segment TypeSelection Criteria
TablesMore rows/columns, proper header separator
ImagesAlt text present, local paths preferred
HeadingsProper hierarchy, appropriate length
ListsMore items, nested structure preserved
ParagraphsContent completeness

Image Extraction

# Extract images with metadata
uv run --with pymupdf scripts/extract_pdf_images.py document.pdf -o ./assets

# Generate markdown references file
uv run --with pymupdf scripts/extract_pdf_images.py document.pdf --markdown refs.md

Output:

  • Images:
    assets/img_page1_1.png
    ,
    assets/img_page2_1.jpg
  • Metadata:
    assets/images_metadata.json
    (page, position, dimensions)

Quality Validation

# Validate conversion quality
uv run --with pymupdf scripts/validate_output.py document.pdf output.md

# Generate HTML report
uv run --with pymupdf scripts/validate_output.py document.pdf output.md --report report.html

Quality Metrics

MetricPassWarnFail
Text Retention>95%85-95%<85%
Table Retention100%90-99%<90%
Image Retention100%80-99%<80%

Merge Outputs Manually

# Merge multiple markdown files
python scripts/merge_outputs.py output1.md output2.md -o merged.md

# Show segment attribution
python scripts/merge_outputs.py output1.md output2.md -o merged.md --verbose

Path Conversion (Windows/WSL)

# Windows to WSL conversion
python scripts/convert_path.py "C:\Users\name\Documents\file.pdf"
# Output: /mnt/c/Users/name/Documents/file.pdf

Common Issues

"No conversion tools available"

# Install all tools
pip install pymupdf4llm
uv tool install "markitdown[pdf]"
brew install pandoc

FontBBox warnings during PDF conversion

  • Harmless font parsing warnings, output is still correct

Images missing from output

  • Use Heavy Mode for better image preservation
  • Or extract separately with
    scripts/extract_pdf_images.py

Tables broken in output

  • Use Heavy Mode - it selects the most complete table version
  • Or validate with
    scripts/validate_output.py

Bundled Scripts

ScriptPurpose
convert.py
Main orchestrator with Quick/Heavy mode + DOCX post-processing
test_convert.py
31 tests covering all post-processing functions
merge_outputs.py
Merge multiple markdown outputs
validate_output.py
Quality validation with HTML report
extract_pdf_images.py
PDF image extraction with metadata
convert_path.py
Windows to WSL path converter

References

  • references/benchmark-2026-03-22.md
    - 5-tool benchmark (Docling/MarkItDown/Pandoc/Mammoth/ours)
  • references/heavy-mode-guide.md
    - Detailed Heavy Mode documentation
  • references/tool-comparison.md
    - Tool capabilities comparison
  • references/conversion-examples.md
    - Batch operation examples