Medical-research-skills markitdown
Convert files and Office documents into clean Markdown when you need LLM-friendly, token-efficient text (e.g., for summarization, search, RAG ingestion, or dataset preparation).
install
source · Clone the upstream repo
git clone https://github.com/aipoch/medical-research-skills
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/aipoch/medical-research-skills "$T" && mkdir -p ~/.claude/skills && cp -r "$T/scientific-skills/Other/markitdown" ~/.claude/skills/aipoch-medical-research-skills-markitdown && rm -rf "$T"
manifest:
scientific-skills/Other/markitdown/SKILL.mdsource content
When to Use
- Converting research papers or reports (PDF/DOCX/EPUB/HTML) into Markdown for LLM summarization, Q&A, or RAG indexing.
- Extracting tables and structured content from spreadsheets (XLSX/CSV) into Markdown for analysis or documentation.
- Turning slide decks (PPTX) into Markdown notes, including speaker notes and (optionally) AI-generated image descriptions.
- Processing images or scanned documents with OCR to obtain searchable, editable Markdown text.
- Transcribing audio (WAV/MP3) or pulling YouTube transcripts into Markdown for meeting notes, content analysis, or knowledge bases.
Key Features
- Converts many formats to structured Markdown (PDF, DOCX, PPTX, XLSX, images, audio, HTML, CSV, JSON, XML, ZIP, EPUB, YouTube URLs, etc.).
- Produces token-efficient output suitable for LLM pipelines (summarization, chunking, embedding).
- OCR support for images/scans (when OCR dependencies are installed).
- Audio transcription support (when transcription dependencies are installed).
- Optional AI-enhanced image/slide descriptions via an OpenAI-compatible client (e.g., OpenRouter).
- Plugin system to extend format support and custom behaviors.
- Stream-based conversion API for large files.
Dependencies
- Python:
(recommended)>=3.9 - Package:
(installs all optional format handlers)markitdown[all]
Optional system dependencies (feature-dependent):
- Tesseract OCR:
(for image/scanned-text OCR)tesseract-ocr
Optional external services (feature-dependent):
- Azure Document Intelligence endpoint (for enhanced PDF extraction)
- OpenAI-compatible LLM endpoint (e.g., OpenRouter) for AI image descriptions
Example Usage
Install
pip install 'markitdown[all]'
CLI: Convert a PDF to Markdown
markitdown document.pdf -o output.md
Python: Convert multiple formats (PDF/XLSX/PPTX/DOCX) and save outputs
from pathlib import Path from markitdown import MarkItDown md = MarkItDown() files = [ "document.pdf", "spreadsheet.xlsx", "presentation.pptx", "notes.docx", ] for path in files: result = md.convert(path) out = Path(path).with_suffix(".md") out.write_text(result.text_content, encoding="utf-8") print(f"Converted {path} -> {out}")
Python: Stream conversion (useful for large files)
from markitdown import MarkItDown md = MarkItDown() with open("large_file.pdf", "rb") as f: result = md.convert_stream(f, file_extension=".pdf") with open("large_file.md", "w", encoding="utf-8") as out: out.write(result.text_content)
Python: AI-enhanced image/slide descriptions (OpenAI-compatible, e.g., OpenRouter)
from markitdown import MarkItDown from openai import OpenAI client = OpenAI( api_key="YOUR_OPENROUTER_API_KEY", base_url="https://openrouter.ai/api/v1", ) md = MarkItDown( llm_client=client, llm_model="anthropic/claude-opus-4.5", llm_prompt="Describe this image in detail for scientific documentation.", ) result = md.convert("presentation.pptx") print(result.text_content)
Implementation Details
-
Conversion entry points
converts a file by path/URL and returns an object whose primary payload isMarkItDown().convert(path)
(Markdown).result.text_content
converts from a binary stream; use this for large files or when data is not on disk.MarkItDown().convert_stream(stream, file_extension=".pdf")
-
Format handling
- Format support is provided by optional extras (e.g.,
,pdf
,docx
,pptx
,xlsx
,audio-transcription
) oryoutube-transcription
.all - ZIP inputs are typically processed by iterating through contained files and converting each supported entry.
- Format support is provided by optional extras (e.g.,
-
OCR
- For images/scanned documents, OCR is enabled when OCR tooling is available (commonly Tesseract). Ensure the OS-level OCR binary is installed and accessible in
.PATH
- For images/scanned documents, OCR is enabled when OCR tooling is available (commonly Tesseract). Ensure the OS-level OCR binary is installed and accessible in
-
AI image descriptions
- When
,llm_client
, andllm_model
are provided, MarkItDown can request model-generated descriptions for images (including slide images), then inject those descriptions into the Markdown output.llm_prompt - Any OpenAI-compatible client can be used (e.g., OpenRouter) by setting
andbase_url
.api_key
- When
-
Enhanced PDF extraction (Azure Document Intelligence)
- When configured with a Document Intelligence endpoint, PDF extraction can be improved for complex layouts (tables, multi-column text, scanned PDFs), producing more faithful Markdown structure.
-
Plugins
- Plugins can be listed and enabled from the CLI (e.g.,
,--list-plugins
) to extend conversion behavior or add new format handlers.--use-plugins
- Plugins can be listed and enabled from the CLI (e.g.,