Medical-research-skills markitdown

Convert files and Office documents into clean Markdown when you need LLM-friendly, token-efficient text (e.g., for summarization, search, RAG ingestion, or dataset preparation).

install

source · Clone the upstream repo

git clone https://github.com/aipoch/medical-research-skills

Claude Code · Install into ~/.claude/skills/

T=$(mktemp -d) && git clone --depth=1 https://github.com/aipoch/medical-research-skills "$T" && mkdir -p ~/.claude/skills && cp -r "$T/scientific-skills/Other/markitdown" ~/.claude/skills/aipoch-medical-research-skills-markitdown && rm -rf "$T"

manifest: scientific-skills/Other/markitdown/SKILL.md

source content

Source: https://github.com/aipoch/medical-research-skills

When to Use

Converting research papers or reports (PDF/DOCX/EPUB/HTML) into Markdown for LLM summarization, Q&A, or RAG indexing.
Extracting tables and structured content from spreadsheets (XLSX/CSV) into Markdown for analysis or documentation.
Turning slide decks (PPTX) into Markdown notes, including speaker notes and (optionally) AI-generated image descriptions.
Processing images or scanned documents with OCR to obtain searchable, editable Markdown text.
Transcribing audio (WAV/MP3) or pulling YouTube transcripts into Markdown for meeting notes, content analysis, or knowledge bases.

Key Features

Converts many formats to structured Markdown (PDF, DOCX, PPTX, XLSX, images, audio, HTML, CSV, JSON, XML, ZIP, EPUB, YouTube URLs, etc.).
Produces token-efficient output suitable for LLM pipelines (summarization, chunking, embedding).
OCR support for images/scans (when OCR dependencies are installed).
Audio transcription support (when transcription dependencies are installed).
Optional AI-enhanced image/slide descriptions via an OpenAI-compatible client (e.g., OpenRouter).
Plugin system to extend format support and custom behaviors.
Stream-based conversion API for large files.

Dependencies

Python:
```
>=3.9
```
(recommended)
Package:
- ```
markitdown[all]
```
  (installs all optional format handlers)

Optional system dependencies (feature-dependent):

Tesseract OCR:
```
tesseract-ocr
```
(for image/scanned-text OCR)

Optional external services (feature-dependent):

Azure Document Intelligence endpoint (for enhanced PDF extraction)
OpenAI-compatible LLM endpoint (e.g., OpenRouter) for AI image descriptions

Example Usage

Install

pip install 'markitdown[all]'

CLI: Convert a PDF to Markdown

markitdown document.pdf -o output.md

Python: Convert multiple formats (PDF/XLSX/PPTX/DOCX) and save outputs

from pathlib import Path
from markitdown import MarkItDown

md = MarkItDown()

files = [
    "document.pdf",
    "spreadsheet.xlsx",
    "presentation.pptx",
    "notes.docx",
]

for path in files:
    result = md.convert(path)
    out = Path(path).with_suffix(".md")
    out.write_text(result.text_content, encoding="utf-8")
    print(f"Converted {path} -> {out}")

Python: Stream conversion (useful for large files)

from markitdown import MarkItDown

md = MarkItDown()

with open("large_file.pdf", "rb") as f:
    result = md.convert_stream(f, file_extension=".pdf")

with open("large_file.md", "w", encoding="utf-8") as out:
    out.write(result.text_content)

Python: AI-enhanced image/slide descriptions (OpenAI-compatible, e.g., OpenRouter)

from markitdown import MarkItDown
from openai import OpenAI

client = OpenAI(
    api_key="YOUR_OPENROUTER_API_KEY",
    base_url="https://openrouter.ai/api/v1",
)

md = MarkItDown(
    llm_client=client,
    llm_model="anthropic/claude-opus-4.5",
    llm_prompt="Describe this image in detail for scientific documentation.",
)

result = md.convert("presentation.pptx")
print(result.text_content)

Implementation Details

Conversion entry points
- ```
MarkItDown().convert(path)
```
  converts a file by path/URL and returns an object whose primary payload is
```
result.text_content
```
  (Markdown).
- ```
MarkItDown().convert_stream(stream, file_extension=".pdf")
```
  converts from a binary stream; use this for large files or when data is not on disk.
Format handling
- Format support is provided by optional extras (e.g.,
```
pdf
```
  ,
```
docx
```
  ,
```
pptx
```
  ,
```
xlsx
```
  ,
```
audio-transcription
```
  ,
```
youtube-transcription
```
  ) or
```
all
```
  .
- ZIP inputs are typically processed by iterating through contained files and converting each supported entry.
OCR
- For images/scanned documents, OCR is enabled when OCR tooling is available (commonly Tesseract). Ensure the OS-level OCR binary is installed and accessible in
```
PATH
```
  .
AI image descriptions
- When
```
llm_client
```
  ,
```
llm_model
```
  , and
```
llm_prompt
```
  are provided, MarkItDown can request model-generated descriptions for images (including slide images), then inject those descriptions into the Markdown output.
- Any OpenAI-compatible client can be used (e.g., OpenRouter) by setting
```
base_url
```
  and
```
api_key
```
  .
Enhanced PDF extraction (Azure Document Intelligence)
- When configured with a Document Intelligence endpoint, PDF extraction can be improved for complex layouts (tables, multi-column text, scanned PDFs), producing more faithful Markdown structure.
Plugins
- Plugins can be listed and enabled from the CLI (e.g.,
```
--list-plugins
```
  ,
```
--use-plugins
```
  ) to extend conversion behavior or add new format handlers.