Full-stack-skills ocrmypdf
OCRmyPDF core skill — add searchable OCR text layer to scanned PDFs, convert images to searchable PDFs, support 100+ languages via Tesseract. Use when the user needs to OCR a PDF, make a scanned PDF searchable, or extract text from scanned documents.
git clone https://github.com/partme-ai/full-stack-skills
T=$(mktemp -d) && git clone --depth=1 https://github.com/partme-ai/full-stack-skills "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/ocrmypdf-skills/ocrmypdf" ~/.claude/skills/partme-ai-full-stack-skills-ocrmypdf && rm -rf "$T"
skills/ocrmypdf-skills/ocrmypdf/SKILL.mdOCRmyPDF — Core OCR Guide
Overview
OCRmyPDF adds an OCR text layer to scanned PDF files, allowing them to be searched or copy-pasted. It uses Tesseract OCR, supports 100+ languages, produces PDF/A by default, and distributes work across all CPU cores.
For image processing (deskew, rotate, clean), see the ocrmypdf-image skill. For optimization and PDF/A options, see ocrmypdf-optimize. For batch/Docker/scripting, see ocrmypdf-batch. For Python API and plugins, see ocrmypdf-api.
Installation
One-liner installs (recommended)
| OS | Command |
|---|---|
| Debian / Ubuntu | |
| Fedora | |
| macOS (Homebrew) | |
| macOS (MacPorts) | |
| FreeBSD | |
| Snap | |
pip install (latest version)
# After installing system dependencies (Tesseract, Ghostscript) pip install ocrmypdf
Verify
ocrmypdf --version ocrmypdf --help
Requirements
- Python 3.11+
- Tesseract 4.1.1+ (OCR engine)
- Ghostscript 9.54+ or pypdfium2 (PDF rasterization)
- Optional: jbig2enc (compression), pngquant (image optimization), unpaper (cleaning)
Quick Start
# Basic OCR — input scanned PDF, output searchable PDF/A ocrmypdf input.pdf output.pdf # OCR an image file directly ocrmypdf --image-dpi 300 scan.png output.pdf # OCR in place (only overwrites on success) ocrmypdf myfile.pdf myfile.pdf
Language Support
OCRmyPDF uses Tesseract language packs. Install them for your OS:
# Debian / Ubuntu apt-cache search tesseract-ocr # List all language packs apt install tesseract-ocr-chi-sim # Chinese Simplified apt install tesseract-ocr-fra # French # macOS (Homebrew) brew install tesseract-lang # All languages # Fedora dnf search tesseract-langpack dnf install tesseract-langpack-ita # Italian
Using languages
# Single language ocrmypdf -l fra document.pdf output.pdf # Multiple languages ocrmypdf -l eng+fra bilingual.pdf output.pdf # Chinese Simplified + English ocrmypdf -l chi_sim+eng chinese-doc.pdf output.pdf
Note: Use ISO 639-3 codes for language identifiers.
OCR Modes
Default mode (skip existing text)
# Skip pages that already have text — only OCR pages without text ocrmypdf input.pdf output.pdf
Force OCR (--force-ocr
or -m force
)
--force-ocr-m force# Rasterize and OCR all pages, even those with existing text ocrmypdf --force-ocr input.pdf output.pdf # v17+ short form: ocrmypdf -m force input.pdf output.pdf
Redo OCR (--redo-ocr
or -m redo
)
--redo-ocr-m redo# Replace existing OCR without rasterizing (preserves quality) ocrmypdf --redo-ocr input.pdf output.pdf # v17+ short form: ocrmypdf -m redo input.pdf output.pdf
Skip text (--skip-text
or -m skip
)
--skip-text-m skip# Skip pages with any text, only OCR blank/image pages ocrmypdf --skip-text input.pdf output.pdf # v17+ short form: ocrmypdf -m skip input.pdf output.pdf
No OCR (image processing only)
# Apply image processing / PDF/A conversion without OCR ocrmypdf --ocr-engine none input.pdf output.pdf
Page Selection
# OCR only specific pages ocrmypdf --pages 1,3,5-10 input.pdf output.pdf # OCR only the first page, minimal changes elsewhere ocrmypdf --pages 1 --output-type pdf --optimize 0 input.pdf output.pdf
Output Types
# PDF/A (default) — for archival ocrmypdf --output-type pdfa input.pdf output.pdf # Standard PDF ocrmypdf --output-type pdf input.pdf output.pdf # Auto (v17+) — speculative PDF/A, falls back to standard PDF ocrmypdf --output-type auto input.pdf output.pdf # No output PDF — only produce sidecar text ocrmypdf --output-type none --sidecar text.txt input.pdf -
Sidecar Text File
# Produce a companion text file with OCR text ocrmypdf --sidecar output.txt input.pdf output.pdf
Metadata
# Set output PDF metadata ocrmypdf --title "My Document" --author "Author Name" --subject "Subject" input.pdf output.pdf
Parallel Processing
# Use 4 CPU cores (default: all available) ocrmypdf --jobs 4 input.pdf output.pdf # Single-threaded ocrmypdf --jobs 1 input.pdf output.pdf
Common Recipes
Make a scanned PDF searchable
ocrmypdf scanned.pdf searchable.pdf
Convert image to searchable PDF
ocrmypdf --image-dpi 300 scan.jpg output.pdf
OCR a multilingual document
ocrmypdf -l eng+deu+fra multilingual.pdf output.pdf
Re-OCR with newer Tesseract
ocrmypdf --redo-ocr old-ocr.pdf updated.pdf
Strip all text/OCR from a PDF
ocrmypdf --ocr-engine none --force-ocr input.pdf stripped.pdf
Quick Reference
| Task | Command |
|---|---|
| Basic OCR | |
| Specify language | |
| Multiple languages | |
| Force re-OCR all pages | |
| Replace existing OCR | |
| Skip pages with text | |
| Specific pages only | |
| Output standard PDF | |
| Extract text sidecar | |
| Image to PDF | |
| In-place OCR | |
| Set metadata | |
| Parallel jobs | |
Troubleshooting
- "Tesseract not found": Install Tesseract and ensure it's on PATH.
- Poor OCR quality: Check language packs (
), try-l
(see ocrmypdf-image), or--deskew
.--oversample 300 - "Input file has text": Use
,--force-ocr
, or--redo-ocr
as appropriate.--skip-text - Large output files: See ocrmypdf-optimize for
levels and JBIG2.--optimize - Signed PDFs: Use
to override (signatures will be invalidated).--invalidate-digital-signatures