Full-stack-skills ocrmypdf

OCRmyPDF core skill — add searchable OCR text layer to scanned PDFs, convert images to searchable PDFs, support 100+ languages via Tesseract. Use when the user needs to OCR a PDF, make a scanned PDF searchable, or extract text from scanned documents.

install

source · Clone the upstream repo

git clone https://github.com/partme-ai/full-stack-skills

Claude Code · Install into ~/.claude/skills/

T=$(mktemp -d) && git clone --depth=1 https://github.com/partme-ai/full-stack-skills "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/ocrmypdf-skills/ocrmypdf" ~/.claude/skills/partme-ai-full-stack-skills-ocrmypdf && rm -rf "$T"

manifest: skills/ocrmypdf-skills/ocrmypdf/SKILL.md

OCRmyPDF — Core OCR Guide

Overview

OCRmyPDF adds an OCR text layer to scanned PDF files, allowing them to be searched or copy-pasted. It uses Tesseract OCR, supports 100+ languages, produces PDF/A by default, and distributes work across all CPU cores.

For image processing (deskew, rotate, clean), see the ocrmypdf-image skill. For optimization and PDF/A options, see ocrmypdf-optimize. For batch/Docker/scripting, see ocrmypdf-batch. For Python API and plugins, see ocrmypdf-api.

Installation

One-liner installs (recommended)

OS	Command
Debian / Ubuntu	`apt install ocrmypdf`
Fedora	`dnf install ocrmypdf tesseract-osd`
macOS (Homebrew)	`brew install ocrmypdf`
macOS (MacPorts)	`port install ocrmypdf`
FreeBSD	`pkg install py-ocrmypdf`
Snap	`snap install ocrmypdf`

pip install (latest version)

# After installing system dependencies (Tesseract, Ghostscript)
pip install ocrmypdf

Verify

ocrmypdf --version
ocrmypdf --help

Requirements

Python 3.11+
Tesseract 4.1.1+ (OCR engine)
Ghostscript 9.54+ or pypdfium2 (PDF rasterization)
Optional: jbig2enc (compression), pngquant (image optimization), unpaper (cleaning)

Quick Start

# Basic OCR — input scanned PDF, output searchable PDF/A
ocrmypdf input.pdf output.pdf

# OCR an image file directly
ocrmypdf --image-dpi 300 scan.png output.pdf

# OCR in place (only overwrites on success)
ocrmypdf myfile.pdf myfile.pdf

Language Support

OCRmyPDF uses Tesseract language packs. Install them for your OS:

# Debian / Ubuntu
apt-cache search tesseract-ocr          # List all language packs
apt install tesseract-ocr-chi-sim       # Chinese Simplified
apt install tesseract-ocr-fra           # French

# macOS (Homebrew)
brew install tesseract-lang             # All languages

# Fedora
dnf search tesseract-langpack
dnf install tesseract-langpack-ita      # Italian

Using languages

# Single language
ocrmypdf -l fra document.pdf output.pdf

# Multiple languages
ocrmypdf -l eng+fra bilingual.pdf output.pdf

# Chinese Simplified + English
ocrmypdf -l chi_sim+eng chinese-doc.pdf output.pdf

Note: Use ISO 639-3 codes for language identifiers.

OCR Modes

Default mode (skip existing text)

# Skip pages that already have text — only OCR pages without text
ocrmypdf input.pdf output.pdf

Force OCR (

--force-ocr

-m force

)

# Rasterize and OCR all pages, even those with existing text
ocrmypdf --force-ocr input.pdf output.pdf
# v17+ short form:
ocrmypdf -m force input.pdf output.pdf

Redo OCR (

--redo-ocr

-m redo

)

# Replace existing OCR without rasterizing (preserves quality)
ocrmypdf --redo-ocr input.pdf output.pdf
# v17+ short form:
ocrmypdf -m redo input.pdf output.pdf

Skip text (

--skip-text

-m skip

)

# Skip pages with any text, only OCR blank/image pages
ocrmypdf --skip-text input.pdf output.pdf
# v17+ short form:
ocrmypdf -m skip input.pdf output.pdf

No OCR (image processing only)

# Apply image processing / PDF/A conversion without OCR
ocrmypdf --ocr-engine none input.pdf output.pdf

Page Selection

# OCR only specific pages
ocrmypdf --pages 1,3,5-10 input.pdf output.pdf

# OCR only the first page, minimal changes elsewhere
ocrmypdf --pages 1 --output-type pdf --optimize 0 input.pdf output.pdf

Output Types

# PDF/A (default) — for archival
ocrmypdf --output-type pdfa input.pdf output.pdf

# Standard PDF
ocrmypdf --output-type pdf input.pdf output.pdf

# Auto (v17+) — speculative PDF/A, falls back to standard PDF
ocrmypdf --output-type auto input.pdf output.pdf

# No output PDF — only produce sidecar text
ocrmypdf --output-type none --sidecar text.txt input.pdf -

Sidecar Text File

# Produce a companion text file with OCR text
ocrmypdf --sidecar output.txt input.pdf output.pdf

Metadata

# Set output PDF metadata
ocrmypdf --title "My Document" --author "Author Name" --subject "Subject" input.pdf output.pdf

Parallel Processing

# Use 4 CPU cores (default: all available)
ocrmypdf --jobs 4 input.pdf output.pdf

# Single-threaded
ocrmypdf --jobs 1 input.pdf output.pdf

Common Recipes

Make a scanned PDF searchable

ocrmypdf scanned.pdf searchable.pdf

Convert image to searchable PDF

ocrmypdf --image-dpi 300 scan.jpg output.pdf

OCR a multilingual document

ocrmypdf -l eng+deu+fra multilingual.pdf output.pdf

Re-OCR with newer Tesseract

ocrmypdf --redo-ocr old-ocr.pdf updated.pdf

Strip all text/OCR from a PDF

ocrmypdf --ocr-engine none --force-ocr input.pdf stripped.pdf

Quick Reference

Task	Command
Basic OCR	`ocrmypdf input.pdf output.pdf`
Specify language	`ocrmypdf -l fra input.pdf output.pdf`
Multiple languages	`ocrmypdf -l eng+fra input.pdf output.pdf`
Force re-OCR all pages	`ocrmypdf --force-ocr input.pdf output.pdf`
Replace existing OCR	`ocrmypdf --redo-ocr input.pdf output.pdf`
Skip pages with text	`ocrmypdf --skip-text input.pdf output.pdf`
Specific pages only	`ocrmypdf --pages 1,3,5-10 input.pdf output.pdf`
Output standard PDF	`ocrmypdf --output-type pdf input.pdf output.pdf`
Extract text sidecar	`ocrmypdf --sidecar text.txt input.pdf output.pdf`
Image to PDF	`ocrmypdf --image-dpi 300 image.png output.pdf`
In-place OCR	`ocrmypdf myfile.pdf myfile.pdf`
Set metadata	`ocrmypdf --title "Title" input.pdf output.pdf`
Parallel jobs	`ocrmypdf --jobs 4 input.pdf output.pdf`

Troubleshooting

"Tesseract not found": Install Tesseract and ensure it's on PATH.
Poor OCR quality: Check language packs (
```
-l
```
), try
```
--deskew
```
(see ocrmypdf-image), or
```
--oversample 300
```
.
"Input file has text": Use
```
--force-ocr
```
,
```
--redo-ocr
```
, or
```
--skip-text
```
as appropriate.
Large output files: See ocrmypdf-optimize for
```
--optimize
```
levels and JBIG2.
Signed PDFs: Use
```
--invalidate-digital-signatures
```
to override (signatures will be invalidated).

Full-stack-skills ocrmypdf

OCRmyPDF — Core OCR Guide

Overview

Installation

One-liner installs (recommended)

pip install (latest version)

Verify

Requirements

Quick Start

Language Support

Using languages

OCR Modes

Default mode (skip existing text)

Force OCR (
`--force-ocr`
or
`-m force`
)

Redo OCR (
`--redo-ocr`
or
`-m redo`
)

Skip text (
`--skip-text`
or
`-m skip`
)

No OCR (image processing only)

Page Selection

Output Types

Sidecar Text File

Metadata

Parallel Processing

Common Recipes

Make a scanned PDF searchable

Convert image to searchable PDF

OCR a multilingual document

Re-OCR with newer Tesseract

Strip all text/OCR from a PDF

Quick Reference

Troubleshooting

References

Full-stack-skills ocrmypdf

OCRmyPDF — Core OCR Guide

Overview

Installation

One-liner installs (recommended)

pip install (latest version)

Verify

Requirements

Quick Start

Language Support

Using languages

OCR Modes

Default mode (skip existing text)

Force OCR (--force-ocr or -m force)

Redo OCR (--redo-ocr or -m redo)

Skip text (--skip-text or -m skip)

No OCR (image processing only)

Page Selection

Output Types

Sidecar Text File

Metadata

Parallel Processing

Common Recipes

Make a scanned PDF searchable

Convert image to searchable PDF

OCR a multilingual document

Re-OCR with newer Tesseract

Strip all text/OCR from a PDF

Quick Reference

Troubleshooting

References

Force OCR (
`--force-ocr`
or
`-m force`
)

Redo OCR (
`--redo-ocr`
or
`-m redo`
)

Skip text (
`--skip-text`
or
`-m skip`
)