Skills pdf-ocr-extractor
Extract text from image-based or scanned PDFs using Tesseract OCR.
install
source · Clone the upstream repo
git clone https://github.com/openclaw/skills
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/openclaw/skills "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/bilicen700/pdf-ocr-extraction" ~/.claude/skills/openclaw-skills-pdf-ocr-extractor && rm -rf "$T"
OpenClaw · Install into ~/.openclaw/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/openclaw/skills "$T" && mkdir -p ~/.openclaw/skills && cp -r "$T/skills/bilicen700/pdf-ocr-extraction" ~/.openclaw/skills/openclaw-skills-pdf-ocr-extractor && rm -rf "$T"
manifest:
skills/bilicen700/pdf-ocr-extraction/SKILL.mdsource content
PDF OCR Extractor
Use this skill to extract text from scanned PDFs or image-based PDFs that lack a native text layer. It's completely free, doesn't utilize third-party APIs, and offers unlimited usage. It renders PDF pages to images and runs optical character recognition (OCR).
Dependencies
This skill requires:
- System Binary:
(along with required language data packs liketesseract
orchi_sim
).eng - Python Packages:
,pypdfium2
, andpytesseract
.Pillow
Note: Do not run automated
commands at runtime. Rely on the user or the environment to pre-install the dependencies defined in the metadata block.pip install
Quick Start
Create a Python script (e.g.,
extract.py) in a temporary directory to handle the extraction safely:
import pypdfium2 as pdfium import pytesseract from PIL import Image import sys import os def extract(pdf_path): doc = pdfium.PdfDocument(pdf_path) full_text = [] for i, page in enumerate(doc): # Render page to a high-resolution image bitmap = page.render(scale=2) tmp_img = f"/tmp/page_{i}.png" bitmap.to_pil().save(tmp_img) # Run OCR (assuming English and Simplified Chinese packs are installed) text = pytesseract.image_to_string(Image.open(tmp_img), lang='chi_sim+eng') full_text.append(text) # Cleanup temporary file os.remove(tmp_img) return "\n".join(full_text) if __name__ == "__main__": if len(sys.argv) > 1: print(extract(sys.argv[1]))
Then execute the script:
python3 extract.py /path/to/document.pdf
Security & Sandbox Constraints
- Write temporary images only to
and clean them up immediately after extraction./tmp/ - Do not attempt to dynamically download or install language packs via shell commands; notify the user if a specific language is missing.